# Importing Required Libraries

This code block imports several libraries that are used in the code.

- `os`: This library provides a portable way of using operating system dependent functionality.

- `zipfile`: This library provides tools to create, read, write, append, and list a ZIP file.

- `requests`: This library is used for making HTTP requests to retrieve data from a specified URL.

- `functools`: This library provides tools for working with functions.

- `pathlib`: This library provides an object-oriented way of working with file paths.

- `tqdm`: This library provides a progress bar to indicate the progress of a task.

- `pandas`: This library provides data structures and data analysis tools for handling and manipulating numerical tables and time series data.

- `nest_asyncio`: This library is used to run multiple asyncio loops in the same thread.

- `aiohttp`: This library provides an asynchronous HTTP client/server for asyncio.

- `time`: This library provides functions to work with time.

- `asyncio`: This library provides asynchronous I/O, event loop, and concurrency tools.

- `csv`: This library provides functionality to read from and write to CSV (Comma-Separated Values) files.

- `load_dotenv`: This function is imported from the `dotenv` library and is used to load environment variables from a .env file.


In [2]:
!pip install requests pathlib tqdm pandas pillow nest_asyncio aiohttp python-dotenv





In [3]:
import os
import zipfile
import requests
import functools
import pathlib
from tqdm import tqdm
import pandas as pd
import nest_asyncio
import aiohttp
import time
import asyncio
import csv
from dotenv import load_dotenv
load_dotenv()

True

# Setting Paths and Variables
This code block sets several paths and variables that are used throughout the code.

- `output_path`: This variable stores the base folder path for the project.

- `images_path`, `metadata_path`, `include_path`: These variables store the paths for the images, metadata, and include folders, respectively. The paths are constructed by joining the base folder path with the respective sub-folder names.

- `list_of_paths`: This list stores the paths of the output, images, metadata, and include folders.

- `dataset_url`: This variable stores the base URL for the dataset.

- `num_images`: This variable stores the number of images to be downloaded.


In [15]:
# Set the base folder path for the project
output_path = "../output"
images_path = os.path.join(output_path, "images")
metadata_path = os.path.join(output_path, "metadata")
include_path = os.path.join(output_path, "include")

list_of_paths = [output_path, images_path, metadata_path, include_path]

# Set the base URL for the dataset
dataset_url = "https://unsplash.com/data/lite/latest"

# Set the number of images to download
num_images = 100

# Creating a Folder
This function `create_folder` creates a folder at the specified path.

**Function Parameters:**

- `path (str)`: The path of the folder to be created.

**Function Returns:**

- `None`

**Function Behavior:**

- The function uses the `os.mkdir` method to create the folder at the specified path.
- If the folder already exists, the function prints a message saying so.
- If there is an error creating the folder, the function prints the error message.


In [5]:
def create_folder(path):
    """
    This function creates a folder at the specified path.
    If the folder already exists, it will print a message saying so.
    If there is an error creating the folder, it will print the error message.

    Parameters:
        :param path (str): The path of the folder to be created.

    Returns:
    None
    """
    try:
        # Use os.mkdir to create the folder at the specified path
        os.mkdir(path)
        print(f"Folder {path} created")
    except FileExistsError:
        # If the folder already exists, print a message saying so
        print(f"Folder {path} already exists")
    except Exception as e:
        # If there is an error creating the folder, print the error message
        print(f"Error creating folder {path}: {e}")

# Initializing Folders
This function `init_folder` initializes the specified folders.

**Function Parameters:**

- `folder_names (list)`: A list of folder names to be created.

**Function Behavior:**

- The function iterates over the list of folder names and calls the `create_folder` function for each name.
- This function is used to create the required output, images, metadata, and include folders.

In [6]:
def init_folder(folder_names: list):
    for folder_name in folder_names:
        create_folder(folder_name)

In [7]:
init_folder(list_of_paths)

Folder ../output created
Folder ../output\images created
Folder ../output\metadata created
Folder ../output\include created


# Downloading a File
This function `download` downloads a file from a given URL and saves it to a specified filename.

**Function Parameters:**

- `url (str)`: The URL of the file to be downloaded.
- `filename (str)`: The filename to save the file as.

**Function Returns:**

- `path (str)`: The path of the downloaded file.

**Function Behavior:**

- The function creates a `requests.Session` object to persist the state of the connection.
- The function then sends a GET request to the URL to start the download.
- The function raises an error if the response is not 200 OK.
- The function retrieves the file size from the `Content-Length` header and uses it to display the download progress using the `tqdm` library.
- The function opens the target file in binary write mode and writes each chunk of data from the response to the file.
- The function returns the path to the downloaded file.
- If an HTTP error occurs while downloading the file, the function prints an error message.
- If any other error occurs while downloading the file, the function prints a general error message.

In [8]:
def download(url, filename):
    """
    This download a file from a given URL and save it to a specified filename.

    Parameters:
        :param url (str): The URL of the file to be downloaded.
        :param filename (str): The filename to save the file as.

    Returns:
    path (str): The path of the downloaded file.
    """
    try:
        # Create a session object to persist the state of connection
        s = requests.Session()
        # Retry connecting to the URL up to 3 times
        s.mount(url, requests.adapters.HTTPAdapter(max_retries=3))
        # Send a GET request to the URL to start the download
        r = s.get(url, stream=True, allow_redirects=True)
        # Raise an error if the response is not 200 OK
        r.raise_for_status()
        # Get the file size from the Content-Length header, default to 0 if not present
        file_size = int(r.headers.get('Content-Length', 0))
        # Get the absolute path to the target file
        path = pathlib.Path(filename).expanduser().resolve()
        # Create parent directories if they don't exist
        path.parent.mkdir(parents=True, exist_ok=True)
        # Set the description to display while downloading, "(Unknown total file size)" if file size is 0
        desc = "(Unknown total file size)" if file_size == 0 else ""
        # Enable decoding the response content
        r.raw.read = functools.partial(r.raw.read, decode_content=True)
        # Use tqdm to display the download progress
        with tqdm(total=file_size, unit='B', unit_scale=True, desc=desc) as pbar:
            # Open the target file in binary write mode
            with path.open("wb") as f:
                # Write each chunk of data from the response to the file
                for chunk in r.iter_content(chunk_size=1024):
                    f.write(chunk)
                    pbar.update(len(chunk))
        # Return the path to the downloaded file
        return path
    # Handle HTTP error if the response is not 200 OK
    except requests.exceptions.HTTPError as e:
        print(f"HTTP error occurred while downloading dataset: {e}")
    # Handle any other exceptions that might occur while downloading the file
    except Exception as e:
        print(f"Error occurred while downloading dataset: {e}")

# Downloading and Extracting the Dataset
This function `download_dataset` downloads the dataset from a given URL, unzips it, and stores the images in a specified image path.

**Function Parameters:**

- `dataset_url (str)`: The URL of the dataset to be downloaded.
- `image_path (str)`: The path to store the images after unzipping the dataset.

**Function Behavior:**

- The function first checks if the dataset has already been downloaded by checking if the `archive.zip` file exists or if the images folder is empty.
- If the dataset has not been downloaded, the function downloads it from the given URL using the `download` function.
- The function then uses the `zipfile` library to extract the contents of the `archive.zip` file to the specified image path.
- The function then removes the `archive.zip` file.
- If an error occurs while unzipping the dataset, the function prints an error message.
- If an error occurs while removing the `archive.zip` file, the function prints an error message.

In [9]:
def download_dataset(dataset_url, image_path):
    """
    Downloads the dataset from the given URL, unzips it, and stores the images in the specified image path.

    Args:
        :param dataset_url (str): URL of the dataset to be downloaded
        :param image_path (str): Path to store the images after unzipping the dataset
    """
    # Check if the dataset has already been downloaded
    # Check if the archive.zip file exists or if the images folder is empty
    if not os.path.exists('archive.zip'):
        # Download the dataset from the given url
        download(dataset_url, 'archive.zip')
        print("Dataset downloaded!")
        try:
            # Extract the contents of the archive.zip to the specified image path
            with zipfile.ZipFile('archive.zip', 'r') as zip_ref:
                zip_ref.extractall(image_path)
            print("Dataset unzipped")
        except Exception as e:
            print(f"Error occurred while unzipping dataset: {e}")
        try:
            # Remove the archive.zip file
            os.remove('archive.zip')
            print("archive.zip removed")
        except Exception as e:
            print(f"Error occurred while removing archive.zip: {e}")

In [10]:
download_dataset(dataset_url, images_path)

100%|██████████| 632M/632M [00:20<00:00, 31.0MB/s] 


Dataset downloaded!
Dataset unzipped
archive.zip removed


# Reading and Processing the Photos Data
This code block reads the `photos.tsv000` file in the images folder and processes the data.

- The `pd.read_csv` method is used to read the `photos.tsv000` file, and the `sep` parameter is set to `'\t'` to indicate that the data is separated by tabs.
- The resulting data is stored in a Pandas DataFrame called `photo_df`.
- The `photo_df` DataFrame is then modified to only include the `photo_id` and `photo_image_url` columns.
- The `print(photo_df.head())` statement is used to display the first 5 rows of the `photo_df` DataFrame.

In [11]:
nest_asyncio.apply()

In [12]:
# Read photo.tsv file in images folder
photo_df = pd.read_csv(os.path.join(images_path, 'photos.tsv000'), sep='\t')
# read photo_image_url column and photo_id in index
photo_df = photo_df[['photo_id', 'photo_image_url']]

print(photo_df.head())

      photo_id                                    photo_image_url
0  XMyPniM9LF0  https://images.unsplash.com/uploads/1411949294...
1  rDLBArZUl1c  https://images.unsplash.com/photo-141633941111...
2  cNDGZ2sQ3Bo  https://images.unsplash.com/photo-142014251503...
3  iuZ_D1eoq9k  https://images.unsplash.com/photo-141487280988...
4  BeD3vjQ8SI0  https://images.unsplash.com/photo-141700759404...


# Downloading an Image Asynchronously
This is an asynchronous function `download_image` that downloads an image from a given URL and saves it to the local file system.

**Function Parameters:**

- `session (aiohttp.ClientSession)`: An `aiohttp` client session that manages HTTP requests and responses.
- `url (str)`: The URL of the image to download.
- `i (int)`: An integer representing the index of the image to download.
- `err_cnt (int, optional)`: An optional integer representing the number of times that the download has failed due to a client error. Defaults to 0.

**Function Returns:**

- This function does not return anything.

**Function Behavior:**

- The function uses the `session.get` method from the `aiohttp` library to send a GET request to the URL of the image.
- The function opens a file in binary write mode with the filename `image_i.jpg`, where `i` is the index of the image, and writes the content of the response to the file.
- If a `ClientError` occurs while downloading the image, the function retries the download up to 10 times, with a 10-second delay between each retry.
- If the download still fails after 10 retries, the function stops trying and prints an error message.

In [13]:
async def download_image(session: aiohttp.ClientSession, url: str, i: int, err_cnt=None):
    """
    Downloads an image from the given URL using an aiohttp client session and saves it to the local file system.

    Args:
        session: An aiohttp client session that manages HTTP requests and responses.
        url: The URL of the image to download.
        i: An integer representing the index of the image to download.
        err_cnt: An optional integer representing the number of times that the download has failed due to a client error.
                 If not provided, it defaults to 0.

    Raises:
        This method does not raise any exceptions.

    Returns:
        None.
    """
    if err_cnt is None:
        err_cnt = 0
    try:
        async with session.get(url) as response:
            filename = os.path.join(images_path, "image_" + str(i) + ".jpg")
            with open(filename, 'wb') as f:
                f.write(await response.content.read())
            print(f"Downloaded {url} to {filename} idx: {i}")
    except aiohttp.ClientError as e:
        print(f"Error occurred while downloading {url}: {e}")
        if err_cnt == 10:
            return
        await asyncio.sleep(10)
        err_cnt += 1
        await download_image(session, url, i, err_cnt)

# Download Images

This code defines an asynchronous function `download_images` that downloads a list of images from a set of given URLs. The function uses the `aiohttp` library to manage the HTTP requests and responses during the download process. The function takes two arguments: `image_urls`, a list of strings representing the URLs of the images to be downloaded, and `images_ids`, a list of integers representing the indices of the images to be downloaded.

The function starts by creating a new `aiohttp` client session, which will be used to manage the HTTP requests and responses during the download process. A semaphore with a limit of 5000 concurrent downloads is created to prevent overloading the server. The function then loops through the `image_urls` list and creates a new task for each URL using the `asyncio.ensure_future` method. Before creating the task, the function acquires a permit from the semaphore to limit the number of concurrent downloads. Each task calls the `download_image` function to download the image from the URL and save it to the local file system. After the task is created, the function releases the semaphore permit when the task completes.

Once all the tasks are created, the function waits for all download tasks to complete using the `asyncio.wait` method. The results of all the tasks are then gathered using the `asyncio.gather` method, although this step is not necessary as the tasks have already completed.

In [16]:
async def download_images(image_urls, images_ids):
    """
    Downloads a list of images from the given URLs using an aiohttp client session and saves them to the local file system.

    Args:
        image_urls: A list of strings representing the URLs of the images to download.
        images_ids: A list of integers representing the indices of the images to download.

    Raises:
        This method does not raise any exceptions.

    Returns:
        None.
    """
    # Create a new aiohttp client session to manage HTTP requests and responses
    async with aiohttp.ClientSession() as session:
        tasks = []  # Create an empty list to hold the tasks that will download the images
        semaphore = asyncio.Semaphore(5000)  # Create a semaphore to limit the number of concurrent downloads
        # Loop through the image URLs and create a new task for each one
        for i, url in enumerate(image_urls):
            try:
                await semaphore.acquire()  # Acquire a permit from the semaphore to limit concurrency
                #url = url + "?w=1000&fm=jpg&fit=max"  # Append query parameters to resize and optimize the image
                task = asyncio.ensure_future(download_image(session, url, images_ids[i]))  # Create a new download task
                task.add_done_callback(
                    lambda x: semaphore.release())  # Release the semaphore permit when the task completes
                tasks.append(task)  # Add the task to the list of download tasks
            except Exception:
                print(f"Error occurred while downloading {url}")
                semaphore.release()  # Release the semaphore permit if an exception occurs
        # Wait for all download tasks to complete
        await asyncio.wait(tasks)
        # Gather the results of all download tasks (not necessary because the tasks have already completed)
        await asyncio.gather(*tasks)

In [17]:
# Get the list of image urls and image ids
image_urls = photo_df['photo_image_url'].values.tolist()[:num_images]
# img id are from 0 to size of the list
images_ids = [i for i in range(len(image_urls))][:num_images]
# filter by looking if the image already exist in fact of the image_id is already in the folder
# Loop on the image_id and check if the image exist in the folder
image_urls = [url for url, image_id in zip(image_urls, images_ids) if
              not os.path.exists(os.path.join(images_path, "image_" + str(image_id) + ".jpg"))]
print(f"Number of images to download: {len(image_urls)}")

Number of images to download: 100


The code filters a list of image URLs and their respective image IDs by checking if the images already exist in a specified folder.

1. The `image_urls` list is populated with the values of the `photo_image_url` column of the `photo_df` dataframe. The list is then sliced to a specified number of images using the `[:num_images]` syntax.

2. The `images_ids` list is created by generating a range of integers from 0 to the length of the `image_urls` list and slicing it to the specified number of images.

3. The code then creates a new list `image_urls` that only contains URLs of images that do not already exist in the specified folder. This is done by looping through the `image_urls` and `images_ids` lists and checking if the image file with the corresponding ID exists in the folder. If it does, the URL is not added to the new list.

4. Finally, the code prints the number of images that will be downloaded.

In [18]:
# Split the list of image urls into chunks of max and add a timeout of 30 seconds
chunks = [image_urls[i:i + 5000] for i in range(0, len(image_urls), 5000)]
start_t = time.time()
loop = None
for i, chunk in enumerate(chunks):
    start = time.time()
    try:
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        loop.run_until_complete(download_images(chunk, images_ids[i * 5000:(i + 1) * 5000]))
    except Exception as e:
        print(f"Error occurred while downloading chunk {i}: {e}")
    finally:
        loop.close()
        print(f"[Chunk {i}] Downloaded {len(chunk)} images in {time.time() - start} seconds")

print(f'Downloaded {len(image_urls)} images in {time.time() - start_t} seconds')

Downloaded https://images.unsplash.com/photo-1428550670225-15f007f6f1ba to ../output\images\image_7.jpg idx: 7
Downloaded https://images.unsplash.com/reserve/m6rT4MYFQ7CT8j9m2AEC_JakeGivens%20-%20Sunset%20in%20the%20Park.JPG to ../output\images\image_6.jpg idx: 6
Downloaded https://images.unsplash.com/uploads/1411476843343e89a8f76/bda95c47 to ../output\images\image_5.jpg idx: 5
Downloaded https://images.unsplash.com/photo-1501275578-61eff411d8bf to ../output\images\image_67.jpg idx: 67
Downloaded https://images.unsplash.com/photo-1492644233019-549102aa44f7 to ../output\images\image_61.jpg idx: 61
Downloaded https://images.unsplash.com/photo-1420142515034-86cc8c508475 to ../output\images\image_2.jpg idx: 2
Downloaded https://images.unsplash.com/photo-1455655100973-2a0d5f567386 to ../output\images\image_16.jpg idx: 16
Downloaded https://images.unsplash.com/photo-1502977249166-824b3a8a4d6d to ../output\images\image_47.jpg idx: 47
Downloaded https://images.unsplash.com/photo-1505962372142-

This code removes all files in the `images_path` directory except for `.jpg` files. It uses the `os.listdir()` method to get a list of all files in the directory and loops through each file. If the file ends with `.jpg`, it continues to the next file. If the file does not end with `.jpg`, it uses the `os.remove()` method to delete the file. However, if the file is named `TERMS.md`, it skips it and does not delete it. If an exception occurs while removing a file, the code continues to the next file.


In [19]:
# Remove all files except images
for file in os.listdir(images_path):
    if file.endswith('.jpg'):
        continue
    else:
        try:
            # Don't delete TERMS.md
            if file == 'TERMS.md':
                continue
            os.remove(os.path.join(images_path, file))
        except Exception as e:
            continue

## get_all_images
This function returns a list of full paths to all the images with .png or .jpg extensions in the given path. If an error occurs while fetching images, the function returns an empty list and logs the error message.

### Args
- path (str): The path to the directory containing the images.

### Returns
- list: A list of full path to all the images with .png or .jpg extensions.
- empty list: An empty list if an error occurred while fetching images.

In [20]:
def get_all_images(path):
    """Get all images from the given path.

    Args:
    param: image_path (str): path to the directory containing the images.

    Returns:
    - list: a list of full path to all the images with png or jpg extensions.
    - empty list: an empty list if an error occurred while fetching images.
    """
    try:
        # use os.walk to traverse all the subdirectories and get all images
        return [os.path.join(root, name)
                for root, dirs, files in os.walk(path)
                for name in files
                if name.endswith((".png", ".jpg"))]
    except Exception as e:
        # return an empty list and log the error message if an error occurred
        print(f"An error occurred while fetching images: {e}")
        return []

The `get_all_metadata` coroutine function extracts metadata from all images in a directory and saves the metadata information in either pickle or json format. The function takes two parameters, `images_path` and `metadata_path`, which are the paths to the directory where the images are stored and the directory where the metadata will be saved, respectively.

The function starts by executing the binary `exifextract` from the `include_path` and passing `images_path` and `metadata_path/metadata.csv` as arguments. The function then waits for the process to terminate and checks if the process terminated successfully. If the process is not successful, a `subprocess.CalledProcessError` is raised.

Once the metadata has been extracted, the function opens the `metadata.csv` file and loads the metadata using a `csv.reader` object. The metadata is stored in a list and the first row of the list is treated as the header. The metadata is then processed and stored in a dictionary where each key represents the index of a metadata item and the value is another dictionary containing the metadata information. The `filename` is also added to the metadata for each item.

Finally, the metadata dictionary is converted to a pandas dataframe, and the dataframe is saved to a `metadata.csv` file in the `metadata_path` directory.


# Note : https://github.com/TeissierYannis/cpe-bigdata-project-cpp-dependencies (exifextract) build the binary from this repository

In [21]:
async def get_all_metadata(images_path):
    """
    This coroutine extracts metadata from all images in a directory and saves the metadata information in either pickle or json format.

    Parameters:
    image_path (str): The path to the directory where the images are stored.
    metadata_path (str): The path to the directory where the metadata will be saved.

    Returns:
    None
    """
    # Use the binary exifextract from include path
    binary = include_path + '/exifextract' # https://github.com/TeissierYannis/cpe-bigdata-project-cpp-dependencies (exifextract) build the binary from this repository
    command = [binary, images_path, metadata_path + '/metadata.csv']
    import subprocess
    # execute command
    popen = subprocess.Popen(command, stdout=subprocess.PIPE)
    popen.wait()

    # wait for the process to terminate
    output, error = popen.communicate()

    while popen.poll() is None:
        time.sleep(0.1)

    # check if the process terminated successfully
    if popen.returncode != 0:
        raise subprocess.CalledProcessError(popen.returncode, command)

    # load metadata from csv
    with open(metadata_path + '/metadata.csv', 'r') as f:
        reader = csv.reader(f)
        metadata = list(reader)
        header = metadata[0]

    metadata = metadata[1:]
    metadata_dict = {}
    for i, row in enumerate(metadata):
        metadata_dict[i] = {}
        for j in range(1, len(header)):
            metadata_dict[i][header[j]] = row[j]
        # add filename to metadata
        # remove ' from row[0]
        row[0] = row[0].replace("'", '')
        metadata_dict[i]['filename'] = row[0]

    # convert dict to dataframe
    metadata_df = pd.DataFrame.from_dict(metadata_dict, orient='index')
    # save metadata to csv
    metadata_df.to_csv(metadata_path + '/metadata.csv', index=False)

In [22]:
asyncio.run(get_all_metadata(images_path))

FileNotFoundError: [WinError 2] Le fichier spécifié est introuvable

# Read metadata from CSV file

In [23]:
read_metadata = pd.read_csv(metadata_path + '/metadata.csv')

FileNotFoundError: [Errno 2] No such file or directory: '../output\\metadata/metadata.csv'

In [None]:
read_metadata.head()