# Imports
The code imports several libraries to work correctly. The libraries are as follows:

- tqdm is a library that provides a progress bar for loops and iteration, which is helpful for keeping track of the progress of long-running processes.
- PIL (Python Imaging Library) is a library used for opening, manipulating, and saving image files.
- The pickle library allows the user to serialize and deserialize Python objects, meaning that the user can save a Python object to a file and then load it back later.
- The os library provides a way of using operating system dependent functionality like reading or writing to the file system.
- The zipfile library provides functionality to create, read, write, append, and list a ZIP file.
- requests library is used for sending HTTP requests to web servers and downloading content.
- functools is a module that implements higher-order functions. Higher-order functions are functions that take other functions as inputs, or that return functions as output.
- pathlib is a library for working with file paths, which is part of the Python Standard Library starting from Python 3.4.
- Finally, tqdm.auto automatically enables or disables the progress bar depending on the context of use, so that it only shows when the output is connected to a terminal or not.

In [2]:
!pip install nest_asyncio requests pathlib tqdm json sqlite3 pandas PIL

[31mERROR: Could not find a version that satisfies the requirement pickle (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pickle[0m[31m
[0m

In [3]:
import pickle
import os
import zipfile
import requests
import functools
import pathlib
from tqdm import tqdm
import json
import sqlite3
import pandas as pd
from PIL import Image
from PIL.ExifTags import TAGS
import nest_asyncio

# Settings base variables and paths
For this project, we used the unsplash dataset, which is a large-scale image dataset. The dataset contains over 25,000 images.
The code sets the base variables and paths for the project. The variables are as follows:

In [4]:
# Set the base folder path for the project
output_path = "../output"
images_path = os.path.join(output_path, "images")
metadata_path = os.path.join(output_path, "metadata")
config_path = os.path.join(output_path, "config")

list_of_paths = [output_path, images_path, metadata_path, config_path]

# Set the base URL for the dataset
dataset_url = "https://unsplash-datasets.s3.amazonaws.com/lite/latest/unsplash-research-dataset-lite-latest.zip"
# metadata mode (used to save metadata)
metadata_mode = "sqlite"

# Create folder structure
The code creates the folder structure for the project. The folder structure is as follows:
- output
    - images
    - metadata
    - config

This method creates a folder with the given path if it doesn't already exist, It also outputs a message to inform the user if the folder was created or if it already exists.
This is useful for organizing and managing files in a project. By creating a folder to store data and resources, it keeps the working directory tidy and makes it easier to locate files. Additionally, by checking if the folder exists before creating it, it prevents the program from overwriting existing data or throwing an error.

In [5]:
def create_folder(path):
    """
    This function creates a folder at the specified path.
    If the folder already exists, it will print a message saying so.
    If there is an error creating the folder, it will print the error message.

    Parameters:
        :param path (str): The path of the folder to be created.

    Returns:
    None
    """
    try:
        # Use os.mkdir to create the folder at the specified path
        os.mkdir(path)
        print(f"Folder {path} created")
    except FileExistsError:
        # If the folder already exists, print a message saying so
        print(f"Folder {path} already exists")
    except Exception as e:
        # If there is an error creating the folder, print the error message
        print(f"Error creating folder {path}: {e}")

# Create the folder structure
This method initializes a list of folders by calling the create_folder method for each folder in the list.
The purpose of this method is to make sure that all necessary folders exist before the program continues its execution.
If a folder does not exist, the create_folder method will create it. If a folder already exists, the method will simply print a message indicating that the folder already exists. In case of any other error, the method will print the error message.

In [6]:
def init_folder(folder_names: list):
    for folder_name in folder_names:
        create_folder(folder_name)

In [7]:
init_folder(list_of_paths)

Folder ../output already exists
Folder ../output/images already exists
Folder ../output/metadata already exists
Folder ../output/config already exists


# Define methods for downloading the dataset
The following code block is a method to download a file from a given URL and save it to a specified filename.

The method starts by creating a session (s = requests.Session()) and then mounting it to the URL (s.mount(url, requests.adapters.HTTPAdapter(max_retries=3))). This sets the maximum number of retries to 3 if the connection to the URL fails.

Then, the method makes a GET request to the URL (r = s.get(url, stream=True, allow_redirects=True)) and checks if it returns a successful response (r.raise_for_status()). If there was an HTTP error during the request, the error message is printed (print(f"HTTP error occurred while downloading dataset: {e}")).

The method also checks the file size specified in the response headers and assigns it to the variable file_size (file_size = int(r.headers.get('Content-Length', 0))). If the file size is 0, a default value of "(Unknown total file size)" is assigned to the variable desc; otherwise, the variable desc is left empty.

Next, the method resolves the file path and creates a directory if it doesn't already exist (path.parent.mkdir(parents=True, exist_ok=True)). The method then creates a tqdm progress bar to show the download progress (with tqdm.tqdm(total=file_size, unit='B', unit_scale=True, desc=desc) as pbar:).

Finally, the method writes the contents of the file to disk in chunks (for chunk in r.iter_content(chunk_size=1024):), updating the progress bar for each chunk that is written to disk (pbar.update(len(chunk))). If an error occurred during the download, a message with the error is printed (print(f"Error occurred while downloading dataset: {e}")). The file path is returned when the method is finished.

In [7]:
def download(url, filename):
    """
    This download a file from a given URL and save it to a specified filename.

    Parameters:
        :param url (str): The URL of the file to be downloaded.
        :param filename (str): The filename to save the file as.

    Returns:
    path (str): The path of the downloaded file.
    """
    try:
        # Create a session object to persist the state of connection
        s = requests.Session()
        # Retry connecting to the URL up to 3 times
        s.mount(url, requests.adapters.HTTPAdapter(max_retries=3))
        # Send a GET request to the URL to start the download
        r = s.get(url, stream=True, allow_redirects=True)
        # Raise an error if the response is not 200 OK
        r.raise_for_status()
        # Get the file size from the Content-Length header, default to 0 if not present
        file_size = int(r.headers.get('Content-Length', 0))
        # Get the absolute path to the target file
        path = pathlib.Path(filename).expanduser().resolve()
        # Create parent directories if they don't exist
        path.parent.mkdir(parents=True, exist_ok=True)
        # Set the description to display while downloading, "(Unknown total file size)" if file size is 0
        desc = "(Unknown total file size)" if file_size == 0 else ""
        # Enable decoding the response content
        r.raw.read = functools.partial(r.raw.read, decode_content=True)
        # Use tqdm to display the download progress
        with tqdm(total=file_size, unit='B', unit_scale=True, desc=desc) as pbar:
            # Open the target file in binary write mode
            with path.open("wb") as f:
                # Write each chunk of data from the response to the file
                for chunk in r.iter_content(chunk_size=1024):
                    f.write(chunk)
                    pbar.update(len(chunk))
        # Return the path to the downloaded file
        return path
    # Handle HTTP error if the response is not 200 OK
    except requests.exceptions.HTTPError as e:
        print(f"HTTP error occurred while downloading dataset: {e}")
    # Handle any other exceptions that might occur while downloading the file
    except Exception as e:
        print(f"Error occurred while downloading dataset: {e}")

# Download the dataset
The following code block downloads the dataset from the URL and saves it to the specified filename. The method also prints a message to inform the user that the download is complete.

In [8]:
def download_dataset(dataset_url, image_path):
    """
    Downloads the dataset from the given URL, unzips it, and stores the images in the specified image path.

    Args:
        :param dataset_url (str): URL of the dataset to be downloaded
        :param image_path (str): Path to store the images after unzipping the dataset
    """
    # Check if the dataset has already been downloaded
    # Check if the archive.zip file exists or if the images folder is empty
    if not os.path.exists('archive.zip'):
        # Download the dataset from the given url
        download(dataset_url, 'archive.zip')
        print("Dataset downloaded!")
        try:
            # Extract the contents of the archive.zip to the specified image path
            with zipfile.ZipFile('archive.zip', 'r') as zip_ref:
                zip_ref.extractall(image_path)
            print("Dataset unzipped")
        except Exception as e:
            print(f"Error occurred while unzipping dataset: {e}")
        try:
            # Remove the archive.zip file
            os.remove('archive.zip')
            print("archive.zip removed")
        except Exception as e:
            print(f"Error occurred while removing archive.zip: {e}")

In [61]:
download_dataset(dataset_url, images_path)


  0%|          | 835k/632M [00:03<39:33, 266kB/s]   


KeyboardInterrupt: 

In [9]:
nest_asyncio.apply()

In [10]:
# Read photo.tsv file in images folder
photo_df = pd.read_csv(os.path.join(images_path, 'photos.tsv000'), sep='\t')
# read photo_image_url column and photo_id in index
photo_df = photo_df[['photo_id', 'photo_image_url']]
print(photo_df.head())

FileNotFoundError: [Errno 2] No such file or directory: '../output/images/photos.tsv000'

In [12]:
import asyncio
import aiohttp
import time


async def download_image(session, url, i, err_cnt=None):
    if err_cnt is None:
        err_cnt = 0
    try:
        async with session.get(url) as response:
            filename = os.path.join(images_path, "image_" + str(i) + ".jpg")
            with open(filename, 'wb') as f:
                f.write(await response.content.read())
            print(f"Downloaded {url} to {filename} idx: {i}")
    except aiohttp.ClientError as e:
        print(f"Error occurred while downloading {url}: {e}")
        if err_cnt == 10:
            return
        await asyncio.sleep(10)
        err_cnt += 1
        await download_image(session, url, i, err_cnt)


async def download_images(image_urls, images_ids):
    async with aiohttp.ClientSession() as session:
        tasks = []
        semaphore = asyncio.Semaphore(5000)
        # add index
        for i, url in enumerate(image_urls):
            try:
                await semaphore.acquire()
                url = url + "?w=1000&fm=jpg&fit=max"
                task = asyncio.ensure_future(download_image(session, url, images_ids[i]))
                task.add_done_callback(lambda x: semaphore.release())
                tasks.append(task)
            except Exception:
                print(f"Error occurred while downloading {url}")
                semaphore.release()
        # Wait for all tasks to complete
        await asyncio.wait(tasks)
        await asyncio.gather(*tasks)


image_urls = photo_df['photo_image_url'].values.tolist()
# img id is from 0 to size of the list
images_ids = [i for i in range(len(image_urls))]
# filter by looking if the image already exist in fact of the image_id is already in the folder
# Loop on the image_id and check if the image exist in the folder
image_urls = [url for url, image_id in zip(image_urls, images_ids) if
              not os.path.exists(os.path.join(images_path, "image_" + str(image_id) + ".jpg"))]
print(f"Number of images to download: {len(image_urls)}")

NameError: name 'photo_df' is not defined

In [20]:
# Split the list of image urls into chunks of max and add a timeout of 30 seconds
chunks = [image_urls[i:i + 5000] for i in range(0, len(image_urls), 5000)]
start_t = time.time()
loop = None
for i, chunk in enumerate(chunks):
    start = time.time()
    try:
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        loop.run_until_complete(download_images(chunk, images_ids[i * 5000:(i + 1) * 5000]))
    except Exception as e:
        print(f"Error occurred while downloading chunk {i}: {e}")
    finally:
        loop.close()
        print(f"[Chunk {i}] Downloaded {len(chunk)} images in {time.time() - start} seconds")

print(f'Downloaded {len(image_urls)} images in {time.time() - start_t} seconds')

Exception ignored in: <coroutine object download_images at 0x17b5c2f40>
Traceback (most recent call last):
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/asyncio/futures.py", line 82, in __init__
    self._callbacks = []
RuntimeError: coroutine ignored GeneratorExit
Task was destroyed but it is pending!
task: <Task pending name='Task-35621' coro=<download_image() running at /var/folders/3m/wyqs41z16v53xn4gssj1tz2h0000gn/T/ipykernel_1308/3998157083.py:10> wait_for=<Future pending cb=[Task.__wakeup()]> cb=[download_images.<locals>.<lambda>() at /var/folders/3m/wyqs41z16v53xn4gssj1tz2h0000gn/T/ipykernel_1308/3998157083.py:34]>
Task was destroyed but it is pending!
task: <Task pending name='Task-32972' coro=<download_image() running at /var/folders/3m/wyqs41z16v53xn4gssj1tz2h0000gn/T/ipykernel_1308/3998157083.py:10> cb=[download_images.<locals>.<lambda>() at /var/folders/3m/wyqs41z16v53xn4gssj1tz2h0000gn/T/ipykernel_1308/3

Downloaded https://images.unsplash.com/uploads/1411476843343e89a8f76/bda95c47?w=1000&fm=jpg&fit=max to ../output/images/image_5.jpg idx: 5
Downloaded https://images.unsplash.com/reserve/m6rT4MYFQ7CT8j9m2AEC_JakeGivens%20-%20Sunset%20in%20the%20Park.JPG?w=1000&fm=jpg&fit=max to ../output/images/image_6.jpg idx: 6
Downloaded https://images.unsplash.com/photo-1428550670225-15f007f6f1ba?w=1000&fm=jpg&fit=max to ../output/images/image_7.jpg idx: 7
Downloaded https://images.unsplash.com/photo-1429270958905-78f90e2dca75?w=1000&fm=jpg&fit=max to ../output/images/image_8.jpg idx: 8
Downloaded https://images.unsplash.com/photo-1430826032205-b84a31521ea6?w=1000&fm=jpg&fit=max to ../output/images/image_10.jpg idx: 10
Downloaded https://images.unsplash.com/photo-1433621611134-008713dc0321?w=1000&fm=jpg&fit=max to ../output/images/image_11.jpg idx: 11
Downloaded https://images.unsplash.com/photo-1445264760308-f80de105f42d?w=1000&fm=jpg&fit=max to ../output/images/image_13.jpg idx: 13
Downloaded http

In [22]:

# Remove all files except images
for file in os.listdir(images_path):
    if file.endswith('.jpg'):
        continue
    else:
        try:
            # Don't delete TERMS.md
            if file == 'TERMS.md':
                continue
            os.remove(os.path.join(images_path, file))
        except Exception as e:
            continue


# Define methods to get all the image paths
The get_all_images method is used to retrieve all images present in the specified image path. It uses the os.walk function to traverse through all subdirectories within the image path and collects the file names that end with either '.png' or '.jpg' extensions. The full path of each image is then generated by joining the root directory and the file name. The method returns a list of all images' full paths. In case of any error, an error message is printed and an empty list is returned.

In [11]:
def get_all_images(path):
    """Get all images from the given path.

    Args:
    param: image_path (str): path to the directory containing the images.

    Returns:
    - list: a list of full path to all the images with png or jpg extensions.
    - empty list: an empty list if an error occurred while fetching images.
    """
    try:
        # use os.walk to traverse all the subdirectories and get all images
        return [os.path.join(root, name)
                for root, dirs, files in os.walk(path)
                for name in files
                if name.endswith((".png", ".jpg"))]
    except Exception as e:
        # return an empty list and log the error message if an error occurred
        print(f"An error occurred while fetching images: {e}")
        return []

# Define methods to set a checkpoint system

## Method: create_checkpoint
This method is used to create a checkpoint file containing the latest processed image.

The method first tries to open a file named checkpoint.txt in write mode. Then it writes the latest_file into it. If any error occurs during this process, the error message is printed with the message "An error occurred while creating checkpoint: [error message]".

In [12]:
# Method to create a checkpoint with the latest file
def create_checkpoint(latest_file):
    """
   Creates a checkpoint file containing the latest processed file name.

   Parameters:
       :param latest_file (str): The name of the latest processed file.

   Returns:
       None
   """
    try:
        # Open a file in write mode
        with open('checkpoint.txt', 'w') as f:
            # Write the latest file to the checkpoint
            f.write(latest_file)
    except Exception as e:
        # Print error message
        print(f"An error occurred while creating checkpoint: {e}")

## Method: load_checkpoint
This method is used to load the checkpoint file to get the latest processed image.

The method first checks if the checkpoint file exists by verifying the existence of checkpoint.txt. If the checkpoint file exists, it opens the file in read mode, reads the content, and returns it. If the checkpoint file does not exist, it prints "Checkpoint not found" and returns None. If any error occurs during this process, the error message is printed with the message "An error occurred while loading checkpoint: [error message]".

In [13]:
# Method to load a checkpoint
def load_checkpoint():
    """
    Loads the checkpoint file if it exists.

    Returns:
        str: The name of the latest processed file, None if checkpoint file not found.
    """
    try:
        # Check if checkpoint exists
        if os.path.exists('checkpoint.txt'):
            # Open the checkpoint in read mode
            with open('checkpoint.txt', 'r') as f:
                # Return the contents of the checkpoint
                return f.read()
        else:
            # Print message if checkpoint not found
            print("Checkpoint not found")
            return None
    except Exception as e:
        # Print error message
        print(f"An error occurred while loading checkpoint: {e}")
        return None

## Method: remove_checkpoint
This method is used to remove the checkpoint file.

The method first checks if the checkpoint file exists by verifying the existence of checkpoint.txt. If the checkpoint file exists, it removes the file and prints "Checkpoint removed successfully". If the checkpoint file does not exist, it prints "Checkpoint not found". If any error occurs during this process, the error message is printed with the message "An error occurred while removing checkpoint: [error message]".

In [14]:
# Method to remove a checkpoint
def remove_checkpoint():
    """
    Removes the checkpoint file if it exists.

    Returns:
        None
    """
    try:
        # Check if checkpoint exists
        if os.path.exists('checkpoint.txt'):
            # Remove the checkpoint
            os.remove('checkpoint.txt')
            # Print success message
            print("Checkpoint removed successfully")
        else:
            # Print message if checkpoint not found
            print("Checkpoint not found")
    except Exception as e:
        # Print error message
        print(f"An error occurred while removing checkpoint: {e}")

# Define methods to set test dataset
The set_test_dataset method is used to limit the number of images in a given directory. The method takes two arguments, image_path and amount. The image_path argument specifies the directory containing the images, while the amount argument specifies the number of images that should be kept in the directory. The method works by looping through all the files in the directory and removing any file if the number of files exceeds the value specified by amount. If an error occurs during the execution of the method, an error message will be printed.

In [15]:
def set_test_dataset(image_path, amount=100):
    """
    This function removes all images from the given image_path except the first amount images.

    Parameters:
    :param image_path (str): The path to the image folder.
    :param amount (int, optional): The number of images to keep in the folder. Defaults to 100.

    Returns:
    None
    """
    try:

        # loop through the images in the directory using tqdm
        for root, dirs, files in os.walk(image_path, topdown=False):
            for name in tqdm(files, desc="Removing images"):
                # check if the number of images is greater than the amount specified
                if len(os.listdir(image_path)) > amount:
                    # remove the image if the number of images is greater than the specified amount
                    os.remove(os.path.join(root, name))
        # print message indicating that images were removed successfully
        print("All images removed except " + str(amount) + " images")
    except Exception as e:
        # print error message if there was an error during removal of images
        print(f"An error occurred while setting test dataset: {e}")

# Define methods to process dataset
The arrange_dataset method performs several tasks to clean up and organize a directory containing images. The steps are as follows:

- Get a list of all the image files in the directory using the get_all_images method.

- Load the latest checkpoint using the load_checkpoint method, which is used to keep track of the last image file processed.

- For each image file in the list, the method checks if it is the same as the latest checkpoint. If it is, it skips it and sets the checkpoint to None. If the checkpoint is not None, the method skips it. If the checkpoint is None, it moves the file to the image_path directory using the os.rename method. It also creates a checkpoint using the create_checkpoint method to keep track of the last image processed.

- After processing all the image files, the method removes the checkpoint using the remove_checkpoint method.

- The method then removes all subdirectories in the image_path directory using os.rmdir.

- If the is_test argument is set to True, the method calls the set_test_dataset method to remove all images except for a certain number specified by the amount argument (defaults to 100).

The method includes a try-except block to catch any errors that may occur while processing the images. If an error occurs, it prints a message indicating that an error occurred while arranging the dataset.


In [16]:
def arrange_dataset(image_path, is_test=False):
    """
    Arrange the dataset stored in `image_path`.

    :param image_path: path to the dataset folder.
    :param is_test: If True, the dataset will be set to a test set with only 100 images.
    """
    try:
        # Get a list of all images in the path
        img_files = get_all_images(image_path)
        # Load the last checkpoint if it exists
        checkpoint = load_checkpoint()
        # Iterate over all image files
        for file in tqdm(img_files, desc="Moving all file to images folder"):
            # Check if the current file matches the checkpoint
            if checkpoint == file:
                # If it does, reset the checkpoint
                checkpoint = None
                continue
            # If the checkpoint is not None, skip this file
            elif checkpoint is not None:
                continue
            # If neither of the above conditions are met, move the file
            else:
                os.rename(file, os.path.join(image_path, os.path.basename(file)))
                # Create a new checkpoint after moving the file
                create_checkpoint(file)
        # Print a message indicating that all files have been moved
        print("All files moved to images folder")
        # Remove the checkpoint since all files have been moved
        remove_checkpoint()

        # Remove all subfolders in the image path
        for root, dirs, files in os.walk(image_path, topdown=False):
            for name in dirs:
                os.rmdir(os.path.join(root, name))
        # Print a message indicating that all subfolders have been removed
        print("All subfolders removed")

        print(is_test)
        # If is_test is True, set the test dataset
        if is_test:
            print("Setting test dataset 13&3" + image_path)
            set_test_dataset(image_path)
            print("Test dataset set successfully")
    # Catch any exceptions that may occur
    except Exception as e:
        print("An error occurred while arranging the dataset: ", e)

In [31]:
arrange_dataset(images_path, False)

Checkpoint not found


Moving all file to images folder: 100%|██████████| 24980/24980 [00:02<00:00, 11293.59it/s]


All files moved to images folder
Checkpoint removed successfully
All subfolders removed


# Define methods to get metadata
The method get_metadata is used to extract metadata information from an image file and return it in a dictionary format. The method takes two parameters: img_file and image_path. img_file is the file name of the image and image_path is the path to the directory where the image is stored.

The method uses the Image module from the Python Imaging Library (PIL) to open the image file. Then it gets the EXIF data from the image and stores it in a dictionary along with other metadata information such as the file name, size, height, width, format, and mode of the image.

If an error occurs while processing the image file, the method will print an error message and return None. Otherwise, the method returns the metadata information in the form of a dictionary.


In [17]:
import subprocess

# Define path to image file
image_path = "../output/images/image_0.jpg"

# Define exiftool command
exiftool_cmd = ["exiftool", image_path]

# Execute exiftool command and capture output
output = subprocess.check_output(exiftool_cmd)

# Convert output to string and split into lines
output_str = output.decode("utf-8")
output_lines = output_str.strip().split("\n")

# Create dictionary to store Exif data
exif_data = {}

# Parse Exif data from output lines
for line in output_lines:
    # Split line into tag and value
    try:
        tag, value = line.split(": ")
    except ValueError:
        continue
    # Strip whitespace from tag and value
    tag = tag.strip()
    value = value.strip()
    # Add tag and value to Exif data dictionary
    exif_data[tag] = value

# Print Exif data
print(exif_data)


{'ExifTool Version Number': '12.50', 'File Name': 'image_0.jpg', 'Directory': '../output/images', 'File Size': '229 kB', 'File Modification Date/Time': '2023:02:28 17:25:01+01:00', 'File Access Date/Time': '2023:02:28 17:25:01+01:00', 'File Inode Change Date/Time': '2023:02:28 17:25:01+01:00', 'File Permissions': '-rw-r--r--', 'File Type': 'JPEG', 'File Type Extension': 'jpg', 'MIME Type': 'image/jpeg', 'JFIF Version': '1.01', 'Resolution Unit': 'inches', 'X Resolution': '72', 'Y Resolution': '72', 'Profile CMM Type': 'Linotronic', 'Profile Version': '2.1.0', 'Profile Class': 'Display Device Profile', 'Color Space Data': 'RGB', 'Profile Connection Space': 'XYZ', 'Profile Date Time': '1998:02:09 06:49:00', 'Profile File Signature': 'acsp', 'Primary Platform': 'Microsoft Corporation', 'CMM Flags': 'Not Embedded, Independent', 'Device Manufacturer': 'Hewlett-Packard', 'Device Model': 'sRGB', 'Device Attributes': 'Reflective, Glossy, Positive, Color', 'Rendering Intent': 'Perceptual', 'Con

In [36]:
import subprocess


def get_metadata(img_file):
    """
    This function extracts metadata information from an image file and returns it in a dictionary format.

    Parameters:
    img_file (str): The file name of the image.
    image_path (str): The path to the directory where the image is stored.

    Returns:
    dict: A dictionary containing the metadata information of the image. If an error occurs, the function returns None.
    """
    try:
        # Define exiftool command
        exiftool_cmd = ["exiftool", img_file]

        # Execute exiftool command and capture output
        output = subprocess.check_output(exiftool_cmd)

        # Convert output to string and split into lines
        output_str = output.decode("utf-8")
        output_lines = output_str.strip().split("\n")

        # Create dictionary to store Exif data
        metadata = {}

        # Parse Exif data from output lines
        for line in output_lines:
            # Split line into tag and value
            try:
                tag, value = line.split(": ")
            except ValueError:
                continue
            # Strip whitespace from tag and value
            tag = tag.strip()
            value = value.strip()
            # Add tag and value to Exif data dictionary
            metadata[tag] = value

        # get the image width and height
        width, height = Image.open(img_file).size

        # add the image width and height to the metadata dictionary
        metadata['width'] = width
        metadata['height'] = height

    except Exception as e:
        # print an error message if an error occurs
        print(f"An error occurred while processing {img_file}: {str(e)}")
        return None

    # return the metadata information
    return metadata



# Define the method to save the metadata
The method save_metadata is used to save the metadata information of an image in either pickle or json format. The method takes in 4 arguments:

- metadata (dict): The metadata information of an image.
- img_name (str): The file name of the image.
- metadata_path (str): The path to the directory where the metadata will be saved.
- save_format (str): The format in which the metadata will be saved. The default is 'pickle'.

The method first checks if the save_format argument is equal to 'pickle'. If it is, it uses the pickle module to save the metadata information in pickle format. The file name of the saved metadata is created by joining the metadata_path with the base name of the image file (obtained using os.path.basename) and the '.pickle' extension.

If the save_format argument is equal to 'json', the method uses the json module to save the metadata information in json format. The file name of the saved metadata is created by joining the metadata_path with the base name of the image file (obtained using os.path.basename) and the '.json' extension.

If the save_format argument is neither 'pickle' nor 'json', the method raises a ValueError with the message "Invalid save format".

The method uses a try-except block to catch any exceptions that may occur during the saving of the metadata. If an error occurs, it prints an error message that includes the image file name and the error message.


In [43]:
def save_metadata(metadata, img_name, metadata_path, save_format='pickle'):
    """
    This function saves the metadata information of an image in either pickle or json format.
    Parameters:
    metadata (dict): The metadata information of an image.
    img_name (str): The file name of the image.
    metadata_path (str): The path to the directory where the metadata will be saved.
    save_format (str): The format in which the metadata will be saved. The default is 'pickle'.

    Returns:
    None
    """
    try:
        if save_format == 'sqlite':
            # Get only the file name of the image
            img_name = os.path.basename(img_name)
            # Open a connection to the database
            conn = sqlite3.connect(os.path.join(metadata_path, 'metadata.db'))
            # Create a cursor
            c = conn.cursor()
            # Create a table if it doesn't exist : filename, key, value
            c.execute('''CREATE TABLE IF NOT EXISTS metadata (filename text, key text, value text)''')
            # Insert the metadata into the table
            for key, value in metadata.items():
                # Convert key, value to string
                key = str(key)
                value = str(value)
                # Check if the key is already in the table
                c.execute("SELECT * FROM metadata WHERE filename=? AND key=?", (img_name, key))
                # If the key is already in the table, update the value
                if c.fetchone():
                    c.execute("UPDATE metadata SET value=? WHERE filename=? AND key=?", (value, img_name, key))
                    # Commit the changes
                    conn.commit()
                # If the key is not in the table, insert the key, value pair
                else:
                    c.execute("INSERT INTO metadata VALUES (?, ?, ?)", (img_name, key, value))
                    # Commit the changes
                    conn.commit()
            # Close the connection
            conn.close()
        else:
            raise ValueError("Invalid save format")
    except Exception as e:
        # print an error message if an error occurs
        print(f"An error occurred while saving metadata for {img_name}: {str(e)}")


# Get all the metadata
The method get_all_metadata is used to extract metadata information from all images in a directory and save the metadata information in either pickle or json format. The method takes in 3 arguments:
- image_path (str): The path to the directory where the images are stored.
- metadata_path (str): The path to the directory where the metadata will be saved.
- save_format (str): The format in which the metadata will be saved. The default is 'pickle'.

In [22]:
def get_all_metadata(image_path, metadata_path, save_format='pickle'):
    """
    This function extracts metadata from all images in a directory and saves the metadata information in either pickle or json format.
    Parameters:
    image_path (str): The path to the directory where the images are stored.
    metadata_path (str): The path to the directory where the metadata will be saved.

    Returns:
    None
    """
    # Get a list of all images in the directory
    img_files = get_all_images(image_path)
    # Create a progress bar to track the progress of processing all images
    checkpoint = load_checkpoint()
    for img in tqdm(img_files, desc="Get all metadata of the images and save it"):
        # Check if the metadata extraction process was interrupted previously
        if checkpoint == img:
            # If the checkpoint is found, set it to None and continue processing the remaining images
            checkpoint = None
            continue
        elif checkpoint is not None:
            # If the checkpoint is not None and not equal to the current image, continue to the next image
            continue
        else:
            # Extract the metadata of the current image
            metadata = get_metadata(img)
            if metadata:
                # Save the metadata of the current image
                save_metadata(metadata, img, metadata_path, save_format)
                # Create a checkpoint to track the progress of the metadata extraction process
                create_checkpoint(img)

    # Remove the checkpoint file once all metadata have been extracted and saved
    remove_checkpoint()


In [43]:
get_all_metadata(images_path, metadata_path, save_format=metadata_mode)

Get all metadata of the images and save it:  89%|████████▉ | 22212/24997 [00:05<00:00, 4188.51it/s]


KeyboardInterrupt: 

# How to look at the metadata (pickle format)
This is the way to look at the metadata information of an image in pickle format.

In [None]:
# Get the first file of metadata directory
metadata_file = os.listdir(metadata_path)[0]
# Load the metadata
with open(os.path.join(metadata_path, metadata_file), 'rb') as f:
    metadata = pickle.load(f)
# Print the metadata
print(metadata)

# How to look at the metadata (json format)
This is the way to look at the metadata information of an image in json format.

In [None]:
# Get the first file of metadata directory
metadata_file = os.listdir(metadata_path)[0]
# Load the metadata
with open(os.path.join(metadata_path, metadata_file), 'r') as f:
    metadata = json.load(f)
# Print the metadata
print(metadata)

# How to look at the metadata (sqlite format)
This is the way to look at the metadata information of an image in sqlite database format.

In [56]:
# Open a connection to the database
conn = sqlite3.connect(os.path.join(metadata_path, 'metadata.db'))
# Create a cursor
c = conn.cursor()
# Get a name of the first file in the images directory
metadata_file = os.listdir(images_path)[0]
# Get the metadata of the first file
c.execute("SELECT * FROM metadata WHERE filename=?", (metadata_file,))
# Print the metadata
#Convert result to format: filename: [{key: value}...]
metadata = c.fetchall()
result = {}
for row in metadata:
    if row[0] not in result:
        result[row[0]] = []
    result[row[0]].append({row[1]: row[2]})

print(result)

# Close the connection
conn.close()

image_TCfCXuhgG4c.jpg
{'image_TCfCXuhgG4c.jpg': [{'ResolutionUnit': '2'}, {'ExifOffset': '214'}, {'Make': 'Canon'}, {'Model': 'Canon EOS Rebel T6'}, {'Software': 'Adobe Photoshop Camera Raw 10.5 (Windows)'}, {'DateTime': '2018:12:24 14:33:30'}, {'XResolution': '72.0'}, {'YResolution': '72.0'}, {'ExifVersion': "b'0230'"}, {'ShutterSpeedValue': '5.643856'}, {'ApertureValue': '4.33985'}, {'DateTimeOriginal': '2018:12:22 02:45:53'}, {'DateTimeDigitized': '2018:12:22 02:45:53'}, {'ExposureBiasValue': '0.0'}, {'MaxApertureValue': '4.25'}, {'MeteringMode': '5'}, {'ColorSpace': '65535'}, {'Flash': '16'}, {'FocalLength': '33.0'}, {'ExposureMode': '0'}, {'WhiteBalance': '0'}, {'SceneCaptureType': '0'}, {'FocalPlaneXResolution': '5728.176795580111'}, {'FocalPlaneYResolution': '5808.403361344538'}, {'FocalPlaneResolutionUnit': '2'}, {'SubsecTimeOriginal': '00'}, {'SubsecTimeDigitized': '00'}, {'ExposureTime': '0.02'}, {'FNumber': '4.5'}, {'ExposureProgram': '0'}, {'CustomRendered': '0'}, {'ISOSpee

In [None]:
# This cell is dedicated to improving the gathering of metadata using multiprocessing and multithreading techniques.


In [64]:
import asyncio
import subprocess
from PIL import Image
from tqdm.asyncio import tqdm_asyncio

async def get_metadata(img_file, sem):
    """
    This coroutine extracts metadata information from an image file and returns it in a dictionary format.

    Parameters:
    img_file (str): The file name of the image.
    sem (Semaphore): A Semaphore object to limit the number of concurrent calls.

    Returns:
    dict: A dictionary containing the metadata information of the image. If an error occurs, the coroutine returns None.
    """
    try:
        # Acquire the semaphore
        await sem.acquire()

        # Define exiftool command
        exiftool_cmd = ["exiftool", img_file]

        # Execute exiftool command and capture output
        process = await asyncio.create_subprocess_exec(*exiftool_cmd, stdout=asyncio.subprocess.PIPE)
        output, _ = await process.communicate()

        # Convert output to string and split into lines
        output_str = output.decode("utf-8")
        output_lines = output_str.strip().split("\n")

        # Create dictionary to store Exif data
        metadata = {}

        # Parse Exif data from output lines
        for line in output_lines:
            # Split line into tag and value
            try:
                tag, value = line.split(": ")
            except ValueError:
                continue
            # Strip whitespace from tag and value
            tag = tag.strip()
            value = value.strip()
            # Add tag and value to Exif data dictionary
            metadata[tag] = value

        # get the image width and height
        with Image.open(img_file) as img:
            width, height = img.size

        # add the image width and height to the metadata dictionary
        metadata['width'] = width
        metadata['height'] = height

    except Exception as e:
        # print an error message if an error occurs
        print(f"An error occurred while processing {img_file}: {str(e)}")
        return None

    finally:
        # Release the semaphore
        sem.release()

    # return the metadata information
    return metadata


def gen_sql_requests(metadatas):
    """
    This function generates a list of SQL requests to insert metadata into a database.

    Parameters:
    metadatas (list): A list of dictionaries containing the metadata information of the images.

    Returns:
    list: A list of SQL requests to insert metadata into a database.
    """
    # Create a list to store SQL requests
    sql_requests = []

    # Loop over all metadata
    for metadata in tqdm(metadatas, desc="Generating SQL requests"):
        # Get the filename of the image
        filename = metadata['File Name']

        # Loop over all metadata items
        for key, value in metadata.items():
            # Create SQL request to insert metadata into database
            sql_request = f"INSERT INTO metadata VALUES ('{filename}', '{key}', '{value}')"
            # Add SQL request to list
            sql_requests.append(sql_request)

    # Return the list of SQL requests
    return sql_requests

async def get_all_metadata(image_path, metadata_path, save_format='pickle'):
    """
    This coroutine extracts metadata from all images in a directory and saves the metadata information in either pickle or json format.

    Parameters:
    image_path (str): The path to the directory where the images are stored.
    metadata_path (str): The path to the directory where the metadata will be saved.

    Returns:
    None
    """
    # Get a list of all images in the directory
    img_files = get_all_images(image_path)
    metadatas = []

    # Create a semaphore to limit the number of simultaneous coroutines to 100
    semaphore = asyncio.Semaphore(5000)

    # Create a progress bar to track the progress of processing all images
    with tqdm_asyncio(total=len(img_files), desc="(Aprox : 5min) Get all metadata of the images and save it") as progress:
        # Create a list of coroutines to extract metadata for all images
        coroutines = [get_metadata(img, semaphore) for img in img_files]

        # Execute the coroutines concurrently with a maximum of 100 coroutines at a time
        for coroutine in asyncio.as_completed(coroutines):
            metadata = await coroutine
            if metadata:
                metadatas.append(metadata)
            # Update the progress bar
            progress.update(1)

    queries = gen_sql_requests(metadatas)

    def execute_query(query):
        conn = sqlite3.connect(os.path.join(metadata_path, 'metadata.db'))
        # Insert the metadata into the database
        conn.execute(query)
        # Commit the changes
        conn.commit()
        # Close the connection
        conn.close()

    for query in tqdm(queries, desc="Inserting metadata into the database"):
        execute_query(query)

In [65]:
asyncio.run(get_all_metadata(images_path, metadata_path, save_format=metadata_mode))

(Aprox : 5min) Get all metadata of the images and save it: 100%|██████████| 24975/24975 [05:43<00:00, 72.63it/s] 
Generating SQL requests: 100%|██████████| 24975/24975 [00:00<00:00, 48239.61it/s]
Inserting metadata into the database: 100%|██████████| 1550177/1550177 [21:54<00:00, 1179.27it/s]
