# Download processing

This notebook is the first step in the workflow for downloading a dataset from IIIF manifests and allows you to follow the entire workflow.

**Requirements**
This workflow is based on a list of manifest urls given in a CSV file containing two columns.
Column headers MUST be:
- 'Manifest_URL': containing the URL of the manifest,
- 'Image_basename': containing the name you wish to give to the folder containing the downloaded images and their name.

**Warning 1**
The entire workflow is built using csv files with a '**;**' separator.

**Warning 2**
The naming of 'Image_basename' is free. It is not constrained by the URL of the manifest or any other embedded metadata. It is specific to the project. This is the identifier that will be used during the entire workflow. The manifest copy will also be named according to 'Image_basename'.



## Environnement

In [None]:
import os
import pandas as pd
import requests
import shutil
import time
import json
import cv2
from PIL import Image

## Functions

### Download images from URL

In [None]:
def download_image_from_url(image_url, dir_path, request_pause):
    """
    The function downloads an image from a given url ('image_url') and stores the image as a file at the path 'dir_path'.
    A time out between the download of several images is given by the parameter 'request_pause' 
    """
    print(f'Downloading image from {image_url}')

    r = requests.get(image_url, stream=True)
    print(r.status_code)
    # Check if image was retrieved successfully
    if r.status_code == 200:
        r.raw.decode_content = True

        with open(dir_path, 'wb') as image_file:
            shutil.copyfileobj(r.raw, image_file)
        time.sleep(request_pause)
    
    elif r.status_code == 429:
        print(f"Error {r.status_code}. Rate limit exceeded. Pausing for {request_pause} seconds.")
        time.sleep(request_pause)
    else:
        print(f"Error {r.status_code}")

### Open Json file

In [None]:
def open_json_file(json_file_name):
    '''
    The function opens a JSON file and returns it as a dictionary. It is used to process the iiif-manifest.
    '''
    with open(json_file_name, 'r') as json_file:
        return json.load(json_file)

### Download images from a IIIF manifest and create a csv with all images data

#### Uses of the function

Download all the images in an IIIF manifest. All the images are named with the name of the IIIF manifest (file) and, after an underscore, the number of the image canvas.

Creates a CSV file with all the information about each downloaded image:
- 'manifestURL': URL of the manifest,
- 'canvasId': canvas ID,
- 'urlImage': image URL
- 'folderPath': name of the folder to which the image is downloaded,
- 'imageLabel': name of the image as declared in the manifest,
- 'imageWidthAsDeclared': width of the image as declared in the manifest,
- 'imageHeightAsDeclared': height of the image as declared in the manifest,
- 'htmlCode': HTTP status code response,
- 'imageFileName': name of the downloaded image
- 'imageWidthAsDownloaded': width of the image calculated after downloading,
- 'imageHeightAsDownloaded': height of the image calculated after downloading.

#### Function

In [None]:
def download_images_from_manifest(manifest_path, dataset_folder, ms_base_name, request_pause):
    """
    Download all the images in an IIIF manifest. All the images are named with the name of the IIIF manifest (file) and, after an underscore, the number of the image canvas.
    
    Parameters are 
    'manifest_path': path to the downloaded iiif-manifest
    'dataset_folder': path to the folder where the data will be sto
    'ms_base_name': folder name and base name of downloaded images
    'request_pause': time out between the download of two images (to avoid errors 429)
    
    Creates a CSV file with all the information about each downloaded image:

    'manifestURL': URL of the manifest,
    'canvasId': canvas ID,
    'urlImage': image URL
    'folderPath': name of the folder to which the image is downloaded,
    'imageLabel': name of the image as declared in the manifest,
    'imageWidthAsDeclared': width of the image as declared in the manifest,
    'imageHeightAsDeclared': height of the image as declared in the manifest,
    'htmlCode': HTTP status code response,
    'imageFileName': name of the downloaded image
    'imageWidthAsDownloaded': width of the image calculated after downloading,
    'imageHeightAsDownloaded': height of the image calculated after downloading.

    """
    print(f'Downloading all images from {manifest_path}')
    
    # Read the manifest content from the locally downloaded file
    with open(manifest_path, 'r') as manifest_file:
        manifest_content = json.load(manifest_file)
    
    if manifest_content:
        iiif_manifest = manifest_content
    else:
        print("Failed to read the manifest file.")
        return
    # Create a list to store image information
    images_data = []

    # Download images
    for i, canvas in enumerate(iiif_manifest['sequences'][0]['canvases']):
        manifestURL = iiif_manifest['@id']
        canvasId = canvas['@id']
        urlImage = canvas['images'][0]['resource']['@id']
        imageLabel = canvas['label']
        imageWidthAsDeclared = canvas['width']
        imageHeightAsDeclared = canvas['height']
        image_format = urlImage.split('.')[-1]
        image_id = f"{i+1}"
        htmlCode = requests.get(urlImage).status_code
        path_to_store_image = os.path.join(dataset_folder, ms_base_name + '_' + image_id + '.' + "jpg")
       
       # Check if the images are downloaded and don't downloaded again
        if os.path.exists(path_to_store_image):
            print(f"Image {path_to_store_image} already dowloaded ")
            htmlCode = 200
            imageFileName = os.path.join(dataset_folder, ms_base_name + '_' + image_id + '.' + "jpg")
            with Image.open(imageFileName) as img:
                imageWidthAsDownloaded, imageHeightAsDownloaded = img.size
        # If they are not dowloaded download them and add the post_download data
        else:
            download_image_from_url(urlImage, path_to_store_image, request_pause)
        
            try:
                htmlCode = requests.get(urlImage).status_code
                imageFileName = os.path.join(dataset_folder, ms_base_name + '_' + image_id + '.' + "jpg")
                with Image.open(imageFileName) as img:
                    imageWidthAsDownloaded, imageHeightAsDownloaded = img.size

            except:
                htmlCode = ""
                imageFileName = ""
                imageWidthAsDownloaded = ""
                imageHeightAsDownloaded = ""

        # Add image data to the list
        image_data = {
            'manifestURL': manifestURL,
            'canvasId': canvasId,
            'urlImage': urlImage,
            'folderPath': os.path.join(dataset_folder),
            'imageLabel': imageLabel,
            'imageWidthAsDeclared': imageWidthAsDeclared,
            'imageHeightAsDeclared': imageHeightAsDeclared,
            'htmlCode': htmlCode,
            'imageFileName': imageFileName,
            'imageWidthAsDownloaded': imageWidthAsDownloaded,
            'imageHeightAsDownloaded': imageHeightAsDownloaded
        }
        images_data.append(image_data)
    
    # Create a DataFrame from the image data list
    df = pd.DataFrame(images_data)
    
    # Save DataFrame to a CSV file
    csv_filename = os.path.join(dataset_folder, ms_base_name + '_image_data.csv')
    df.to_csv(csv_filename, index=False)
    
    print(f"Image data saved to {csv_filename}")
    print('Downloads complete')

### Download data (manifest, images and create a CSV file for each manifest)

In [None]:
def download_data(list_of_manifests, dataset_folder):
    """
    This function launches the download of all manifests in the list of manifest given as a CSV file (parameter 'list_of_manifests'). For each manifest, a folder will be created containing a copy 
    of the iiif-manifest file, the images and a CSV file documenting the downloads.
    """
    df = pd.read_csv(list_of_manifests, sep=';')

    # Browse each line of the CSV file
    for i, row in df.iterrows():
        url_manifest = row['Manifest_URL']
        ms_name = str(row['Image_basename'])

        outputfolder = os.path.join(dataset_folder, ms_name)
        
        # Check if the destination folder exists, if not create it
        if not os.path.exists(outputfolder):
            os.makedirs(outputfolder)
        
        # Download the manifest.json file
        response = requests.get(url_manifest)
        if response.status_code == 200:
            # Build the full path to the destination file
            manifest_path = os.path.join(outputfolder, ms_name + '_manifest.json')
            
            # Save the manifest.json file in the destination folder
            with open(manifest_path, 'wb') as f:
                f.write(response.content)
                
            print(f'The manifest.json file has been downloaded and saved in {outputfolder}.')
            print()
            print(f'Downloading images from {manifest_path}')

             # Download the manifest images
        # The Gallica's API is restricted to 5 IIIF Images / min so if you have manifests from Gallica the recommended request_pause = 12 seconds. Cf. https://api.bnf.fr/fr/node/232
            download_images_from_manifest(manifest_path, outputfolder, ms_name, 3)
        else:
            print(f'Failed to download manifest.json file for {ms_name}')


## Download

In [None]:
list_of_manifests = PATHTOCSVWITHDATA # to be change
dataset_folder = PATHTODOWNLOADFOLDER # to be change

In [None]:
list_of_manifests = '/home/jovyan/work/Books_in_books/_Download_scripts/Books_in_Books.csv'
dataset_folder = '/home/jovyan/work/Books_in_books'

'''EXAMPLE OF CSV file:x
'''
'''
Image_basename;Manifest_URL
MS Typ 1006;https://fragmentarium.ms/metadata/iiif/F-y5pq/manifest.json
MS Lat 451;https://fragmentarium.ms/metadata/iiif/F-j8m5/manifest.json
'''

In [None]:
%%time

download_data(list_of_manifests, dataset_folder)