# Dataset Download

This Dataset was provided by the "Dust and Data" project.

In this Notebook the dataset will be downloaded into a folder. Further Processing will then be handled in a seperate notebook.

In [None]:
import pandas as pd
import os
import urllib.request
from time import sleep
import logging
from PIL import Image

## Loading the Dataset

In [None]:
database_file = os.path.join(os.getcwd(), 'xmlkultur.xlsx')
assert os.path.exists(database_file), 'The Dataset is not in the expected directory'

target_dir = os.path.join(os.getcwd(), 'images')
if not os.path.exists(target_dir):
    os.makedirs(target_dir)

df = pd.read_excel(database_file, index_col=0)
df.head()


## Checking assumptions
In order to be sure that there are no images in this dataset that do not fall under the expected license or publisher, I quickly check the three correlated columns.

I also check if I have to handle multiple fileformats and wheter the filenames I am planning on using are actually unique identifiers.

In [None]:
rights = df['Rights'].unique()
ds_license = df['CreativeCommons'].unique()
publisher = df['Publisher'].unique()
file_formats = df['Format'].unique()

assert len(rights) == 1 and len(ds_license) == 1 and len(publisher) == 1 and \
       rights[0] == 'Österreichische Galerie Belvedere' and \
       ds_license[0] == 'https://creativecommons.org/licenses/by-sa/4.0/' and \
       publisher[0] == 'Österreichische Galerie Belvedere', 'UNEXPECTED LEGAL STATUS'
print('Legal status is as expected.')

assert len(file_formats) == 1 and file_formats[0] == 'image/jpeg', 'Unsuspected file format'
print('File formats are as expected.')

assert len(df['Identifier'].unique()) == df.shape[0], 'Object Id is not a unique identifier'
print('Naming scheme is as expected.')

Although the images are provided as jpeg files, I save them as pngs, as I do not know how I will deal with the images later on in the project and jpeg files can quickly degrade if modified and saved frequently.

## Downloading the images
Since I do not want to risk overloading the servers by sending thousands of requests at once, a delay between each download is introduced. This slows down the process significantly, but it is important in order to prevent an unintented Denial of Service attack.
Since there does not seem to be a robots.txt file for sammlung.belvedere.at, I use a delay of 10 seconds per request, which has the effect that I barely put any load on the server whatsoever.

In [None]:
log_file = os.path.join(os.getcwd(), 'download_errors.log')
logger = logging.getLogger('failed_download')
logger.addHandler(logging.FileHandler(log_file))

failed_downloads = []
with open(log_file) as f:
    for line in f.readlines():
        failed_downloads.append(line.rstrip())

for _, row in df.iterrows():
    obj_name_jpg = row['Identifier'] + '.jpg'
    obj_name_png = row['Identifier'] + '.png'

    # Do not request images, that were already downloaded or failed before
    if row['Identifier'] in failed_downloads or os.path.exists(os.path.join(target_dir, obj_name_png)):
        continue

    try:
        urllib.request.urlretrieve(row['Object'], os.path.join(target_dir, obj_name_jpg))
        sleep(10)
    except:
        logger.error(f'{row["Identifier"]}')
        sleep(10)
        continue

    # Converting to png
    im = Image.open(os.path.join(target_dir, obj_name_jpg))
    im.save(os.path.join(target_dir, obj_name_png))
    os.remove(os.path.join(target_dir, obj_name_jpg))

## Checking the download
Checking if an attempt for downloading was made for every image:

In [None]:
# Failed download set is reloaded to account for the errors of the last cell
failed_downloads = []
with open(log_file) as f:
    [failed_downloads.append(line.rstrip()) for line in f.readlines()]

for f in os.listdir(target_dir):
    assert os.path.isfile(os.path.join(target_dir, f)) or not f.endswith('.png'), 'Unexpected finding (either folder or file that is not a png): {f}'
    assert (df['Identifier'] == f[:-4]).any(), f'No matching identifier was found in the dataset for the picture {f}'

successful_downloads = len([png for png in os.listdir(target_dir)])

accounted_for = len(set(failed_downloads)) + successful_downloads

assert accounted_for == df.shape[0], f'Missmatch: {df.shape[0]} expected, but found {successful_downloads + len(failed_downloads)}: {successful_downloads} successfully downloaded and {len(set(failed_downloads))} accounted for as failed'
print(f'{len(failed_downloads)} images are counted as missing. Not considering these, the dataset is complete. If there are missing images, they are referenced in the log file.')

# Missing images
For the following identifiers the images are still missing

In [None]:
print(*set(failed_downloads), sep=' ')