### Dataset understanding

| Column Name               | Type        | UNIQUE      | Description |
| -----------               | ----------- | ----------- | ----------- |
| `record identifiers`      | LINK        | NO          | A link to an object (e.g. set of papers of a book)       |
| `persistent identifier`   | LINK        | YES         | A link to an subobject of an object (e.g. one sheet of paper from this book). Sometimes is similar to `record identifiers`       |
| `identification labels`   | TEXT        | NO          |A label, text description |
| `digital images/downloads (files)`   | SET OF LINKS        | PROBABLY          |A set of links to download all coresponding images to this subobject (e.g. all sides of this sheet of paper in each link) |
| `digital images/archival description`   | LINK        | PROBABLY          |A link to another page in archive ??? |
| `microfiche/downloads (files)`   | SET OF LINKS        | PROBABLY          |A link to download pictures from archive. Probably there are scans of original objects, books, photos of sculptures with metadate on them. Could be used as a style transfer and metadata resourse, if it is not in the `identification labels` column |
| `microfiche/archival description`   | LINK        | PROBABLY          |Links to another archive |

<img src="dataset logic.jpg" width="500">


In [2]:
import pandas as pd
import requests
import re
import numpy as np
import os

In [3]:
file_name = 'DATASETS_DONT_TOUCH/dataset_CLEAN.csv'
df = pd.read_csv(file_name, sep=';', encoding='utf-8')

# print(df.shape)
# df.head()

Converting strings in coulmn image_links - microfiche_archive_links back into lists of strings

In [4]:
# Took ~20 seconds
 
columns_to_convert = df.columns[3:]

# for i in range(11):
for i in range(df.shape[0]):
    for column in columns_to_convert:
        _ = df.iloc[i][column]
        if _ == '[]':
            df.at[i, column] = []
        else:
            # print(_[2:-2])
            df.at[i, column] =  _[2:-2].split('\', \'')


In [5]:
# df.iloc[100]

### Image downloading
Rrof. Bell said that target column is `image_links`. However, in column `microfiche_links` there are also pictures to downloading, but scans of photos and documents. I will download images from this two sources in separate folders.

In [6]:
sum_img, sum_scans, sum_dif = 0, 0, 0

for i in range(df.shape[0]):
    num_img, num_scans = len(df.iloc[i].image_links), len(df.iloc[i].microfiche_links)
    sum_img += num_img
    sum_scans += num_scans
    sum_dif += num_img == num_scans


print(f'Amount of objects in dataset: {df.shape[0]}')
print(f'Amount of images in \"image_links\": {sum_img}')
print(f'Amount of images in \"microfiche_links\": {sum_scans}')
print(f'Cases, when only scans or only images are given: {sum_dif}')

Amount of objects in dataset: 38200
Amount of images in "image_links": 57094
Amount of images in "microfiche_links": 36177
Cases, when only scans or only images are given: 5793


In [7]:
def download_image(url, save_path):
    try:
        # Send a GET request to the URL
        response = requests.get(url)
        
        # Check if the request was successful
        if response.status_code == 200:
            # Open a file in binary mode and write the image content
            with open(save_path, 'wb') as file:
                file.write(response.content)
            # print(f"Image successfully downloaded: {save_path}")
        else:
            print(f"Failed to retrieve image from {url}. Status code: {response.status_code}")
    except Exception as e:
        print(f"An error occurred: {e}")

def create_folder(folder_name):
    # Check if the folder exists
    if not os.path.exists(folder_name):
        # Create the folder if it doesn't exist
        os.makedirs(folder_name)
        print(f"Folder '{folder_name}' created.")
    else:
        print(f"Folder '{folder_name}' already exists.")

Choose amount of objects to download and specify the name of the folder if it's needed

In [8]:
# Upload images from image_links to your computer

# One object consists of many images
OBJECTS_TO_DOWNLOAD = 5

create_folder('images')

for i in range(OBJECTS_TO_DOWNLOAD):
    j = 0
    for link in df.iloc[i].image_links:

        # Build a name for the image
        name = ''
        line = df.iloc[i].id_persistent
        if line != 'NO_ID_PERSISTENT':
            name += re.findall(r'.*/([^/]*)$', line)[0] + '-'
        else:
            name += 'NO_ID_PERSISTENT-'
        name += re.findall(r'.*/([^/]*)$', link)[0]

        save_path = f'images/img-{name}.jpg'
        download_image(link, save_path)
        j += 1


Folder 'images' already exists.


#### Same function for scan downloading (another column)


In [9]:
# Upload scans from microfiche_links to your computer

create_folder('scans')

# One object consists of many images
OBJECTS_TO_DOWNLOAD = 10

for i in range(OBJECTS_TO_DOWNLOAD):
    j = 0
    for link in df.iloc[i].microfiche_links:

        # Build a name for the image
        name = ''
        line = df.iloc[i].id_persistent
        if line != 'NO_ID_PERSISTENT':
            name += re.findall(r'.*/([^/]*)$', line)[0] + '-'
        else:
            name += 'NO_ID_PERSISTENT-'
        name += re.findall(r'.*/([^/]*)$', link)[0]

        save_path = f'scans/scn-{name}.jpg'
        download_image(link, save_path)
        j += 1


Folder 'scans' already exists.
