## Load dataset

We will load a version of the custom dataset from my google drive. To the google collab. And then get in the correct directory to then work with it. Also do some cleanups of useless files.

In [9]:
# from google.colab import drive
# Mount the google drive
# drive.mount('/content/drive')

# Change directory to the data directory
# %cd /Volumes/ArabaFenice/Thesis_Data

/Volumes/ArabaFenice/Thesis_Data


### Formatting the dataset lables

We import the dataset lables in the format given to us in `csv` and load them into a pandas dataframe to be able to work with the with more ease. We also do some cleanup by setting NaN for unlabled tags. By looking around I found that there are 204 nan media type: 

```python
# Get the unique values of the 'mediaType' column
media_types = df['mediaType'].unique()

# Print the unique values
print(media_types)

none_count = df['mediaType'].isnull().sum()

print(none_count) # 204
```

In [28]:
from tqdm import tqdm
import pandas as pd
import numpy as np
import json
import os

# Change direcotry
%cd /Volumes/ArabaFenice/Thesis_Data
# %cd ~/Downloads

# Read the csv file
df = pd.read_csv('Pisa_metadata.csv', delimiter=';')

# Select only the first 5 rows of the dataframe
# df = df.head(5)

column_names = df.columns.tolist()

# Print the column names
print(column_names)

# print(df)

# Replace "[ ... ]" with an empty string only in the "locality" column
df['locality'] = df['locality'].replace(r'\[.*?\]', '', regex=True)

# Replace "sine loco" and "sine loco" with nothing
df['locality'] = df['locality'].replace('sine loco', '', regex=True)

# Remove any leading or trailing spaces
df['locality'] = df['locality'].str.strip()

# Remove nan values
df = df.fillna('')

# Convert day, month, and year columns to strings without decimals
df['day'] = df['day'].astype(str).str.split('.').str[0]
df['month'] = df['month'].astype(str).str.split('.').str[0]
df['year'] = df['year'].astype(str).str.split('.').str[0]
df['elevation'] = df['elevation'].astype(str).str.split('.').str[0]

unique_values = df['verbatimScientificName'].value_counts()
print(unique_values)

unique_locations = df['locality'].value_counts()
# print(unique_locations)

# convert elevation column to integer type
unique_elevation = df['elevation'].value_counts()
# print(unique_elevation)

# names_with_count_1 = unique_values[unique_values == 1].index.tolist()
# print(names_with_count_1)

/Volumes/ArabaFenice/Thesis_Data
['catalogNumber', 'recordedBy', 'eventDate', 'year', 'month', 'day', 'verbatimEventDate', 'locality', 'decimalLatitude', 'decimalLongitude', 'identifiedBy', 'elevation', 'elevationAccuracy', 'mediaType', 'verbatimScientificName']
verbatimScientificName
Dianthus virgineus L.                                      235
Pulmonaria hirta L.                                        220
Alchemilla glaucescens Wallr.                              143
Trifolium L.                                               140
Armeria arenaria subsp. praecox (Jord.) Kerguélen          132
                                                          ... 
Rubus katrenkensis Kupcsok                                   1
Rubus lindebergii P. J. Müll.                                1
Rubus macrostemon (Focke) Caflisch                           1
Rubus bifrons Vest ex Tratt. x Rubus foliolosus D. Don       1
Trifolium resupinatum subsp. suaveolens (Willd.) Ponert      1
Name: count, Length: 

### Creating the metadata.json

We have a folder with images and we want to create the `metadata.json` file  which associate text from the dataframe to the images as ground truth. This is necessary for the `imagefolder` feature of `datasets`.

The `metadata.json` should look at the end similar to the example below.

```json
{"file_name": "0001.png", "ground_truth": "This is a golden retriever playing with a ball"}
{"file_name": "0002.png", "ground_truth": "A german shepherd"}
```

In our example will `"text"` column contain the OCR text of the image, which will later be used for creating the Donut specific format.

In [29]:
from pathlib import Path

# define paths
image_path = "img"

# define metadata list
metadata_list = []

# loop through rows of dataframe
for index, row in df.iterrows():
    # Do it only for the ones who have null note and others to avoid wierd things
    if row['mediaType'] == "StillImage": # and pd.isnull(row['Determinavit']) and pd.isnull(row['Legit']):
       
        # Fill the NaN in the row wiht the empty string
        row = row.fillna(' ')
        
        # create dictionary with metadata for this row
        metadata_dict = {
            "Nome_verbatim": row['verbatimScientificName'],
            "Locality": row['locality'],
            "Elevation": row['elevation'],
            "Day": row['day'],
            "Month": row['month'],
            "Year": row['year'],
        }
        # create dictionary with "file_name" and "text" keys
        metadata_list.append({
            "ground_truth": json.dumps(metadata_dict),
            "file_name": f"pi_{str(row['catalogNumber']).zfill(6)}.jpg"
        })

# write jsonline file to the image_path
jsonl_file_path = os.path.join(image_path, 'metadata.jsonl')
with open(jsonl_file_path, 'w') as outfile:
    for entry in metadata_list:
        json.dump(entry, outfile)
        outfile.write('\n')

#### Delete images that are not in the metadata.json

---

After I have created a copy I can delete the images that are not in the metadata from the folder that I actually use

In [30]:
import shutil
# create a copy of the 'data' directory as 'img_copy'
# shutil.copytree('/content/drive/MyDrive/data/img', '/content/drive/MyDrive/img_copy')

metadata_file = "img/metadata.jsonl"
image_path = "img/"

# Load the list of image files from the metadata file
with open(metadata_file, 'r') as f:
    metadata_list = [json.loads(line)['file_name'] for line in f]

# Count the number of deleted files
deleted_count = 0

# Create a progress bar
with tqdm(total=len(os.listdir(image_path)), desc="Going through files") as pbar:
    # Delete image files that don't have metadata
    for file_name in os.listdir(image_path):
        if file_name.endswith('.jpg') and file_name not in metadata_list:
            os.remove(os.path.join(image_path, file_name))
            deleted_count += 1
        pbar.update(1)

print(f"Number of files deleted: {deleted_count}")

Going through files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 45954/45954 [00:09<00:00, 4682.99it/s]

Number of files deleted: 0





#### Show evertyhing that is not jpeg

In [31]:
image_path = "img/"

# Show files that are not JPEG
for filename in os.listdir(image_path):
    if not filename.endswith(".jpeg") and not filename.endswith(".jpg"):
        print(filename)

metadata.jsonl


#### Convert tiff images to jpeg

In [32]:
from PIL import Image

# Convert TIFF images to JPEG and remove TIFF images
jpg_files = [filename for filename in os.listdir(image_path) if filename.endswith((".tiff", ".tif"))]

with tqdm(total=len(jpg_files), desc="Converting images") as pbar:
    for filename in jpg_files:
        tiff_path = os.path.join(image_path, filename)
        jpeg_path = os.path.join(image_path, os.path.splitext(filename)[0] + ".jpg")
        im = Image.open(tiff_path)
        
        # Convert image mode to RGB
        im = im.convert("RGB")
        
        im.save(jpeg_path, "JPEG")
        os.remove(tiff_path)
        pbar.update(1)

print("Conversion complete.")

Converting images: 0it [00:00, ?it/s]

Conversion complete.





#### Remove metadata that are not present

---
Just to be sure there is not anything wierd

In [33]:
# Get the list of jpg files in the directory
jpg_files = set([f for f in os.listdir(image_path) if f.endswith('.jpg')])

# Create a temporary file to store the filtered metadata
temp_file = os.path.join(image_path, "temp_metadata.jsonl")

# Initialize a counter for the number of metadata entries kept
num_entries_kept = 0
encountered_file_names = set()

# Read the metadata file line by line and filter the metadata
with open(metadata_file, 'r') as f:
    lines = f.readlines()

with open(temp_file, 'w') as temp_f:
    for line in tqdm(lines, desc="Filtering metadata", total=len(lines)):
        metadata = json.loads(line)
        file_name = metadata['file_name']
        if file_name in jpg_files and file_name not in encountered_file_names:
            temp_f.write(line)
            num_entries_kept += 1
            encountered_file_names.add(file_name)

# Replace the original metadata file with the filtered version
os.replace(temp_file, metadata_file)

print(f"Number of metadata entries kept: {num_entries_kept}")

Filtering metadata: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52274/52274 [00:00<00:00, 490214.95it/s]

Number of metadata entries kept: 45953





#### Smaller image size dataset

Create a copy of the images with a half of the size and rotate them in necessary do everything with a progressbar

In [34]:
import json
from PIL import Image, UnidentifiedImageError
import shutil
import os
from tqdm import tqdm

input_dir = "img/"
output_dir = "img_small" # 960 * 1280
# output_dir = "img_tiny" # 480 * 640

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

image_files = [filename for filename in os.listdir(input_dir) if filename.endswith(".jpg")]

metadata_file = input_dir + "metadata.jsonl"
metadata = []

with open(metadata_file, "r") as f:
    for line in f:
        metadata.append(json.loads(line))

updated_metadata = []

for filename in tqdm(image_files, desc="Processing images"):
    try:
        with Image.open(os.path.join(input_dir, filename)) as img:
            resized_img = img.resize((960, 1280))
            
            exif_data = img.info.get('exif')
            if exif_data is not None:
                resized_img.save(os.path.join(output_dir, filename), exif=exif_data)
            else:
                resized_img.save(os.path.join(output_dir, filename))

            for entry in metadata:
                if entry["file_name"] == filename:
                    updated_metadata.append(entry)
                    break
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except UnidentifiedImageError:
        print(f"Cannot identify image file: {filename}")

with open(os.path.join(output_dir, "metadata.jsonl"), "w") as f:
    for entry in updated_metadata:
        f.write(json.dumps(entry) + "\n")

# shutil.copyfile(metadata_file, "img_resized/metadata.jsonl")

Processing images:  15%|████████████████████▌                                                                                                                   | 6963/45953 [3:03:49<6:45:13,  1.60it/s]

Cannot identify image file: pi_017168.jpg
Cannot identify image file: pi_017169.jpg


Processing images: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 45953/45953 [14:56:24<00:00,  1.17s/it]


# General part for preprocessing of the data

### Split the dataset

In [35]:
import random
from pathlib import Path
from sklearn.model_selection import train_test_split
from tqdm import tqdm

image_path = "img_small/"

# Set random seed for reproducibility
seed = 1337

# load metadata from JSONL file
metadata_list = []
with open(os.path.join(image_path, 'metadata.jsonl'), 'r') as infile:
    for line in infile:
        metadata_list.append(json.loads(line.strip()))

# Split the dataset into train, validation and test
train_metadata, temp_metadata = train_test_split(metadata_list, test_size=0.3, random_state=seed)
val_metadata, test_metadata = train_test_split(temp_metadata, test_size=0.5, random_state=seed)

# create directories
os.makedirs(os.path.join(image_path, 'train'), exist_ok=True)
os.makedirs(os.path.join(image_path, 'val'), exist_ok=True)
os.makedirs(os.path.join(image_path, 'test'), exist_ok=True)

# Define directories and corresponding metadata
dirs = ['train', 'val', 'test']
metadata = [train_metadata, val_metadata, test_metadata]

# Loop over directories and metadata, copy images and create metadata file
for directory, data in zip(dirs, metadata):
    metadata_list = []
    for entry in tqdm(data, desc=f"Processing {directory}"):
        src_file_path = os.path.join(image_path, entry['file_name'])
        dst_file_path = os.path.join(image_path, directory, entry['file_name'])
        os.rename(src_file_path, dst_file_path)
        metadata_list.append(entry)
    
    with open(os.path.join(image_path, directory, 'metadata.jsonl'), 'w') as outfile:
        for entry in metadata_list:
            json.dump(entry, outfile)
            outfile.write('\n')

Processing train: 100%|██████████████████████████████████| 32165/32165 [00:08<00:00, 3642.17it/s]
Processing val: 100%|██████████████████████████████████████| 6893/6893 [00:01<00:00, 4020.05it/s]
Processing test: 100%|█████████████████████████████████████| 6893/6893 [00:01<00:00, 3655.87it/s]


### Creating the custom dataset

In [None]:
# Delete in case of cached dataset
!rm -rf /root/.cache/huggingface/datasets/imagefolder/default-5a4ceb57f781cbf0/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f
# !rm -rf /content/drive/MyDrive/data/img_resized

In [None]:
from pathlib import Path
from datasets import load_dataset
import json

# image_path = base_path.joinpath("img")
image_path = "../data/img_resized"

dataset = load_dataset(image_path)

print(f'this is the dataset {dataset}')

#### Show an example

Now, lets take a closer look at our dataset by showing an example

In [None]:
import random

random_sample = random.randint(0, len(dataset['train']))

example = dataset['train'][random_sample]
image = example['image']
ground_truth = example['ground_truth']
 
# Print the nmae of the sample
print(f"Random sample is {random_sample}")
        
# let's load the corresponding JSON dictionary (as string representation)
print(f"OCR text is {ground_truth}")

# let's make the image a bit smaller when visualizing
width, height = image.size
display(image.resize((int(width*0.3), int(height*0.3))))

We can also parse the string as a Python dictionary using `ast.literal_eval`. Each training example has a single "gt_parse" key, which contains the ground truth parsing of the document:

In [None]:
from ast import literal_eval

literal_eval(ground_truth)

Let's check which tokens are added:

In [None]:
print(added_tokens)

In [None]:
# the vocab size attribute stays constants (might be a bit unintuitive - but doesn't include special tokens)
print("Original number of tokens:", processor.tokenizer.vocab_size)
print("Number of tokens after adding special tokens:", len(processor.tokenizer))

As always, it's very important to verify whether our data is prepared correctly. Let's check the first training example:

In [None]:
pixel_values, labels, target_sequence = train_dataset[0]

This returns the `pixel_values` (the image, but prepared for the model as a PyTorch tensor), the `labels` (which are the encoded `input_ids` of the target sequence, which we want Donut to learn to generate) and the original `target_sequence`. The reason we also return the latter is because this will allow us to compute metrics between the generated sequences and the ground truth target sequences.

In [None]:
print(pixel_values.shape)

Another important thing is that we need to set 2 additional attributes in the configuration of the model. This is not required, but will allow us to train the model by only providing the decoder targets, without having to provide any decoder inputs.

The model will automatically create the `decoder_input_ids` (the decoder inputs) based on the `labels`, by shifting them one position to the right and prepending the decoder_start_token_id. I recommend checking [this video](https://www.youtube.com/watch?v=IGu7ivuy1Ag&t=888s&ab_channel=NielsRogge) if you want to understand how models like Donut automatically create decoder_input_ids - and more broadly how Donut works.

In [None]:
model.config.pad_token_id = processor.tokenizer.pad_token_id
model.config.decoder_start_token_id = processor.tokenizer.convert_tokens_to_ids(['<s_herbarium>'])[0]

In [None]:
# sanity check
print("Pad token ID:", processor.decode([model.config.pad_token_id]))
print("Decoder start token ID:", processor.decode([model.config.decoder_start_token_id]))

In [None]:
from torch.utils.data import DataLoader

# feel free to increase the batch size if you have a lot of memory
# I'm fine-tuning on Colab and given the large image size, batch size > 1 is not feasible
train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True, num_workers=4)
val_dataloader = DataLoader(val_dataset, batch_size=1, shuffle=False, num_workers=4)

Let's verify a batch:

In [None]:
batch = next(iter(train_dataloader))
pixel_values, labels, target_sequences = batch
print(pixel_values.shape)

In [None]:
for id in labels.squeeze().tolist()[:30]:
  if id != -100:
    print(processor.decode([id]))
  else:
    print(id)

In [None]:
print(len(train_dataset))
print(len(val_dataset))

In [None]:
# let's check the first validation batch
batch = next(iter(val_dataloader))
pixel_values, labels, target_sequences = batch
print(pixel_values.shape)

In [None]:
print(target_sequences[0])

# More analyses on possible tokens

#### First look at the data to see if there is any interesting pattern

In [None]:
import pandas as pd

data_path = "/Users/jaczac/Github/Thesis/data/"
file_name = "italy_names.csv"
file_path = data_path + file_name

# Specify the delimiter and encoding parameters
df = pd.read_csv(file_path, delimiter=';', encoding='utf-8')

# get the count of unique values for each column
unique_counts = df.nunique()
# get the count of non-empty elements for each column
non_empty_counts = df.count()

# create a dataframe with the counts and column names
counts_df = pd.DataFrame({'Unique Counts': unique_counts, 'Non-Empty Counts': non_empty_counts, 'Column Names': unique_counts.index})

# Calculate the mode for each column, excluding NaN values
column_modes = df.mode(dropna=True).transpose()

# Add the most frequent value and count to the counts_df dataframe
counts_df['Most Frequent Value'] = column_modes[0]
counts_df['Most Frequent Count'] = df.apply(lambda col: col.value_counts().iloc[0], axis=0)

# Display the dataframe
counts_df

In [None]:
import pandas as pd

data_path = "/Users/jaczac/Github/Thesis/data/"
file_name = "italy_names.csv"
file_path = data_path + file_name

# Specify the delimiter and encoding parameters
df = pd.read_csv(file_path, delimiter=';', encoding='utf-8')

# adding most relevants with a reasonable amount of frequency
genus = df['genus']
species = df['species']
species_auth = df['species_auth']
infrasp1_rank= df['infrasp1_rank']
infrasp1 = df['infrasp1']
infrasp1_auth = df['infrasp1_auth']
infrasp2_rank = df['infrasp2_rank']
infrasp2 = df['infrasp2']
infrasp2_auth = df['infrasp2_auth']

# around 10 frequncy most relevants

# Get the most frequent scientific names and their counts
most_frequent_names = species_auth.value_counts() + species.value_counts() + genus.value_counts() + infrasp1_rank.value_counts()
+ infrasp1.value_counts() + infrasp1_auth.value_counts() + infrasp2_rank.value_counts() + infrasp2.value_counts() + infrasp2_auth.value_counts()

# Create a list of tokens from the most frequent scientific names
tokens = most_frequent_names.index.tolist()

# check for duplicate
def has_duplicates(seq):
    return len(seq) != len(set(seq))

print(has_duplicates(tokens))

print(len(tokens))