# Data cleaning and processing

This notebook outlines the different steps of data cleaning and processing performed on the annotated data (CSV files exported from Label Studio, one for each archival collection) to prepare it for the [dataset creation](./2_dataset_creation.ipynb) phase.

## Table of contents
1. [Annotated data](#annotated-data), for an analysis of the annotated data
2. [Data cleaning](#data-cleaning), for the cleaning process 
3. [Image processing](#image-processing), for checking for corrupted images and conversion 



## Annotated data

The annotated data are various CSV files, one for each archival collection used:
| Institution | # pictures | Notes | File name |
| :---: | :---: | :---: | :---: |
| Imperial War Museum (IWM) | 199 | Batch1 | rawIWM1.csv |
| Geheugen van Nederland (GvN) | 357 (355) | Selection of pictures with higher resolution | rawGVN.csv | 
| Universiteit Leiden (UL) | 1981 (1978) | From the collection of the [Southeast Asian & Caribbean Images (KITLV)](https://digitalcollections.universiteitleiden.nl/imagecollection-kitlv?solr_nav%5Bid%5D=9b3e2e2b7974226a5774&solr_nav%5Bpage%5D=2&solr_nav%5Boffset%5D=19), with download available: [ULNI - Indigenous administration](https://digitalcollections.universiteitleiden.nl/search?type=dismax&islandora_solr_search_navigation=1&f%5B0%5D=RELS_EXT_isMemberOfCollection_uri_ms%3A%22info%5C%3Afedora%5C/collection%5C%3Akitlv_photos%22&f%5B1%5D=mods_subject_topic_Material_Type_ms%3A%22Photographs%22&f%5B2%5D=mods_accessCondition_restriction_on_access_ms%3A%22Download%5C%20provided.%22&f%5B3%5D=mods_subject_topic_ms%3A%22Indigenous%5C%20Administration%22) | rawULNI.csv |
| **Total (provisional)** | 2537 | | |


The CSV files are exported directly from Label Studio, with the following structure:
| annotation_id | annotator | choice | created_at | id | image | lead_time | updated_at |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 1 | 1 | Sensitive content | 2023-09-19T14:53:45.965682Z | 1 | /data/upload/1/6ba74cc4-Q_017042.jpg | 44.225 | 2023-09-19T14:53:45.965682Z |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 4 | 1 | Dubious content | 2023-09-19T14:55:19.854238Z | 4 | /data/upload/1/613fc759-Q_017045.jpg | 100.020 | 2023-10-03T11:00:38.615262Z |
| 5 | 1 | Not-sensitive content | 2023-09-19T14:55:26.779273Z | 5 | /data/upload/1/9cdb6487-Q_017046.jpg | 6.623 | 2023-09-19T14:55:26.779273Z |


## Data cleaning
1. **Remove the unnecessary columns** (`'annotator'`, as there's only one, and `'lead_time'`, as it doesn't really matter how much time was spent on every annotation)
    - The columns `'annotation_id'` and `'id'` may seem superfluous, but they actually contain different values on the long run (e.g., `'annotation_id'` skips a few numbers between the values 27 and 31, probably due to the management of the data during the annotation phase)
2. **Add columns** `'provenance'`, to specify the original collection of the picture, and `'set'`, useful for the dataset split (which will be performed in the [2_dataset_creation.ipynb](./2_dataset_creation.ipynb) file)
    - The `'provenance'` information is contained in the CSV file name
    - The `'set'` column will be empty, for the moment

In [1]:
# ---- IMPORT LIBRARIES
# Data cleaning
import pandas as pd
import re
# Image processing
from PIL import Image, UnidentifiedImageError
import os

In [2]:
# ---- TASKS 1, 2: REMOVE/ADD COLUMNS

def processCSV(path):
    df = pd.read_csv(path, header=0, encoding='utf-8')
    # Drop unnecessary columns (drop selected columns ONLY IF they are in the df)
    del_cols = ['annotator','lead_time']    # May be worth it to also drop 'created_at' and 'updated_at' columns
    df = df.drop([x for x in del_cols if x in df.columns], axis=1)
      
    # Add column 'provenance' (pd.Series): first update its values, then attach it to the df
    provenance = pd.Series(dtype='str', index=df.index)
    if "IWM" in path:
        provenance.fillna('IWM', inplace=True)
    elif "ULNI" in path:
       provenance.fillna('ULNI', inplace=True)
    elif "ULIA" in path:
        provenance.fillna('ULIA', inplace=True)
    elif "GVN" in path:
        provenance.fillna('GVN', inplace=True)
    df['provenance'] = provenance

    # Add column 'set' (pd.Series) -> it will be all NaN values
    df['set'] = pd.Series(dtype='str', index=df.index)

    # Change 'choice' column to lowercase
    df['choice'] = df['choice'].str.lower()

    return df

df_IWM1 = processCSV('./raw-csv/rawIWM1.csv')
df_ULNI = processCSV('./raw-csv/rawULNI.csv')
df_GVN = processCSV('./raw-csv/rawGVN.csv')
# Check out example
df_GVN


Unnamed: 0,annotation_id,choice,created_at,id,image,updated_at,provenance,set
0,348,not-sensitive content,2023-10-16T09:13:55.378150Z,3421,http://localhost:8081/UL_NI_1000.png,2023-10-16T09:13:55.378150Z,ULNI,
1,347,not-sensitive content,2023-10-16T09:13:49.523270Z,3422,http://localhost:8081/UL_NI_1001.png,2023-10-16T09:13:49.523270Z,ULNI,
2,827,not-sensitive content,2023-10-16T14:41:37.152651Z,3423,http://localhost:8081/UL_NI_1002.png,2023-10-16T14:41:37.152651Z,ULNI,
3,828,not-sensitive content,2023-10-16T14:41:41.650469Z,3424,http://localhost:8081/UL_NI_1003.png,2023-10-16T14:41:41.650469Z,ULNI,
4,829,dubious content,2023-10-16T14:41:52.564087Z,3425,http://localhost:8081/UL_NI_1004.png,2023-10-26T16:50:39.192520Z,ULNI,
...,...,...,...,...,...,...,...,...
1973,670,not-sensitive content,2023-10-16T14:18:35.153646Z,5397,http://localhost:8081/UL_NI_998.png,2023-10-16T14:18:35.153646Z,ULNI,
1974,2244,not-sensitive content,2023-10-18T10:26:04.778189Z,5398,http://localhost:8081/UL_NI_999.png,2023-10-18T10:26:04.778189Z,ULNI,
1975,370,not-sensitive content,2023-10-16T13:19:18.871754Z,5399,/data/upload/6/0c18f096-UL_NI_1.png,2023-10-16T13:19:18.871754Z,ULNI,
1976,1705,not-sensitive content,2023-10-18T08:39:03.427205Z,5400,/data/upload/6/268b43b3-UL_NI_10.png,2023-10-18T08:39:03.427205Z,ULNI,


### Further processing
**During the training phase, we ran into a problem**: the training would incur an error due to corrupted images (`OSError: image file is truncated`).
<br>
The corruption might have occurred after the mode conversion of the images, and, as it was only three pictures of the most frequent class ('not-sensitive content', specifically images UL_NI_992.png, UL_NI_1190.png, UL_NI_1493.png), to proceed with the research I have decided to just delete those images from the dataset: first, by deleting the image files, then by modifying the `rawULNI.csv` and pandas `DataFrame`.

#### Deleting the image files
Deleted the faulty images, then selected all the image files and renamed them into 'UL_NI_ (1)', 'UL_NI_ (2)', 'UL_NI_ (3)'... and so on. After that, ran a quick python script to eliminate the parentheses in the filenames through regular expressions:
```
import os
for filename in os.listdir("."):
    if filename.startswith("UL_NI_"):
        new_filename = filename.replace("_ (","_")
        newer_filename = new_filename.replace(")","")
        os.rename(filename, newer_filename)
```  

#### Modifying the dataframe
Manually deleting the rows with the faulty data, then replaced the columns `'image'` and `'id'` with new pandas Series with the appropriate values.


In [3]:
# ---- MODIFY RAWULNI.CSV

image_ULNI = pd.Series(dtype='str', index=df_ULNI.index)
id_ULNI = pd.Series(dtype='str', index=df_ULNI.index)

for idx,row in df_ULNI.iterrows():
    old_number = int(re.search('_\d+',row['image']).group().replace("_",""))
    if old_number<=992:
        new_number = old_number
        new_id = row['id']
    elif old_number>=993 and old_number<1190:
        new_number = old_number-1
        new_id = int(row['id'])-1
    elif old_number>=1190 and old_number<1493:
        new_number = old_number-2
        new_id = int(row['id'])-2
    elif old_number>=1493:
        new_number = old_number-3
        new_id = int(row['id'])-3
    #print(old_number,"\t>>>>>>\t",new_number)
    #print(row['image'])
    id_ULNI[idx] = new_id
    image_ULNI[idx] = row['image'].replace(str(old_number),str(new_number))

# Drop the columns with the old values
df_ULNI.drop(['image','id'], axis=1, inplace=True)
# Add the columns with the new values
df_ULNI['image'] = image_ULNI
df_ULNI['id'] = id_ULNI

df_ULNI

Unnamed: 0,annotation_id,choice,created_at,updated_at,provenance,set,image,id
0,348,not-sensitive content,2023-10-16T09:13:55.378150Z,2023-10-16T09:13:55.378150Z,ULNI,,http://localhost:8081/UL_NI_999.png,3420
1,347,not-sensitive content,2023-10-16T09:13:49.523270Z,2023-10-16T09:13:49.523270Z,ULNI,,http://localhost:8081/UL_NI_1000.png,3421
2,827,not-sensitive content,2023-10-16T14:41:37.152651Z,2023-10-16T14:41:37.152651Z,ULNI,,http://localhost:8081/UL_NI_1001.png,3422
3,828,not-sensitive content,2023-10-16T14:41:41.650469Z,2023-10-16T14:41:41.650469Z,ULNI,,http://localhost:8081/UL_NI_1002.png,3423
4,829,dubious content,2023-10-16T14:41:52.564087Z,2023-10-26T16:50:39.192520Z,ULNI,,http://localhost:8081/UL_NI_1003.png,3424
...,...,...,...,...,...,...,...,...
1973,670,not-sensitive content,2023-10-16T14:18:35.153646Z,2023-10-16T14:18:35.153646Z,ULNI,,http://localhost:8081/UL_NI_997.png,5396
1974,2244,not-sensitive content,2023-10-18T10:26:04.778189Z,2023-10-18T10:26:04.778189Z,ULNI,,http://localhost:8081/UL_NI_998.png,5397
1975,370,not-sensitive content,2023-10-16T13:19:18.871754Z,2023-10-16T13:19:18.871754Z,ULNI,,/data/upload/6/0c18f096-UL_NI_1.png,5399
1976,1705,not-sensitive content,2023-10-18T08:39:03.427205Z,2023-10-18T08:39:03.427205Z,ULNI,,/data/upload/6/268b43b3-UL_NI_10.png,5400


## Image processing
1. Merge the different dataframes into a generic one and export it into CSV (`'./index.csv'`)
    - Change path of the image (column `'image'`) to the appropriate local folder (`'./pictures'` folder, where all the pictures are temporarily stored before the dataset creation)
2. Check for corrupted images
3. Convert images to the same mode (RGB) to avoid complications during the training phase (see: [3_training.ipynb](./3_training.ipynb))

In [4]:
# ---- TASK 1: MERGE THE DATAFRAMES AND CHANGE IMAGE PATH

def createDataset(df_list):
    dataset_df = pd.DataFrame()

    # Update paths for each df
    for df in df_list:
        # Create new Series to update column 'image' (path to the images)
        image = pd.Series(dtype='str', index=df.index)

        # Iterate through df
        for idx, row in df.iterrows():

            # -- Get image name to fix source path
            if row['provenance'] == 'IWM':
                # regex to get the Q_ name typical of IWM pics
                # using .group() returns the string matching the regex, e.g. Q_017042.jpg
                img_name = re.search('Q_\d+.jpg', row['image']).group()
                img_name = "./pictures/" + img_name

            elif 'UL' in row['provenance']:
                # regex to get the UL_ name typical of UL pics
                img_name = re.search('UL_\w+_\d+.png', row['image']).group()
                img_name = "./pictures/" + img_name
            
            elif row['provenance'] == 'GVN':
                img_name = re.search('GVN_\d+.jpg', row['image']).group()
                img_name = './pictures/' + img_name

            # -- Update path
            # Find new path
            #image[idx] = os.path.abspath(img_name)
            image[idx] = img_name
            # Drop the column with the old path
            clean_df = df.drop('image', axis=1)
            # Append the column with the new path
            clean_df['image'] = image
        
        dataset_df = pd.concat([dataset_df,clean_df], ignore_index=True)

    return dataset_df

dataset = createDataset([df_IWM1,df_ULNI,df_GVN])
dataset.to_csv('./index.csv', index=False)



In [5]:
# ---- TASK 2: CHECK CORRUPTED IMAGES
# 33,1 GB runtime: 11m 37.8s - 12m 4.8s

def findCorruption():
    # Iterate through each file of the folder using the os.listdir() function
    for filename in os.listdir('./pictures/'):
        # For each file, make an attempt to open the image using Image.open() by PIL
        try:
            Image.open(os.path.join('./pictures', filename)).load()
            
        except Exception as e:
            print(f"Error in file {filename}: {e}")

    print("Done!")

findCorruption()

Done!


In [6]:
# ---- TASK 3: CONVERT IMAGES
    # The image conversion is done to prevent further issues in the processing phase before the training, as the pictures have different modes and different numbers of channels depending on their source (while some pictures were provided by their hosting institutions, others were webscraped)
def convertImage(path):
    for file in os.listdir(path):
        filename, extension = os.path.splitext(file)
        #print(extension)
        image = Image.open(os.path.join(path,file))
        #print(file)
        if image.mode != 'RGB':
            #print("The image is NOT rgb!")
            converted_image = image.convert('RGB')
            converted_image.save(os.path.join(path,file))

        #print(image.mode)

convertImage('./pictures')