# Data processing and preparation
This notebook outlines the different steps of data processing to prepare the raw data (CSV file with annotations downloaded directly from Label Studio) for the fine-tuning of a ML model:
- [Data cleaning](#data-cleaning)
- [Dataset creation](#dataset-creation)

Before we start, let's import the necessary libraries

In [1]:
# Import necessary libraries
import pandas as pd
import re

## Data cleaning
1. Read the CSV as a pandas dataframe
2. **Remove the unnecessary columns** (`'annotator'`, as there's only one, and `'lead_time'`, as it doesn't really matter how much time was spent on every annotation)
    - The columns `'annotation_id'` and `'id'` may seem superfluous, but they actually contain different values on the long run, as 'annotation_id' skips a few numbers between the values 27 and 31
3. **Add columns** `'provenance'`, to specify the original collection of the picture (which will be useful when splitting the dataset), and `'set'` (to be kept empty at the moment), useful for the dataset split
    - The `'provenance'` information is contained either in the CSV filepath (for pictures from the Imperial War Museum, IWM), or from the temporary path under the column `'image'` (for pictures from the different collections of the Universiteit Leiden, UL)

In [2]:
# TASKS: 1, 2

def processCSV(path):
    df = pd.read_csv(path, header=0, encoding='utf-8')
    # Drop unnecessary columns (drop selected columns ONLY IF they are in the df)
    del_cols = ['annotator','lead_time']    # May be worth it to also drop 'created_at' and 'updated_at' columns
    df = df.drop([x for x in del_cols if x in df.columns], axis=1)
      
    # Add column 'provenance' (pd.Series): first update its values, then attach it to the df
    provenance = pd.Series(dtype='str', index=df.index)
    if "IWM" in path:
        provenance.fillna('IWM', inplace=True)
    elif "UL" in path:
        for idx, row in df.iterrows():
            if 'UL_NI' in df.at[idx, 'image']:
                provenance.fillna('ULNI', inplace=True)
            elif 'UL_IA' in df.at[idx, 'image']:
                provenance.fillna('ULIA', inplace=True)
            else:
                provenance.fillna('ULCA', inplace=True)
    df['provenance'] = provenance

    # Add column 'set' (pd.Series) -> it will be all NaN values
    df['set'] = pd.Series(dtype='str', index=df.index)

    return df

df_IWM1 = processCSV('../data/IWM/rawIWM1.csv')
#df_ULNI = processCSV('../data/UL/rawULNI.csv')

## Dataset creation
We now have to create the dataset which we will use for the fine tuning!

This is the **structure of the dataset**:
1. A `pictures` folder in which we are storing all the images of our raw dataset
2. Three different CSV files for the three different subsets `train`, `validation`, `test`
    - Create a `dataset_df` with all the information coming from the different collections (and the updated path to the images)
    - Split the `dataset_df` creating the three different dataframes and updating the values under the `set` column (possible values: strings `train`, `validation`, `test`)
    - Export the three dataframes as CSV files

```
dataset/
├── pictures/
│   └─── Q_017044.jpg
|   └─── UL_NI_272.png
|   └─── ...
├── train.csv
├── validation.csv
└─── test.csv

```

We now create the `dataset_df` which is the result of the merge of the different collection-specific dataframes coming from the different CSV files and update the path value under the column `image`.

In [3]:

def createDataset(df_list):
    dataset_df = pd.DataFrame()

    # Update paths for each df
    for df in df_list:
        # Create new Series to update column 'image' (path to the images)
        image = pd.Series(dtype='str', index=df.index)

        # Iterate through df
        for idx, row in df.iterrows():

            # -- Get image name to fix source path
            if row['provenance'] == 'IWM':
                # regex to get the Q_ name typical of IWM pics
                # using .group() returns the string matching the regex, e.g. Q_017042.jpg
                img_name = re.search('Q_\d+.jpg', row['image']).group()
                img_name = "https://huggingface.co/datasets/ombrr/rcm-1/blob/main/pictures/" + img_name

            elif 'UL' in row['provenance']:
                # regex to get the UL_ name typical of UL pics
                img_name = re.search('UL_\w+_\d+.png', row['image']).group()
                img_name = "https://huggingface.co/datasets/ombrr/rcm-1/blob/main/pictures/" + img_name

            # -- Update path
            # Find new path
            #image[idx] = os.path.abspath(img_name)
            image[idx] = img_name
            # Drop the column with the old path
            clean_df = df.drop('image', axis=1)
            # Append the column with the new path
            clean_df['image'] = image
        
        dataset_df = pd.concat([dataset_df,clean_df], ignore_index=True)
    
    # Modify the 'choice' column values: move everything to lowercase 
    dataset_df['choice'] = dataset_df['choice'].str.lower()

    return dataset_df


dataset = createDataset([df_IWM1])
dataset



Unnamed: 0,annotation_id,choice,created_at,id,updated_at,provenance,set,image
0,1,sensitive,2023-09-19T14:53:45.965682Z,1,2023-09-19T14:53:45.965682Z,IWM,,https://huggingface.co/datasets/ombrr/rcm-1/bl...
1,2,sensitive,2023-09-19T14:53:52.546617Z,2,2023-09-19T14:54:01.018338Z,IWM,,https://huggingface.co/datasets/ombrr/rcm-1/bl...
2,3,sensitive,2023-09-19T14:54:22.608619Z,3,2023-09-26T14:33:31.759760Z,IWM,,https://huggingface.co/datasets/ombrr/rcm-1/bl...
3,4,dubious,2023-09-19T14:55:19.854238Z,4,2023-10-03T11:00:38.615262Z,IWM,,https://huggingface.co/datasets/ombrr/rcm-1/bl...
4,5,not-sensitive,2023-09-19T14:55:26.779273Z,5,2023-09-19T14:55:26.779273Z,IWM,,https://huggingface.co/datasets/ombrr/rcm-1/bl...
...,...,...,...,...,...,...,...,...
194,201,not-sensitive,2023-09-26T16:10:27.352364Z,195,2023-09-26T16:10:27.352364Z,IWM,,https://huggingface.co/datasets/ombrr/rcm-1/bl...
195,202,sensitive,2023-09-26T16:10:39.927971Z,196,2023-10-05T13:37:16.164102Z,IWM,,https://huggingface.co/datasets/ombrr/rcm-1/bl...
196,203,sensitive,2023-09-26T16:10:46.119657Z,197,2023-09-26T16:10:46.119657Z,IWM,,https://huggingface.co/datasets/ombrr/rcm-1/bl...
197,204,sensitive,2023-09-26T16:10:50.707118Z,198,2023-09-26T16:10:50.707118Z,IWM,,https://huggingface.co/datasets/ombrr/rcm-1/bl...


## Dataset split

Finally, we have to split the data into the three sets and save the dataset_df in a CSV file. First we count the proportions between the three different possible values of the target variable `choice`.

We can see that the proportions are very unbalanced in favour of the "not-sensitive" value (we can also check the different percentages).

In [4]:
from sklearn.model_selection import train_test_split

dataset['choice'].value_counts()

not-sensitive    115
sensitive         67
dubious           17
Name: choice, dtype: int64

In [5]:
dataset['choice'].value_counts(normalize=True)*100

not-sensitive    57.788945
sensitive        33.668342
dubious           8.542714
Name: choice, dtype: float64

With this kind of unbalanced dataset, the class "not-sensitive" will be predicted more because of the bigger number of samples in this class. 

When developing a ML model we have to make sure that, when splitting a dataset, we have a proper distribution of the target variable in our train, validation and test sets.


### Stratified split
Apply a stratified split which has the same proportion of the target class (`'choice'`) in all three sets.


In [6]:
def split_stratified_into_train_val_test(df_input, stratify_colname,
                                         frac_train, frac_val, frac_test,
                                         random_state):
    '''
    Splits a Pandas dataframe into three subsets (train, val, and test)
    following fractional ratios provided by the user, where each subset is
    stratified by the values in a specific column (that is, each subset has
    the same relative frequency of the values in the column). It performs this
    splitting by running train_test_split() twice.
    Parameters
    ----------
    df_input : Pandas dataframe
        Input dataframe to be split.
    stratify_colname : str
        The name of the column that will be used for stratification. Usually
        this column would be for the label.
    frac_train : float
    frac_val   : float
    frac_test  : float
        The ratios with which the dataframe will be split into train, val, and
        test data. The values should be expressed as float fractions and should
        sum to 1.0.
    random_state : int, None, or RandomStateInstance
        Value to be passed to train_test_split().
    Returns
    -------
    df_train, df_val, df_test :
        Dataframes containing the three splits.
    '''
    if frac_train + frac_val + frac_test != 1.0:
        raise ValueError('fractions %f, %f, %f do not add up to 1.0' % \
                         (frac_train, frac_val, frac_test))
    if stratify_colname not in df_input.columns:
        raise ValueError('%s is not a column in the dataframe' % (stratify_colname))
    X = df_input # Contains all columns.
    y = df_input[[stratify_colname]] # Dataframe of just the column on which to stratify.
    # Split original dataframe into train and temp dataframes.
    df_train, df_temp, y_train, y_temp = train_test_split(X,
                                                          y,
                                                          stratify=y,
                                                          test_size=(1.0 - frac_train),
                                                          random_state=random_state)
    # Split the temp dataframe into val and test dataframes.
    relative_frac_test = frac_test / (frac_val + frac_test)
    df_val, df_test, y_val, y_test = train_test_split(df_temp,
                                                      y_temp,
                                                      stratify=y_temp,
                                                      test_size=relative_frac_test,
                                                      random_state=random_state)
    assert len(df_input) == len(df_train) + len(df_val) + len(df_test)

    # --- Add the column 'set' to each df
    df_train['set'] = 'train'
    df_val['set'] = 'validation'
    df_test['set'] = 'test'

    return df_train, df_val, df_test

In [7]:
dataset_train, dataset_val, dataset_test = split_stratified_into_train_val_test(dataset,'choice', 0.70, 0.15, 0.15, 1)

In [8]:
dataset_train

Unnamed: 0,annotation_id,choice,created_at,id,updated_at,provenance,set,image
73,77,not-sensitive,2023-09-26T15:19:16.351587Z,74,2023-09-26T15:19:16.351587Z,IWM,train,https://huggingface.co/datasets/ombrr/rcm-1/bl...
148,155,not-sensitive,2023-09-26T15:56:03.352460Z,149,2023-09-26T15:56:03.352460Z,IWM,train,https://huggingface.co/datasets/ombrr/rcm-1/bl...
1,2,sensitive,2023-09-19T14:53:52.546617Z,2,2023-09-19T14:54:01.018338Z,IWM,train,https://huggingface.co/datasets/ombrr/rcm-1/bl...
31,35,sensitive,2023-09-26T14:31:08.523693Z,32,2023-10-05T13:16:55.648851Z,IWM,train,https://huggingface.co/datasets/ombrr/rcm-1/bl...
186,193,sensitive,2023-09-26T16:08:31.733400Z,187,2023-09-26T16:08:31.733400Z,IWM,train,https://huggingface.co/datasets/ombrr/rcm-1/bl...
...,...,...,...,...,...,...,...,...
40,44,not-sensitive,2023-09-26T14:46:34.792739Z,41,2023-09-26T14:46:34.792739Z,IWM,train,https://huggingface.co/datasets/ombrr/rcm-1/bl...
113,114,sensitive,2023-09-26T15:38:34.417061Z,114,2023-10-05T13:30:41.391578Z,IWM,train,https://huggingface.co/datasets/ombrr/rcm-1/bl...
13,14,sensitive,2023-09-19T14:58:36.950031Z,14,2023-09-19T14:58:36.950031Z,IWM,train,https://huggingface.co/datasets/ombrr/rcm-1/bl...
177,184,not-sensitive,2023-09-26T16:06:43.876635Z,178,2023-09-26T16:06:43.876635Z,IWM,train,https://huggingface.co/datasets/ombrr/rcm-1/bl...


In [9]:
dataset_val

Unnamed: 0,annotation_id,choice,created_at,id,updated_at,provenance,set,image
166,173,not-sensitive,2023-09-26T16:02:29.474967Z,167,2023-09-28T13:00:20.592173Z,IWM,validation,https://huggingface.co/datasets/ombrr/rcm-1/bl...
2,3,sensitive,2023-09-19T14:54:22.608619Z,3,2023-09-26T14:33:31.759760Z,IWM,validation,https://huggingface.co/datasets/ombrr/rcm-1/bl...
120,121,dubious,2023-09-26T15:41:00.391828Z,121,2023-10-05T13:31:11.702176Z,IWM,validation,https://huggingface.co/datasets/ombrr/rcm-1/bl...
174,181,not-sensitive,2023-09-26T16:05:57.071807Z,175,2023-09-26T16:05:57.071807Z,IWM,validation,https://huggingface.co/datasets/ombrr/rcm-1/bl...
17,18,sensitive,2023-09-19T14:59:44.115073Z,18,2023-09-19T15:01:25.203772Z,IWM,validation,https://huggingface.co/datasets/ombrr/rcm-1/bl...
85,30,sensitive,2023-09-20T08:59:14.322293Z,86,2023-09-20T08:59:14.323293Z,IWM,validation,https://huggingface.co/datasets/ombrr/rcm-1/bl...
191,198,sensitive,2023-09-26T16:09:44.194300Z,192,2023-09-26T16:09:44.194300Z,IWM,validation,https://huggingface.co/datasets/ombrr/rcm-1/bl...
141,148,not-sensitive,2023-09-26T15:54:50.704472Z,142,2023-09-26T15:54:50.704472Z,IWM,validation,https://huggingface.co/datasets/ombrr/rcm-1/bl...
60,64,sensitive,2023-09-26T15:14:16.557178Z,61,2023-09-26T15:14:16.557178Z,IWM,validation,https://huggingface.co/datasets/ombrr/rcm-1/bl...
162,169,not-sensitive,2023-09-26T16:01:20.267701Z,163,2023-10-05T13:33:56.716809Z,IWM,validation,https://huggingface.co/datasets/ombrr/rcm-1/bl...


In [10]:
dataset_test

Unnamed: 0,annotation_id,choice,created_at,id,updated_at,provenance,set,image
77,29,sensitive,2023-09-20T08:59:00.674439Z,78,2023-10-05T13:21:56.021595Z,IWM,test,https://huggingface.co/datasets/ombrr/rcm-1/bl...
114,115,sensitive,2023-09-26T15:38:39.340779Z,115,2023-09-26T15:38:39.340779Z,IWM,test,https://huggingface.co/datasets/ombrr/rcm-1/bl...
53,57,not-sensitive,2023-09-26T15:10:20.471265Z,54,2023-09-26T15:10:20.471265Z,IWM,test,https://huggingface.co/datasets/ombrr/rcm-1/bl...
170,177,not-sensitive,2023-09-26T16:04:27.946205Z,171,2023-10-05T13:35:07.228792Z,IWM,test,https://huggingface.co/datasets/ombrr/rcm-1/bl...
195,202,sensitive,2023-09-26T16:10:39.927971Z,196,2023-10-05T13:37:16.164102Z,IWM,test,https://huggingface.co/datasets/ombrr/rcm-1/bl...
58,62,not-sensitive,2023-09-26T15:13:49.507191Z,59,2023-09-26T15:13:49.507191Z,IWM,test,https://huggingface.co/datasets/ombrr/rcm-1/bl...
128,135,not-sensitive,2023-09-26T15:51:20.407958Z,129,2023-09-26T15:51:20.407958Z,IWM,test,https://huggingface.co/datasets/ombrr/rcm-1/bl...
21,22,sensitive,2023-09-19T15:02:10.558165Z,22,2023-09-19T15:02:10.558165Z,IWM,test,https://huggingface.co/datasets/ombrr/rcm-1/bl...
34,38,not-sensitive,2023-09-26T14:38:26.451707Z,35,2023-09-28T12:19:10.153397Z,IWM,test,https://huggingface.co/datasets/ombrr/rcm-1/bl...
33,37,not-sensitive,2023-09-26T14:31:26.517135Z,34,2023-09-26T14:31:55.407939Z,IWM,test,https://huggingface.co/datasets/ombrr/rcm-1/bl...


In [11]:
dataset_train.shape

(139, 8)

In [12]:
dataset_val.shape

(30, 8)

In [13]:
dataset_test.shape

(30, 8)

In [14]:
dataset_train.choice.value_counts(normalize=True)*100

not-sensitive    57.553957
sensitive        33.812950
dubious           8.633094
Name: choice, dtype: float64

In [15]:
dataset_val.choice.value_counts(normalize=True)*100

not-sensitive    56.666667
sensitive        33.333333
dubious          10.000000
Name: choice, dtype: float64

In [16]:
dataset_test.choice.value_counts(normalize=True)*100

not-sensitive    60.000000
sensitive        33.333333
dubious           6.666667
Name: choice, dtype: float64

Now we save the three dataframes in three different CSV files.



In [17]:
dataset_train.to_csv('./rcm-1/train.csv', index=False)
dataset_val.to_csv('./rcm-1/validation.csv', index=False)
dataset_test.to_csv('./rcm-1/test.csv', index=False)

""" prova_train.to_json('./train.json')
prova_val.to_json('./validation.json')
prova_test.to_json('./test.json') """

" prova_train.to_json('./train.json')\nprova_val.to_json('./validation.json')\nprova_test.to_json('./test.json') "