# Dataset creation

This notebook outlines the different steps for the creation of the dataset: after inspecting the current state of the data (the imagesa are all stored in a temporary `pictures` folder and all the information is contained in a `index.csv` file), the dataset is split in three stratified splits and the folder-based dataset is created.

## Table of contents
1. [Temporary data structure](#temporary-data-structure), for inspecting the current state of our data
2. [Dataset split](#dataset-split), for splitting the dataset into train, validation and test sets
3. [Dataset creation](#dataset-creation), for creating the folder-based dataset

In [3]:
# ---- IMPORT LIBRARIES
import pandas as pd
# Dataset split
from sklearn.model_selection import train_test_split
# Dataset creation
import os

## Temporary data structure
The current and temporary structure of our data is the following:
1. A `pictures` folder in which we are storing all the images of the dataset
2. A `index.csv` file containing all the information from all the different collections

```
project/
├── pictures/
│   └── Q_017044.jpg
|   └── UL_NI_272.png
|   └── GVN_1.jpg
|   └── ...
└── index.csv
```

Let's inspect the `index.csv` file as a pandas DataFrame:

In [4]:
# ---- DATA INSPECTION

dataset = pd.read_csv('./index.csv', header=0, encoding='utf-8')
dataset

Unnamed: 0,annotation_id,choice,created_at,id,updated_at,provenance,set,image
0,1,sensitive content,2023-09-19T14:53:45.965682Z,1,2023-09-19T14:53:45.965682Z,IWM,,./pictures/Q_017042.jpg
1,2,sensitive content,2023-09-19T14:53:52.546617Z,2,2023-09-19T14:54:01.018338Z,IWM,,./pictures/Q_017043.jpg
2,3,sensitive content,2023-09-19T14:54:22.608619Z,3,2023-09-26T14:33:31.759760Z,IWM,,./pictures/Q_017044.jpg
3,4,dubious content,2023-09-19T14:55:19.854238Z,4,2023-10-03T11:00:38.615262Z,IWM,,./pictures/Q_017045.jpg
4,5,not-sensitive content,2023-09-19T14:55:26.779273Z,5,2023-09-19T14:55:26.779273Z,IWM,,./pictures/Q_017046.jpg
...,...,...,...,...,...,...,...,...
2527,2826,sensitive content,2023-11-27T16:10:25.437511Z,7089,2023-11-27T16:10:25.438513Z,GVN,,./pictures/GVN_351.jpg
2528,2683,dubious content,2023-11-27T16:04:50.352530Z,7090,2023-11-27T16:04:50.352530Z,GVN,,./pictures/GVN_352.jpg
2529,2759,sensitive content,2023-11-27T16:07:25.362035Z,7091,2023-11-27T16:07:25.362035Z,GVN,,./pictures/GVN_353.jpg
2530,2744,dubious content,2023-11-27T16:06:55.218316Z,7092,2023-11-27T16:06:55.218316Z,GVN,,./pictures/GVN_354.jpg


## Dataset split
First of all, we have to **split the data into three sets** (train, validation, test) and update the values under the `'set'` column of our dataset dataframe accordingly (possible values: strings `train`, `validation`, `test`).

First we **count the proportions** between the three different possible values of the target variable `choice`.
<br>
We can see that the proportions are very **unbalanced** in favour of the "not-sensitive content" value (we can also check the different percentages).

In [5]:
dataset['choice'].value_counts()


choice
not-sensitive content    1939
dubious content           330
sensitive content         263
Name: count, dtype: int64

In [6]:
dataset['choice'].value_counts(normalize=True)*100

choice
not-sensitive content    76.579779
dubious content          13.033175
sensitive content        10.387046
Name: proportion, dtype: float64

With this kind of unbalanced dataset, doing a random split during this phase is very risky: the class with the least amount of samples (in our case, the 'sensitive content' class) could be completely absent from the train set!
<br>
Clearly, this means that the model might never see any examples from this class, and therefore might not learn how to recognise it, leading to an **incomplete and incorrect learning**.


To prevent this, we will apply a **stratified split**, which allows us to preserve the same class proportion we have in the dataset population also in the three different sets. 

In [7]:
# ---- DATA SPLIT

def split_stratified_into_train_val_test(df_input, stratify_colname,
                                         frac_train, frac_val, frac_test,
                                         random_state):
    '''
    Splits a Pandas dataframe into three subsets (train, val, and test)
    following fractional ratios provided by the user, where each subset is
    stratified by the values in a specific column (that is, each subset has
    the same relative frequency of the values in the column). It performs this
    splitting by running train_test_split() twice.
    Parameters
    ----------
    df_input : Pandas dataframe
        Input dataframe to be split.
    stratify_colname : str
        The name of the column that will be used for stratification. Usually
        this column would be for the label.
    frac_train : float
    frac_val   : float
    frac_test  : float
        The ratios with which the dataframe will be split into train, val, and
        test data. The values should be expressed as float fractions and should
        sum to 1.0.
    random_state : int, None, or RandomStateInstance
        Value to be passed to train_test_split().
    Returns
    -------
    df_train, df_val, df_test :
        Dataframes containing the three splits.
    '''
    if frac_train + frac_val + frac_test != 1.0:
        raise ValueError('fractions %f, %f, %f do not add up to 1.0' % \
                         (frac_train, frac_val, frac_test))
    if stratify_colname not in df_input.columns:
        raise ValueError('%s is not a column in the dataframe' % (stratify_colname))
    X = df_input # Contains all columns.
    y = df_input[[stratify_colname]] # Dataframe of just the column on which to stratify.
    # Split original dataframe into train and temp dataframes.
    df_train, df_temp, y_train, y_temp = train_test_split(X,
                                                          y,
                                                          stratify=y,
                                                          test_size=(1.0 - frac_train), #0.3
                                                          random_state=random_state)
    # Split the temp dataframe into val and test dataframes.
    relative_frac_test = frac_test / (frac_val + frac_test)
    df_val, df_test, y_val, y_test = train_test_split(df_temp,
                                                      y_temp,
                                                      stratify=y_temp,
                                                      test_size=relative_frac_test, #0.5
                                                      random_state=random_state)
    assert len(df_input) == len(df_train) + len(df_val) + len(df_test)

    # --- Add the column 'set' to each df
    df_train['set'] = 'train'
    df_val['set'] = 'validation'
    df_test['set'] = 'test'

    # --- Reset index
    df_train = df_train.reset_index(drop=True)
    df_val = df_val.reset_index(drop=True)
    df_test = df_test.reset_index(drop=True)
    
    return df_train, df_val, df_test

dataset_train, dataset_val, dataset_test = split_stratified_into_train_val_test(dataset,'choice', 0.70, 0.15, 0.15, 1)

Let's check the classes proportions in the three splits:

In [8]:
print("""dataset_train
      .shape:\t{shape}
      .percentages:\t{percentages}""".format(shape=dataset_train.shape, percentages=dataset_train.choice.value_counts(normalize=True)*100))
dataset_train


dataset_train
      .shape:	(1772, 8)
      .percentages:	choice
not-sensitive content    76.580135
dubious content          13.036117
sensitive content        10.383747
Name: proportion, dtype: float64


Unnamed: 0,annotation_id,choice,created_at,id,updated_at,provenance,set,image
0,1969,not-sensitive content,2023-10-18T09:47:11.226933Z,5192,2023-10-18T09:47:11.226933Z,ULNI,train,./pictures/UL_NI_812.png
1,823,not-sensitive content,2023-10-16T14:39:37.915522Z,4390,2023-10-16T14:39:37.916523Z,ULNI,train,./pictures/UL_NI_1874.png
2,1958,not-sensitive content,2023-10-18T09:46:44.052315Z,4996,2023-10-18T09:46:44.052315Z,ULNI,train,./pictures/UL_NI_636.png
3,1621,not-sensitive content,2023-10-17T11:31:17.606093Z,3926,2023-10-17T11:31:17.606093Z,ULNI,train,./pictures/UL_NI_1456.png
4,698,not-sensitive content,2023-10-16T14:22:25.432749Z,4662,2023-10-16T14:22:25.432749Z,ULNI,train,./pictures/UL_NI_335.png
...,...,...,...,...,...,...,...,...
1767,951,not-sensitive content,2023-10-16T14:57:11.973000Z,3582,2023-10-16T14:57:11.973000Z,ULNI,train,./pictures/UL_NI_1146.png
1768,1094,not-sensitive content,2023-10-16T15:07:27.005712Z,3754,2023-10-16T15:07:27.005712Z,ULNI,train,./pictures/UL_NI_1300.png
1769,960,not-sensitive content,2023-10-16T14:57:40.776071Z,3594,2023-10-16T14:57:40.776071Z,ULNI,train,./pictures/UL_NI_1157.png
1770,2869,dubious content,2023-11-27T16:11:57.331738Z,6995,2023-11-27T16:11:57.331738Z,GVN,train,./pictures/GVN_257.jpg


In [9]:
print("""dataset_val
      .shape:\t{shape}
      .percentages:\t{percentages}""".format(shape=dataset_val.shape, percentages=dataset_val.choice.value_counts(normalize=True)*100))

dataset_val
      .shape:	(380, 8)
      .percentages:	choice
not-sensitive content    76.578947
dubious content          12.894737
sensitive content        10.526316
Name: proportion, dtype: float64


In [10]:
print("""dataset_test
      .shape:\t{shape}
      .percentages:\t{percentages}""".format(shape=dataset_test.shape, percentages=dataset_test.choice.value_counts(normalize=True)*100))

dataset_test
      .shape:	(380, 8)
      .percentages:	choice
not-sensitive content    76.578947
dubious content          13.157895
sensitive content        10.263158
Name: proportion, dtype: float64


## Dataset creation
For the scope of this project, we are giving our dataset a folder-based structure: the images will be split in different folders based on their set (train/validation/test) and on their class (sensitive/not-sensitive/dubious).

The dataset will have this structure:
```
dataset/
├── train/
│   └── sensitive/
|   │   └── ...
|   └── not-sensitive/
|   │   └── ...
|   └── dubious/
|       └── ...
├── validation/
│   └── sensitive/
|   └── not-sensitive/
|   └── dubious/
└── test/
    └── sensitive/
    └── not-sensitive/
    └── dubious/
```

To keep track of our data, we can also export again the `index.csv` file (and one CSV for split, why not?). 
```
[project folder]/
├── train.csv
├── validation.csv
├── test.csv
└── index.csv

```


In [11]:
# ---- DATASET CREATION: MOVING IMAGES

def moveData(df):
    #source = './pictures/'
    subset = df['set'][0]
    sensitive_dest = subset + '/sensitive/'
    not_sensitive_dest = subset + '/not-sensitive/'
    dubious_dest = subset + '/dubious/'
    #print('dests:\t',sensitive_dest,not_sensitive_dest,dubious_dest)

    # Create new Series to update column 'image' with the new path
    image = pd.Series(dtype='str',index=df.index)

    # Scroll through df and move each picture from the source path to the new path
    for idx,row in df.iterrows():
        # Get image name
        img_name = row['image'].replace('./pictures/','')
        src_path = row['image']

        # Move picture
        if row['choice'] == 'sensitive content':
            dst_path = os.path.join(sensitive_dest,img_name)
            os.rename(src_path,dst_path)
            image[idx] = os.path.abspath(dst_path)
        
        elif row['choice'] == 'not-sensitive content':
            dst_path = os.path.join(not_sensitive_dest,img_name)
            os.rename(src_path,dst_path)
            image[idx] = os.path.abspath(dst_path)
        
        elif row['choice'] == 'dubious content':
            dst_path = os.path.join(dubious_dest,img_name)
            os.rename(src_path,dst_path)
            image[idx] = os.path.abspath(dst_path)

        # New dataframe with updated path (drop column with old path and add column with new path)
        new_df = df.drop('image',axis=1) 
        new_df['image'] = image
    
    return new_df 

train = moveData(dataset_train)
validation = moveData(dataset_val)
test = moveData(dataset_test)

# ---- DATASET CREATION: UPDATING AND SAVING INFORMATION IN CSV FILES

# Merge the three dataframes together and save index.csv file
sets_list = [train,validation,test]
index_df = pd.DataFrame()
for df in sets_list:
    index_df = pd.concat([index_df,df], ignore_index=True)
index_df.to_csv('./index.csv',index=False)

# Save the different sets' dataframes into CSV files
train.to_csv('./train.csv', index=False)
validation.to_csv('./validation.csv', index=False)
test.to_csv('./test.csv', index=False)


subset:	 train
dests:	 train/sensitive/ train/not-sensitive/ train/dubious/
image name UL_NI_812.png
source path ./pictures/UL_NI_812.png
image name UL_NI_1874.png
source path ./pictures/UL_NI_1874.png
image name UL_NI_636.png
source path ./pictures/UL_NI_636.png
image name UL_NI_1456.png
source path ./pictures/UL_NI_1456.png
image name UL_NI_335.png
source path ./pictures/UL_NI_335.png
image name Q_017068.jpg
source path ./pictures/Q_017068.jpg
image name UL_NI_1446.png
source path ./pictures/UL_NI_1446.png
image name UL_NI_1602.png
source path ./pictures/UL_NI_1602.png
image name UL_NI_1783.png
source path ./pictures/UL_NI_1783.png
image name UL_NI_1644.png
source path ./pictures/UL_NI_1644.png
image name UL_NI_460.png
source path ./pictures/UL_NI_460.png
image name GVN_172.jpg
source path ./pictures/GVN_172.jpg
image name Q_052463.jpg
source path ./pictures/Q_052463.jpg
image name UL_NI_753.png
source path ./pictures/UL_NI_753.png
image name UL_NI_859.png
source path ./pictures/UL_N