# DATA20001 Deep Learning - Group Project
## Image project
## Explore Data

This notebook is dedicated to exploratory data analysis of the label distributions in the provided dataset.
- Check [this](https://github.com/utkuozbulak/pytorch-custom-dataset-examples) out to create custome data loader

### Exploring:
- Basic summary statistics
- Is the data shuffled and can we directly split data and preserve label distributions

In [1]:
# automatically reload dependencies and repository content so that kernel need not be restarted
%load_ext autoreload
%autoreload 2

In [2]:
# Import dependencies
import utils

import numpy as np
import pandas as pd
from matplotlib.pyplot import imread

### Build the image-file to labels mapping

Build image filename <-> labels csv if it doesn't already exist. The file is saved as './file_to_labels_table.csv'.

In [3]:
# check how long it takes to construct the csv
import time
start = time.time()

utils.build_imgfile_to_labels_csv()

end = time.time()
print(f"elapsed time {round(end-start, 2)} s")

elapsed time 0.15 s


### Load the data

In [4]:
df = pd.read_csv("file_to_labels_table.csv")
df.head()

Unnamed: 0,filename,clouds,male,bird,dog,river,portrait,baby,night,people,female,sea,tree,car,flower
0,im1.jpg,0,0,0,0,0,1,0,0,1,1,0,0,0,0
1,im2.jpg,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,im3.jpg,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,im4.jpg,0,1,0,0,0,0,0,0,1,0,0,0,0,0
4,im5.jpg,0,1,0,0,0,0,0,0,1,0,0,0,0,0


### Basic summary stats

- 

In [5]:
_df = df.iloc[:, 1:]  # remove img-filename-col
col_label_counts = _df.sum(axis=0)
row_label_counts = _df.sum(axis=1)
zero_rows = (row_label_counts == 0)
nonzero_rows = (row_label_counts != 0)

print(f"# of samples: {_df.shape[0]}")
print(f"# of samples with no labels at all: {zero_rows.sum()}")
print(f"# of samples with at least one label: {nonzero_rows.sum()}")
print()
print(f"Summary stats for samples with labels:\
    \n# samples: {nonzero_rows.sum()}\
    \n# labels in total: {row_label_counts[nonzero_rows].sum()}\
    \n# labels mean (for sample): {round(row_label_counts[nonzero_rows].mean(), 3)}\
    \n# labels median (for sample): {row_label_counts[nonzero_rows].median()}\
    \nMax # of labels (for sample): {row_label_counts[nonzero_rows].max()}\
    \nMin # of labels (for sample): {row_label_counts[nonzero_rows].min()}"
)
print()


# of samples: 20000
# of samples with no labels at all: 9824
# of samples with at least one label: 10176

Summary stats for samples with labels:    
# samples: 10176    
# labels in total: 20224    
# labels mean (for sample): 1.987    
# labels median (for sample): 2.0    
Max # of labels (for sample): 5    
Min # of labels (for sample): 1



In [6]:
label_df = col_label_counts.to_frame().transpose()
label_df = label_df.append(col_label_counts/_df.shape[0]*100, ignore_index=True)
label_df.index = ['#', '%']
label_df

Unnamed: 0,clouds,male,bird,dog,river,portrait,baby,night,people,female,sea,tree,car,flower
#,1095.0,2979.0,360.0,448.0,120.0,3121.0,95.0,598.0,6403.0,3227.0,173.0,525.0,319.0,761.0
%,5.475,14.895,1.8,2.24,0.6,15.605,0.475,2.99,32.015,16.135,0.865,2.625,1.595,3.805


### Find if original label distribution can be maintained with simple train/val split conditions

We don't know if the provided data is shuffled or not.

- check label distribution for last 2k samples _before_ shuffling
- check label distribution for last 2k samples _after_ shuffling

**Conclusion**: Based on this it seems that the data is already shuffled and the label distribution is pretty well preserved by just sampling from complete dataset.

In [7]:
# see if splitting to training/testing requires to shuffle the df first
val_df_not_shuffled = _df.iloc[-2000:, :]

col_label_counts_not_shuffled = val_df_not_shuffled.sum(axis=0)
_label_df = col_label_counts_not_shuffled.to_frame().transpose()
_label_df = _label_df.append(
    col_label_counts_not_shuffled/val_df_not_shuffled.shape[0]*100, ignore_index=True)

# shuffle array and perform train/test split
arr = np.arange(20000)
np.random.shuffle(arr)
# first 90% is 'training', remaining 10% viewed as validation set
split_index = int(arr.shape[0]*.9)
idx = arr[split_index:]
val_df_shuffled = _df.iloc[idx, :]
col_label_counts_shuffled = val_df_shuffled.sum(axis=0)
_label_df = _label_df.append(col_label_counts_shuffled, ignore_index=True)
_label_df = _label_df.append(
    col_label_counts_shuffled/val_df_shuffled.shape[0]*100, ignore_index=True)

_label_df.index = ['NotShuffled #', 'NotShuffled %', 'Shuffled #', 'Shuffled %']
_label_df

Unnamed: 0,clouds,male,bird,dog,river,portrait,baby,night,people,female,sea,tree,car,flower
NotShuffled #,128.0,295.0,40.0,51.0,8.0,305.0,11.0,36.0,659.0,303.0,9.0,36.0,26.0,74.0
NotShuffled %,6.4,14.75,2.0,2.55,0.4,15.25,0.55,1.8,32.95,15.15,0.45,1.8,1.3,3.7
Shuffled #,117.0,289.0,39.0,54.0,11.0,296.0,6.0,58.0,618.0,315.0,23.0,65.0,26.0,72.0
Shuffled %,5.85,14.45,1.95,2.7,0.55,14.8,0.3,2.9,30.9,15.75,1.15,3.25,1.3,3.6
