# Hendricks and Gimpel - Prepare data

Own dataset composed 7 classes (Alt, Big, Mac, Mil, Myc, Pse, Syl). Each class is divide two typical (in) and atypical (out).

> **Question**  
> Can it divide Symptomless class in typical and atypical ?

To use Hendricks and Gimpel (HnG), we need to devide the dataset in 4 parts:
1. Train: 50% of in
2. Validation: 20% of in
3. threshold: 20% of in and 80% of out
4. Test: 10% of in and 20% of out

## 1. Initialization and load

In [1]:
# Add directory to load personal module
import sys
sys.path.append('../../structuration/python')
print(sys.path)

import os
import numpy as np
import glob
import pandas as pd
# Load module to create clusters of data
import prepare_data as pdata
# Module to manipulate directory
import tools_file as tsf
# Function to split data
from sklearn.model_selection import StratifiedShuffleSplit as sss

['/home/mpalerme/Documents/atipical-exploi/poc/HnG', '/usr/local/lib/python38.zip', '/usr/local/lib/python3.8', '/usr/local/lib/python3.8/lib-dynload', '', '/home/mpalerme/Documents/atipical-exploi/ENV/lib/python3.8/site-packages', '/home/mpalerme/Documents/atipical-exploi/ENV/lib/python3.8/site-packages/IPython/extensions', '/home/mpalerme/.ipython', '../../structuration/python']


In [2]:
# define path where find images
t_image_path = '/mnt/stockage/dataset_atipical_resize/'
at_image_path = '/mnt/stockage/'

# Directory names of parts
PARTS = ("train", "val", "thres", "test")

## 2. Split data
We distribute images in 4 directories

In [3]:
def distribute(dir_in, splits=(0.5, 0.2, 0.2, 0.1)):
    """
    To distribute images in 4 directories (train, val, thres, test)

    Parameters
    ----------
    dir_in : str
        Directory where each label type image in a directory.

    splits : tupple
        ratio of dataset in each directory.

    Returns
    -------
    None
    """
    # Take absolue path of input directory
    abs_dir_in = os.path.abspath(os.path.expanduser(dir_in))

    # Get all elements in input directory
    all_elemt = glob.glob(os.path.join(abs_dir_in, '*'))

    # Get all directories of input directory
    all_dir = [x for x in all_elemt if os.path.isdir(x)]

    # Get all images name with label
    for one_dir in all_dir:
        # Extract label
        label = os.path.basename(one_dir)
        print("label: " + label) 
        # Get all file names
        all_files = glob.glob(os.path.join(one_dir, 'recto', '*.*'))
        # Create dictionary
        tmp = {'image': all_files, 'label': [label] * len(all_files)}
        print('total files: ' + str(len(all_files)))
        # Transform dictionary in dataframe
        tmp = pd.DataFrame(tmp)
        
        # Define distribution
        distri = pdata.split(splits, tmp)
        
        for part, info_images in zip(PARTS, distri):
            print(part + ": " + str(len(info_images)) + " images is " +
                  str(len(info_images)/len(all_files)) + "%")
            
            # create directories for label
            tsf.create_directory(os.path.join("./data", part, label))
            tsf.create_directory(os.path.join("./data", part, label, 'recto'))
            tsf.create_directory(os.path.join("./data", part, label, 'verso'))

            for _, row in info_images.iterrows():
                # Get name of image
                name_image = os.path.basename(row.image)
                
                # Create symbolic link of recto
                os.symlink(row.image,
                           os.path.join("./data", part, label, 'recto',
                                        name_image))
                
                # Create path of verso image
                row.image = row.image.replace('recto', 'verso')
                row.image = row.image.replace('Recto', 'Verso')

                # Get name of image
                name_image = os.path.basename(row.image)

                # Create symbolic link of verso
                os.symlink(row.image,
                           os.path.join("./data", part, label, 'verso',
                                        name_image))
                
                                
        
        
# Distribute typical dataset
distribute(t_image_path)

label: Myc
total files: 768
train: 384 images is 0.5%
val: 153 images is 0.19921875%
thres: 154 images is 0.20052083333333334%
test: 77 images is 0.10026041666666667%
label: Mil
total files: 100
train: 50 images is 0.5%
val: 20 images is 0.2%
thres: 20 images is 0.2%
test: 10 images is 0.1%
label: Big
total files: 665
train: 332 images is 0.4992481203007519%
val: 133 images is 0.2%
thres: 133 images is 0.2%
test: 67 images is 0.10075187969924812%
label: Pse
total files: 780
train: 390 images is 0.5%
val: 156 images is 0.2%
thres: 156 images is 0.2%
test: 78 images is 0.1%
label: Syl
total files: 857
train: 428 images is 0.49941656942823803%
val: 171 images is 0.19953325554259044%
thres: 172 images is 0.20070011668611434%
test: 86 images is 0.10035005834305717%
label: Alt
total files: 988
train: 494 images is 0.5%
val: 197 images is 0.19939271255060728%
thres: 198 images is 0.20040485829959515%
test: 99 images is 0.10020242914979757%
label: Mac
total files: 750
train: 375 images is 0.5%