# Data Preparation Notebook

This notebook is used to split the anemic and non-anemic dataset each into their respective train and validation dataset after shuffling them.

Prerequisite action:
- The images need to be split into anemic and non-anemic directories in advance. (can be done manually)
- Make sure there are no subdirectories below anemic and non-anemic directories

This notebook should produce a dataset directory like below

```
dataset/
|-- train/
|   |-- anemic/
|   |   |-- .jpg
|   |   |-- .jpg
|   |   |-- ...
|   |
|   |-- non-anemic/
|       |-- .jpg
|       |-- .jpg
|       |-- ...
|
|-- valid/
|   |-- anemic/
|   |   |-- .jpg
|   |   |-- .jpg
|   |   |-- ...
|   |
|   |-- non-anemic/
|       |-- .jpg
|       |-- .jpg
|       |-- ...
```

In [1]:
import os
import shutil
from random import shuffle

## Main Directories

In [16]:
anemic_dir = "./anemic/"
non_anemic_dir = "./non-anemic/"

## Create Dataset Directories

In [23]:
dataset_dir = "./dataset/"
os.mkdir(dataset_dir)

In [24]:
for subdir_1 in ["train/", "valid/"]:
    path_1 = os.path.join(dataset_dir, subdir_1)
    os.mkdir(path_1)
    for subdir_2 in ["anemic/", "non-anemic/"]:
        path_2 = os.path.join(path_1, subdir_2)
        os.mkdir(path_2)

## List the file paths, shuffle, and split (70% train, 30% validation)

In [29]:
anemic_path_list = [os.path.join(anemic_dir, file) for file in os.listdir(anemic_dir)]
non_anemic_path_list = [os.path.join(non_anemic_dir, file) for file in os.listdir(non_anemic_dir)]

In [31]:
shuffle(anemic_path_list)
shuffle(non_anemic_path_list)

In [32]:
train_size_anemic = int(len(anemic_path_list) * 0.7)
train_size_non_anemic = int(len(non_anemic_path_list) * 0.7)

In [34]:
source_paths = [anemic_path_list[:train_size_anemic],
               anemic_path_list[train_size_anemic:],
               non_anemic_path_list[:train_size_non_anemic],
               non_anemic_path_list[train_size_non_anemic:]]

destination_path = ["./dataset/train/anemic/",
                    "./dataset/valid/anemic/",
                    "./dataset/train/non-anemic/",
                    "./dataset/valid/non-anemic/"]

for i in range(len(destination_path)):
    for source_path in source_paths[i]:
        shutil.copy(source_path, destination_path[i])

In [38]:
if train_size_anemic == len(os.listdir(destination_path[0])):
    print("OK")

if train_size_non_anemic == len(os.listdir(destination_path[2])):
    print("OK")

OK
OK
