# Cacao Datasets Data Cleaning Process

In this data cleaning process our team decided to balance the dataset counts for each classes.

In [16]:
from tensorflow.keras.utils import image_dataset_from_directory
from random import sample
import os

## Data Description

In [6]:
data_dir = image_dataset_from_directory(
    'cacao_photos/',
    label_mode='int')


Found 4390 files belonging to 3 classes.


In [7]:
class_names = data_dir.class_names

def count_data_per_class() -> None:
    for class_name in class_names:
        print(class_name, ":", len(os.listdir("cacao_photos/"+class_name)))

def get_class_count(class_name:str) -> int:

    if class_name in class_names:
        return len(os.listdir("cacao_photos/"+class_name))
    else:
        return 0

count_data_per_class()
get_class_count('black_pod_rot')

black_pod_rot : 943
healthy : 3344
pod_borer : 103


943

As the results above shows unbalance count of images per dataset. Therefore our team decided to *decrease the number of images under healthy class* down to the number of black_pod_rot class and *increase the number of images under pod_borer up to 943* using **Data Aumentation process**.

In [10]:
healthy_path = "cacao_photos/healthy"
healthy = os.listdir(healthy_path)
for file in sample(healthy,(get_class_count('healthy')-get_class_count('black_pod_rot'))):
    os.remove(healthy_path+"/"+file)

Reduce Healthy images count from 3334 to 943

In [13]:

count_data_per_class()

black_pod_rot : 943
healthy : 943
pod_borer : 103


Augment pod_borer dataset to increase count from 103 to 943

In [14]:
import Augmentor

pod_borer_augmentation_pipeline = Augmentor.Pipeline(source_directory="cacao_photos/pod_borer",
                                                    output_directory="cacao_photos/pod_borer/../../")

pod_borer_augmentation_pipeline.rotate(probability=0.6, max_left_rotation=10, max_right_rotation=10)
pod_borer_augmentation_pipeline.skew_top_bottom(0.3, 0.7)
pod_borer_augmentation_pipeline.skew_left_right(0.3, 0.7)
pod_borer_augmentation_pipeline.flip_random(0.3)


pod_borer_augmentation_pipeline.sample(get_class_count('black_pod_rot') - get_class_count('pod_borer'))

Initialised with 103 image(s) found.
Output directory set to cacao_photos/pod_borer\cacao_photos/pod_borer/../../.

Processing <PIL.Image.Image image mode=RGB size=1080x1080 at 0x20D9F7520B0>: 100%|██████████| 840/840 [00:56<00:00, 14.78 Samples/s]                  


Check dataset count if it is now balance.

In [15]:

count_data_per_class()

black_pod_rot : 943
healthy : 943
pod_borer : 943
