# **Data Preprocessing**

## 1. Introduction

## Objectives

* Preprocessing and cleaning

## Inputs

* We need the **cherry_leaves_dataset** for this task

## Outputs

* The **cherry leaves dataset** will be clean and ready for visualisation


---

## 2. Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [4]:
import os
current_dir = os.getcwd()
current_dir
print(f"\x1b[32m{current_dir}\x1b[0m")

[32m/workspaces/Portfolio-Project-5/jupyter_notebooks[0m


We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [5]:
os.chdir(os.path.dirname(current_dir))
print("\x1b[32mYou set a new current directory\x1b[0m")

[32mYou set a new current directory[0m


Confirm the new current directory

In [6]:
current_dir = os.getcwd()
current_dir
print(f"\x1b[32m{current_dir}\x1b[0m")

[32m/workspaces/Portfolio-Project-5[0m


## 3. Delete non-image files

In this section, we'll perform dataset cleaning by removing non-image files.
- We'll start by defining a function for this task in the first cell
- In the second cell we define our dataset and execute the cleaning process

*This function was taken from one of CodeInstitutes Walkthrough Project "Malaria Detector"*

In [11]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [12]:
my_data_dir='inputs/datasets/cherry-leaves'
remove_non_image_file(my_data_dir)
print("\x1b[32mTask completed.\x1b[0m")

IsADirectoryError: [Errno 21] Is a directory: 'inputs/datasets/cherry-leaves/validation/powdery_mildew'

---

## 4. Data Spliting

In this section we will split our **cherry_leaves_dataset** into 3 datasets.

- **Trains (70%)**
    - Healty
    - Powdery Mildew
- **Validate (10%)**
    - Healty
    - Powdery Mildew
- **Test (20%)**
    - Healty
    - Powdery Mildew

*This function was taken from one of CodeInstitutes Walkthrough Project "Malaria Detector"*

In [6]:
import os
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

---

When executing the following cell, we will begin the process of dividing the data.

In [9]:
split_train_validation_test_images(my_data_dir,
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

This cell will calculate the percentage of each set relative to the total.

In [127]:
folders = os.listdir('inputs/datasets/cherry-leaves')
dir_path = r'inputs/datasets/cherry-leaves'
validation  = 0
train = 0
test = 0

for folder in folders:
    count = 0
    for root_dir, cur_dir, files in os.walk(r'inputs/datasets/cherry-leaves/' + folder):
        count += len(files)
    if folder == "validation":
        validation = count
    elif folder == "test":
        test = count
    elif folder == "train":
        train = count

total = test + train + validation
testp = "{:.0%}". format(test/total)
validationp = "{:.0%}". format(validation/total)
trainp = "{:.0%}". format(train/total)

print(f"Total amount - {total}")
print(f"Train - {trainp} - {train} images")
print(f"Validation - {validationp} - {validation} images")
print(f"Test - {testp} - {test} images")

Total amount - 4208
Train - 70% - 2944 images
Validation - 10% - 420 images
Test - 20% - 844 images


---

# Conclusions and Next Steps

In this notebook, we've successfully cleaned our data and divided it into three key sets: the training set, validation set, and test set. This preparation is vital for training and evaluating machine learning models. Our next steps is now visualization and modeling.