# **3 – Data Preparation**

## Objectives

* Clean and prepare image data for training
* Ensure consistency in image dimensions and file types
* Split data into training, validation, and test sets
* Prepare the folder structure needed for modelling

## Inputs

* inputs/dataset/raw/cherry-leaves/

## Outputs

* Cleaned images in appropriate folders for modelling

---

# Change working directory

Change the working directory from its current folder to its parent folder

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\amyno\\OneDrive\\Documents\\CherryLeafProject\\milestone-project-mildew-detection-in-cherry-leaves\\jupyter_notebooks'

Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\amyno\\OneDrive\\Documents\\CherryLeafProject\\milestone-project-mildew-detection-in-cherry-leaves'

---

# Identify and remove and non image files

Whilst I expect a Kaggle dataset to be relatively uniform, any non image files could result in bugs or errors during image processing later on. This step will ensure that only files ending in .jpg, .jpeg, or .png will be used

Define the current image path and types of data allowed

In [4]:
raw_path = os.path.join("inputs", "dataset", "raw", "cherry-leaves")

valid_extensions = [".jpg", ".jpeg", ".png"]

Track non image files 

In [5]:
non_image_files = []

For loop to loop through class folders and check

In [6]:
for class_name in os.listdir(raw_path):
    class_folder = os.path.join(raw_path, class_name)

    for file in os.listdir(class_folder):
        file_path = os.path.join(class_folder, file)

        if not os.path.splitext(file)[1].lower() in valid_extensions:
            non_image_files.append(file_path)

non_image_files

[]

The code returned a tupple with no contents, meaning there is no non image files to deal with

In [7]:
for file_path in non_image_files:
    os.remove(file_path)

print(f"{len(non_image_files)} non image files removed.")

0 non image files removed.


Confirmed as no non image files as none have been deleted

---

# Check image dimensions

Check if all images in the data set are the same size as this will make it easier for the model to idetify the images and therefore make predictions. If they're not they will need to be standardised first

Import image 

In [8]:
from PIL import Image

Track the current sizes of all images in the dataset

In [None]:
image_sizes = []

for class_name in os.listdir(raw_path):
    class_folder = os.path.join(raw_path, class_name) # code inspired by python documentation and ref. in readme

    for img_name in os.listdir(class_folder):
        img_path = os.path.join(class_folder, img_name)
        with Image.open(img_path) as img:
            image_sizes.append(img.size)

set(image_sizes)

{(256, 256)}

Every image has now been itterated over and its sizes determined and listed below. As there is only one size of 256 by 256, I can see they're all the same size. Square and 265 pixels each side.

---

# Splitting the data

To ensure the best evaluation possible, the dataset will be split into 3 different folders. They will be used to train the model, validate the model and then test the model. 

The splits between the groups will be;
* 70% training
* 15% validation
* 15% testing

Import shutil for copying and manipulating files and train test split to divide dataset

In [10]:
import shutil
from sklearn.model_selection import train_test_split

Set the input and output folder

In [11]:
raw_dir = os.path.join("inputs", "dataset", "raw", "cherry-leaves")
output_base_dir = os.path.join("inputs", "dataset")

Define the split ratios

In [12]:
train_ratio = 0.7 # code provided by geeks for geeks and ref. in credits at the end of this notebook
val_ratio = 0.15
test_ratio = 0.15

Create target folders

In [13]:
for split in ["train", "val", "test"]:
    for class_name in os.listdir(raw_dir):
        split_dir = os.path.join(output_base_dir, split, class_name)
        os.makedirs(split_dir, exist_ok=True)

### Credits

* The general use of Shutil module in python was helpfully explained via geeks for geeks (ref. in readme)
* Stack overflow provided great information on spliting data into 3 groups (ref. in readme)
* The general idea of scikit-learns train_test_split function was helpfully explained via geeks for geeks (ref. in readme)
* 