# **3 – Data Preparation**

## Objectives

* Clean and prepare image data for training
* Ensure consistency in image dimensions and file types
* Split data into training, validation, and test sets
* Prepare the folder structure needed for modelling

## Inputs

* inputs/dataset/raw/cherry-leaves/

## Outputs

* Cleaned images in appropriate folders for modelling

---

# Change working directory

Change the working directory from its current folder to its parent folder

In [1]:
import os

project_dir = r"C:\Users\amyno\OneDrive\Documents\CherryLeafProject\milestone-project-mildew-detection-in-cherry-leaves"

os.chdir(project_dir)

print(f" Current working directory is now: {os.getcwd()}")

 Current working directory is now: C:\Users\amyno\OneDrive\Documents\CherryLeafProject\milestone-project-mildew-detection-in-cherry-leaves


In [2]:
current_dir = os.getcwd()
current_dir

'C:\\Users\\amyno\\OneDrive\\Documents\\CherryLeafProject\\milestone-project-mildew-detection-in-cherry-leaves'

---

# Identify and remove and non image files

Whilst I expect a Kaggle dataset to be relatively uniform, any non image files could result in bugs or errors during image processing later on. This step will ensure that only files ending in .jpg, .jpeg, or .png will be used

Define the current image path and types of data allowed

In [None]:
raw_path = os.path.join("inputs", "dataset", "raw", "cherry-leaves")

valid_extensions = [".jpg", ".jpeg", ".png"] # Code explained by stack overflow and ref. in readme (1)

Track non image files 

In [6]:
non_image_files = []

For loop to loop through class folders and check

In [None]:
for class_name in os.listdir(raw_path):
    class_folder = os.path.join(raw_path, class_name)

    for file in os.listdir(class_folder):
        file_path = os.path.join(class_folder, file)

        if not os.path.splitext(file)[1].lower() in valid_extensions: 
            non_image_files.append(file_path)

non_image_files

[]

The code returned a tupple with no contents, meaning there is no non image files to deal with

In [8]:
for file_path in non_image_files:
    os.remove(file_path)

print(f"{len(non_image_files)} non image files removed.")

0 non image files removed.


Confirmed as no non image files as none have been deleted

---

# Check image dimensions

Check if all images in the data set are the same size as this will make it easier for the model to idetify the images and therefore make predictions. If they're not they will need to be standardised first

Import image 

In [9]:
from PIL import Image

Track the current sizes of all images in the dataset

In [None]:
image_sizes = []

for class_name in os.listdir(raw_path):
    class_folder = os.path.join(raw_path, class_name) # code inspired by python documentation and ref. in readme (2)

    for img_name in os.listdir(class_folder):
        img_path = os.path.join(class_folder, img_name)
        with Image.open(img_path) as img:
            image_sizes.append(img.size)

set(image_sizes)

{(256, 256)}

Every image has now been itterated over and its sizes determined and listed below. As there is only one size of 256 by 256, I can see they're all the same size. Square and 265 pixels each side.

---

# Splitting the data

To ensure the best evaluation possible, the dataset will be split into 3 different folders. They will be used to train the model, validate the model and then test the model. 

The splits between the groups will be;
* 70% training
* 15% validation
* 15% testing

Import shutil for copying and manipulating files and train test split to divide dataset

In [12]:
import shutil
from sklearn.model_selection import train_test_split

Set the input and output folder and confirm 

In [13]:
raw_dir = os.path.join("inputs", "dataset", "raw", "cherry-leaves")
output_base_dir = os.path.join("inputs", "dataset") # code inspired by python documentation and ref. in readme (2)
class_names = os.listdir(raw_dir)

print("Class names:", class_names)

Class names: ['healthy', 'powdery_mildew']


Define the split ratios

In [14]:
train_ratio = 0.7 # code provided by geeks for geeks and ref. in readme (3)
val_ratio = 0.15
test_ratio = 0.15

Create target folders

In [15]:
for split in ["train", "val", "test"]:
    for class_name in class_names:
        os.makedirs(os.path.join(output_base_dir, split, class_name), exist_ok=True)

Loop through each class to get list of images to split and split them into classes

By looping through the classes first they are now seperated before splitting into their respective groups. This ensures that it will be 70%/15%/15% of each class rather than % of the whole data which could skew the model's performance 

In [16]:
for class_name in class_names:
    class_path = os.path.join(raw_dir, class_name)
    images = os.listdir(class_path)

    train_imgs, temp_imgs = train_test_split(images, test_size=(1 - train_ratio), random_state=42)
    val_imgs, test_imgs = train_test_split(temp_imgs, test_size=0.5, random_state=42)

    for split, img_list in [("train", train_imgs), ("val", val_imgs), ("test", test_imgs)]:
        for img_name in img_list:
            src = os.path.join(class_path, img_name)
            dst = os.path.join(output_base_dir, split, class_name, img_name)
            shutil.copy2(src, dst)  # code provided by geeks for geeks and ref. in readme (3)

print("Dataset successfully split and copied into train/val/test folders.")


Dataset successfully split and copied into train/val/test folders.


By looping through the classes first they are now seperated before splitting into their respective groups. This ensures that it will be 70%/15%/15% of each class rather than % of the whole data which could skew the model's performance 

First split will split the full image list into 2 groups: 70% and 30% 
* The group of 70% will be used as the training data
* The second split will further split the 30% into 2 groups of 15% to be used as test and validation data

Lastly, confirm the number of images in each class folder to verify that the dataset has been split correctly

In [17]:
for split in ["train", "val", "test"]:
    for class_name in ["healthy", "powdery_mildew"]:
        folder = os.path.join("inputs", "dataset", split, class_name)
        count = len(os.listdir(folder))
        print(f"{split}/{class_name}: {count} images")

train/healthy: 1472 images
train/powdery_mildew: 1472 images
val/healthy: 316 images
val/powdery_mildew: 316 images
test/healthy: 316 images
test/powdery_mildew: 316 images


This confirms the data is in the correct folders and split expectedly 

---

# Conclusions and next steps

## Conclusions

* The dataset was successfully split into training (70%), validation (15%), and test (15%) sets
* Both classes contain the correct number of images in each split, maintaining balance across the dataset
* Folder structure has been created ready for model training and evaluation

## Next steps

* Begin data preparation for model training