# **Data collection Notebook**

## Contents and purpose

- Import packages
- set up directory and path structure
- Load raw data from Kaggle and save it to repo
- sift through the data and process/ save it respectively
- clean data
- create train, test and validation sets

## Important files

- kaggle JSON file is a personal authentication token, if this repo is forked and reproduced, it needs to be replaced by an individual file.

## Expected Results

- we will receive the necessary data for the subsequent notebooks
    - a train set to train our models
    - a test set
    - a validation set
- each set will have healthy and afflicted sample images

## Why are we doing this

These steps are common practice for the necessary preparation of data sets for machine learning.


# Install/ Import packages necessary for this notebook

- if you have created your working environment based on the requirements.txt file, you can skip the next step, as the requirements will already be satisfied. If not, you cann install the necessary packages now.

In [None]:
! pip install -r ../requirements.txt

Now you can import the packages that will be needed in this notebook.

In [1]:
import os
import sys
import zipfile


## Set working directory and file path architecture for notebook
As the notebooks are set in a subfolder of this repo we need to adjust the working directory so files can be accessed properly. 

First we check our current working directory.

In [2]:
current_dir = os.getcwd()
current_dir

'e:\\Projects\\Code-I\\vscode-projects\\PP5-predictive_analysis\\jupyter_notebooks'

Now we can change the directory to the parent folder that contains the complete repo. We will also print our new working directory so we can check everything worked out as planned.

In [3]:
# Only change the directory if not already at the repo root
current_dir = os.getcwd()
target_dir = os.path.abspath(os.path.join(current_dir, os.pardir))  # One level up

# Check if we're already in the repo root
if os.path.basename(current_dir) == 'jupyter_notebooks':
    os.chdir(target_dir)
    current_dir = os.getcwd()
    print(f"Working directory set to: {os.getcwd()}")
else:
    print(f"Current working directory remains: {current_dir}")

Working directory set to: e:\Projects\Code-I\vscode-projects\PP5-predictive_analysis


# Kaggle as a data source

Kaggle is a data science platform that offers a vast repository of publicly shared datasets across diverse domains such as healthcare, finance, sports, and more. These datasets are freely available for analysis, modeling, and learning, making Kaggle a popular resource for data scientists and machine learning practitioners.

In this repo we will use data from Kaggle and thus it is already part of the requirements file. If you want to install it separately you can do so via pip:

In [8]:
pip install kaggle

Note: you may need to restart the kernel to use updated packages.


Once we have installed Kaggle we need to change the Kaggle config directory to our current working directory. We also need to need to authenticate using our kaggle.json file. (Can be obtained from the user settings in your kaggle account)

In [4]:
# change Kaggle config directory
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
# Set permissions for kaggle using our json file
! chmod 600 kaggle.json

Der Befehl "chmod" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


Now we can obtain our dataset for this notebook.

In [8]:
# Set variables to define source and destination of our kaggle dataset
data_path = "codeinstitute/cherry-leaves"
data_folder = "inputs/datasets/raw"
# If our inputs folder does not exist yet, we are creating it in the next step
os.makedirs(data_folder, exist_ok=True)   
# Finally we download and save the dataset
! kaggle datasets download -d {data_path} -p {data_folder}

Downloading cherry-leaves.zip to inputs/datasets/raw


  0%|          | 0.00/55.0M [00:00<?, ?B/s]
  2%|▏         | 1.00M/55.0M [00:00<00:33, 1.68MB/s]
  5%|▌         | 3.00M/55.0M [00:00<00:11, 4.62MB/s]
  9%|▉         | 5.00M/55.0M [00:00<00:07, 6.87MB/s]
 13%|█▎        | 7.00M/55.0M [00:01<00:05, 8.55MB/s]
 16%|█▋        | 9.00M/55.0M [00:01<00:04, 9.80MB/s]
 20%|█▉        | 11.0M/55.0M [00:01<00:04, 10.7MB/s]
 24%|██▎       | 13.0M/55.0M [00:01<00:03, 11.3MB/s]
 27%|██▋       | 15.0M/55.0M [00:01<00:04, 9.83MB/s]
 33%|███▎      | 18.0M/55.0M [00:02<00:03, 12.3MB/s]
 36%|███▋      | 20.0M/55.0M [00:02<00:02, 12.3MB/s]
 40%|███▉      | 22.0M/55.0M [00:02<00:02, 12.6MB/s]
 44%|████▎     | 24.0M/55.0M [00:02<00:02, 12.7MB/s]
 47%|████▋     | 26.0M/55.0M [00:02<00:02, 12.7MB/s]
 51%|█████     | 28.0M/55.0M [00:02<00:02, 12.7MB/s]
 55%|█████▍    | 30.0M/55.0M [00:03<00:02, 12.8MB/s]
 58%|█████▊    | 32.0M/55.0M [00:03<00:01, 12.8MB/s]
 62%|██████▏   | 34.0M/55.0M [00:03<00:01, 12.8MB/s]
 65%|██████▌   | 36.0M/55.0M [00:03<00:01, 12.8MB/s]
 





Now that we have our raw data, we will unzip it and remove the zipfile. We will also put a new label name for the "powdery-mildew" set which will be "diseased".

In [9]:
# Unzip the dataset
with zipfile.ZipFile(data_folder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(data_folder)
# Remove the zip file
os.remove(data_folder + '/cherry-leaves.zip')



In [10]:
# Rename the 'powdery_mildew' folder to 'diseased'
dataset_path = os.path.join(data_folder, 'cherry-leaves')

old_label = os.path.join(dataset_path, 'powdery_mildew')
new_label = os.path.join(dataset_path, 'diseased')

if os.path.exists(old_label):
    os.rename(old_label, new_label)
    print(f"Renamed 'powdery_mildew' → 'diseased'")
else:
    print("The folder 'powdery_mildew' does not exist.")

Renamed 'powdery_mildew' → 'diseased'


# Data processing

---

## Data cleaning

Check for unnecessary files and remove all excess files. A function to remove access files can be found in PP5-predictive_analysis\src\data_processing.py 

In [11]:
# First we will add the ressource file to our path to be able to load relevant functions
sys.path.append('./src')
# Then we load our function from the ressource file
from data_processing import remove_non_image_files

remove_non_image_files(data_dir='inputs/datasets/raw/cherry-leaves')

Folder 'diseased': Image files = 2104, Non-image files removed = 0
Folder 'healthy': Image files = 2104, Non-image files removed = 0


Now that only image files remain, we should check if all images are in working order or if we have some corrupted images in our data set.

In [12]:
from data_processing import remove_corrupt_images

corrupt_images = remove_corrupt_images("inputs/datasets/raw/cherry-leaves")

✅ Total corrupt images removed: 0


# Split data into train-, test-, and validation set

For the upcoming model training, we need a train test to train our model, a validation set to adjust our model training process and a test set to test our models performance.

In [13]:
from data_processing import split_dataset, clear_splits

# First we clear the old splits if they already exist (e.g. if we run this script again to change the ratios)
# Note, that you need to reload the original dataset to be able to run this script again
clear_splits(data_dir='inputs/datasets/raw/cherry-leaves')

# Then we split the dataset into train, validation, and test sets
split_dataset(data_dir=f"inputs/datasets/raw/cherry-leaves",
                                   train_ratio=0.7,
                                   validation_ratio=0.15,
                                   test_ratio=0.15
                                   )

To get an overview of the size of the sets and to check if the sets are ready for the next steps we will count the data entries of the sets.

In [15]:
from data_processing import count_dataset_images


sets = ['train', 'validation', 'test']
labels = ['healthy', 'diseased']
base_path = 'inputs/datasets/raw/cherry-leaves'

count_dataset_images(base_path, sets, labels)

There are 1472 images in train/healthy
There are 1472 images in train/diseased
There are 315 images in validation/healthy
There are 315 images in validation/diseased
There are 317 images in test/healthy
There are 317 images in test/diseased

Total number of images: 4208


4208

## Summary and Next Steps

In this notebook, we performed the essential preprocessing steps to prepare our cherry leaf dataset for modeling:

- Removed non-image and corrupt files to ensure data integrity.
- Verified and cleaned the directory structure.
- Split the dataset into **training**, **validation**, and **test** sets with user-defined ratios. Our       default will be (70%, 15%, 15%)

These steps ensure that our dataset is clean, balanced, and ready for model training and evaluation.

---

## Next Steps

Now that the dataset has been cleaned and split, the next steps are focused on understanding the data and preparing it for model training so we will explore the data (EDA) and visualize the results:

- **Analyze class distribution** to check for potential imbalance between categories.
- **Visualize image samples** to assess data quality and variation within classes.
- **Inspect image dimensions and aspect ratios** to inform resizing or preprocessing decisions.

These steps will help guide decisions around model architecture, data augmentation, and normalization techniques.