# **Data Collection**

## Objectives

* Fetch cherry leaf image data from Kaggle and prepare it for use.

## Inputs

* Kaggle JSON file - authentication token. 

## Outputs

* Dataset: input/datasets/cherry_leaf_dataset



---

# Imports

In [None]:
%pip install -r /workspaces/PP5-MildewDetection/requirements.txt

In [None]:
import os
from pathlib import Path
import shutil
import random

# Change working directory

* We will store the notebooks in a subfolder, therefore when running the notebook in the editor, we need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Kaggle Installation and Configuration

First we install Kaggle

In [None]:
%pip install kaggle==1.5.12

The cell below changes the Kaggle configuration directory to the current working directory and sets permissions for the Kaggle authentication JSON.

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Here we set the Kaggle dataset path and download it to the folder we specified in the 'Outputs' section.

In [None]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "input/datasets/cherry_leaf_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Finally, we unzip the downloaded file then delete the zip file.

In [None]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Data Preparation

---

## Data Cleaning

### The below method allows us to find and remove files of a particular extension type.

In [None]:

def clean_image_dataset(root_dir, extensions=None):
    """
    Remove any files in each subdirectory of root_dir whose suffix
    is not in the allowed extensions, and print a summary.
    """
    # Default to common image suffixes if none are provided
    allowed = {'.png', '.jpg', '.jpeg'} if extensions is None else set(ext.lower() for ext in extensions)
    root = Path(root_dir)

    for subfolder in root.iterdir():
        if not subfolder.is_dir():
            continue  # skip files at the top level

        kept, removed = 0, 0
        for file in subfolder.iterdir():
            # Only consider actual files
            if not file.is_file():
                continue

            if file.suffix.lower() in allowed:
                kept += 1
            else:
                file.unlink()   # delete non-image file
                removed += 1

        print(f"Subfolder '{subfolder.name}': kept {kept} images, removed {removed} non-images")


We need to find and remove any non-image files from our dataset, if they exist.

We create a variable and assign it the pathway to our data which will be the 'root_dir' parameter of the clean_image_dataset method.

As we are looking for image files the 'extensions' parameter does not need to be altered.

In [None]:
dataset_path = "input/datasets/cherry_leaf_dataset/cherry-leaves"
clean_image_dataset(dataset_path)

---

# Splitting Train, Validation and Test Sets

In [None]:

def split_dataset(data_dir, train_ratio, val_ratio, test_ratio):
    """
    Split each class‑folder under data_dir into train/validation/test
    according to the three ratios (which must add to 1.0).
    """
    # Validate ratios sum to 1
    if train_ratio + val_ratio + test_ratio != 1.0:
        print("Error: train_ratio + val_ratio + test_ratio must equal 1.0")
        return

    # Find class folders skipping any existing split directories
    classes = [
        d for d in os.listdir(data_dir)
        if os.path.isdir(os.path.join(data_dir, d))
           and d not in ('train', 'validation', 'test')
    ]

    # Create train/validation/test subfolders for each class
    for split in ('train', 'validation', 'test'):
        for cls in classes:
            os.makedirs(os.path.join(data_dir, split, cls), exist_ok=True)

    # Shuffle and move
    for cls in classes:
        cls_path = os.path.join(data_dir, cls)
        files = os.listdir(cls_path)
        random.shuffle(files)

        total = len(files)
        n_train = int(total * train_ratio)
        n_val   = int(total * val_ratio)
        n_test  = int(total * test_ratio)

        for i, fname in enumerate(files):
            src = os.path.join(cls_path, fname)

            if i < n_train:
                split = 'train'
            elif i < n_train + n_val:
                split = 'validation'
            elif i < n_train + n_val + n_test:
                split = 'test'
            else:
                # In case of any rounding leftovers, put them in 'test'
                split = 'test'

            dst = os.path.join(data_dir, split, cls, fname)
            shutil.move(src, dst)

        # Remove empty original folder
        os.rmdir(cls_path)

        # Feedback exact counts
        print(f"Class '{cls}': train={n_train}, validation={n_val}, test={n_test}")


In keeping with convention:

* We will allocate 70% of the data to Training.
* 10% to Validation.
* 20% to Testing.

In [None]:
split_dataset(
    data_dir="input/datasets/cherry_leaf_dataset/cherry-leaves",
    train_ratio=0.7,
    val_ratio=0.1,
    test_ratio=0.2,
)

---