## 01 - Data Collection

### Objective:

- Install and configure Kaggle API
- Download and unzip the "Cherry Leaves" dataset

In [20]:
%pip install -r ../requirements.txt

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Step 1: Install Kaggle and Set Up Environment

In [21]:
%pip install kaggle

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Step 2: Set Kaggle config directory and Set File Permissions for kaggle.json

In [22]:
import os
import zipfile
import shutil
import random
from pathlib import Path

# Set working directory manually to the project root
os.chdir("C:/Users/Robert/mildew_cherry_detector")
print("You set a new current directory")

current_dir = os.getcwd()
current_dir


You set a new current directory


'C:\\Users\\Robert\\mildew_cherry_detector'

In [24]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json  # on Windows, this will be ignored

'chmod' is not recognized as an internal or external command,
operable program or batch file.


### Step 3: Download Dataset from Kaggle

In [25]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown


### Step 4: Unzip Dataset

In [26]:
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

## Data Preparation

### Data cleaning
### Check and remove non image files

In [27]:
def remove_non_images(directory):
    removed_files = []
    valid_ext = [".jpg", ".jpeg", ".png"]
    for filename in os.listdir(directory):
        ext = os.path.splitext(filename)[1].lower()
        if ext not in valid_ext:
            file_path = os.path.join(directory, filename)
            os.remove(file_path)
            removed_files.append(filename)
    return removed_files

healthy_dir = "C:/Users/Robert/mildew_cherry_detector/inputs/cherry_leaves_dataset/cherry-leaves/healthy"
mildew_dir = "C:/Users/Robert/mildew_cherry_detector/inputs/cherry_leaves_dataset/cherry-leaves/powdery_mildew"

removed_healthy = remove_non_images(healthy_dir)
removed_mildew = remove_non_images(mildew_dir)

print("Data cleaning complete.")
print(f"Removed from healthy: {removed_healthy}")
print(f"Removed from mildew: {removed_mildew}")

Data cleaning complete.
Removed from healthy: []
Removed from mildew: []


### Split train validation test set

In [28]:
healthy_src = Path("C:/Users/Robert/mildew_cherry_detector/inputs/cherry_leaves_dataset/cherry-leaves/healthy")
mildew_src = Path("C:/Users/Robert/mildew_cherry_detector/inputs/cherry_leaves_dataset/cherry-leaves/powdery_mildew")
base_output = Path("C:/Users/Robert/mildew_cherry_detector/inputs/cherry_leaves_split")
classes = ["healthy", "powdery_mildew"]

train_ratio = 0.7
val_ratio = 0.15
test_ratio = 0.15

def split_and_copy(class_name, src_dir):
    files = [f for f in os.listdir(src_dir) if f.lower().endswith((".jpg", ".jpeg", ".png"))]
    if not files:
        print(f"No images found in {src_dir}. Skipping {class_name}.")
        return

    random.shuffle(files)
    n_total = len(files)
    n_train = int(n_total * train_ratio)
    n_val = int(n_total * val_ratio)

    train_files = files[:n_train]
    val_files = files[n_train:n_train + n_val]
    test_files = files[n_train + n_val:]

    for split_name, split_files in zip(["train", "val", "test"], [train_files, val_files, test_files]):
        split_dir = base_output / split_name / class_name
        split_dir.mkdir(parents=True, exist_ok=True)

        for file in split_files:
            src_file = src_dir / file
            dst_file = split_dir / file
            if not dst_file.exists():
                shutil.copy(src_file, dst_file)

    print(f"{class_name}: {n_total} images → {n_train} train, {n_val} val, {len(test_files)} test")

split_and_copy("healthy", healthy_src)
split_and_copy("powdery_mildew", mildew_src)


healthy: 2104 images → 1472 train, 315 val, 317 test
powdery_mildew: 2104 images → 1472 train, 315 val, 317 test
