# **Data Collection**

## Objectives

- Download data from Kaggle and prepare it for processing

## Inputs

- kaggle.json - authntication token
- dataset - images

## Outputs

- Generated dataset: inputs/datasets/mildew_dataset
- Split dataset - train, test, validation

## Change working directory

By default, the working directory is "jupyter_notebooks", where the notebook is running. However, we need to change the working directory to its parent folder so that file references align with the broader project structure.

To do this, we first check the current working directory — note that the output below only displays the last two folders in the file path, rather than the full system path. This is done intentionally to prevent exposing the full local file path stored on my machine.

**Any time you revisit this notebook after logging out, or open a different notebook for the first time, you must repeat these steps to ensure the working directory is always correctly set.**

In [1]:
import os
from pathlib import Path # ensure file path consistency

# Get the current working directory
current_dir = Path.cwd()

# Extract the last two directory names
filtered_path = Path(*current_dir.parts[-2:])
print(f"📂 {filtered_path}")  # Example output: 📂 mildew_detector/jupyter_notebooks

📂 mildew_detector\jupyter_notebooks


Now we change the working directory from "jupyter_notebooks" to the parent directory.

In [2]:
# Change the working directory to its parent folder
os.chdir(os.path.dirname(os.getcwd()))

# Confirmation message with
print("✅ You set a new current directory")

✅ You set a new current directory


Confirm the new current directory.

In [3]:
# Get the current working directory
current_dir = Path.cwd()

# Extract the last two directory names
filtered_path = Path(*current_dir.parts[-2:])
print(f"📂 {filtered_path}")  # Example output: 📂 mildew_detector/jupyter_notebooks

📂 Projects\mildew_detector


## Import Packages

In [4]:
%pip install -r requirements.txt- fix this at the end with a new requirememts file curated from actual use

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt-'


## Install Kaggle

Now we need to think about gathering our data. We will be downloading our images from kaggle.com so we first install kaggle to help with the download.

For this you need to have your Kaggle Token handy.

In [5]:
# install kaggle package
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Drag and drop your kaggle.json file (Kaggle Token) into the same directory as README.md.

The code below will check that kaggle.json appears in the directory by listing its contents. You should see a list of entries in this directory, including kaggle.json.

In [6]:
print(os.listdir())  # Should list `kaggle.json`

['.git', '.gitattributes', '.gitignore', '.python-version', '.venv', 'app.py', 'app_pages', 'jupyter_notebooks', 'kaggle.json', 'outputs', 'README.md', 'requirements.txt', 'setup.sh']


Now we get the path for the dataset and set the destination folder where the downloaded images will be stored.

This code will download a zip folder, then create new folders ("inputs" and "cherry-leaves") for storing the images.

In [7]:
# Define Kaggle dataset and destination folder using pathlib
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = Path("inputs") / "cherry-leaves"  # Ensures correct path handling across OS

# Download the Kaggle dataset into the specified folder
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs\cherry-leaves




  0%|          | 0.00/55.0M [00:00<?, ?B/s]
  2%|▏         | 1.00M/55.0M [00:00<00:51, 1.09MB/s]
  4%|▎         | 2.00M/55.0M [00:01<00:36, 1.52MB/s]
  5%|▌         | 3.00M/55.0M [00:01<00:28, 1.91MB/s]
  7%|▋         | 4.00M/55.0M [00:02<00:23, 2.24MB/s]
  9%|▉         | 5.00M/55.0M [00:02<00:20, 2.54MB/s]
 11%|█         | 6.00M/55.0M [00:02<00:20, 2.51MB/s]
 13%|█▎        | 7.00M/55.0M [00:03<00:18, 2.72MB/s]
 15%|█▍        | 8.00M/55.0M [00:03<00:16, 2.98MB/s]
 16%|█▋        | 9.00M/55.0M [00:03<00:15, 3.18MB/s]
 18%|█▊        | 10.0M/55.0M [00:04<00:14, 3.37MB/s]
 20%|█▉        | 11.0M/55.0M [00:04<00:13, 3.47MB/s]
 22%|██▏       | 12.0M/55.0M [00:05<00:18, 2.43MB/s]
 24%|██▎       | 13.0M/55.0M [00:05<00:15, 2.89MB/s]
 25%|██▌       | 14.0M/55.0M [00:05<00:13, 3.20MB/s]
 27%|██▋       | 15.0M/55.0M [00:05<00:11, 3.50MB/s]
 29%|██▉       | 16.0M/55.0M [00:06<00:10, 3.75MB/s]
 31%|███       | 17.0M/55.0M [00:06<00:10, 3.88MB/s]
 33%|███▎      | 18.0M/55.0M [00:06<00:09, 4.15MB/s]
 

Now we need to unzip the downloaded file and get hold of the images. 

The cell below will unzip the file and store the images inside a new directory within the "inputs" folder.

The code will also delete the zip file and your Kaggle Token for data protection purposes.

In [8]:
import zipfile
from pathlib import Path

# Define paths
zip_file = Path("inputs") / "cherry-leaves" / "cherry-leaves.zip"
extract_folder = Path("inputs") / "cherry-leaves"
kaggle_token = Path("kaggle.json")

# Unzip the file
with zipfile.ZipFile(zip_file, "r") as zip_ref:
    zip_ref.extractall(extract_folder)

print(f"📂 Extracted files into: {extract_folder}")

# Delete the zip file after extraction
zip_file.unlink()
print(f"🗑️ Deleted: {zip_file}")

# Remove Kaggle token for security
if kaggle_token.exists():
    kaggle_token.unlink()
    print(f"🛡️ Kaggle Token removed: {kaggle_token}")

📂 Extracted files into: inputs\cherry-leaves
🗑️ Deleted: inputs\cherry-leaves\cherry-leaves.zip
🛡️ Kaggle Token removed: kaggle.json


## Data Preparation

Now we have the images downloaded and stored in the right place, we need to ckeck that the data is suitable for our project.

As we are concerned with using image data, we need to check that any non-image files are identified and removed.

The function below will do this.

In [9]:
from pathlib import Path

def remove_non_image_files(my_data_dir):
    """
    Removes all non-image files from the specified directory and its subfolders.

    Args:
        my_data_dir (str): Path to the directory containing image folders.

    The function scans each folder within `my_data_dir`, identifies non-image files, 
    deletes them, and prints a summary of the number of image and non-image files per folder.
    """

    # Define valid image file extensions
    image_extensions = {'.png', '.jpg', '.jpeg'}

    # Convert the directory path to a Path object for better handling
    my_data_dir = Path(my_data_dir)

    # Check if directory exists
    if not my_data_dir.exists():
        print(f"❌ Directory does not exist: {my_data_dir}")
        return

    # Iterate through each folder in the main directory
    for folder in my_data_dir.iterdir():
        if folder.is_dir():  # Ensure we only process directories
            image_count = 0
            non_image_count = 0

            # Iterate through files inside the folder
            for given_file in folder.iterdir():
                # Convert suffix safely and check extension
                if given_file.suffix and given_file.suffix.lower() not in image_extensions:
                    given_file.unlink()  # Remove non-image file
                    non_image_count += 1
                else:
                    image_count += 1

            # Print a summary of processed files
            print(f"📂 Folder: {folder.name} - has 📄 {image_count} image files")
            print(f"📂 Folder: {folder.name} - has 📄 {non_image_count} non-image files")

Now we call the function...

In [10]:
remove_non_image_files(my_data_dir=r'inputs/cherry-leaves/cherry-leaves')

📂 Folder: healthy - has 📄 2104 image files
📂 Folder: healthy - has 📄 0 non-image files
📂 Folder: powdery_mildew - has 📄 2104 image files
📂 Folder: powdery_mildew - has 📄 0 non-image files


## Split the Data

Now that we know the data is all images, we can split it into the groups we will need to build and fit a training model. These groups are:

- Train set
- Test set
- Validation set

The function below will create subfolders and split the data amongst them according to the arguments we define when calling the function.



In [11]:
from pathlib import Path
import shutil
import random

def split_train_validation_test_images(data_dir, train_ratio, val_ratio, test_ratio):
    """
    Splits images into train, validation, and test sets based on given ratios.

    Args:
        data_dir (str or Path): The directory containing subfolders of images, where each subfolder represents a class label.
        train_ratio (float): The proportion of images allocated to the training set.
        val_ratio (float): The proportion of images allocated to the validation set.
        test_ratio (float): The proportion of images allocated to the test set.

    Raises:
        ValueError: If the sum of train_ratio, val_ratio, and test_ratio is not equal to 1.0.
    """

    # Ensure the ratios sum to 1.0 (rounded for floating-point precision)
    if round(train_ratio + val_ratio + test_ratio, 5) != 1.0:
        raise ValueError("train_ratio + val_ratio + test_ratio must sum to 1.0")

    # Convert data_dir to a Path object if not already
    data_dir = Path(data_dir)

    # Get only directories (class labels) inside the dataset directory
    labels = [label for label in data_dir.iterdir() if label.is_dir()]

    # Create train, validation, and test directories
    for folder in ['train', 'validation', 'test']:
        for label in labels:
            (data_dir / folder / label.name).mkdir(parents=True, exist_ok=True)

    for label in labels:
        files = [f for f in label.iterdir() if f.is_file()]  # List only files, ignoring subdirectories
        random.shuffle(files)  # Shuffle to ensure randomness in splits

        # Calculate split indices
        train_count = int(len(files) * train_ratio)
        val_count = int(len(files) * val_ratio)

        # Iterate and move files to corresponding directories
        for i, file_path in enumerate(files):
            if i < train_count:
                dst = data_dir / 'train' / label.name / file_path.name
            elif i < train_count + val_count:
                dst = data_dir / 'validation' / label.name / file_path.name
            else:
                dst = data_dir / 'test' / label.name / file_path.name

            shutil.move(str(file_path), str(dst))  # Move file

        # Check if the folder is empty before deletion
        if any(label.iterdir()):  
            print(f"⚠️ Not empty, skipping: {label}")
        else:
            label.rmdir()
            print(f"🗑️ Deleted empty folder: {label}")

    print("✅ Dataset split completed successfully!")

Call the function to split the data with the following ratios:

- Train = 70%
- Test = 20%
- Validation = 10%

In [12]:
split_train_validation_test_images(data_dir=r"inputs/cherry-leaves/cherry-leaves",
                                   train_ratio=0.7,
                                   val_ratio=0.1,
                                   test_ratio=0.2)

🗑️ Deleted empty folder: inputs\cherry-leaves\cherry-leaves\healthy
🗑️ Deleted empty folder: inputs\cherry-leaves\cherry-leaves\powdery_mildew
✅ Dataset split completed successfully!


### Count number of images in each folder

In [13]:
from pathlib import Path

def count_images(folder_path):
    folder = Path(folder_path)
    return sum(1 for f in folder.iterdir() if f.is_file())  # Efficient file counting

base_dir = Path("inputs/cherry-leaves/cherry-leaves")  # Adjusted base path

print(f"Healthy Train: {count_images(base_dir / 'train' / 'healthy')}")
print(f"Powdery Mildew Train: {count_images(base_dir / 'train' / 'powdery_mildew')}")
print(f"Healthy Validation: {count_images(base_dir / 'validation' / 'healthy')}")
print(f"Powdery Mildew Validation: {count_images(base_dir / 'validation' / 'powdery_mildew')}")
print(f"Healthy Test: {count_images(base_dir / 'test' / 'healthy')}")
print(f"Powdery Mildew Test: {count_images(base_dir / 'test' / 'powdery_mildew')}")

Healthy Train: 1472
Powdery Mildew Train: 1472
Healthy Validation: 210
Powdery Mildew Validation: 210
Healthy Test: 422
Powdery Mildew Test: 422


## Summary

In this notebook we have:

- Installed the required packages
- Installed Kaggle and the authentication token
- Downloaded the image data from kaggle.com
- Removed the image data from the zip folder, deleted the folder and Kaggle authentication token
- Checked the image data for any non-image files
- Split the images into train, test and validation sets

When you are ready, move on to the next notebook where we will look at Data Visualizations!
