# **Data Collection**

## Objectives

- Download data from Kaggle and prepare it for processing

## Inputs

- kaggle.json - authentication token
- dataset - images

## Outputs

- Generated dataset: inputs/datasets/mildew_dataset
- Split dataset - train, test, validation

## Change working directory

By default, the working directory is "jupyter_notebooks", where the notebook is running. However, we need to change the working directory to its parent folder so that file references align with the broader project structure.

To do this, we first check the current working directory — note that the output below only displays the last two folders in the file path, rather than the full system path. This is done intentionally to prevent exposing the full local file path stored on my machine.

**Any time you revisit this notebook after logging out, or open a different notebook for the first time, you must repeat these steps to ensure the working directory is always correctly set.**

In [None]:
import os

# Get the current directory
current_dir = os.getcwd()
current_dir

Now we change the working directory from "jupyter_notebooks" to the parent directory.

In [None]:
# Change the working directory to its parent folder
os.chdir(os.path.dirname(os.getcwd()))

# Confirmation message
print("You set a new current directory")


Confirm the new current directory.

In [None]:
# Confirm that the directory has changed
current_dir = os.getcwd()
current_dir


## Import Packages

In [None]:
%pip install -r requirements.txt

## Install Kaggle

Now we need to think about gathering our data. We will be downloading our images from kaggle.com so we first install kaggle to help with the download.

For this you need to have your Kaggle Token handy.

In [None]:
# install kaggle package
%pip install kaggle==1.5.12


Drag and drop your kaggle.json file (Kaggle Token) into the same directory as README.md.

The code below will check that kaggle.json appears in the directory by listing its contents. You should see a list of entries in this directory, including kaggle.json.

In [None]:
print(os.listdir())  # Should list `kaggle.json`

Now we get the path for the dataset and set the destination folder where the downloaded images will be stored.

This code will download a zip folder, then create new folders ("inputs" and "cherry-leaves") for storing the images.

In [None]:
from pathlib import Path

# Define Kaggle dataset and destination folder using pathlib
KaggleDatasetPath = "codeinstitute/cherry-leaves"
# Ensure correct path handling across OS
DestinationFolder = Path("inputs") / "cherry-leaves"

# Download the Kaggle dataset into the specified folder
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}


Now we need to unzip the downloaded file and get hold of the images. 

The cell below will unzip the file and store the images inside a new directory within the "inputs" folder.

The code will also delete the zip file and your Kaggle Token for data protection purposes.

In [None]:
import zipfile
from pathlib import Path

# Define the file paths
zip_file = Path("inputs") / "cherry-leaves" / "cherry-leaves.zip"
extract_folder = Path("inputs") / "cherry-leaves"
kaggle_token = Path("kaggle.json")

# Unzip the file and extract data
with zipfile.ZipFile(zip_file, "r") as zip_ref:
    zip_ref.extractall(extract_folder)

print(f"Extracted files into: {extract_folder}")

# Delete the zip file after extraction
zip_file.unlink()
print(f"Deleted: {zip_file}")

# Remove Kaggle token for security purposes
if kaggle_token.exists():
    kaggle_token.unlink()
    print(f"Kaggle Token removed: {kaggle_token}")


## Data Preparation

Now we have the images downloaded and stored in the right place, we need to check that the data is suitable for our project.

As we are concerned with using image data, we need to check that any non-image files are identified and removed.

The function below will do this.

In [None]:
from pathlib import Path

def remove_non_image_files(my_data_dir):
    """
    Removes all non-image files from the specified directory 
    and its subfolders.

    Args:
        my_data_dir (str): Path to the directory containing image folders.

    The function scans each folder within `my_data_dir`, identifies non-image 
    files, deletes them, and prints a summary of the number of image and 
    non-image files per folder.
    """

    # Define valid image file extensions
    image_extensions = {'.png', '.jpg', '.jpeg'}

    # Convert directory path to Path object for better handling
    my_data_dir = Path(my_data_dir)

    # Check if directory exists
    if not my_data_dir.exists():
        print(f"Directory does not exist: {my_data_dir}")
        return

    # Iterate through each folder in the main directory
    for folder in my_data_dir.iterdir():
        if folder.is_dir():  # Ensure only directories are processed
            image_count, non_image_count = 0, 0  # Initialize counters

            # Iterate through files inside the folder
            for given_file in folder.iterdir():
                # Convert suffix safely and check extension
                if given_file.suffix and given_file.suffix.lower() \
                        not in image_extensions:
                    given_file.unlink()  # Remove non-image file
                    non_image_count += 1
                else:
                    image_count += 1

            # Print summary of processed files
            print(f"Folder: {folder.name} - {image_count} image files")
            print(f"Folder: {folder.name} - {non_image_count} non-image files")


Now we call the function...

In [None]:
remove_non_image_files(my_data_dir=r'inputs/cherry-leaves/cherry-leaves')

## Split the Data

Now that we know the data is all images, we can split it into the groups we will need to build and fit a training model. These groups are:

- Train set
- Test set
- Validation set

The function below will create subfolders and split the data amongst them according to the arguments we define when calling the function.



In [None]:
from pathlib import Path
import shutil
import random


def split_train_validation_test_images(
    data_dir, train_ratio, val_ratio, test_ratio
):
    """
    Splits images into train, validation, and test sets based on given ratios.

    Args:
        data_dir (str or Path): Directory containing subfolders of images,
            where each subfolder represents a class label.
        train_ratio (float): Proportion of images allocated to training.
        val_ratio (float): Proportion of images allocated to validation.
        test_ratio (float): Proportion of images allocated to testing.

    Raises:
        ValueError: If train_ratio, val_ratio, and test_ratio
        do not sum to 1.0.
    """

    # Ensure the ratios sum to 1.0 (rounded for floating-point precision)
    if round(train_ratio + val_ratio + test_ratio, 5) != 1.0:
        raise ValueError("ratios must sum to 1.0")

    # Convert data_dir to Path object if not already an object
    data_dir = Path(data_dir)

    # Get only directories (class labels) inside the dataset directory
    labels = [label for label in data_dir.iterdir() if label.is_dir()]

    # Create train, validation, and test directories
    for folder in ['train', 'validation', 'test']:
        for label in labels:
            (data_dir / folder / label.name).mkdir(parents=True, exist_ok=True)

    for label in labels:
        # List only files and ignore subdirectories
        files = [f for f in label.iterdir() if f.is_file()]
        random.shuffle(files)  # Shuffle for randomness

        # Calculate split indices
        train_count = int(len(files) * train_ratio)
        val_count = int(len(files) * val_ratio)

        # Move files to corresponding directories
        for i, file_path in enumerate(files):
            dst = (
                data_dir / 'train' / label.name / file_path.name
                if i < train_count else
                data_dir / 'validation' / label.name / file_path.name
                if i < train_count + val_count else
                data_dir / 'test' / label.name / file_path.name
            )

            shutil.move(str(file_path), str(dst))  # Move file

        # Check if folder is empty before deletion
        if any(label.iterdir()):
            print(f"Not empty, skipping: {label}")
        else:
            label.rmdir()
            print(f"Deleted empty folder: {label}")

    print("Dataset split completed successfully!")


Call the function to split the data with the following ratios:

- Train = 70%
- Test = 20%
- Validation = 10%

In [None]:
split_train_validation_test_images(
    data_dir=r"inputs/cherry-leaves/cherry-leaves",
    train_ratio=0.7,
    val_ratio=0.1,
    test_ratio=0.2
)

### Count number of images in each folder

In [None]:
from pathlib import Path


def count_images(folder_path):
    """
    Counts the number of image files in the specified folder.

    Args:
        folder_path (str or Path): Path to the folder containing images.

    Returns:
        int: Number of image files in the folder.
    """
    folder = Path(folder_path)
    return sum(1 for f in folder.iterdir() if f.is_file())


# Define base dataset directory
base_dir = Path("inputs/cherry-leaves/cherry-leaves")

# Print image counts for each dataset split
print(f"Healthy Train: {
    count_images(base_dir / 'train' / 'healthy')}")
print(f"Powdery Mildew Train: {
    count_images(base_dir / 'train' / 'powdery_mildew')}")
print(f"Healthy Validation: {
    count_images(base_dir / 'validation' / 'healthy')}")
print(f"Powdery Mildew Validation: {
    count_images(base_dir / 'validation' / 'powdery_mildew')}")
print(f"Healthy Test: {
    count_images(base_dir / 'test' / 'healthy')}")
print(f"Powdery Mildew Test: {
    count_images(base_dir / 'test' / 'powdery_mildew')}")


## Summary

In this notebook we have:

- Installed the required packages
- Installed Kaggle and the authentication token
- Downloaded the image data from kaggle.com
- Removed the image data from the zip folder, deleted the folder and Kaggle authentication token
- Checked the image data for any non-image files
- Split the images into train, test and validation sets
- Counted how many images are in each folder

When you are ready, move on to the next notebook where we will look at Data Visualizations!
