# **Data Collection**


## Objectives

- Download data from Kaggle and prepare it for processing

## Inputs

- kaggle.json - authntication token
- dataset - images

## Outputs

- Generated dataset: inputs/datasets/mildew_dataset
- Split dataset - train, test, validation




---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/milestone-project-mildew-detection-in-cherry-leaves/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/milestone-project-mildew-detection-in-cherry-leaves'

## Import packages

Install the packages neccessary for running this project.

In [4]:
%pip install -r /workspaces/milestone-project-mildew-detection-in-cherry-leaves/requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


---

## Install Kaggle

Install kaggle to facilitate the downloading of image data from kaggle.com.

In [5]:
# install kaggle package
%pip install kaggle==1.5.12


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


---

### Change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON.

Drag and drop your kaggle.json file from your computer into the root directory then follow the steps below.

#### Check that kaggle.json appears in the root directory

In [6]:
import os
print(os.listdir())  # Should list `kaggle.json`

['.devcontainer', '.github', 'setup.sh', '.git', '.gitignore', '.slugignore', 'README.md', 'Procfile', 'requirements.txt', '.python-version', 'jupyter_notebooks', 'kaggle.json']


#### Set Kaggle configuration directory

In [7]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
print(f"Kaggle config directory set to: {os.environ['KAGGLE_CONFIG_DIR']}")

Kaggle config directory set to: /workspaces/milestone-project-mildew-detection-in-cherry-leaves


#### Get the dataset path and set the destination folder

In [8]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry-leaves"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry-leaves
 98%|█████████████████████████████████████▎| 54.0M/55.0M [00:01<00:00, 53.3MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 41.8MB/s]


#### Unzip the downloaded file, delete the zip and the kaggle token

---

In [9]:
import zipfile
import os

# Extract ZIP file
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

# Remove the ZIP file
os.remove(DestinationFolder + '/cherry-leaves.zip')
print("Zip folder deleted succesfuly.")

# Define Kaggle.json path for GitHub Codespace (root directory)
kaggle_json_path = os.path.join(os.getcwd(), "kaggle.json")  # Gets the current working directory

if os.path.exists(kaggle_json_path):
    os.remove(kaggle_json_path)
    print("Deleted kaggle.json successfully.")

print("Cleanup completed.")


Zip folder deleted succesfuly.
Deleted kaggle.json successfully.
Cleanup completed.


## Data Preparation

### Check and remove non-image files

In [10]:

from pathlib import Path

def remove_non_image_files(my_data_dir):
    """
    Removes all non-image files from the specified directory and its subfolders.

    Args:
        my_data_dir (str): Path to the directory containing image folders.

    The function scans each folder within `my_data_dir`, identifies non-image files, 
    deletes them, and prints a summary of the number of image and non-image files per folder.
    """
    
    # Define valid image file extensions
    image_extensions = ('.png', '.jpg', '.jpeg')
    
    # Convert the directory path to a Path object for better path handling
    my_data_dir = Path(my_data_dir)

    # Iterate through each folder in the main directory
    for folder in my_data_dir.iterdir():
        if folder.is_dir():  # Ensure we only process directories
            image_count = 0
            non_image_count = 0

            # Iterate through files inside the folder
            for given_file in folder.iterdir():
                if not given_file.suffix.lower() in image_extensions:
                    given_file.unlink()  # Remove non-image file
                    non_image_count += 1
                else:
                    image_count += 1

            # Print a summary of processed files
            print(f"Folder: {folder.name} - has {image_count} image files")
            print(f"Folder: {folder.name} - has {non_image_count} non-image files")

### Call the function to remove any non-image files

In [11]:
remove_non_image_files(my_data_dir=r'inputs/cherry-leaves/cherry-leaves')

Folder: healthy - has 2104 image files
Folder: healthy - has 0 non-image files
Folder: powdery_mildew - has 2104 image files
Folder: powdery_mildew - has 0 non-image files


## Split train test validation test set

Create a function to create directories for train, test and validation images and then split the data amongst the folders.

In [12]:
import os
import shutil
import random

def split_train_validation_test_images(data_dir, train_ratio, val_ratio, test_ratio):
    """
    Splits images into train, validation, and test sets based on given ratios.

    Args:
        data_dir (str): The directory containing subfolders of images, where each subfolder represents a class label.
        train_ratio (float): The proportion of images allocated to the training set.
        val_ratio (float): The proportion of images allocated to the validation set.
        test_ratio (float): The proportion of images allocated to the test set.

    Raises:
        ValueError: If the sum of train_ratio, val_ratio, and test_ratio is not equal to 1.0.
    """

    # Ensure the ratios sum to 1.0 (rounded for floating-point precision)
    if round(train_ratio + val_ratio + test_ratio, 5) != 1.0:  # Using rounding to prevent floating-point errors
        raise ValueError("train_ratio + val_ratio + test_ratio must sum to 1.0")

    # Get only directories (class labels) inside the given dataset directory
    labels = [label for label in os.listdir(data_dir) if os.path.isdir(os.path.join(data_dir, label))]

    # Create train, validation, and test directories, ensuring they exist before moving files
    for folder in ['train', 'validation', 'test']:
        for label in labels:
            os.makedirs(os.path.join(data_dir, folder, label), exist_ok=True)  # Create folder if it doesn't exist

    for label in labels:
        label_path = os.path.join(data_dir, label)

        # List all image files (ignoring subdirectories, if any)
        files = [f for f in os.listdir(label_path) if os.path.isfile(os.path.join(label_path, f))]

        # Shuffle the file list to ensure randomness in splits
        random.shuffle(files)

        # Calculate split indices based on the ratios
        train_count = int(len(files) * train_ratio)  # Number of images for training
        val_count = int(len(files) * val_ratio)  # Number of images for validation

        # Iterate through files and move them to corresponding directories
        for i, file_name in enumerate(files):
            src = os.path.join(label_path, file_name)  # Source file path

            if i < train_count:  # Move files to training folder
                dst = os.path.join(data_dir, 'train', label, file_name)
            elif i < train_count + val_count:  # Move files to validation folder
                dst = os.path.join(data_dir, 'validation', label, file_name)
            else:  # Move remaining files to test folder
                dst = os.path.join(data_dir, 'test', label, file_name)

            shutil.move(src, dst)  # Move file to the new location

        # Remove the now-empty label folder after all images have been moved
        os.rmdir(label_path)

    print("Dataset split completed successfully.")

### Call the function to split the data

Split the images with the following ratios:

- Train = 70%
- Test = 20%
- Validation = 10%

In [13]:
split_train_validation_test_images(data_dir=r"inputs/cherry-leaves/cherry-leaves",
                                   train_ratio=0.7,
                                   val_ratio=0.1,
                                   test_ratio=0.2)

Dataset split completed successfully.


# Summary

In this notebook we have:

- Installed the required packages
- Installed Kaggle and the authentication token
- Downloaded the image data from kaggle.com
- Removed the image data from the zip folder, deleted the folder and Kaggle authentication token
- Checked the image data for any non-image files
- Split the images into train, test and validation sets

When you are ready, move on to the next notebook where we will look at Data Visualizations!