# **Data Collection**

## Objectives

* Fetch data from Kaggle's cherry leaf dataset and prepare it for further analysis and model development.

## Inputs

* Kaggle API Key: A JSON file with the authentication token to access the Kaggle dataset.
* Kaggle Dataset ID: The specific ID for the cherry leaf dataset on Kaggle. 

## Outputs

* Generated Dataset: The images of cherry leaves (both healthy and infected) saved in the inputs/cherry_leaves directory. 

## Additional Comments

* No additional comments at this stage.



---

## Import packages

In [1]:
import os  # Library for interacting with the operating system
import shutil  # Used for high-level file operations like moving and copying files
import random  # To shuffle data for random splits between training, validation, and test sets
import zipfile  # For extracting files from zip archives

## Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
current_dir = os.getcwd()
current_dir

'/workspace/Mildew-Detection-in-Cherry-Leaves/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/Mildew-Detection-in-Cherry-Leaves'

## Download dataset

In [5]:
# Change Kaggle configuration directory to the current working directory and set permission for kaggle.json
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [6]:
# Set destination folder

KaggleDatasetPath = "codeinstitute/cherry-leaves"  
DestinationFolder = "inputs"
!kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown
Downloading cherry-leaves.zip to inputs
 93%|███████████████████████████████████▏  | 51.0M/55.0M [00:05<00:00, 7.29MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:05<00:00, 10.0MB/s]


In [7]:
# Unzip the downloaded file and delete the zip file to save space
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

## Data Preparation

### Data Cleaning

* Remove non-image files

In [8]:
def clean_non_image_files(data_dir):
    """
    This function removes non-image files from the dataset directories.
    Only .png, .jpg, and .jpeg files are kept.
    """
    allowed_extensions = ('.png', '.jpg', '.jpeg')
    
    # Loop through the folders inside the data directory
    for folder in os.listdir(data_dir):
        folder_path = os.path.join(data_dir, folder)
        
        if os.path.isdir(folder_path):
            files_in_folder = os.listdir(folder_path)
            
    
            non_image_files = [file for file in files_in_folder if not file.lower().endswith(allowed_extensions)]
            
            for non_image_file in non_image_files:
                file_path = os.path.join(folder_path, non_image_file)
                os.remove(file_path)  # Remove the non-image file
                print(f"Removed: {file_path}")
    
    print("Data cleaning complete. Non-image files removed.")

# Specify the directory where the cherry leaf images are stored
clean_non_image_files('inputs/cherry-leaves')


Data cleaning complete. Non-image files removed.


* Split dataset

In [9]:

def organize_dataset(data_directory, train_split=0.7, validation_split=0.1, test_split=0.2):
    """
    Organizes the dataset into training, validation, and test sets.
    The function creates folders for each split and moves files accordingly.
    """
    
    # Ensure the ratios sum to 1
    if not abs((train_split + validation_split + test_split) - 1.0) < 1e-6:
        raise ValueError("The splits must sum to 1.0")
    
    # Identify class directories (healthy/infected)
    categories = [category for category in os.listdir(data_directory) if os.path.isdir(os.path.join(data_directory, category))]
    
    # Create necessary folders for train, validation, and test splits
    for split in ['train', 'validation', 'test']:
        for category in categories:
            split_path = os.path.join(data_directory, split, category)
            os.makedirs(split_path, exist_ok=True)
    
    # Move images to their respective split directories
    for category in categories:
        category_path = os.path.join(data_directory, category)
        images = os.listdir(category_path)
        random.shuffle(images)  # Shuffle to ensure random distribution
        
        # Calculate the number of images for each split
        train_size = int(len(images) * train_split)
        validation_size = int(len(images) * validation_split)
        
        # Move images to training set
        for image in images[:train_size]:
            src = os.path.join(category_path, image)
            dst = os.path.join(data_directory, 'train', category, image)
            shutil.move(src, dst)
        
        # Move images to validation set
        for image in images[train_size:train_size + validation_size]:
            src = os.path.join(category_path, image)
            dst = os.path.join(data_directory, 'validation', category, image)
            shutil.move(src, dst)
        
        # Move remaining images to test set
        for image in images[train_size + validation_size:]:
            src = os.path.join(category_path, image)
            dst = os.path.join(data_directory, 'test', category, image)
            shutil.move(src, dst)

    print("Dataset has been successfully split into train, validation, and test sets.")

# Execute the function with the cherry leaves dataset
organize_dataset(data_directory="inputs/cherry-leaves", train_split=0.7, validation_split=0.1, test_split=0.2)



Dataset has been successfully split into train, validation, and test sets.


---

## Conclusions
* The dataset from Kaggle containing images of healthy and powdery mildew-infected cherry leaves has been successfully downloaded and organized.
* Non-image files were cleaned from the dataset to ensure the integrity of the image classification process.
* The data was split into training, validation, and test sets, adhering to the 70/10/20 ratio. This split will allow for robust model training and performance evaluation.