"# **Data Collection**"

## Objectives

* Fetch cherry leaf dataset from Kaggle (or other specified source) containing healthy and powdery mildew images.
* Save the collected data as raw data for further processing and analysis.

## Inputs

* Kaggle API credentials (if the dataset is hosted on Kaggle).
* URL or file paths to the cherry leaf dataset.
* Access to storage location where raw data will be saved.

## Outputs

* Raw dataset files saved in a structured directory format.
* Metadata file summarizing the dataset information (e.g., number of images per class).
* Basic statistics about the dataset (e.g., image dimensions, file size).

## Additional Comments

* Ensure that all data handling follows NDA guidelines.
* Store the raw data in a secure and restricted-access environment to comply with privacy concerns.
* If the dataset size is large, consider storing it in a cloud-based service for scalability and accessibility.



---

# Import packages

In [13]:

%pip install -r /workspaces/CI_PP5/requirements.txt

Note: you may need to restart the kernel to use updated packages.


# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [14]:
import os
import numpy
current_dir = os.getcwd()
current_dir

'/workspaces/CI_PP5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.chir() defines the new current directory

In [15]:
os.chdir(os.path.dirname('/workspaces/CI_PP5/'))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [16]:
current_dir = os.getcwd()
current_dir

'/workspaces/CI_PP5'

# Install Kaggle

In [17]:
# install kaggle package
%pip install kaggle

Note: you may need to restart the kernel to use updated packages.


---

Run the cell below **to change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON**.

In [18]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Get the dataset path from Kaggle

In [20]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown
Downloading cherry-leaves.zip to inputs
 95%|███████████████████████████████████▉  | 52.0M/55.0M [00:02<00:00, 20.0MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 20.1MB/s]


Unzip the downloaded file, and delete the zip file.

In [21]:
import zipfile
import os

# Set your destination folder where the zip file is located
DestinationFolder = 'inputs/'
zip_file_name = 'cherry-leaves.zip'
zip_file_path = os.path.join(DestinationFolder, zip_file_name)

# Check if the zip file exists before attempting to extract
if os.path.exists(zip_file_path):
    # Extract the contents of the zip file
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(DestinationFolder)
    print(f"Extraction complete: Contents extracted to {DestinationFolder}")
    
    # Optionally remove the zip file after extraction to save space
    os.remove(zip_file_path)
    print(f"Removed the zip file: {zip_file_name}")
else:
    print(f"Zip file not found: {zip_file_path}")

Extraction complete: Contents extracted to inputs/
Removed the zip file: cherry-leaves.zip


---

# Data Preparation

---

## Data cleaning

### Check and remove non-image files

In [24]:
def remove_non_image_file(my_data_dir):
    """ 
    The remove_non_image_file function removes all non-image files from each subfolder within a given directory (my_data_dir). It checks file extensions (.png, .jpg, .jpeg) to identify image files, deletes files that don’t match these extensions, and prints the count of image and non-image files in each folder.
    
    Parameters:-
    my_data_dir: str — Path to the directory containing labeled folders with files to be checked and cleaned.
    """
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        print(folders)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))


In [25]:
remove_non_image_file(my_data_dir='inputs/cherry-leaves')

['healthy', 'powdery_mildew']
Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
['healthy', 'powdery_mildew']
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


## Split train validation test set

In [26]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    """
    The split_train_validation_test_images function splits images into train, validation, and test sets based on specified ratios. It verifies that the sum of these ratios equals 1.0, retrieves the class labels (sub-folder names) from my_data_dir, creates the necessary folders if they don't exist, shuffles the files, and moves them according to the ratios. The original label folders are then removed.
    
    Parameters:-
    my_data_dir: str — Path to the main directory containing labeled image folders (e.g., data/cherry_leaves).
    train_set_ratio: float — Proportion of images for the training set.
    validation_set_ratio: float — Proportion of images for the validation set.
    test_set_ratio: float — Proportion of images for the test set.
    """
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)


Conventionally,
* The training set is divided into a 0.70 ratio of data.
* The validation set is divided into a 0.10 ratio of data.
* The test set is divided into a 0.20 ratio of data.

In [29]:
#The function call split_train_validation_test_images splits images in the inputs/cherryleaves_dataset/cherry-leaves directory into training, validation, and test sets. It allocates 70% of the images to the training set, 10% to the validation set, and 20% to the test set.
#Parameters
#    my_data_dir: Path to the dataset ("inputs/cherryleaves_dataset/cherry-leaves").
#    train_set_ratio: Proportion of images for training (0.7).
#    validation_set_ratio: Proportion of images for validation (0.1).
#    test_set_ratio: Proportion of images for testing (0.2).
split_train_validation_test_images(my_data_dir=f"inputs/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )


---

# Congratulations

---