### Data Collection

**Objectives**
- Fetch data from Kaggle and prepare it for further processes

**Inputs**
- Kaggle JSON file - authentication token

**Outputs**
- Generate Dataset: inputs/datasets/codeinstitute/cherry-leaves
- Data Prepration
    + Data cleaning: Check and remove non images files
    + Split train validation test set (0.7, 0.1, 0.2)

**Additional Comments | Insights | Conclusions**
- No comments

**Import Packages**

In [1]:
import numpy as np
import os
import zipfile
import shutil
import random
import joblib


**Change the working directory**

In [2]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detection-in-cherry-leaves/jupyter_notebooks'

In [3]:
os.chdir('/workspace/mildew-detection-in-cherry-leaves')
print("You set a new current directory")

You set a new current directory


In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detection-in-cherry-leaves'

**Install Kaggle**

In [5]:
%pip install kaggle

Note: you may need to restart the kernel to use updated packages.


Change kaggle configuration directory to current working directory and permission of kaggle authentication json

In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set Kaggle Dataset and Download it

In [7]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown
Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset
 91%|██████████████████████████████████▌   | 50.0M/55.0M [00:01<00:00, 43.9MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 32.6MB/s]


Unzip the downloaded file, delete the zip file

In [8]:
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

_________________________________________________________________________________
### Data Prepration
_________________________________________________________________________________

**Data cleaning**

Check and remove non images files

In [9]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)

            #print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))


In [10]:
remove_non_image_file(my_data_dir='inputs/cherry_leaves_dataset/cherry-leaves')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


**Split train validation test set**

In [11]:
def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    already_split = {'train', 'validation', 'test'}.issubset(labels)
    if already_split:
        print("Dataset already split.")
        return

    # create train, test, and validation folders with class labels sub-folders
    for folder in ['train', 'validation', 'test']:
        for label in labels:
            new_dir = os.path.join(my_data_dir, folder, label)
            if not os.path.exists(new_dir):
                os.makedirs(new_dir)

    for label in labels:
        files = os.listdir(os.path.join(my_data_dir, label))
        random.shuffle(files)

        train_set_files_qty = int(len(files) * train_set_ratio)
        validation_set_files_qty = int(len(files) * validation_set_ratio)

        for count, file_name in enumerate(files):
            src_path = os.path.join(my_data_dir, label, file_name)

            if count < train_set_files_qty:
                dest_folder = 'train'
            elif count < train_set_files_qty + validation_set_files_qty:
                dest_folder = 'validation'
            else:
                dest_folder = 'test'

            shutil.move(src_path, os.path.join(my_data_dir, dest_folder, label, file_name))

    os.rmdir(os.path.join(my_data_dir, label))

In [12]:
split_train_validation_test_images(my_data_dir = f"inputs/cherry_leaves_dataset/cherry-leaves",
                        train_set_ratio = 0.7,
                        validation_set_ratio=0.1,
                        test_set_ratio=0.2
                        )