# **Data Collection**

### Objectives

* Fetch data from Kaggle and save as raw data. 
* Prepare the data for further processes.
* Split the data into training, validation, and test sets.

### Inputs

* Kaggle JSON file - the authentication token.

### Outputs

* Generate Dataset: inputs/cherry_leaves/cherry-leaves

### Additional Comments

* No additional comments



---

## **Change working directory**

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/milestone-project-mildew-detection-in-cherry-leaves/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/milestone-project-mildew-detection-in-cherry-leaves'

## **Import the requirements.txt packages**

In [4]:
# This code snippet was adapted from Code Institue Malaria Detector Walkthrough Sample Project
# https://github.com/Code-Institute-Solutions/WalkthroughProject01/blob/main/jupyter_notebooks/01%20-%20DataCollection.ipynb

%pip install -r /workspace/milestone-project-mildew-detection-in-cherry-leaves/requirements.txt
import numpy
import os

You should consider upgrading via the '/home/gitpod/.pyenv/versions/3.8.16/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


## **Install Kaggle. Download the Dataset**

Install Kaggle

In [5]:
# This code snippet was adapted from Code Institue Malaria Detector Walkthrough Sample Project
# https://github.com/Code-Institute-Solutions/WalkthroughProject01/blob/main/jupyter_notebooks/01%20-%20DataCollection.ipynb

# install kaggle package
%pip install kaggle==1.5.12

You should consider upgrading via the '/home/gitpod/.pyenv/versions/3.8.16/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


* Change the Kaggle configuration directory to the current working directory. 
* Set permissions for the Kaggle authentication JSON.

In [6]:
# This code snippet was adapted from Code Institue Malaria Detector Walkthrough Sample Project
# https://github.com/Code-Institute-Solutions/WalkthroughProject01/blob/main/jupyter_notebooks/01%20-%20DataCollection.ipynb

os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


Set the Kaggle Dataset and Download it.

In [9]:
# This code snippet was adapted from Code Institue Malaria Detector Walkthrough Sample Project
# https://github.com/Code-Institute-Solutions/WalkthroughProject01/blob/main/jupyter_notebooks/01%20-%20DataCollection.ipynb

KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves"  
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry_leaves
 96%|████████████████████████████████████▌ | 53.0M/55.0M [00:01<00:00, 44.2MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 30.6MB/s]


Unzip the downloaded file, and delete the zip file.

In [10]:
# This code snippet was adapted from Code Institue Malaria Detector Walkthrough Sample Project
# https://github.com/Code-Institute-Solutions/WalkthroughProject01/blob/main/jupyter_notebooks/01%20-%20DataCollection.ipynb

import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

## **Data Preparation**

### Data Cleaning


Check whether there are non-image files and remove them

In [7]:
# This code snippet was adapted from Code Institue Malaria Detector Walkthrough Sample Project
# https://github.com/Code-Institute-Solutions/WalkthroughProject01/blob/main/jupyter_notebooks/01%20-%20DataCollection.ipynb

def remove_non_image_file(my_data_dir):
    """
    Removes non-image files from the a directory and its subdirectories.
    """
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [8]:
# This code snippet was adapted from Code Institue Malaria Detector Walkthrough Sample Project
# https://github.com/Code-Institute-Solutions/WalkthroughProject01/blob/main/jupyter_notebooks/01%20-%20DataCollection.ipynb

remove_non_image_file(my_data_dir='inputs/cherry_leaves/cherry-leaves/')

['healthy', 'powdery_mildew']


IsADirectoryError: [Errno 21] Is a directory: 'inputs/cherry_leaves/cherry-leaves//test/healthy'

### Split the data into train, validation, and test sets

In [7]:
# This code snippet was adapted from Code Institue Malaria Detector Walkthrough Sample Project
# https://github.com/Code-Institute-Solutions/WalkthroughProject01/blob/main/jupyter_notebooks/01%20-%20DataCollection.ipynb

import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    """
    Splits images in a directory into train, validation, and test sets based on the specified ratios.
    """
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

Conventionally, the dataset is typically split with a **70%** ratio of the data for the **training set**. The **validation set** is allocated a **10%** ratio of the data, and the rest **20%** of the available data is distributed to the **test set**.

In [8]:
# This code snippet was adapted from Code Institue Malaria Detector Walkthrough Sample Project
# https://github.com/Code-Institute-Solutions/WalkthroughProject01/blob/main/jupyter_notebooks/01%20-%20DataCollection.ipynb

split_train_validation_test_images(my_data_dir=f"inputs/cherry_leaves/cherry-leaves/",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---

# Conclusions and Next Steps

## Conclusions

* The necessary packages for the project were successfully imported.
* The image dataset was successfully downloaded and prepared for further analysis.
* The dataset was split into training, test and validation sets.

## Next Steps

* In this notebook the data was collected, prepared and split into Training, Validation and Test sets.
* The next step is Data Visualization. This step will satisfy the first business requirement: "differentiate a cherry leaf that is healthy from one that contains powdery mildew"