# **Data Collection**

## Objectives of this notebook

* Fetch data from Kaggle using a json API and prepare it for future processing


### Steps 
* Import packages
* Set working directory
* Fetch data from Kaggle and prepare it for further processing.
* Clean data
* Split data into different environments

## Inputs

* Kaggle JSON file - authentication token 

## Outputs

* Generate Dataset: inputs/datasets/cherryleaves_dataset/cherry-leaves

## Additional Comments

* These steps will allow us to fecth data, clean it, divide it into the different environments in preparation for the machine learning activities. 



----

# Import packages

In [None]:
! pip install -r /workspaces/Mildew_Detection_pjkt/requirements.txt

In [None]:
import numpy
import os

## Change working directory

**Change the working directory from its current folder to its parent folder**

* Note: I need to change the working directory when running the notebook in the editor, since the notebooks are in a subfolder.

In [None]:
current_dir = os.getcwd()
current_dir

**Make the parent of the current directory the new current directory**

In [None]:
os.chdir('/workspaces/Mildew_Detection_pjkt')
print("You set a new current directory")

**Confirm the new current directory**

In [None]:
current_dir = os.getcwd()
current_dir

In [None]:
# Install Kaggle

In [None]:
! pip install kaggle==1.5.12

In [None]:
pip install --upgrade pip

**Set the kaggle configuration directory to the current working directory**

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [None]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherryleaves_dataset/cherry-leaves"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

**Unzip the dowloaded file and delete the zip file**

In [None]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Data Preparation

## Data Cleaning

**Check and remove non images files as they will not be used for model training**

In [None]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)

    for folder in folders:
        folder_path = os.path.join(my_data_dir, folder)
        files = os.listdir(folder_path)

        i = []
        j = []
        for given_file in files:
            file_location = os.path.join(folder_path, given_file)

            if os.path.isfile(file_location):
                if not given_file.lower().endswith(image_extension):
                    os.remove(file_location)  # remove non-image file
                    i.append(1)
                else:
                    j.append(1)

        print(f"Folder: {folder_path} - has image file {len(j)}")
        print(f"Folder: {folder_path} - has non-image file {len(i)}")
        

In [None]:
remove_non_image_file(my_data_dir='inputs/cherryleaves_dataset/cherry-leaves')

---

In [None]:
## Split train validation test set

In [None]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

I will split the data following convention rules below:
* The training set is divided into a 0.70 ratio of data.
* The validation set is divided into a 0.10 ratio of data.
* The test set is divided into a 0.20 ratio of data.

In [None]:
split_train_validation_test_images(my_data_dir=f"inputs/cherryleaves_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---