# **Data Collection, Cleaning and Preparation**

## Objectives

* Fetch data from Kaggle, Clean it, and Split it into training, testing, and validation folders.

## Inputs

*   Kaggle JSON file - the authentication token. 

## Outputs

* Generate dataset into inputs/dataset/cherry_leaves

## Additional Comments

*  



---

# Change working directory

* The notebooks are in a subfolder, therefore when running the notebook in the editor, I will change the path to the parent folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [23]:
import os
import numpy as np

In [24]:
current_dir = os.getcwd()
current_dir

'/Users/rana/Documents/artificial_intelligence/PP5-mildew-detection-in-cherry-leaves'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [28]:
os.chdir('/Users/rana/Documents/artificial_intelligence/PP5-mildew-detection-in-cherry-leaves')
print("I have set a new current directory to the home folder")

I have set a new current directory to the home folder


Confirm the new current directory

In [29]:
current_dir = os.getcwd()
current_dir

'/Users/rana/Documents/artificial_intelligence/PP5-mildew-detection-in-cherry-leaves'

# Data Collection

## Installing Kaggle

In [18]:
! pip install kaggle==1.5.12



---

Download kaggle api token and Setting authentication for kaggle JSON file

In [30]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Setting Kaggle dataset and downloading it

In [21]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

cherry-leaves.zip: Skipping, found more recently modified local copy (use --force to force download)


Unzipping the data and deleting the zipped data folder

In [31]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

# Data Cleaning and Preparation

## Data Cleaning

Checking the data for any non-images files and discard if present

In [32]:
def remove_non_image_file(data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(data_dir)
    for folder in folders:
        files = os.listdir(data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non-image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [34]:
remove_non_image_file(data_dir='inputs/cherry_dataset/cherry-leaves')

Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0
Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0


## Data Preparation

---

Splitting data into train, test and validation

In [35]:
import os
import shutil
import random
import joblib


def split_train_valid_test_images(data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(data_dir + '/' + label + '/' + file_name,
                                data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(data_dir + '/' + label + '/' + file_name,
                                data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(data_dir + '/' + label + '/' + file_name,
                                data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(data_dir + '/' + label)

The data is split as following
train: 70%
testing: 20%
validation: 10%


In [36]:
split_train_valid_test_images(data_dir=f"inputs/cherry_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   test_set_ratio=0.2,
                                   validation_set_ratio=0.1
                                   
                                   )

---

# Conclusions

The data have successfully been downloaded from kaggle, cleaned and split into train, test, and validation folder for further processing and devvelopmen