# **Data Collection**

## Objectives

* Installing all the necessary packages
* Fetch raw data from Kaggle
* Clean raw data and preparing data

## Inputs

* Kaggle JSON file (authentication token)
* Download raw data from Kaggle

## Outputs

* Store clean and prepared data in bellow directory:inputs/datasets/cherry_leaves_dataset

## Additional Comments

* No additional comments.



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\fazel.DESKTOP-2JBP228\\OneDrive\\Desktop\\PP5\\PP5_Mildew-Detection-In-Cherry-Leaves\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\fazel.DESKTOP-2JBP228\\OneDrive\\Desktop\\PP5\\PP5_Mildew-Detection-In-Cherry-Leaves'

# Installing necessary packages

In [4]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


# Kaggle

## Installing the Kaggle API package

In [5]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


---

## Fetch Data From Kaggle

change the Kaggle configuration directory to the current working directory

In [16]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
#Uncomment the bellow command if you are WINDOWS user
! icacls "kaggle.json" /grant Everyone:(OI)(CI)F

#Please uncomment the bellow command if you are LINUX user
#! chmod 600 kaggle.json

processed file: kaggle.json
Successfully processed 1 files; Failed processing 0 files


Set the Kaggle Dataset URL and Download it.

In [18]:
KaggleDatasetPath = "datasets/codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry-leaves-dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

403 - Forbidden


Unzip the downloaded file, and delete the zip file.

In [8]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Data Preparation

## Data Cleaning

Function to Check if all data are images

In [13]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # lists for images and non-image files
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension): # check the file extension
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

- Call the function to remove drop the non-image data

In [14]:
remove_non_image_file(my_data_dir='inputs/cherry-leaves-dataset/cherry-leaves')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


# Split Data To Train Test and Validation

Define a function to split data making new directories and remove the old directory

In [20]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

Conventionally,
* The training set is divided into a 0.70 ratio of data.
* The validation set is divided into a 0.10 ratio of data.
* The test set is divided into a 0.20 ratio of data.NOTE

Call the function to split data

In [23]:
split_train_validation_test_images(my_data_dir=f"inputs/cherry-leaves-dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---

# Push files to Repo

In [26]:

! git add .
! git commit -m "data-collection"



# Conclusions

- The data has been successfully downloaded from Kaggle
- Data has been cleaned
- Data has been splitted into train, test, and validation

In the next step we are going to visualize data