# **Initial Data Collection**

## Objectives

* Fetch data from Kaggle and save as raw data

## Inputs

* Kaggle JSON file

## Outputs
datasets
* Generate the dataset in: 'inputs/datasets/cherry_leaves'



---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/milestone-project-mildew-detection-in-cherry-leaves/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/milestone-project-mildew-detection-in-cherry-leaves'

In [None]:
%pip install -r /workspaces/milestone-project-mildew-detection-in-cherry-leaves/requirements.txt

# Install Kaggle

* We will install the kaggle module so we can download the dataset from the kaggle website

In [None]:
%pip install kaggle

---

* this code will set the environment variable (KAGGLE_CONFIG_DIR) to work out of, the second part of the code sets the kaggle.json permissions so only I, as the owner can read from it

In [10]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

---

In [9]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/datasets/"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/datasets
 96%|████████████████████████████████████▌ | 53.0M/55.0M [00:01<00:00, 58.9MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 44.7MB/s]


* This code unzips the downloaded dataset and simultaneously deleted the .zip version it also removes the kaggle.json file and adds it .gitignore

In [11]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

9885___FREC_Pwd.M 0276.JPG  
  inflating: inputs/datasets/cherry-leaves/powdery_mildew/ef0039cb-e261-4019-86a1-db6287829885___FREC_Pwd.M 0276_flipLR.JPG  
  inflating: inputs/datasets/cherry-leaves/powdery_mildew/ef4c52e5-d7d8-4996-b445-42478df2c353___FREC_Pwd.M 0584.JPG  
  inflating: inputs/datasets/cherry-leaves/powdery_mildew/ef4c52e5-d7d8-4996-b445-42478df2c353___FREC_Pwd.M 0584_flipLR.JPG  
  inflating: inputs/datasets/cherry-leaves/powdery_mildew/ef6ede1b-32db-433e-afc6-c39cb318f58e___FREC_Pwd.M 0506.JPG  
  inflating: inputs/datasets/cherry-leaves/powdery_mildew/ef6ede1b-32db-433e-afc6-c39cb318f58e___FREC_Pwd.M 0506_flipLR.JPG  
  inflating: inputs/datasets/cherry-leaves/powdery_mildew/efdc7774-273a-488d-8153-fb61908e3cfb___FREC_Pwd.M 4467.JPG  
  inflating: inputs/datasets/cherry-leaves/powdery_mildew/efdc7774-273a-488d-8153-fb61908e3cfb___FREC_Pwd.M 4467_flipLR.JPG  
  inflating: inputs/datasets/cherry-leaves/powdery_mildew/f0318151-9783-43f4-a1f1-d8ec05e565df___FREC_Pwd.M 03

# Data Preparation

This section will be dedicated to removing any files from the dataset that aren't images (therefore aren't relevant) and seperates them into train, test and validation subfolders.

In [12]:
"""
This function was written by the Code institute team in the malaria detector project,
this was an efficient way of checking for image extensions and I couldn't think of an alternative way to write a function
for the same purpose.
"""

def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [15]:
remove_non_image_file(my_data_dir='inputs/datasets/cherry-leaves/')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


In [16]:
import os
import shutil
import random
import joblib

"""
This function was written by the Code institute team in the malaria detector project,
this was an efficient way of separating the images into different folders with a given ratio parameter
and I couldn't think of an alternative way to write a function
for the same purpose.
"""

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

In [17]:
split_train_validation_test_images(my_data_dir=f"inputs/datasets/cherry-leaves/",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---

# Conclusions

* The Dataset has now been loaded and saved into it's own inputs folder with individual test, train and validation sets
* the next step is data analysis with the train set, then visualizing that data to answer business requirement 1.