# **Data Gathering**

## Objectives

* Download Data from Kaggle and prep the data for cleaning and testing
* Split the Data into Train, Test and Validation sets
* Remove Bad data, if any

## Inputs

* Kaggle JSON File, The authenification key
* Kaggle API - Used to download the data

## Outputs

* Train, test and validation sets in inputs/datasets/cherry_leaves
 



---

## Setting up the Enviroment

### Install the Requirments

In [3]:
! pip install -r requirements.txt



### Import Libraries

In [4]:
import numpy
import os

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [5]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/Mildew-Detection/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [10]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [11]:
current_dir = os.getcwd()
current_dir

'/workspaces/Mildew-Detection'

## Install Kaggle

In [19]:
! pip install kaggle

Collecting kaggle
  Downloading kaggle-1.5.16.tar.gz (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.6/83.6 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.16-py3-none-any.whl size=110685 sha256=1a2d34c1c71ecb02833fcb7da9da52296d427dac13b0b7b9ae4d352374d29f91
  Stored in directory: /home/codeany/.cache/pip/wheels/5a/ab/50/e224f599a07faf6d398a8600796012da271b7e5e7f2a3ab2b8
Successfully built kaggle
Installing collected packages: kaggle
Successfully installed kaggle-1.5.16


Set kaggle configuration to working directory, set permission for kaggle.json file

In [20]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


## Download Data from Kaggle

Set Kaggle Dataset and download directorys

In [16]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry-leaves"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/home/codeany/.pyenv/versions/3.8.12/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/home/codeany/.pyenv/versions/3.8.12/lib/python3.8/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/home/codeany/.pyenv/versions/3.8.12/lib/python3.8/site-packages/kaggle/api/kaggle_api_extended.py", line 181, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /workspaces/Mildew-Detection. Or use the environment method.


In [None]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

## Data Preparation

### Data Cleaning

Check for non image files and remove them if any

In [20]:
def remove_non_img(my_data_dir):
    '''
    Check the dataset for any bad data and remove it
    '''
    image_extensions = ('.png', '.jpeg', '.jpg')
    for root, dirs, files in os.walk(my_data_dir):
        # os.walk used to traverse directory structure and search for image files
        a = 0
        b = 0
        for file in files:
            if not file.lower().endswith(image_extensions):
                file_location = os.path.join(
                    root, file
                ) 
                os.remove(file_location) 
                a += 1
            else:
                b += 1
        print(f"Folder: {root} has correctly formatted image files {a}")
        print(f"Folder: {root} has correctly formatted image files {b}") 

I'm using the os.walk method because it is more time and space efficient. It avoids having to load the entire directory at once. It generates the directory as needed.

It also avoids the need to call os.path.isdir() to check where a file is. 

In [21]:
remove_non_img(my_data_dir='inputs/cherry_leaves')

## Spliting data into its datasets

The Industry standard pushes toward a 70-10-20 split with datasets this large. 

70% being used for training is a good starting point, the 10% for validation will provide enough data to optimize the hyperparams without major risk of overfitting. Lastly, a 20% test set will provide a good estimate of the models preformance. 

In [22]:
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    """
    split data set into three groups by ratio's .7, .1, .2
    """
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            # Move files to appropriate set directories
            # Use of enumerate leads to improved memory efficiency and faster execution time,
            # particularly in cases where the loop is iterating over a large number of items.
            for count, file_name in enumerate(files):
                if count < train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count < (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

            os.rmdir(my_data_dir + '/' + label)

In [16]:
split_train_validation_test_images(my_data_dir=f"inputs/cherry_leaves_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

FileNotFoundError: [Errno 2] No such file or directory: 'inputs/cherry_leaves_dataset/cherry-leaves'

---

# Push files to Github

git add .

git commit -m "Add and prepare cherry leaves dataset"

git push

## Next Step

[02-Visualisation](jupyter_notebooks/02-Visualisation.ipynb)