# Data Collection

## Objectives

- Import packages & set directory
- Retrieve data from Kaggle and prepare it for analysis
- Prepare data & split into training, test & validation sets

## Inputs

- Kaggle JSON file to authenitcate user & enable download of dataset

## Outputs

- Generate dataset into inputs folder

## Additional Comments/ Conclusions

- These are required steps in order to properly set up the data for use in training the ML model

## Import packages

In [1]:
! pip install -r /workspace/CherryPicker/requirements.txt

Collecting matplotlib==3.4.0 (from -r /workspace/CherryPicker/requirements.txt (line 3))
  Downloading matplotlib-3.4.0.tar.gz (37.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m37.1/37.1 MB[0m [31m106.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[28 lines of output][0m
  [31m   [0m !!
  [31m   [0m 
  [31m   [0m         ********************************************************************************
  [31m   [0m         Please remove any references to `setuptools.command.test` in all supported versions of the affected package.
  [31m   [0m 
  [31m   [0m         This deprecation is overdue, please update your project and remove deprecated
  [31m   [0m         calls to avoid build errors in the future.
  

In [2]:
import os
import numpy

## Set the Working Directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/CherryPicker/jupyter_notebooks'

In [4]:
os.chdir('/workspace/CherryPicker')
print('You set a new working directory')

You set a new working directory


In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/CherryPicker'

## Install Kaggle


In [6]:
! pip install kaggle



Change Kaggle configuration directory to the current working directory

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Download Kaggle dataset

In [9]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"
! kaggle datasets download - d {KaggleDatasetPath} - p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown
Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset
 98%|█████████████████████████████████████▎| 54.0M/55.0M [00:02<00:00, 30.5MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 23.1MB/s]


Unzip downloaded file, save contents and delete

In [10]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

## Data Cleaning & Preparation

### Remove non-image files

In [11]:
def remove_non_image_file(my_data_dir):
    """
    This is a function to check the dataset
    for files that are not images and delete
    all such files
    """
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [12]:
remove_non_image_file(my_data_dir='inputs/cherry_leaves_dataset/cherry-leaves')


Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


### Split data into Train and Test sets

In [13]:
import os
import shutil
import random
import joblib


def split_images(my_data_dir, train_set_ratio, validation_set_ratio,
                 test_set_ratio):
    """
    This function creates the train, validation
    and test sets and splits the data into them
    The parameters are as follows:
    my_data_dir = the path to the input directory
    where the images are kept
    train_set_ratio = ratio of images included in the train set
    validation_set_ratio = ratio of images included in the validation set
    test_set_ratio = ratio of images included in the test set
    """
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio +"
              "test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' +
                                file_name, my_data_dir +
                                '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' +
                                label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/'
                                + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

The dataset will be divided up as follows in the conventional manner:
- 70% train set
- 20% validation set
- 10% test set

In [14]:
split_images(my_data_dir=f"inputs/cherry_leaves_dataset/cherry-leaves",
             train_set_ratio=0.7,
             validation_set_ratio=0.1,
             test_set_ratio=0.2
             )