# **Data Collection**

---

## Objectives

* Fetch data from Kaggle and prepare it for visualisation.
* Clean and split data into Train/Test/Validation sets.

## Inputs

* Kaggle JSON - authentication token. 

## Outputs

* Data split in to Train/Test/Validation directories and sorted by label.

## Additional Comments

* These steps are needed in order to cleanly collect data and divide it in to subsets for later use in the ML pipeline.



---

# Change working directory

We need to change the working directory from its current folder, to its parent folder.

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/portfolio-five/jupyter_notebooks'

We want to make the parent of the current directory, our new working directory.

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You have successfully set a new working directory")

You have successfully set a new working directory


And finally, confirm our new working directory.

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/portfolio-five'

---

# Install Kaggle

First we will install the kaggle package.

In [4]:
! pip install kaggle



Then, we change the Kaggle configuration directory to the current working directory, and set permissions for the Kaggle authentication JSON.

In [5]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define the dataset we want to use, and download to our destination folder.

In [6]:
KaggleDatasetPath = "codeinstitute/cherry-leaves/"
DestinationFolder = "inputs/cherryleaves_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherryleaves_dataset
 89%|█████████████████████████████████▊    | 49.0M/55.0M [00:02<00:00, 29.6MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 24.5MB/s]


Then unzip the data, extract it, and delete the zip file when complete.

In [7]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Data Cleaning

We can check for non image files in the extracted data and remove these files.

In [8]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [9]:
remove_non_image_file(my_data_dir='inputs/cherryleaves_dataset/cherry-leaves')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


And then split our data in to Train/Test/Validate directories ready for use in our ML pipeline.

In [10]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

* Training data set ratio is 0.70 of our total data.
* Test data set ratio is 0.20 of our total data.
* Validation data set ratio is 0.10 of our total data.

In [11]:
split_train_validation_test_images(my_data_dir=f"inputs/cherryleaves_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---

# Push files to repository.

If necessary, uncomment and use code below to push files to repo.

In [None]:
# import os
# try:
#     # create here your folder
#     # os.makedirs(name='')
# except Exception as e:
#     print(e)

---

# Data is now cleaned and split, please continue to the next notebook.