# **DATA COLLECTION**

## Objectives

* Fetch data from Kaggle and prepare it for further processing.

## Inputs

* Kaggle JSON file - authentication token.

## Outputs

* Generate Datasets: inputs/datasets/cherry_leaves




---

# Import Packages

In [1]:
%pip install --upgrade pip
%pip install numpy==1.19.5

import os
import numpy 

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detector-cherryleaves/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname('/workspace/mildew-detector-cherryleaves'))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace'

# Install Kaggle

In [5]:
%pip install kaggle

Note: you may need to restart the kernel to use updated packages.


Run the cell below to change the kaggle working directory to the current working directory and permission of kaggle authentication json.

---

In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = '/workspace/mildew-detector-cherryleaves'
! chmod 600 /workspace/mildew-detector-cherryleaves/kaggle.json

chmod: cannot access '/workspace/mildew-detector-cherryleaves/kaggle.json': No such file or directory


In [7]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "/workspace/mildew-detector-cherryleaves/inputs"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown
cherry-leaves.zip: Skipping, found more recently modified local copy (use --force to force download)


Unzip the downloaded file, delete the zip file

In [8]:
!unzip '/workspace/mildew-detector-cherryleaves/inputs/cherry-leaves/*.zip' -d '/workspace/mildew-detector-cherryleaves/inputs/cherry-leaves' && rm '/workspace/mildew-detector-cherryleaves/inputs/cherry-leaves/*.zip'


unzip:  cannot find or open /workspace/mildew-detector-cherryleaves/inputs/cherry-leaves/*.zip, /workspace/mildew-detector-cherryleaves/inputs/cherry-leaves/*.zip.zip or /workspace/mildew-detector-cherryleaves/inputs/cherry-leaves/*.zip.ZIP.

No zipfiles found.


---

# Data Preperation

---

Data Cleaning

Check and remove non-image files

In [9]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [10]:
remove_non_image_file(my_data_dir='/workspace/mildew-detector-cherryleaves/inputs/cherry-leaves')

IsADirectoryError: [Errno 21] Is a directory: '/workspace/mildew-detector-cherryleaves/inputs/cherry-leaves/train/healthy'

---

## Split train, validation and test sets

In [11]:
%pip install joblib

import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

Note: you may need to restart the kernel to use updated packages.


---

* The training set is divided into a 0.70 ratio of data.
* The validation set is divided into a 0.10 ratio of data.
* The test set is divided into a 0.20 ratio of data.

* If you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [12]:
split_train_validation_test_images(my_data_dir=f"/workspace/mildew-detector-cherryleaves/inputs/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---


# Conclusion

---

The dataset is diverse and of high quality, with no issues encountered during retrieval. 
For data preprocessing, non-image data was excluded, 
and the dataset was divided into three subsets: 
* 10% for validation, 
* 70% for training, 
* and 20% for testing.