# **Data Collection**

## Objectives

- Fetch data from Kaggle and prepare it for further processing.

## Inputs

- Kaggle JSON file - the authentication token.

## Outputs

- Generate Dataset: "inputs/datasets/cherry_leaves"

## Additional Comments

- No additional comments.

---

## Import Packages

In [1]:
import numpy as np
import os

## Change working directory

In [2]:
current_dir = os.getcwd()
print(f"Current directory: {current_dir}")
os.chdir('/workspace/pp5_cherry_leaves_mildew_detection')
print("New current directory set:")
current_dir = os.getcwd()
print(current_dir)

Current directory: /workspace/pp5_cherry_leaves_mildew_detection/jupyter_notebooks
New current directory set:
/workspace/pp5_cherry_leaves_mildew_detection


## Install Kaggle

In [4]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle==1.5.12)
  Downloading tqdm-4.66.6-py3-none-any.whl.metadata (57 kB)
Collecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading tqdm-4.66.6-py3-none-any.whl (78 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73027 sha256=50b08d007069cd098ad079b28645de18e61efc7a3e2b6bfaa4ce4c4dd5fe4ad8
  Stored in directory: /home/gitpod/.cache/pip/wheels/29/da/11/144cc25aebdaeb4931b231e25fd34b3

### Set Kaggle Configuration Directory

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

### Set the Kaggle Dataset Path and Download destination

- Get the dataset path from the Kaggle URL.

In [6]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"

- Download the Dataset

In [7]:
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}


Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset
 91%|██████████████████████████████████▌   | 50.0M/55.0M [00:02<00:00, 33.8MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 25.4MB/s]


- Unzip the Downloaded file and delete the Zip file to save space.

In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Data Preparation

## Data cleaning 

### Remove Non-Image Files

In [12]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non-image file
                i.append(1)
            else:
                j.append(1)
        print(f"Folder: {folder} - Image files count: {len(j)}, Non-image files removed: {len(i)}")

remove_non_image_file(my_data_dir='inputs/cherry_leaves_dataset/cherry-leaves')

Folder: healthy - Image files count: 2104, Non-image files removed: 0
Folder: powdery_mildew - Image files count: 2104, Non-image files removed: 0


## Split Data into Train, Validation and Test Sets

In [13]:
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("Ratios for train, validation, and test sets must sum to 1.0")
        return

    labels = os.listdir(my_data_dir)
    if 'test' in labels:
        pass
    else:
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:
            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)
                elif count <= (train_set_files_qty + validation_set_files_qty):
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)
                else:
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)
                count += 1
            os.rmdir(my_data_dir + '/' + label)


### Set the Data split ratios

In [15]:
split_train_validation_test_images(
    my_data_dir='inputs/cherry_leaves_dataset/cherry-leaves',
    train_set_ratio=0.7,
    validation_set_ratio=0.1,
    test_set_ratio=0.2
)

---

# Next Steps:

* Prepare, analyze, and visualize the cherry leaf images to meet the business needs.

---