# **Data Collection**

## Objectives

* Fetch cherry leaf image data from Kaggle and prepare it for use.

## Inputs

* Kaggle JSON file - authentication token. 

## Outputs

* Dataset: input/datasets/cherry_leaf_dataset



---

# Imports

In [None]:
%pip install -r /workspaces/PP5-MildewDetection/requirements.txt

In [None]:
import os, sys
from pathlib import Path
proj_root = Path.cwd()
if proj_root.name == "jupyter_notebooks":
    proj_root = proj_root.parent
os.chdir(proj_root)
sys.path.insert(0, str(proj_root))
from src.data_utils import clean_image_dataset, split_dataset, fetch_kaggle_dataset
import random
import numpy as np

### Define Seed for Testability Across Runs

In [None]:
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# Change working directory

* We will store the notebooks in a subfolder, therefore when running the notebook in the editor, we need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Kaggle Installation and Configuration

The following function call completes several actions:

1. First it installs Kaggle.
2. Then it changes the Kaggle configuration directory to the current working directory and sets permissions for the Kaggle authentication JSON.
3. Next it will set the Kaggle dataset path and download it to the folder we specified in the 'Outputs' section.
4. Finally, we unzip the downloaded file then delete the zip file.

In [None]:
fetch_kaggle_dataset(
    kaggle_path="codeinstitute/cherry-leaves",
    dest_folder="input/datasets/cherry_leaf_dataset"
)

---

# Data Preparation

---

## Data Cleaning

### The below method allows us to find and remove files of a particular extension type.

We need to find and remove any non-image files from our dataset, if they exist.

We create a variable and assign it the pathway to our data which will be the 'root_dir' parameter of the clean_image_dataset method.

As we are looking for image files the 'extensions' parameter does not need to be altered.

In [None]:
dataset_path = "input/datasets/cherry_leaf_dataset/cherry-leaves"
clean_image_dataset(dataset_path)

---

# Splitting Train, Validation and Test Sets

In keeping with convention:

* We will allocate 70% of the data to Training.
* 10% to Validation.
* 20% to Testing.

In [None]:
split_dataset(
    data_dir="input/datasets/cherry_leaf_dataset/cherry-leaves",
    train_ratio=0.7,
    val_ratio=0.1,
    test_ratio=0.2,
)

---