# **DATA COLLECTION**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and check for non-image files
* Split the data into Train, Test and Validation sets
* Save it under inputs/cherry_leaves_dataset/cherry-leaves

## Inputs

* kaggle.json for the authentication token 

## Outputs

 Generate Dataset Folders for sets:
* Train Sets:
    - inputs/cherry_leaves_dataset/cherry-leaves/train/healthy
    - inputs/cherry_leaves_dataset/cherry-leaves/train/powdery_mildew
* Test Sets:
    - inputs/cherry_leaves_dataset/cherry-leaves/test/healthy
    - inputs/cherry_leaves_dataset/cherry-leaves/test/powdery_mildew
* Validation Sets:
   - inputs/cherry_leaves_dataset/cherry-leaves/validation/healthy
   - inputs/cherry_leaves_dataset/cherry-leaves/validation/powdery_mildew
## Additional Comments

* This covers the second and third phases of the CRISP-DM workflow, which are data understanding and data preparation


---

# Fetch data from Kaggle


You first need to download to your machine a **JSON file** (authentication token) from Kaggle for authentication.

The process is as follows:

1. From the site header, click on your user profile picture, then on **“Account”** from the dropdown menu. This will take you to your account settings.
2. Scroll down to the section of the page called **API**.
3. Click **Expire API Token** to remove previous tokens.
4. To create a new token, click on the “**Create New API Token**” button. It will generate a fresh authentication token and will download a kaggle.json file onto your machine.
In case of any difficulty, go to the "**Authentication**" section at this link.

* This file should now be saved locally on your machine. Please make sure this file is named kaggle.json



# Import packages


In [2]:
pip install -r /workspace/mildew-detector/requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [6]:
import numpy

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [7]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detector/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [8]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [9]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detector'

# Install Kaggle

Install Kaggle package

In [7]:
pip install kaggle

Collecting kaggle
  Downloading kaggle-1.6.17.tar.gz (82 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle)
  Downloading tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
Collecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading tqdm-4.66.5-py3-none-any.whl (78 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.6.17-py3-none-any.whl size=105786 sha256=568340236a5c487cdc6650fdcbe4ff9cdcadee3d9b9c360d216da28faa008a7c
  Stored in directory: /home/gitpod/.cache/pip/wheels/a5/6f/7b/837915771e94e181fa3052822926444e34f725ca38e70be77e
Successfull

Run the cell below **to change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON**.



In [8]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json


Get the dataset path from the [Kaggle URL](https://www.kaggle.com/datasets/codeinstitute/cherry-leaves).

![Kaggle](static/images/kaggle_dataset.png)

Set the Kaggle Dataset and Download it.

In [3]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}


Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown
Downloading cherry-leaves.zip to inputs/cherry_dataset
 98%|█████████████████████████████████████▎| 54.0M/55.0M [00:02<00:00, 25.4MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 21.3MB/s]


---

Unzip the downloaded file, and delete the zip file.

In [11]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')


#  **DATA PREPARATION**

---

### Data Cleaning

##### Check for and remove non-image files

If there is any image that do not have an extension finished with png, jpg or jpeg, this function will remove it

In [14]:
def remove_non_image_file(my_data_dir):
    """If there any image that do not have an extension finished with png, jpg 
    or jpeg, this function will remove it"""
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir) 
    
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        i = []
        j = []
        # Iterate over every file in each folder of the dataset
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))

In [15]:
remove_non_image_file(my_data_dir='inputs/cherry_dataset/cherry-leaves')


Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


- There two categories named 'healthy' and 'powdery_mildew'(infected) as image classification.

#### **Split train, validation and test sets**

Conventionally,

- The training set is divided into a 0.70 ratio of data.
- The validation set is divided into a 0.10 ratio of data.
- The test set is divided into a 0.20 ratio of data.

In [17]:
import os
import shutil
import random
from pathlib import Path

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    '''
    This function splits images from the input directory into training, validation, and test sets,
    and saves them into corresponding directories in the output directory.

    Parameters:
        my_data_dir (str): The directory path containing the input images.
        train_set_ratio (float): The ratio of images to be allocated for training (default is 0.7).
        validation_set_ratio (float): The ratio of images to be allocated for validation (default is 0.1).
        test_set_ratio (float): The ratio of images to be allocated for testing (default is 0.2).

    '''

    if  train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        # Check that the sum of all the ratios is 1
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir) # it should get only the folder name
    if 'test' in labels:
        # If test exists means that all the folders have been created
        pass
    else: 
        # create train, validation and test folders with classess labels sub-folder
        for folder in ['train','validation','test']:
            for label in labels:
                os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)


NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
