# Data Collection

---

## Fetch data from Kaggle 

## Objectives
The main objectives of this notebook are as follows:

1. Download data from Kaggle and save it as raw data for further processing.
2. Clean and organize the downloaded data into two equal-sized groups - one for training and one for testing and validation.
3. Save the data into the "input_datasets" directory.

## Inputs
The notebook requires the following input:

- Kaggle JSON file: This file contains the authentication key for accessing the Kaggle dataset.

## Data Description
The dataset for this project consists of 4208 images, which are evenly split between two categories:
1. Images of healthy cherry leaves.
2. Images of cherry leaves infected with powdery mildew.

## Outputs
The expected outcomes of this notebook are as follows:

1. All images will be resized to an appropriate image size for both the healthy and infected groups.
2. The dataset will be divided into three sets:
   - Training set
   - Testing set
   - Validation set
   
The notebook will save the processed data into the "input_datasets" directory for further use in the Cherry Leaves Detection model.


### Import packages and libraries

In [2]:
%pip install -r /workspaces/PP5-mildew-detection-in-cherry-leaves/requirements.txt

Collecting typing-extensions~=3.7.4 (from tensorflow-cpu==2.6.0->-r /workspaces/PP5-mildew-detection-in-cherry-leaves/requirements.txt (line 10))
  Using cached typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Installing collected packages: typing-extensions
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.7.1
    Uninstalling typing_extensions-4.7.1:
      Successfully uninstalled typing_extensions-4.7.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
astroid 2.15.6 requires typing-extensions>=4.0.0; python_version < "3.11", but you have typing-extensions 3.7.4.3 which is incompatible.
async-lru 2.0.3 requires typing-extensions>=4.0.0; python_version < "3.11", but you have typing-extensions 3.7.4.3 which is incompatible.
mypy 1.4.1 requires typing-extensions>=4.1.0, but you have typing-extensions 3.7.4.3 whi

In [3]:
import numpy
import os

### Change the working directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/PP5-mildew-detection-in-cherry-leaves/jupyter_notebooks'

In [5]:
os.chdir('/workspaces/PP5-mildew-detection-in-cherry-leaves/')
print("You set a new current directory")

You set a new current directory


##### Confirm the new current directory

In [6]:
current_dir = os.getcwd()
current_dir

'/workspaces/PP5-mildew-detection-in-cherry-leaves'

### Install Kaggle

In [8]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle==1.5.12)
  Downloading tqdm-4.65.0-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.1/77.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-p

##### Set Kaggle configuration directory to the current working directory and set permissions for Kaggle authentication JSON

In [9]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

### Download the dataset

In [10]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset
100%|█████████████████████████████████████▉| 55.0M/55.0M [00:01<00:00, 46.2MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 44.2MB/s]


##### Unzip and delete the zip file

In [11]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

### Data cleaning and preparation

#### Check for non-img files and remove them

In [12]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [14]:
remove_non_image_file(my_data_dir='inputs/cherry_leaves_dataset/cherry-leaves')

Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0
Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0


### Split data into train, test and validation set

In [15]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

#### Split the dataset into train, validation and test sets (70%-10%-20%)

In [16]:
split_train_validation_test_images(my_data_dir=f"inputs/cherry_leaves_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

#### Add, commit and push to github

In [18]:
! git add . 

In [24]:
! git config pull.rebase false

In [25]:
! git pull

hint: Waiting for your editor to close the file... error: cannot run editor: No such file or directory
error: unable to start editor 'editor'
Not committing merge; use 'git commit' to complete the merge.


In [26]:
! git commit -m "download clean, prepare and split the dataset"

[main 656c22b] download clean, prepare and split the dataset


In [27]:
! git push 

Enumerating objects: 4239, done.
Counting objects: 100% (4236/4236), done.
Delta compression using up to 4 threads
Compressing objects: 100% (4229/4229), done.
Writing objects: 100% (4231/4231), 54.10 MiB | 14.62 MiB/s, done.
Total 4231 (delta 3), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (3/3), completed with 2 local objects.[K
To https://github.com/23052015/PP5-mildew-detection-in-cherry-leaves
   7ac132f..656c22b  main -> main


#### Conclusions
- The dataset is downloaded and prepared for further processing
- Three different folders have been created each containing train, validation and test images
