# **Data Collection Notebook**

### Objectives

* downlaod dataset from Kaggle
* clean downloaded dataset
* save data downloaded from Kaggle in the dataset directory, inputs/dataset

### Inputs

* Kaggle JSON file - authentication token

### Outputs

* Generate Dataset: inputs/datasets/collection/mildew-dataset 

### Additional Comments

* The client provided the data under an NDA (non-disclosure agreement), therefore the data will only be shared with professionals that are officially involved in the project.

### Insights | Conclusions
* text here if any

---

## Import packages

In [1]:
import numpy
import os

## Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detection/jupyter_notebooks'

set the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("A new current directory has been set")

A new current directory has been set


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detection'

## Fetch data from Kaggle

Install kaggle to fetch data

In [5]:
# install kaggle package
! pip install kaggle

Collecting kaggle
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm
  Downloading tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify
  Downloading python_slugify-6.1.2-py2.py3-none-any.whl (9.4 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73031 sha256=0ab0bdba86909378884e0d255d9bbad09a9297ecd593a4b46eeb54269fadd433
  Sto

---

Change kaggle configuration directory to current working directory and permission of kaggle authentication json so the token is recognized in the session

In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Define the dataset path - The path is the text after [https://kaggle.com/datasets/](https://kaggle.com/datasets/)
* Define the destination folder

In [7]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/datasets/raw"

Download dataset from Kaggle

In [8]:
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/datasets/raw
 75%|█████████████████████████████          | 41.0M/55.0M [00:00<00:00, 209MB/s]
100%|███████████████████████████████████████| 55.0M/55.0M [00:00<00:00, 232MB/s]


Unzip and delete the downloaded file

In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

delete kaggle.json file

In [10]:
rm kaggle.json

---

# Data Preparation

---

## Data Cleaning

In [11]:
import tensorflow as tf

Examine and delete all non-image files

In [12]:
def remove_invalid_image_file(my_data_dir):
    '''
    Iterate through content in the given directory
    and delete all badly-encoded images in the directory 
    that do not feature the string "JFIF" in their header.

        Parameters:
            data_dir(string): file's parent directory
    '''
    folders = os.listdir(my_data_dir) 
    for folder in folders:
        folder_path = os.path.join(my_data_dir, folder)
        files = os.listdir(folder_path)
        
        total_non_image_files = 0
        total_image_files = 0
        for given_file in files:
            file_location = os.path.join(my_data_dir, folder, given_file)
            try:
                fobj = open(file_location, "rb")
                is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)
            finally:
                fobj.close()

            if not is_jfif:
                os.remove(file_location)
                total_non_image_files += 1
            else:
                total_image_files += 1
                pass
        print(f"Folder: {folder} - has total image file of ",total_image_files)
        print(f"Folder: {folder} - total non-image file deleted is ",total_non_image_files)

In [13]:
remove_invalid_image_file('inputs/datasets/raw/cherry-leaves')

Folder: healthy - has total image file of  2104
Folder: healthy - total non-image file deleted is  0
Folder: powdery_mildew - has total image file of  2104
Folder: powdery_mildew - total non-image file deleted is  0


## Split train validation test set

In [14]:
labels = os.listdir("inputs/datasets/raw/cherry-leaves")
labels

['healthy', 'powdery_mildew']

In [15]:
import shutil
import random

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  '''Source code from 
  [code institute -data preparation](https://github.com/GyanShashwat1611/WalkthroughProject01)
  
  Splits the dataset into train, test and validation dataset
  and moves the data into new directory based on the dataset. 
  So that train dataset are moved to train diretory with sub-folders
  healthy and mildew representing the class.
  '''
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
    print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
    return

  # gets classes labels
  labels = os.listdir(my_data_dir) # it should get only the folder name
  if 'test' in labels:
    pass
  else: 
    # create train, test folders with classess labels sub-folder
    for folder in ['train','validation','test']:
      for label in labels:
        os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)

    for label in labels:

      files = os.listdir(my_data_dir + '/' + label)
      random.seed(110)
      random.shuffle(files)

      train_set_files_qty = int(len(files) * train_set_ratio)
      validation_set_files_qty = int(len(files) * validation_set_ratio)

      count = 1
      for file_name in files:
        if count <= train_set_files_qty:
          # move given file to train set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/train/' + label + '/' + file_name)
          

        elif count <= (train_set_files_qty + validation_set_files_qty ):
          # move given file to validation set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/validation/' + label + '/' + file_name)

        else:
          # move given file to test set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                  my_data_dir + '/test/' +label + '/'+ file_name)
          
        count += 1

      os.rmdir(my_data_dir + '/' + label)
  

The dataset is split into:
* training data - 0.70 ratio of the original dataset
* test data - 0.2 ratio of the original dataset
* validation data - 0.1 ratio

In [16]:
split_train_validation_test_images(my_data_dir = f"inputs/datasets/raw/cherry-leaves",
                        train_set_ratio = 0.7,
                        validation_set_ratio=0.1,
                        test_set_ratio=0.2
                        )

# Next step

* Data Visualization