# **Data Collection Notebook**

### Objectives

* download and clean dataset from Kaggle
* save data downloaded from Kaggle in the dataset directory, inputs/cherry_leaves_data

### Inputs

* Kaggle JSON file - authentication token

### Outputs

* Dataset stored at inputs/cheer_leaves_data

### Insights | Conclusions
* The dataset has a data type of "image".
* the dataset is balanced with both classes containing 2041 pictures.
* the dataset was clean, the data cleaning algorithm found no non images


---

## Import packages

## Import packages

In [1]:
import numpy
import os

## Change working directory

We will change the working directory from its current folder to its parent folder using the method os.cetcwd.



In [2]:
working_dir = os.getcwd()
working_dir

'/workspace/showMeMildew/jupiter_notebooks'

here we make the parent of our current directory to the current directory
*  we pass os.path.dirname() as an arguement which gets the parent directory



In [3]:
os.chdir(os.path.dirname(working_dir))

to confirm we have changed our directory

In [4]:
working_dir = os.getcwd()
working_dir

'/workspace/showMeMildew'

## Fetch data from kaggle

here we install kaggle

In [5]:

! pip install kaggle


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


We will change the kaggle configuration directory to current working directory and the permission of kaggle authentication json so the token is recognized in the session

In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Define the dataset path - The path is the text that follows [https://kaggle.com/datasets/](https://kaggle.com/datasets/) in the url
* Define the destination folder

In [7]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_data"

Download dataset from kaggle

In [8]:
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}


Downloading cherry-leaves.zip to inputs/cherry_leaves_data
 95%|███████████████████████████████████▉  | 52.0M/55.0M [00:01<00:00, 41.1MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 30.1MB/s]


Unzip and delete the download file

In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

## Data Preparation

---

## Data Cleaning

In [10]:
import tensorflow as tf

2023-04-03 17:01:10.158999: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-04-03 17:01:10.159039: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Find and delete any non-image files

In [11]:
def data_cleaning(my_data_dir):
    '''
    assess all content in the directory provided in the
    parameter and delete all non applicable files.

    '''
    folders = os.listdir(my_data_dir) 
    for folder in folders:
        folder_path = os.path.join(my_data_dir, folder)
        files = os.listdir(folder_path)
        
        total_non_image_files = 0
        total_image_files = 0
        for given_file in files:
            file_location = os.path.join(my_data_dir, folder, given_file)
            try:
                fobj = open(file_location, "rb")
                is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)
            finally:
                fobj.close()

            if not is_jfif:
                os.remove(file_location)
                total_non_image_files += 1
            else:
                total_image_files += 1
                pass
        print(f"Folder: {folder} - has total image file of ",total_image_files)
        print(f"Folder: {folder} - total non-image file deleted is ",total_non_image_files)

In [12]:
data_cleaning('inputs/cherry_leaves_data/cherry-leaves/')

Folder: healthy - has total image file of  2104
Folder: healthy - total non-image file deleted is  0
Folder: powdery_mildew - has total image file of  2104
Folder: powdery_mildew - total non-image file deleted is  0


## Split data set into train, validation and test sets

In [13]:
labels = os.listdir("inputs/cherry_leaves_data/cherry-leaves/")
labels

['healthy', 'powdery_mildew']

In [14]:
import shutil
import random

def split_dataset(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  '''Source code from 
  [code institute -data preparation](https://github.com/GyanShashwat1611/WalkthroughProject01)


  divides the dataset into train, test and validation datasets using the ratios
  provided in the parameters. each split is put into a different folder containing 
  another two folders for both healthy and powdery mildew.

  '''
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
    print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
    return

  # gets classes labels
  labels = os.listdir(my_data_dir) # it should get only the folder name
  if 'test' in labels:
    pass
  else: 
    # create train, test folders with classess labels sub-folder
    for folder in ['train','validation','test']:
      for label in labels:
        os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)

    for label in labels:

      files = os.listdir(my_data_dir + '/' + label)
      random.seed(110)
      random.shuffle(files)

      train_set_files_qty = int(len(files) * train_set_ratio)
      validation_set_files_qty = int(len(files) * validation_set_ratio)

      count = 1
      for file_name in files:
        if count <= train_set_files_qty:
          # move given file to train set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/train/' + label + '/' + file_name)
          

        elif count <= (train_set_files_qty + validation_set_files_qty ):
          # move given file to validation set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/validation/' + label + '/' + file_name)

        else:
          # move given file to test set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                  my_data_dir + '/test/' +label + '/'+ file_name)
          
        count += 1

      os.rmdir(my_data_dir + '/' + label)
  

We will split the dataset into the following ratios

* 0.7 for the train dataset
* 0.2 for the test dataset
* 0.1 for the validation dataset


In [15]:
split_dataset(my_data_dir = f"inputs/cherry_leaves_data/cherry-leaves",
                        train_set_ratio = 0.7,
                        validation_set_ratio=0.1,
                        test_set_ratio=0.2
                        )

---

## Next step

* Data Visualization