# **Data Collection - Fetch Dataset from Kaggle**

## Objectives
* Fetch data from Kaggle and save as raw data to prepare it for further processes.

## Inputs
* Kaggle JSON file - the authentication token 

## Outputs
* Generate Dataset: inputs/mildew_dataset/cherry-leaves 

## Additional Comments
* No additional comments



## Import pagkages

In [1]:
! pip install -r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt

workspaces/mildew-detection-in-cherry-leaves/requirements.txt (line 47))
  Downloading seaborn-0.11.0-py3-none-any.whl (283 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m283.1/283.1 kB[0m [31m39.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting six==1.15.0 (from -r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt (line 48))
  Downloading six-1.15.0-py2.py3-none-any.whl (10 kB)
Collecting streamlit==0.85.0 (from -r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt (line 49))
  Downloading streamlit-0.85.0-py2.py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorboard==2.11.0 (from -r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt (line 50))
  Downloading tensorboard-2.11.0-py3-none-any.whl (6.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m47.0 MB/s[0m eta [36m0:00:00

In [2]:
import numpy
import os

## Change Working Directory

* To change the working directory from its current folder to its parent folder
* To access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-in-cherry-leaves/jupyter_notebooks'

**To make the parent of the current directory the new current directory**
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


**To confirm the new current directory**

In [5]:
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-in-cherry-leaves'

# Install Kaggle

In [6]:
# install kaggle package
%pip install kaggle

Note: you may need to restart the kernel to use updated packages.


Run the cell below to change the kaggle configuration directory to current working directory and permission of kaggle authentication JSON

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from Kaggle url.  When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ (in some case kaggle.com/datasets). You should copy that at KaggleDatasetPath.

* Set the destination folder
* Set the Kaggle Dataset and Download it

In [19]:
# Set the destination folder to download Kaggle Dataset
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/mildew_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/mildew_dataset
 87%|██████████████████████████████████     | 48.0M/55.0M [00:00<00:00, 175MB/s]
100%|███████████████████████████████████████| 55.0M/55.0M [00:00<00:00, 143MB/s]


* Unzip the downloaded file, and delete the zip file.

In [20]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

# Data Preparation

### Check and remove non-image files

In [21]:
# Function to remove non-image files
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir) 
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) # remove non-image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))

In [22]:
# Remove non-image files
remove_non_image_file(my_data_dir='inputs/mildew_dataset/cherry-leaves')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


### Split train, validation, and test sets

In [23]:
import os
import shutil
import random
import joblib

my_data_dir="inputs/mildew_dataset/cherry-leaves"

# Function to split train, validation, and test images
def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
    print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
    return

  # gets classes labels
  labels = os.listdir(my_data_dir) # To get only the folder name
  if 'test' in labels:
    pass
  else: 
    # create train, validation, and test folders with classess labels sub-folder
    for folder in ['train','validation','test']:
      for label in labels:
        os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)

    for label in labels:

      files = os.listdir(my_data_dir + '/' + label)
      random.shuffle(files)

      train_set_files_qty = int(len(files) * train_set_ratio)
      validation_set_files_qty = int(len(files) * validation_set_ratio)

      count = 1
      for file_name in files:
        if count <= train_set_files_qty:
          # move given file to train set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/train/' + label + '/' + file_name)
          

        elif count <= (train_set_files_qty + validation_set_files_qty ):
          # move given file to validation set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/validation/' + label + '/' + file_name)

        else:
          # move given file to test set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                  my_data_dir + '/test/' +label + '/'+ file_name)
          
        count += 1

      os.rmdir(my_data_dir + '/' + label)

In [24]:
# Split train, validation, and test images
split_train_validation_test_images(my_data_dir=f"inputs/mildew_dataset/cherry-leaves",
                        train_set_ratio = 0.7,
                        validation_set_ratio=0.1,
                        test_set_ratio=0.2
                        )

## Conclusions and Next Steps

## Conclusions
* Dataset as a mildew_dataset has been successfuly added to the input folder 
* mildew_dataset has been successfully splitted into train, validation, and test test

## Next Steps
* Answer business requirement 1:
    * The client is interested in conducting a study to visually differentiate a cherry leaf that is healthy from one that contains powdery mildew

---

# Push files to Repo

## Push generated/new files from this Session to GitHub repository

### .gitignore

In [25]:
!cat .gitignore

core.Microsoft*
core.mongo*
core.python*
env.py
__pycache__/
*.py[cod]
node_modules/
.github/
cloudinary_python.txt
kaggle.json
inputs/mildew_dataset/cherry-leaves/test
inputs/mildew_dataset/cherry-leaves/train
inputs/mildew_dataset/cherry-leaves/validation

### Git status

In [26]:
!git status

-4267-ade3-b457a91fbe46___FREC_Pwd.M 5088_flipLR.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/validation/powdery_mildew/525e0e62-573a-4a9c-9955-bd2f4ae39bad___FREC_Pwd.M 4531.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/validation/powdery_mildew/52ecce7d-bcd4-4d36-b455-b0948eb02371___FREC_Pwd.M 4724.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/validation/powdery_mildew/56a9ffe5-ea20-41de-b40f-c6970f54fab4___FREC_Pwd.M 4837_flipLR.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/validation/powdery_mildew/56b0482b-80d1-4cd8-82c9-478b8d9e296e___FREC_Pwd.M 0594.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/validation/powdery_mildew/58d5d331-0ee1-4db8-bbab-3e0e8b7b5844___FREC_Pwd.M 0537.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/validation/powdery_mildew/596a3002-0685-4b75-a782-1ed69092f23c___FREC_Pwd.M 4654_flipLR.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/validation/powdery_milde

### Git add

In [25]:
!git add .

### Git commit

In [27]:
!git commit -am "Add data collection"

[main 9abb323] Add data collection
 1 file changed, 3 insertions(+), 3 deletions(-)


### Git Push

In [28]:
! git push

Enumerating objects: 17, done.
Counting objects: 100% (17/17), done.
Delta compression using up to 4 threads
Compressing objects: 100% (11/11), done.
Writing objects: 100% (11/11), 2.02 KiB | 2.02 MiB/s, done.
Total 11 (delta 9), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (9/9), completed with 6 local objects.[K
To https://github.com/HumaIlyas/mildew-detection-in-cherry-leaves
   460a5db..9abb323  main -> main
