# **Data Collection - Fetch Dataset from Kaggle**

## Objectives
* Fetch data from Kaggle and save as raw data to prepare it for further processes.

## Inputs
* Kaggle JSON file - the authentication token 

## Outputs
* Generate Dataset: inputs/mildew_dataset/cherry-leaves 

## Additional Comments
* The client provided the data under an NDA (non-disclosure agreement), therefore the data should only be shared with professionals that are officially involved in the project.



## Import pagkages

In [5]:
! pip install -r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt

ne
[?25hCollecting typing-extensions~=3.7.4 (from tensorflow-cpu==2.6.0->-r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt (line 10))
  Downloading typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Collecting wrapt~=1.12.1 (from tensorflow-cpu==2.6.0->-r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt (line 10))
  Downloading wrapt-1.12.1.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting gast==0.4.0 (from tensorflow-cpu==2.6.0->-r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt (line 10))
  Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting tensorboard~=2.6 (from tensorflow-cpu==2.6.0->-r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt (line 10))
  Downloading tensorboard-2.13.0-py3-none-any.whl (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow-estimator~=2.6 (from tensorflow-cpu==2

In [6]:
import numpy
import os

## Change Working Directory

* To change the working directory from its current folder to its parent folder
* To access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-in-cherry-leaves/jupyter_notebooks'

**To make the parent of the current directory the new current directory**
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


**To confirm the new current directory**

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-in-cherry-leaves'

# Install Kaggle

In [4]:
# install kaggle package
%pip install kaggle

Collecting kaggle
  Downloading kaggle-1.5.13.tar.gz (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.3/63.3 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle)
  Downloading tqdm-4.65.0-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.1/77.1 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.13-py3-none-any.whl size=77717 sha256=7873d8ec409b993e8

Run the cell below to change the kaggle configuration directory to current working directory and permission of kaggle authentication JSON

In [12]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from Kaggle url.  When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ (in some case kaggle.com/datasets). You should copy that at KaggleDatasetPath.

* Set the destination folder
* Set the Kaggle Dataset and Download it

In [13]:
# Set the destination folder to download Kaggle Dataset
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/mildew_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/mildew_dataset
 84%|████████████████████████████████▌      | 46.0M/55.0M [00:00<00:00, 168MB/s]
100%|███████████████████████████████████████| 55.0M/55.0M [00:00<00:00, 144MB/s]


* Unzip the downloaded file, and delete the zip file.

In [14]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

# Data Preparation

### Check and remove non-image files

In [15]:
# Function to remove non-image files
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir) 
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) # remove non-image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))

In [16]:
# Remove non-image files
remove_non_image_file(my_data_dir='inputs/mildew_dataset/cherry-leaves')

Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0
Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0


### Split train, validation, and test sets

In [17]:
import os
import shutil
import random
import joblib

my_data_dir="inputs/mildew_dataset/cherry-leaves"

# Function to split train, validation, and test images
def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
    print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
    return

  # gets classes labels
  labels = os.listdir(my_data_dir) # To get only the folder name
  if 'test' in labels:
    pass
  else: 
    # create train, validation, and test folders with classess labels sub-folder
    for folder in ['train','validation','test']:
      for label in labels:
        os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)

    for label in labels:

      files = os.listdir(my_data_dir + '/' + label)
      random.shuffle(files)

      train_set_files_qty = int(len(files) * train_set_ratio)
      validation_set_files_qty = int(len(files) * validation_set_ratio)

      count = 1
      for file_name in files:
        if count <= train_set_files_qty:
          # move given file to train set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/train/' + label + '/' + file_name)
          

        elif count <= (train_set_files_qty + validation_set_files_qty ):
          # move given file to validation set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/validation/' + label + '/' + file_name)

        else:
          # move given file to test set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                  my_data_dir + '/test/' +label + '/'+ file_name)
          
        count += 1

      os.rmdir(my_data_dir + '/' + label)

In [18]:
# Split train, validation, and test images
split_train_validation_test_images(my_data_dir=f"inputs/mildew_dataset/cherry-leaves",
                        train_set_ratio = 0.7,
                        validation_set_ratio=0.1,
                        test_set_ratio=0.2
                        )

## Conclusions and Next Steps
## Conclusions
* Dataset as a mildew_dataset has been successfuly added to the input folder 
* mildew_dataset has been successfully splitted into train, validation, and test test

## Next Steps
* Answer business requirement 1:
    * The client is interested in conducting a study to visually differentiate a cherry leaf that is healthy from one that contains powdery mildew

# Push files to Repo
## Push generated/new files from this Session to GitHub repository

### .gitignore

In [19]:
!cat .gitignore

core.Microsoft*
core.mongo*
core.python*
env.py
__pycache__/
*.py[cod]
node_modules/
.github/
cloudinary_python.txt
kaggle.json
inputs/mildew_dataset/cherry-leaves/test
inputs/mildew_dataset/cherry-leaves/train
inputs/mildew_dataset/cherry-leaves/validation

### Git status

In [20]:
!git status

ry_mildew/0d20bb6d-798a-4c82-8aea-8d64bd1a086b___FREC_Pwd.M 4761_flipLR.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/train/powdery_mildew/112567fd-5046-4328-80f1-f33e01f76cbf___FREC_Pwd.M 5038_flipLR.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/train/powdery_mildew/11486ca3-39fe-4ff3-8474-de8d579ededf___FREC_Pwd.M 0359_flipLR.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/train/powdery_mildew/119ee0ba-5aec-455d-9ce7-cfc7dae1b39d___FREC_Pwd.M 4542.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/train/powdery_mildew/13cd1180-f191-4603-9c1a-02ebec65f2c1___FREC_Pwd.M 4561_flipLR.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/train/powdery_mildew/167d6c02-d49c-4571-b9b8-6d77be09dfb4___FREC_Pwd.M 0256_flipLR.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/train/powdery_mildew/17079025-3a59-46d4-83f2-ccff1c099bee___FREC_Pwd.M 0513.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/train/powdery_mil

### Git add

In [21]:
!git add .

### Git commit

In [22]:
!git commit -am "Add data collection"

a03f-3c6e-4977-96b7-6460433168b8___FREC_Pwd.M 5137_flipLR.JPG
 delete mode 100644 inputs/mildew_dataset/cherry-leaves/train/powdery_mildew/0ceb54ca-c9c1-48fb-9b8c-eb8afaefc378___FREC_Pwd.M 4698_flipLR.JPG
 delete mode 100644 inputs/mildew_dataset/cherry-leaves/train/powdery_mildew/0d20bb6d-798a-4c82-8aea-8d64bd1a086b___FREC_Pwd.M 4761_flipLR.JPG
 delete mode 100644 inputs/mildew_dataset/cherry-leaves/train/powdery_mildew/112567fd-5046-4328-80f1-f33e01f76cbf___FREC_Pwd.M 5038_flipLR.JPG
 delete mode 100644 inputs/mildew_dataset/cherry-leaves/train/powdery_mildew/11486ca3-39fe-4ff3-8474-de8d579ededf___FREC_Pwd.M 0359_flipLR.JPG
 delete mode 100644 inputs/mildew_dataset/cherry-leaves/train/powdery_mildew/119ee0ba-5aec-455d-9ce7-cfc7dae1b39d___FREC_Pwd.M 4542.JPG
 delete mode 100644 inputs/mildew_dataset/cherry-leaves/train/powdery_mildew/13cd1180-f191-4603-9c1a-02ebec65f2c1___FREC_Pwd.M 4561_flipLR.JPG
 delete mode 100644 inputs/mildew_dataset/cherry-leaves/train/powdery_mildew/167d6c02-d

### Git Push

In [23]:
! git push

Enumerating objects: 19, done.
Counting objects: 100% (19/19), done.
Delta compression using up to 4 threads
Compressing objects: 100% (7/7), done.
Writing objects: 100% (10/10), 13.12 KiB | 2.62 MiB/s, done.
Total 10 (delta 4), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (4/4), completed with 4 local objects.[K
To https://github.com/HumaIlyas/mildew-detection-in-cherry-leaves
   6c305ff..019013f  main -> main
