# **Data Collection - Fetch Dataset from Kaggle**

## Objectives
* Fetch data from Kaggle and save as raw data to prepare it for further processes.

## Inputs
* Kaggle JSON file - the authentication token 

## Outputs
* Generate Dataset: inputs/mildew_dataset/cherry-leaves 

## Additional Comments
* The client provided the data under an NDA (non-disclosure agreement), therefore the data should only be shared with professionals that are officially involved in the project.



## Import pagkages

In [1]:
! pip install -r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt

ent already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/codeany/.pyenv/versions/3.8.12/lib/python3.8/site-packages (from matplotlib==3.3.1->-r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt (line 3)) (3.1.0)


In [2]:
import numpy
import os

## Change Working Directory

* To change the working directory from its current folder to its parent folder
* To access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-in-cherry-leaves/jupyter_notebooks'

**To make the parent of the current directory the new current directory**
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


**To confirm the new current directory**

In [5]:
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-in-cherry-leaves'

# Install Kaggle

In [6]:
# install kaggle package
%pip install kaggle

Note: you may need to restart the kernel to use updated packages.


Run the cell below to change the kaggle configuration directory to current working directory and permission of kaggle authentication JSON

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from Kaggle url.  When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ (in some case kaggle.com/datasets). You should copy that at KaggleDatasetPath.

* Set the destination folder
* Set the Kaggle Dataset and Download it

In [8]:
# Set the destination folder to download Kaggle Dataset
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/mildew_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/mildew_dataset
 75%|████████████████████████████▎         | 41.0M/55.0M [00:00<00:00, 52.8MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:00<00:00, 66.2MB/s]


* Unzip the downloaded file, and delete the zip file.

In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

# Data Preparation

### Check and remove non-image files

In [10]:
# Function to remove non-image files
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir) 
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) # remove non-image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))

In [11]:
# Remove non-image files
remove_non_image_file(my_data_dir='inputs/mildew_dataset/cherry-leaves')

Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0
Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0


### Split train, validation, and test sets

In [12]:
import os
import shutil
import random
import joblib

my_data_dir="inputs/mildew_dataset/cherry-leaves"

# Function to split train, validation, and test images
def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
    print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
    return

  # gets classes labels
  labels = os.listdir(my_data_dir) # To get only the folder name
  if 'test' in labels:
    pass
  else: 
    # create train, validation, and test folders with classess labels sub-folder
    for folder in ['train','validation','test']:
      for label in labels:
        os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)

    for label in labels:

      files = os.listdir(my_data_dir + '/' + label)
      random.shuffle(files)

      train_set_files_qty = int(len(files) * train_set_ratio)
      validation_set_files_qty = int(len(files) * validation_set_ratio)

      count = 1
      for file_name in files:
        if count <= train_set_files_qty:
          # move given file to train set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/train/' + label + '/' + file_name)
          

        elif count <= (train_set_files_qty + validation_set_files_qty ):
          # move given file to validation set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/validation/' + label + '/' + file_name)

        else:
          # move given file to test set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                  my_data_dir + '/test/' +label + '/'+ file_name)
          
        count += 1

      os.rmdir(my_data_dir + '/' + label)

In [13]:
# Split train, validation, and test images
split_train_validation_test_images(my_data_dir=f"inputs/mildew_dataset/cherry-leaves",
                        train_set_ratio = 0.7,
                        validation_set_ratio=0.1,
                        test_set_ratio=0.2
                        )

## Conclusions and Next Steps
## Conclusions
* Dataset as a mildew_dataset has been successfuly added to the input folder 
* mildew_dataset has been successfully splitted into train, validation, and test test

## Next Steps
* Answer business requirement 1:
    * The client is interested in conducting a study to visually differentiate a cherry leaf that is healthy from one that contains powdery mildew

# Push files to Repo
## Push generated/new files from this Session to GitHub repository

### .gitignore

In [14]:
!cat .gitignore

core.Microsoft*
core.mongo*
core.python*
env.py
__pycache__/
*.py[cod]
node_modules/
.github/
cloudinary_python.txt
kaggle.json
inputs/mildew_dataset/cherry-leaves/train
inputs/mildew_dataset/cherry-leaves/validation
inputs/mildew_dataset/cherry-leaves/test

### Git status

In [15]:
!git status

in/healthy/93c3cace-b855-4282-917b-8dcd30e8c1d7___JR_HL 9702_flipTB.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/train/healthy/94070a24-1b16-4696-b051-b01d4ca0b59a___JR_HL 4022_flipTB.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/train/healthy/97df0152-e34f-4b34-bb2e-b2d9aadb19ee___JR_HL 4209.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/train/healthy/991f9446-4cb6-41a8-b6ef-c2ee9e500c7b___JR_HL 9735_180deg.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/train/healthy/9b15b47c-53e0-4873-b51c-d2b6e316d56c___JR_HL 4180_flipTB.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/train/healthy/9d9363f1-b9cc-4c44-8225-aa6734763c3c___JR_HL 9579_flipTB.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/train/healthy/9f57292a-576b-4ae8-a71a-cdaf42bcd6b7___JR_HL 9861_180deg.JPG[m
	[31mdeleted:    inputs/mildew_dataset/cherry-leaves/train/healthy/9fcc3c88-5083-43a8-a65e-3ac55558ca1d___JR_HL 9633_180deg.JPG[m
	[31mde

### Git add

In [16]:
!git add .

### Git commit

In [17]:
!git commit -am "Add data collection"

te mode 100644 inputs/mildew_dataset/cherry-leaves/train/healthy/92cd65f0-9866-4985-b706-733e8c53ed83___JR_HL 4007_180deg.JPG
 delete mode 100644 inputs/mildew_dataset/cherry-leaves/train/healthy/9345e6d2-0aeb-4e09-a8ba-d5c64bda1fa5___JR_HL 4034_180deg.JPG
 delete mode 100644 inputs/mildew_dataset/cherry-leaves/train/healthy/9362fc4a-ac87-489d-8572-f15335af1387___JR_HL 9875_flipTB.JPG
 delete mode 100644 inputs/mildew_dataset/cherry-leaves/train/healthy/93c3cace-b855-4282-917b-8dcd30e8c1d7___JR_HL 9702_flipTB.JPG
 delete mode 100644 inputs/mildew_dataset/cherry-leaves/train/healthy/94070a24-1b16-4696-b051-b01d4ca0b59a___JR_HL 4022_flipTB.JPG
 delete mode 100644 inputs/mildew_dataset/cherry-leaves/train/healthy/97df0152-e34f-4b34-bb2e-b2d9aadb19ee___JR_HL 4209.JPG
 delete mode 100644 inputs/mildew_dataset/cherry-leaves/train/healthy/991f9446-4cb6-41a8-b6ef-c2ee9e500c7b___JR_HL 9735_180deg.JPG
 delete mode 100644 inputs/mildew_dataset/cherry-leaves/train/healthy/9b15b47c-53e0-4873-b51c-d

### Git Push

In [18]:
! git push

Enumerating objects: 22, done.
Counting objects: 100% (22/22), done.
Delta compression using up to 4 threads
Compressing objects: 100% (9/9)Compressing objects: 100% (9/9), done.
Writing objects: 100% (12/12), 3.65 MiB | 6.51 MiB/s, done.
Total 12 (delta 5), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (5/5), completed with 5 local objects.[K
To https://github.com/HumaIlyas/mildew-detection-in-cherry-leaves
   3116d69..643d75d  main -> main
