# **Data Collection**

## Objectives

* Fetch data from Kaggle, save as raw data and prepare it for further processes.

## Inputs

* Kaggle JSON file - the token is required for kaggle authentication.

## Outputs

* Generate the Dataset: inputs/cherry-leaves_dataset

## Additional Comments

* No additional comments here. 



---

# Import packages

In [1]:
%pip install -r /workspace/pp5-mildew-detection-in-cherry-leaves/requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy
import os

# Change working directory

* Notebooks are stored in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/pp5-mildew-detection-in-cherry-leaves/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/pp5-mildew-detection-in-cherry-leaves'

---

# Install Kaggle

In [15]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON.

In [16]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from the [Kaggle URL](https://www.kaggle.com/datasets/codeinstitute/cherry-leaves).
* Set your destination folder.

Set the Kaggle Dataset and Download it.

In [17]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry-leaves_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry-leaves_dataset
100%|█████████████████████████████████████▉| 55.0M/55.0M [00:02<00:00, 38.4MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 27.4MB/s]


Unzip the downloaded file, and delete the zip file.

In [18]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Data Preparation

## Data cleaning

### Check and remove files with no images

In [None]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))