# **Data Collection**

## Objectives

- Fetch data from Kaggle, save as raw data and prepare it for further processes.

## Inputs

- Kaggle JSON file - the token is required for kaggle authentication. 

## Outputs

- Generate the Dataset: inputs/cherry_leaves_dataset.

## Additional Comments

- No additional comments here.



---

# Import packages

In [35]:
! pip install -r /workspace/mildew-detector-pp5/requirements.txt



In [36]:
import numpy
import os
import random

# Change the working directory

In [37]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detector-pp5'

In [38]:

os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [39]:
current_dir = os.getcwd()
current_dir

'/workspace'

# Install Kaggle

In [40]:
! pip install kaggle



Run the cell below to change kaggle configuration directory to current working directory and permission of kaggle authentication json

In [41]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory



* Get the dataset path from the [Kaggle url](https://www.kaggle.com/datasets/codeinstitute/cherry-leaves).
* Set your destination folder.


Set the Kaggle Dataset and Download it.

In [42]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/workspace/.pip-modules/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /workspace. Or use the environment method.


Unzip the downloaded file, and delete the zip file.

In [44]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

FileNotFoundError: [Errno 2] No such file or directory: 'inputs/cherry_leaves_dataset/cherry-leaves.zip'

---

# Data Preparation

---

## Data cleaning

### Check files, remove all non-image 

In [29]:
import os
import random

def remove_non_img_data(my_data_dir):
    """
    This function removes non-image files in the given directory & subdirectories.

    Deletes files without the specified extensions from each subdirectory
    and prints the count of image and non-image files.
    """
    image_ext = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)

    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)

        b = []
        c = []

        for given_file in files:
            if not given_file.lower().endswith(image_ext):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)
                b.append(1)
            else:
                c.append(1)
                pass

        print(f"Folder: {folder} - has image file/s", len(c))
        print(f"Folder: {folder} - has non-image file/s", len(b))

In [31]:
remove_non_img_data(my_data_dir='inputs/cherry_leaves_dataset/cherry-leaves')

IsADirectoryError: [Errno 21] Is a directory: 'inputs/cherry_leaves_dataset/cherry-leaves/test/healthy'

# Split Train Validation Test set


In [32]:
import os
import shutil
import random
import joblib

In [33]:
def split_train_validation_test_images(
        my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    labels = os.listdir(my_data_dir)

    if 'test' in labels:
        pass
    else:
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

The ratio that the data will be split into for each set:
* Train set is divided into a 0.70 ratio.
* Validation set is divided into a 0.10 ratio.
* Test set is divided into a 0.20 ratio.

In [34]:
split_train_validation_test_images(my_data_dir = f"inputs/cherry_leaves_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---

# Conclusions and Next Steps 
* The image data has been collected, cleaned and split into train, validation and test sets.
* Next step: Data Visualization.