# **Data Collection**

## Objectives

* Fetch data from Kaggle and prepare it for further processes.

## Inputs

* Kaggle JSON file - authentication token. 

## Outputs

* Generate Dataset: inputs/datasets/cherry_leaves_dataset.

## Additional Comments

* No comments. 



---

# Import packages & change the working directory

In [1]:
import numpy
import os


current_dir = os.getcwd()
current_dir

'/workspace/my-fifth-project/jupyter_notebooks'

In [2]:
os.chdir('/workspace/my-fifth-project')
print("You set a new current directory")

You set a new current directory


In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/my-fifth-project'

# Install Kagle

In [4]:
pip install kaggle

Collecting kaggle
  Downloading kaggle-1.6.12.tar.gz (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.7/79.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle)
  Downloading tqdm-4.66.2-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting bleach (from kaggle)
  Downloading bleach-6.1.0-py3-none-any.whl.metadata (30 kB)
Collecting webencodings (from bleach->kaggle)
  Downloading webencodings-0.5.1-py2.py3-none-any.whl.metadata (2.1 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading bleach-6.1.0-py3-none-any.whl (162 kB)
[2K   [90m━━━━━━━━━━━

* Change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON:

In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Set the kaggle dataset and download it.

In [7]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherryleaves_database"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown
Downloading cherry-leaves.zip to inputs/cherryleaves_database
 93%|███████████████████████████████████▏  | 51.0M/55.0M [00:02<00:00, 28.6MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 27.3MB/s]


* Unzip the downloaded file, delete the zip file.

In [8]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Data Preparation

### Data Cleaning

* Check and remove non image files.

In [11]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir) 
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        
        #print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))

* Summary of the remove_non_image_file() function.

In [16]:
remove_non_image_file(my_data_dir='inputs/cherryleaves_database/cherry-leaves')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


### Split to train-validation-test set

In [17]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    """
    Splits images into train, validation, and test sets and moves them to respective directories.

    Args:
    - my_data_dir: Path to the directory containing the original image folders.
    - train_set_ratio: Ratio of images to be allocated to the train set.
    - validation_set_ratio: Ratio of images to be allocated to the validation set.
    - test_set_ratio: Ratio of images to be allocated to the test set.
    """

    # Check if the sum of ratios equals 1.0
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
        return

    # Get class labels from the original data directory
    labels = os.listdir(my_data_dir)

    # Create train, validation, test folders with class labels sub-folder
    if 'test' not in labels:
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        # Move images to respective sets
        for label in labels:
            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # Move file to train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)
                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # Move file to validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)
                else:
                    # Move file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)
                count += 1

            # Remove original label directory after moving files
            os.rmdir(my_data_dir + '/' + label)

Data distribution ratio: 
* Train set 0.7
* Validation set 0.1
* Test set 0.2

In [18]:
split_train_validation_test_images(my_data_dir = f"inputs/cherryleaves_database/cherry-leaves",
                        train_set_ratio = 0.7,
                        validation_set_ratio=0.1,
                        test_set_ratio=0.2
                        )

Summary of the new data sets.

In [19]:
def count_images_in_datasets(data_dir):
    """
    Counts the number of images in each dataset (train, validation, test) and for each class (healthy, powdery_mildew).
    """
    image_counts = {}

    # Loop through train, validation, and test directories
    for subset in ['train', 'validation', 'test']:
        subset_dir = os.path.join(data_dir, subset)
        image_counts[subset] = {}

        # Loop through class labels
        for label in os.listdir(subset_dir):
            label_dir = os.path.join(subset_dir, label)
            image_counts[subset][label] = 0
            
            # Count image files
            image_counts[subset][label] = len(os.listdir(label_dir))

    return image_counts

# Path to the directory containing the train, validation, and test datasets
data_directory = "inputs/cherryleaves_database/cherry-leaves"

# Count images in datasets
image_counts = count_images_in_datasets(data_directory)

# Display image counts
for subset, counts in image_counts.items():
    print(f"{subset.capitalize()} dataset:")
    for label, count in counts.items():
        print(f"  - {label}: {count} images.")

Train dataset:
  - healthy: 1472 images.
  - powdery_mildew: 1472 images.
Validation dataset:
  - healthy: 210 images.
  - powdery_mildew: 210 images.
Test dataset:
  - healthy: 422 images.
  - powdery_mildew: 422 images.


---