# **Data Collection**: Mildew Detection Project

---

## Objectives

* Fetch dataset of Cherry Leaf Images from Kaggle ([Link](https://www.kaggle.com/datasets/codeinstitute/cherry-leaves)) to save as raw data and prepare for further processes.

* Clean data, to check the data as well as remove all non-image files for the data.

* Find the average image size of the cherry leaf data. 

* Split the dataset into subsets of Train, Validation, and Test, align with the suggested subset ratio. 

* Resize the dataset images

## Inputs

* Kaggle JSON file - authentication token

## Outputs

* Generate Dataset; **inputs/cherry_leaves_dataset/cherry-leaves**
* Within this Dataset have the data resized and within respected subsets of Train, Validation, and Test in alignment with the ratio.
* Average Image shape embeddings pickle file in outputs.

## Additional Comments

* No comments



---

## 1.0 - Import packages

---

In [1]:
%pip install -r /workspace/Mildew-Detection-CL/requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import numpy as np
import tensorflow as tf
import joblib

## 2.0 - Change working directory

---

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Mildew-Detection-CL/jupyter_notebooks'

In [4]:
os.chdir('/workspace/Mildew-Detection-CL')
print("You set a new current directory")

You set a new current directory


In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/Mildew-Detection-CL'

## 3.0 - Install Kaggle & Download the Dataset 

---

In [6]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Run the cell below to **change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON**.

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set the Kaggle Dataset and Download it

In [8]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset
100%|█████████████████████████████████████▉| 55.0M/55.0M [00:02<00:00, 37.5MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 26.3MB/s]


Unzip the downloaded file, and delete the zip file.

In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

## 4.0 - Data Cleaning and Preparation

---

### 4.1 - Check and Remove non-image files

In [10]:
def remove_non_image_files(directory):
    
    image_extensions = ('.png', '.jpg', '.jpeg')
    
    for folder in os.listdir(directory):
        folder_path = os.path.join(directory, folder)
        
        image_count = 0
        non_image_count = 0
        
        for file_name in os.listdir(folder_path):
            file_path = os.path.join(folder_path, file_name)
            
            if not file_name.lower().endswith(image_extensions):
                os.remove(file_path)
                non_image_count += 1
            else:
                image_count += 1
        
        print(f"Folder: {folder} - Image files: {image_count}")
        print(f"Folder: {folder} - Non-image files removed: {non_image_count}")


In [11]:
remove_non_image_files('inputs/cherry_leaves_dataset/cherry-leaves')

Folder: healthy - Image files: 2104
Folder: healthy - Non-image files removed: 0
Folder: powdery_mildew - Image files: 2104
Folder: powdery_mildew - Non-image files removed: 0


### 4.2 - Average Image Shape

Calculate the average image size from the dataset.

In [12]:
data_dir = 'inputs/cherry_leaves_dataset/cherry-leaves'

size1 = [] 
size2 = []

dataset_folders = ['healthy', 'powdery_mildew']

def read_image_tf(img_path):
    img = tf.io.read_file(img_path)
    img = tf.image.decode_image(img)
    return img


for label in dataset_folders:
    label_dir = os.path.join(data_dir, label)
    
    if os.path.isdir(label_dir):
        for image_filename in os.listdir(label_dir):
            img_path = os.path.join(label_dir, image_filename)
            img = read_image_tf(img_path)
            s1, s2 = img.shape[:2]
            size1.append(s1)
            size2.append(s2)

size1_mean = int(np.mean(size1))
size2_mean = int(np.mean(size2))

print("The average image shape is the following: The width average is {size2_mean} pixels and the height average is {size1_mean} pixels")

file_path = "/workspace/Mildew-Detection-CL/outputs/" 

os.makedirs(file_path, exist_ok=True)

image_shape = (size1_mean, size2_mean, 3)
joblib.dump(value=image_shape, filename=f"{file_path}/image_shape.pkl")

2024-08-17 21:43:10.872042: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


The average image shape is the following: The width average is {size2_mean} pixels and the height average is {size1_mean} pixels


['/workspace/Mildew-Detection-CL/outputs//image_shape.pkl']

### 4.3 - Split data into sets of 'Train', 'Validation, 'Test'

In [13]:
import os
import shutil
import random

def split_dataset(data_directory, train_ratio, validation_ratio, test_ratio):

    if train_ratio + validation_ratio + test_ratio != 1.0:
        print("Error: train_ratio + validation_ratio + test_ratio should total 1.0")
        return

    labels = [label for label in os.listdir(data_directory) if os.path.isdir(os.path.join(data_directory, label))]
    
    for subset in ['train', 'validation', 'test']:
        for label in labels:
            os.makedirs(os.path.join(data_directory, subset, label), exist_ok=True)

    for label in labels:
        label_path = os.path.join(data_directory, label)
        all_files = os.listdir(label_path)
        random.shuffle(all_files)

        num_train_files = int(len(all_files) * train_ratio)
        num_validation_files = int(len(all_files) * validation_ratio)

        for index, file_name in enumerate(all_files):
            src_path = os.path.join(label_path, file_name)
            if index < num_train_files:
                dest_path = os.path.join(data_directory, 'train', label, file_name)
            elif index < num_train_files + num_validation_files:
                dest_path = os.path.join(data_directory, 'validation', label, file_name)
            else:
                dest_path = os.path.join(data_directory, 'test', label, file_name)
            
            shutil.move(src_path, dest_path)

        if not os.listdir(label_path):
            os.rmdir(label_path)

    print("Dataset has been successfully divided into training, validation, and test sets.")


Conventionally,

* The training set is divided into a 0.70 ratio of data.
* The validation set is divided into a 0.10 ratio of data.
* The test set is divided into a 0.20 ratio of data.

In [14]:
split_dataset(data_directory="inputs/cherry_leaves_dataset/cherry-leaves",
              train_ratio=0.7,
              validation_ratio=0.1,
              test_ratio=0.2)

Dataset has been successfully divided into training, validation, and test sets.


### 4.4 - Resize Dataset Images

The final step in organizing the data involves resizing all images in the dataset to 100x100 pixels, ensuring scalability and facilitating future development.

In [15]:
def image_resize(data_dir, new_size=(100, 100)):
    total_files_processed = 0

    for root, dirs, files in os.walk(data_dir):
        for file in files:
            if file.lower().endswith(('.png', '.jpg', '.jpeg')):
                file_path = os.path.join(root, file)
                
                img = tf.io.read_file(file_path)
                img = tf.image.decode_image(img)

                img_resized = tf.image.resize(img, new_size)

                img_resized = tf.cast(img_resized, tf.uint8)

                img_resized_np = img_resized.numpy()
                tf.keras.preprocessing.image.save_img(file_path, img_resized_np)

                total_files_processed += 1

    print(f"Processed {total_files_processed} files in {data_dir}")
    print(f"All images resized to {new_size[0]}px x {new_size[1]}px")

As the final step, ensure that every image within each subset of the dataset is resized to the desired dimensions, optimizing them for consistency and future processing tasks.

In [16]:
data_dir = "inputs/cherry_leaves_dataset/cherry-leaves"

print("Processing Train dataset...")
image_resize(os.path.join(data_dir, 'train'))

print("Processing Validation dataset...")
image_resize(os.path.join(data_dir, 'validation'))

print("Processing Test dataset...")
image_resize(os.path.join(data_dir, 'test'))

Processing Train dataset...
Processed 2944 files in inputs/cherry_leaves_dataset/cherry-leaves/train
All images resized to 100px x 100px
Processing Validation dataset...
Processed 420 files in inputs/cherry_leaves_dataset/cherry-leaves/validation
All images resized to 100px x 100px
Processing Test dataset...
Processed 844 files in inputs/cherry_leaves_dataset/cherry-leaves/test
All images resized to 100px x 100px


## 5.0 - Conclusion

---

Within this notebook, the following has been achieved:

* Downloaded and prepared the Cherry Leaf dataset from Kaggle.

* Cleaned the dataset by removing non-image files.

* Calculated the average image dimensions and saved this information for future use.

* Split the dataset into training, validation, and test subsets with specified ratios.

* Resized all images to 100x100 pixels to ensure consistency and support future processing.