# PP5 - Powdery Mildew Detection in Cherry Leaves

## Notebook 1 - Data Collection

**Objectives**

* Install necessary Libraries and Packages
* Set working directory
* Fetch data from kaggle
* Prepare data for further process
* Clean datasets from non-image files
* Rename images to a human-readable form
* Split data

**Inputs**

* Kaggle JSON file - the authentication token.
* Dataset: [Kaggle](https://www.kaggle.com/datasets/codeinstitute/cherry-leaves/data)

**Outputs**

```
.
└── input/
    ├── test/
    │   ├── healthy
    │   └── mildew
    ├── train/
    │   ├── healthy
    │   └── mildew
    └── validation/
        ├── healthy
        └── mildew
```

---

## Preparation

**Import packages**

In [None]:
%pip install -r ../requirements.txt

**Change working directory**

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir('/workspace/ml-mildew-detection-in-cherry-leaves')
print("You set a new current directory")

In [None]:
current_dir = os.getcwd()
current_dir

---

## Working with data from Kaggle

**Install Kaggle**

In [None]:
%pip install kaggle

**Change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON**£

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

**Set the kaggle dataset and download it**

In [None]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "input/"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

**Unzip the file and delete the zip folder.**

In [None]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + 'cherry-leaves.zip')

In [None]:
!ls input/
!ls input/*

---

## Prepare the Data

**Remove non-image files from directories**

In [None]:
def remove_non_image_file(current_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(current_dir + '/input/cherry-leaves')
    for folder in folders:
        files = os.listdir(current_dir + '/input/cherry-leaves/' + folder)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = current_dir + '/input/cherry-leaves/' + folder + '/' + given_file
                os.remove(file_location)
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [None]:
remove_non_image_file(current_dir)

In [None]:
!mv input/cherry-leaves/powdery_mildew input/cherry-leaves/mildew

In [None]:
os.listdir('input/cherry-leaves')

**Rename Images**

In [None]:
def rename_images_in_folder(folder_path, prefix):
    """
    Description:
    Renames image files.

    Parameters:
    folder_path (str): The path to the folder containing the image files.
    prefix (str): The prefix to be added to the renamed image files.

    Returns:
    None
    """
    image_files = [f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]
    
    for i, filename in enumerate(image_files, start=1):
        file_ext = os.path.splitext(filename)[1]
        new_filename = f"{prefix}{i:04}{file_ext}"
        
        os.rename(os.path.join(folder_path, filename), os.path.join(folder_path, new_filename))
        
        print(f"Renamed: {filename} -> {new_filename}")

folders_to_process = ['input/cherry-leaves/healthy', 'input/cherry-leaves/mildew']
prefixes = ['healthy', 'mildew']

for folder, prefix in zip(folders_to_process, prefixes):
    full_path = os.path.join(current_dir, folder)
    if os.path.exists(full_path):
        rename_images_in_folder(full_path, prefix)
    else:
        print(f"Folder not found: {full_path}")


**Split train/validation/test set**

In [None]:
import shutil
import random

In [None]:
def split_train_validation_test_images(dataset_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    '''
    Description:
    Splits dataset into train, validation, and test sets

    Parameters:
    input_dir: input directory containing the images
    train_set_ratio: ratio for images included in the train set
    validation_set_ratio: ratio for images included in the validation set
    test_set_ratio: ratio for images included in the test set

    Returns:
    None

    '''
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(dataset_dir)
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=dataset_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(dataset_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(dataset_dir + '/' + label + '/' + file_name,
                                dataset_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(dataset_dir + '/' + label + '/' + file_name,
                                dataset_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(dataset_dir + '/' + label + '/' + file_name,
                                dataset_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(dataset_dir + '/' + label)

In [None]:
split_train_validation_test_images(dataset_dir="input/cherry-leaves/",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )



In [None]:
os.listdir('input/cherry-leaves/')