# **Cherry Blossom Data Preparation and Exploration**

## Objectives

The objective of this notebook is to prepare the cherry blossom image dataset acquired from Kaggle (https://www.kaggle.com/datasets/codeinstitute/cherry-leaves) for further analysis and modeling. This includes data cleaning, preprocessing, and feature extraction tasks necessary to enhance the quality and usability of the dataset.

## Inputs

Cherry blossom image dataset: A collection of images representing cherry leaves, including both healthy and powdery mildew-infected leaves.

## Outputs

Processed dataset: A cleaned and transformed version of the cherry blossom image dataset, ready for analysis and modeling.
Preprocessing code: Python code snippets and functions used for data cleaning, preprocessing, and feature extraction.
Saved dataset file: A file containing the processed dataset in a suitable format for easy loading in subsequent notebooks or scripts.

## Additional Comments

The data preparation steps may include image resizing, normalization, noise removal, and other techniques to improve the quality and relevance of the cherry blossom image data. The notebook will focus on preparing the data for visualization, feature engineering, and model training stages of the project.



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/kuro/Desktop/PP5 Project/pp5-mildew-detection-in-cherry-leaves/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/kuro/Desktop/PP5 Project/pp5-mildew-detection-in-cherry-leaves'

# Installing Kaggle and importing the data

Section 1 content

In [4]:
# Install Kaggle
!pip install kaggle

Collecting kaggle
  Downloading kaggle-1.5.15.tar.gz (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting certifi (from kaggle)
  Using cached certifi-2023.5.7-py3-none-any.whl (156 kB)
Collecting requests (from kaggle)
  Using cached requests-2.31.0-py3-none-any.whl (62 kB)
Collecting tqdm (from kaggle)
  Using cached tqdm-4.65.0-py3-none-any.whl (77 kB)
Collecting python-slugify (from kaggle)
  Using cached python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting urllib3 (from kaggle)
  Using cached urllib3-2.0.3-py3-none-any.whl (123 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle)
  Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Collecting charset-normalizer<4,>=2 (from requests->kaggle)
  Downloading charset_normalizer-3.2.0-cp39-cp39-macosx_11_0_arm64.whl (124 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

---

In [11]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [105]:
KaggleDataSetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry-leaves-dataset"

! kaggle datasets download -d {KaggleDataSetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry-leaves-dataset
 93%|███████████████████████████████████▏  | 51.0M/55.0M [00:02<00:00, 42.4MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 26.8MB/s]


In [106]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

# Preparing the data 

In [107]:
def remove_non_image_files(my_data_dir):
    image_extensions = ('.png', '.jpg', '.jpeg')
    folders = ['healthy', 'powdery_mildew']
    
    for folder in folders:
        folder_path = os.path.join(my_data_dir, folder)
        
        if os.path.isdir(folder_path):
            files = os.listdir(folder_path)
            num_image_files = 0
            num_non_image_files = 0
            
            for file in files:
                file_path = os.path.join(folder_path, file)
                
                if os.path.isfile(file_path):
                    _, extension = os.path.splitext(file)
                    if extension.lower() not in image_extensions:
                        os.remove(file_path)  # Remove non-image file
                        num_non_image_files += 1
                    else:
                        num_image_files += 1
            
            print(f"Folder: {folder} - has image files: {num_image_files}")
            print(f"Folder: {folder} - had non-image files: {num_non_image_files}")

In [108]:
remove_non_image_files(my_data_dir='inputs/cherry-leaves-dataset/cherry-leaves')

Folder: healthy - has image files: 2104
Folder: healthy - had non-image files: 0
Folder: powdery_mildew - has image files: 2104
Folder: powdery_mildew - had non-image files: 0


In [109]:
import os
import shutil
import random


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    labels = os.listdir(my_data_dir)
    if 'cherry-leaves' in labels:
        labels = os.listdir(os.path.join(my_data_dir, 'cherry-leaves'))

    if 'test' in labels:
        return

    # Create necessary folders
    os.makedirs(os.path.join(my_data_dir, 'cherry-leaves', 'train', 'healthy'))
    os.makedirs(os.path.join(my_data_dir, 'cherry-leaves', 'train', 'powdery_mildew'))
    os.makedirs(os.path.join(my_data_dir, 'cherry-leaves', 'validation', 'healthy'))
    os.makedirs(os.path.join(my_data_dir, 'cherry-leaves', 'validation', 'powdery_mildew'))
    os.makedirs(os.path.join(my_data_dir, 'cherry-leaves', 'test', 'healthy'))
    os.makedirs(os.path.join(my_data_dir, 'cherry-leaves', 'test', 'powdery_mildew'))

    for label in labels:
        files = os.listdir(os.path.join(my_data_dir, 'cherry-leaves', label))
        random.shuffle(files)

        total_files = len(files)
        train_set_size = int(total_files * train_set_ratio)
        validation_set_size = int(total_files * validation_set_ratio)
        test_set_size = total_files - train_set_size - validation_set_size

        train_files = files[:train_set_size]
        validation_files = files[train_set_size:train_set_size + validation_set_size]
        test_files = files[train_set_size + validation_set_size:]

        # Move files to the appropriate folders based on ratios
        for file_name in train_files:
            src_path = os.path.join(my_data_dir, 'cherry-leaves', label, file_name)
            dest_path = os.path.join(my_data_dir, 'cherry-leaves', 'train', label, file_name)
            shutil.move(src_path, dest_path)

        for file_name in validation_files:
            src_path = os.path.join(my_data_dir, 'cherry-leaves', label, file_name)
            dest_path = os.path.join(my_data_dir, 'cherry-leaves', 'validation', label, file_name)
            shutil.move(src_path, dest_path)

        for file_name in test_files:
            if label == 'healthy':
                dest_folder = 'test/healthy'
            else:
                dest_folder = 'test/powdery_mildew'
            src_path = os.path.join(my_data_dir, 'cherry-leaves', label, file_name)
            dest_path = os.path.join(my_data_dir, 'cherry-leaves', dest_folder, file_name)
            shutil.move(src_path, dest_path)

        # Remove the label folder once files are moved
        os.rmdir(os.path.join(my_data_dir, 'cherry-leaves', label))

- The training set is divided into a 0.70 ratio of data.
- The validation set is divided into a 0.10 ratio of data.
- The test set is divided into a 0.20 ratio of data.

In [110]:
my_data_dir = 'inputs/cherry-leaves-dataset/'

train_set_ratio = 0.7
validation_set_ratio = 0.1
test_set_ratio = 0.2

split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio)

# Push files to Repo

* If you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
    # create here your folder
    # os.makedirs(name='')
except Exception as e:
    print(e)
