In [None]:
# Modelling and Evaluation

## Objectives

- Fulfill Business Requirement 2:
  - Develop a model to classify a cherry leaf as healthy or affected by powdery mildew.

## Inputs

- Image shape embeddings file (pickle format)
- Directory structure:
  ```bash
  ├── inputs 
  │       └── cherry-leaves
  │           ├── test
  │           │   ├── healthy
  │           │   └── powdery_mildew
  │           ├── train
  │           │   ├── healthy
  │           │   └── powdery_mildew
  │           └── validation
  │               ├── healthy
  │               └── powdery_mildew
  ```

## Outputs

- Data distribution visualizations for training, validation, and testing:
  - Bar chart for label distribution
  - Pie chart for dataset split
- Image augmentation samples per dataset split.
- Class indices for labeling predictions.
- Model summary and training configurations.
- Model training results and saved model file.
- Learning curves for model performance:
  - Model A - separate plots for accuracy and loss
  - Model B - complete training history visualization
  - Model C - comprehensive training history using Plotly
- Model evaluation metrics:
  - Accuracy score
  - ROC curve
  - Classification report for each model variation
- Confusion Matrix display.
- Save evaluation as a pickle file.
- Prediction testing on a random image.

## Comments | Insights | Conclusions

- Multiple visualizations are provided to support in-depth data analysis.
- The CNN model architecture aims to achieve high accuracy with minimal overfitting.
- Documentation on hyperparameter tuning is available in README.md and a detailed PDF report.

## Import Libraries

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from matplotlib.image import imread

---

## Set Working Directory

In [None]:
cwd = os.getcwd()

In [None]:
os.chdir('/workspace/CI_PP5')
print("Working directory set to /workspace/CI_PP5")

In [None]:
work_dir = os.getcwd()
work_dir

## Set Input Directories

Define paths for train, validation, and test directories.

In [None]:
my_data_dir = 'inputs/cherry-leaves'
train_path = my_data_dir + '/train'
val_path = my_data_dir + '/validation'
test_path = my_data_dir + '/test'

## Set Output Directory

In [None]:
version = 'v1'
file_path = f'outputs/{version}'

if 'outputs' in os.listdir(work_dir) and version in os.listdir(work_dir + '/outputs'):
    print('Older version exists; consider creating a new version.')
else:
    os.makedirs(name=file_path)
    print(f"Output directory created: {file_path}")

## Define Image Labels

In [None]:
labels = os.listdir(train_path)
print('Image labels are:', labels)

## Load Image Shape

In [None]:
import joblib
version = 'v1'
image_shape = joblib.load(filename=f"outputs/{version}/image_shape.pkl")
image_shape

---

## Analyze Image Distribution

### Count Images in Each Set and Label Category

In [None]:
import plotly.express as px

df_freq = pd.DataFrame([])
for folder in ['train', 'test', 'validation']:
    for label in labels:
        df_freq = df_freq.append(
            pd.Series(data={'Set': folder,
                            'Label': label,
                            'Count': int(len(os.listdir(my_data_dir + '/' + folder + '/' + label)))}
                      ),
            ignore_index=True
        )

        print(
            f"* {folder} - {label}: {len(os.listdir(my_data_dir+'/'+ folder + '/' + label))} images")

print("\n")