# **Modeling and Evaluation Notebook**

## Objectives

* Answer business requirement 2:
  * The client is interested in prediciting if a cherry leaf is healthy or contains powdery mildew.

## Inputs

* inputs/cherry-leaves_dataset/cherry-leaves/train
* inputs/cherry-leaves_dataset/cherry-leaves/test
* inputs/cherry-leaves_dataset/cherry-leaves/validation
* image shape embeddings (pickle file)

## Outputs

* Images distribution plot in train, validation, and test set
  * Label distribution in a bar chart
  * Dataset distribution in a pie chart
* Image augmentation
* Class indices to change prediction inference in labels
* TensorFlow convolutional network machine learning model
* Hyperparameter optimisation pipeline
* Model trained on best hyperparameters as generated by Keras Tuner
* Save trained model
* Learning curve plot for model performance
* Model evaluation on pickle file
* Calculate classification report (Model A)
* Calculate classification report with macro avg and weighted avg (Model B)
* Prediction on the random image file

## Comments/Insights/Conclusions

* The same data was plotted in different versions to accomodate possible client's requests of further data understanding, and saved in outputs V1 - V5.
* The CNN was built to seek the maximum accuracy while minimizing loss and training time, various models/hyperparameters were chosen.
* The CNN was kept as small as possible withouth compromising accuracy and avoiding overfitting or underfitting.




---

## Import packages

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.image import imread

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/milestone-project-mildew-detection-in-cherry-leaves/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/milestone-project-mildew-detection-in-cherry-leaves'

## Set input directories

Set train, validation and test paths

In [5]:
my_data_dir = 'inputs/cherry-leaves_dataset/cherry-leaves'
train_path = my_data_dir + '/train'
val_path = my_data_dir + '/validation'
test_path = my_data_dir + '/test'

## Set output directory

In [6]:
version = 'v1'
file_path = f'outputs/{version}'

if 'outputs' in os.listdir(current_dir) and version in os.listdir(current_dir + '/outputs'):
  print('Old version is already available create a new version.')
  pass
else:
  os.makedirs(name=file_path)

Old version is already available create a new version.


## Set labels

In [7]:
labels = os.listdir(train_path)

print(
    f"Project Labels: {labels}"
    )

Project Labels: ['healthy', 'powdery_mildew']


## Set image shape

In [8]:
## Import saved image shape embedding
import joblib
version = 'v1'
image_shape = joblib.load(filename=f"outputs/{version}/image_shape.pkl")
image_shape

(256, 256, 3)

---

# Number of images in train, test and validation data

Count number of images per label & set

In [19]:
df_freq = pd.DataFrame([]) 
for folder in ['train', 'validation', 'test']:
  for label in labels:
    df_freq = df_freq.append(
        pd.Series(data={'Set': folder,
                        'Label': label,
                        'Frequency':int(len(os.listdir(my_data_dir + '/' + folder + '/' + label)))}
                  ),
                  ignore_index=True
        )
    print(f"* {folder} - {label}: {len(os.listdir(my_data_dir+'/'+ folder + '/' + label))} images")

print(df_freq)

* train - healthy: 1472 images
* train - powdery_mildew: 1472 images
* validation - healthy: 210 images
* validation - powdery_mildew: 210 images
* test - healthy: 422 images
* test - powdery_mildew: 422 images
   Frequency           Label         Set
0     1472.0         healthy       train
1     1472.0  powdery_mildew       train
2      210.0         healthy  validation
3      210.0  powdery_mildew  validation
4      422.0         healthy        test
5      422.0  powdery_mildew        test


## Bar chart for label distribution

In [None]:
sns.set_style("whitegrid")
plt.figure(figsize=(8,5))
ax = sns.barplot(data=df_freq, x='Set', y='Frequency', hue='Label')

# Add labels to the bars
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width() / 2, height, f'{height}', ha='center', va='bottom')

plt.legend(loc='upper right')
plt.title('Cherry leaves Label distribution')
plt.savefig(f'{file_path}/labels_distribution.png', bbox_inches='tight', dpi=150)
plt.show()

## Pie chart for dataset distribution

In [None]:
plt.figure(figsize=(8,5))
set_labels = df_freq['Set'].unique()
colors = sns.color_palette('pastel')[:len(set_labels)]
explode = [0.1] * len(set_labels)

set_frequencies = []
for set_label in set_labels:
    set_frequencies.append(df_freq[df_freq['Set'] == set_label]['Frequency'].sum())

plt.pie(set_frequencies, labels=set_labels, colors=colors, explode=explode, autopct='%.0f%%')
plt.title('Cherry leaves dataset distribution')
plt.savefig(f'{file_path}/sets_distribution_pie.png',
            bbox_inches='tight', dpi=150)
plt.show()

----

# Image data augmentation

 Our dataset only shows a limited number of images, we need to train the model by augmenting the images to better improve the ML performance

## Import ImageDataGenerator form Tensorflow

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

## Intiatize ImageDataGenerator

In [None]:
augmented_image_data = ImageDataGenerator(rotation_range=20,
                                   width_shift_range=0.10, 
                                   height_shift_range=0.10,
                                   shear_range=0.1,
                                   zoom_range=0.1,
                                   horizontal_flip=True,
                                   vertical_flip=True,
                                   fill_mode='nearest',
                                   rescale=1./255
                              )

## Augment training image dataset

In [None]:
batch_size = 20 # Set batch size
train_set = augmented_image_data.flow_from_directory(train_path,
                                              target_size=image_shape[:2],
                                              color_mode='rgb',
                                              batch_size=batch_size,
                                              class_mode='binary',
                                              shuffle=True
                                              )

train_set.class_indices

## Augment validation image dataset

In [None]:
validation_set = ImageDataGenerator(rescale=1./255).flow_from_directory(val_path,
                                                          target_size=image_shape[:2],
                                                          color_mode='rgb',
                                                          batch_size=batch_size,
                                                          class_mode='binary',
                                                          shuffle=False
                                                          )

validation_set.class_indices

## Augment test image dataset

In [None]:
test_set = ImageDataGenerator(rescale=1./255).flow_from_directory(test_path,
                                                    target_size=image_shape[:2],
                                                    color_mode='rgb',
                                                    batch_size=batch_size,
                                                    class_mode='binary',
                                                    shuffle=False
                                                    )

test_set.class_indices

## Plot augmented training image

In [None]:

for _ in range(3):
    plt.figure(figsize=(3, 3))
    img, label = train_set.next()
    print(img.shape)   #  (1,256,256,3)
    plt.imshow(img[0])
    plt.show()

## Plot augmented validation image

In [None]:
for _ in range(3):
    plt.figure(figsize=(3, 3))
    img, label = validation_set.next()
    print(img.shape)   #  (1,256,256,3)
    plt.imshow(img[0])
    plt.show()

## Plot augmented test image

In [None]:
for _ in range(3):
    plt.figure(figsize=(3, 3))
    img, label = test_set.next()
    print(img.shape)   #  (1,256,256,3)
    plt.imshow(img[0])
    plt.show()

## Save class_indices

In [None]:
joblib.dump(value=train_set.class_indices ,
            filename=f"{file_path}/class_indices.pkl")

---

# Model creation

## Install and import the Keras Tuner for hyperparameter optimisation

In [None]:
!pip install keras-tuner

In [None]:
import tensorflow as tf
from tensorflow import keras
import keras_tuner as kt

## Import model packages

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense, Conv2D, MaxPooling2D

## Contsruct function to create hypermodel for hypertuning

In [None]:
def create_tf_model(hp):
    model = Sequential()

    model.add(Conv2D(filters=32, kernel_size=(3,3),input_shape=image_shape, activation='relu',))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(filters=64, kernel_size=(3,3),input_shape=image_shape, activation='relu',))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(filters=64, kernel_size=(3,3),input_shape=image_shape, activation='relu',))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Flatten())
    hp_units = hp.Int('units', min_value=128, max_value=256, step=32)
    model.add(Dense(units=hp_units, activation = 'relu'))

    model.add(Dropout(0.5))
    model.add(Dense(1, activation = 'sigmoid'))
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-3, 1e-4])

    model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learning_rate),
                loss='binary_crossentropy',
                metrics=['accuracy'])
    
    return model

## Instantiate the tuner and perform hypertuning

In [None]:
tuner = kt.Hyperband(create_tf_model,
                     objective='val_accuracy',
                     max_epochs=10,
                     factor=3,
                     directory=file_path,
                     project_name='hypertuning')

## Early Stopping

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss',patience=3)

## Run the hyperparameter search

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss',patience=3)

tuner.search(train_set,
          epochs=25,
          steps_per_epoch = len(train_set.classes) // batch_size,
          validation_data=validation_set,
          callbacks=[early_stop],
          verbose=1
          )

# Get the optimal hyperparameters
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

print(f"""
The hyperparameter search is complete. The optimal number of units in the first densely-connected
layer is {best_hps.get('units')} and the optimal learning rate for the optimizer
is {best_hps.get('learning_rate')}.
""")

## Create model using defined best hyperparameters

In [None]:
model = tuner.hypermodel.build(best_hps)

## Model Summary

In [None]:
model.summary()

## Re-instantiate the hypermodel and train it

In [None]:
model = tuner.hypermodel.build(best_hps)

# Retrain the model
model.fit(train_set,
          epochs=25,
          steps_per_epoch = len(train_set.classes) // batch_size,
          validation_data=validation_set,
          class_weight=train_class_weights,
          callbacks=[early_stop],
          verbose=1
          )

## Save the model

In [None]:
model.save('outputs/v1/powdery_mildew_detection_model.h5')


----

# Model Performace

----

## Model learning curve

In [None]:
losses = pd.DataFrame(model.history.history)

sns.set_style("whitegrid")
losses[['loss','val_loss']].plot(style='.-')
plt.title("Loss")
plt.savefig(f'{file_path}/model_training_losses.png', bbox_inches='tight', dpi=150)
plt.show()

print("\n")
losses[['accuracy','val_accuracy']].plot(style='.-')
plt.title("Accuracy")
plt.savefig(f'{file_path}/model_training_acc.png', bbox_inches='tight', dpi=150)
plt.show()

# Model Evaluation

## Load saved model

In [None]:
from keras.models import load_model
model = load_model('outputs/v1/powdery_mildew_detection_model.h5')

## Evaluate model on test set

In [None]:
evaluation = model.evaluate(test_set)
print("Model accuracy: {:.2f}%".format(evaluation[1] * 100))
print("Model Loss: ",evaluation[0])

## Save evaluation pickle

In [None]:
joblib.dump(value=evaluation ,
            filename=f"outputs/v1/evaluation.pkl")

## Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

# Predict labels
pred = model.predict(test_set)
y_pred = np.concatenate(np.round(pred).astype(int))

# Define target names
target_names = ['Powdery Mildew', 'Healthy']

# Compute confusion matrix
cm = confusion_matrix(test_set.classes, y_pred)

# Create a figure and axis
fig, ax = plt.subplots(figsize=(7, 6))

# Plot the heatmap
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues',
            xticklabels=target_names, yticklabels=target_names,
            ax=ax)

# Set labels and title
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
ax.set_title('Confusion Matrix')

# Save the figure
plt.savefig(f'{file_path}/confusion_matrix.png', bbox_inches='tight', dpi=150)

## Classification Report - A

In [None]:
from sklearn.metrics import classification_report

print('Classification Report:\n----------------------\n')
print(classification_report(test_set.classes, y_pred, 
      target_names=target_names))


## Classification Report - B

In [None]:
classification_rep = classification_report(test_set.classes, y_pred, target_names=target_names, output_dict=True)

# Print the keys in the classification_rep dictionary
print(classification_rep.keys())

# Define the target class labels
target_classes = ['accuracy', 'Healthy', 'Powdery Mildew', 'macro avg', 'weighted avg']

# Extract the required metrics
metric_labels = ['precision', 'recall', 'f1-score']

# Create the data matrix
data = np.zeros((len(target_classes), len(metric_labels)))
for i, class_label in enumerate(target_classes):
    if class_label == 'accuracy':
        data[i, :] = round(classification_rep[class_label], 3)  # Assign the accuracy value directly to all metrics
    else:
        for j, metric_label in enumerate(metric_labels):
            data[i, j] = round(classification_rep[class_label][metric_label], 3)  # Round the values to 3 decimal places

# Plot the heatmap
plt.figure(figsize=(8, 5))
sns.heatmap(data, annot=True, fmt='.3f', cmap="Blues", cbar=True, linewidths=1)  # Use fmt='.3f' to display values with 3 decimal places and cbar=True to show the color bar
plt.xticks(np.arange(len(metric_labels)) + 0.5, metric_labels, rotation=0)
plt.yticks(np.arange(len(target_classes)) + 0.5, target_classes, rotation=0)  # Set the class labels on the y-axis
plt.xlabel('Metric')
plt.ylabel('Class')
plt.title('Classification Report')
plt.savefig(f'{file_path}/clf_report.png', bbox_inches='tight', dpi=150)
plt.show()

## Test prediction using a random image

In [None]:
import random
from tensorflow.keras.preprocessing import image

label = labels[0] # select healthy or infected

# Get the list of image file names for the specified label category
image_files = os.listdir(test_path + '/' + label)

# Select a random image file
random_image_file = random.choice(image_files)

# Load and resize the random image
pil_image = image.load_img(test_path + '/'+ label + '/'+ random_image_file,
                          target_size=image_shape, color_mode='rgb')

print(f'Image shape: {pil_image.size}, Image mode: {pil_image.mode}')
pil_image

## Convert image to array and prepare for prediction

In [None]:
my_image = image.img_to_array(pil_image)
my_image = np.expand_dims(my_image, axis=0)/255
print(my_image.shape)

## Predict class probabilities

In [None]:
pred_proba = model.predict(my_image)[0,0]

target_map = {v: k for k, v in train_set.class_indices.items()}
pred_class =  target_map[pred_proba > 0.5]  

if pred_class == target_map[0]:
    pred_proba = 1 - pred_proba

pred_percentage = round(pred_proba * 100, 2)

print(f"Label: {pred_class}")
print(f"Percentage: {pred_percentage}%")

---

# Conclusion

* We have counted the number of labels/sets and displayed the labels and dataset distribution in bar and pie charts respectively.
* The images have been augmented to better train the model to better improve the ML performance.
* Plotted the images in each set and save them to a pickle file.
* Installed Keras Tuner for hyperparameter optimisation.
* Ran a hyperparameter search to find the best hyperparameters and retrained the hypermodel with these hyperparameters before saving the model to outputs.
* Plotted the performance of the model to better understand the loss and accuracy.
* Evaluated the model and save it to a pickle file.
* Ran a confusion matrix for the model performance
* Created two classification reports to also show the model performance.
* Lastly tested the prediction and predicted its class probabilities. 