# **Modeling and Evaluation Notebook**

## Objectives

* Answer business requirement 2:
  * The client is interested in prediciting if a cherry leaf is healthy or contains powdery mildew.

## Inputs

* inputs/cherry-leaves_dataset/cherry-leaves/train
* inputs/cherry-leaves_dataset/cherry-leaves/test
* inputs/cherry-leaves_dataset/cherry-leaves/validation
* image shape embeddings (pickle file)

## Outputs

* Images distribution plot in train, validation, and test set
  * Label distribution in a bar chart
  * Dataset distribution in a pie chart
* Image augmentation
* Class indices to change prediction inference in labels
* TensorFlow convolutional network machine learning model
* Hyperparameter optimisation pipeline
* Model trained on best hyperparameters as generated by Keras Tuner
* Save trained model
* Learning curve plot for model performance
* Model evaluation on pickle file
* Create a confusion matrix report
* Calculate classification report (Model A)
* Calculate classification report focusing on precision, recall, f1-score and accuracy (Model B)
* Prediction on the random image file

## Comments/Insights/Conclusions

* The same data was plotted in different versions to accomodate possible client's requests of further data understanding, and saved in outputs V1 - V9.
* The CNN was built to seek the maximum accuracy while minimizing loss and training time, various models/hyperparameters were chosen.
* The CNN was kept as small as possible withouth compromising accuracy and avoiding overfitting or underfitting.




---

## Import packages

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.image import imread

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Set input directories

Set train, validation and test paths

In [None]:
my_data_dir = 'inputs/cherry-leaves_dataset/cherry-leaves'
train_path = my_data_dir + '/train'
val_path = my_data_dir + '/validation'
test_path = my_data_dir + '/test'

## Set output directory

In [None]:
version = 'v2'
file_path = f'outputs/{version}'

if 'outputs' in os.listdir(current_dir) and version in os.listdir(current_dir + '/outputs'):
  print('Old version is already available create a new version.')
  pass
else:
  os.makedirs(name=file_path)

## Set labels

In [None]:
labels = os.listdir(train_path)

print(
    f"Project Labels: {labels}"
    )

## Set image shape

In [None]:
import joblib
version = 'v2'
image_shape = joblib.load(filename=f"outputs/{version}/image_shape.pkl")
image_shape

---

# Number of images in train, test and validation data

Count number of images per label & set

In [None]:
df_freq = pd.DataFrame([]) 
for folder in ['train', 'validation', 'test']:
  for label in labels:
    df_freq = df_freq.append(
        pd.Series(data={'Set': folder,
                        'Label': label,
                        'Frequency':int(len(os.listdir(my_data_dir + '/' + folder + '/' + label)))}
                  ),
                  ignore_index=True
        )
    print(f"* {folder} - {label}: {len(os.listdir(my_data_dir+'/'+ folder + '/' + label))} images")

print(df_freq)

## Bar chart for label distribution

In [None]:
sns.set_style("whitegrid")
plt.figure(figsize=(8,5))
ax = sns.barplot(data=df_freq, x='Set', y='Frequency', hue='Label')

for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width() / 2, height, f'{height}', ha='center', va='bottom')

plt.legend(loc='upper right')
plt.title('Cherry leaves Label distribution')
plt.savefig(f'{file_path}/labels_distribution.png', bbox_inches='tight', dpi=150)
plt.show()

## Pie chart for dataset distribution

In [None]:
plt.figure(figsize=(8,5))
set_labels = df_freq['Set'].unique()
colors = sns.color_palette('pastel')[:len(set_labels)]
explode = [0.1] * len(set_labels)

set_frequencies = []
for set_label in set_labels:
    set_frequencies.append(df_freq[df_freq['Set'] == set_label]['Frequency'].sum())

plt.pie(set_frequencies, labels=set_labels, colors=colors, explode=explode, autopct='%.0f%%')
plt.title('Cherry leaves dataset distribution')
plt.savefig(f'{file_path}/sets_distribution_pie.png',
            bbox_inches='tight', dpi=150)
plt.show()

----

# Image data augmentation

 Our dataset only shows a limited number of images, we need to train the model by augmenting the images to better improve the ML performance

## Import ImageDataGenerator form Tensorflow

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

## Intiatize ImageDataGenerator

In [None]:
augmented_image_data = ImageDataGenerator(rotation_range=20,
                                   width_shift_range=0.10, 
                                   height_shift_range=0.10,
                                   shear_range=0.1,
                                   zoom_range=0.1,
                                   horizontal_flip=True,
                                   vertical_flip=True,
                                   fill_mode='nearest',
                                   rescale=1./255
                              )

## Augment training image dataset & set batch size

In [None]:
batch_size = 20
train_set = augmented_image_data.flow_from_directory(train_path,
                                              target_size=image_shape[:2],
                                              color_mode='rgb',
                                              batch_size=batch_size,
                                              class_mode='binary',
                                              shuffle=True
                                              )

train_set.class_indices

## Augment validation image dataset

In [None]:
validation_set = ImageDataGenerator(rescale=1./255).flow_from_directory(val_path,
                                                          target_size=image_shape[:2],
                                                          color_mode='rgb',
                                                          batch_size=batch_size,
                                                          class_mode='binary',
                                                          shuffle=False
                                                          )

validation_set.class_indices

## Augment test image dataset

In [None]:
test_set = ImageDataGenerator(rescale=1./255).flow_from_directory(test_path,
                                                    target_size=image_shape[:2],
                                                    color_mode='rgb',
                                                    batch_size=batch_size,
                                                    class_mode='binary',
                                                    shuffle=False
                                                    )

test_set.class_indices

## Plot augmented training image

In [None]:

for _ in range(3):
    plt.figure(figsize=(3, 3))
    img, label = train_set.next()
    print(img.shape)
    plt.imshow(img[0])
    plt.show()

## Plot augmented validation image

In [None]:
for _ in range(3):
    plt.figure(figsize=(3, 3))
    img, label = validation_set.next()
    print(img.shape)
    plt.imshow(img[0])
    plt.show()

## Plot augmented test image

In [None]:
for _ in range(3):
    plt.figure(figsize=(3, 3))
    img, label = test_set.next()
    print(img.shape)
    plt.imshow(img[0])
    plt.show()

## Save class_indices

In [None]:
joblib.dump(value=train_set.class_indices ,
            filename=f"{file_path}/class_indices.pkl")

---

# Model creation

## Install and import the Keras Tuner for hyperparameter optimisation

In [None]:
!pip install keras-tuner

In [None]:
import tensorflow as tf
from tensorflow import keras
import keras_tuner as kt

## Import model packages

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense, Conv2D, MaxPooling2D, BatchNormalization
from tensorflow.keras import regularizers

## Contsruct function to create hypermodel for hypertuning

In [None]:
def create_tf_model(hp):
    model = Sequential()

    model.add(Conv2D(filters=32, kernel_size=(3,3), input_shape=image_shape, activation='relu',))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(filters=64, kernel_size=(3,3), input_shape=image_shape, activation='relu',))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(filters=64, kernel_size=(3,3), input_shape=image_shape, activation='relu',))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Flatten())
    hp_units = hp.Int('units', min_value=128, max_value=256, step=32)
    model.add(Dense(units=hp_units, activation = 'relu'))

    model.add(Dropout(0.5))    
    model.add(Dense(1, activation='sigmoid'))
    
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-3, 1e-4])

    model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learning_rate),
                loss='binary_crossentropy',
                metrics=['accuracy'])
    
    return model

## Hyperparameter Tuning with Hyperband Algorithm

In [None]:
tuner = kt.Hyperband(create_tf_model,
                     objective='val_accuracy',
                     max_epochs=10,
                     factor=3,
                     directory=file_path,
                     project_name='hypertuning')

## Early Stopping

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss',patience=3)

## Run the hyperparameter search

In [None]:
tuner.search(train_set,
          epochs=25,
          steps_per_epoch = len(train_set.classes) // batch_size,
          validation_data=validation_set,
          callbacks=[early_stop],
          verbose=1
          )

best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

print(f"""
The hyperparameter search is complete. The optimal number of units in the first densely-connected
layer is {best_hps.get('units')} and the optimal learning rate for the optimizer
is {best_hps.get('learning_rate')}.
""")

## Create model using defined best hyperparameters

In [None]:
model = tuner.hypermodel.build(best_hps)

## Model Summary

In [None]:
model.summary()

## Re-instantiate the hypermodel with best hyperparameters and retrain it

In [None]:
model = tuner.hypermodel.build(best_hps)

model.fit(train_set,
          epochs=25,
          steps_per_epoch = len(train_set.classes) // batch_size,
          validation_data=validation_set,
          callbacks=[early_stop],
          verbose=1
          )

## Save the model

In [None]:
model.save('outputs/v2/powdery_mildew_detection_model.h5')


----

# Model Performace

----

## Model learning curve

In [None]:
losses = pd.DataFrame(model.history.history)

sns.set_style("whitegrid")
losses[['loss','val_loss']].plot(style='.-')
plt.title("Loss")
plt.savefig(f'{file_path}/model_training_losses.png', bbox_inches='tight', dpi=150)
plt.show()

print("\n")
losses[['accuracy','val_accuracy']].plot(style='.-')
plt.title("Accuracy")
plt.savefig(f'{file_path}/model_training_acc.png', bbox_inches='tight', dpi=150)
plt.show()

# Model Evaluation

## Load saved model

In [None]:
from keras.models import load_model
model = load_model('outputs/v2/powdery_mildew_detection_model.h5')

## Evaluate model on test set

In [None]:
evaluation = model.evaluate(test_set)
print("Model accuracy: {:.2f}%".format(evaluation[1] * 100))
print("Model Loss: ",evaluation[0])

## Save evaluation pickle

In [None]:
joblib.dump(value=evaluation ,
            filename=f"outputs/v2/evaluation.pkl")

## Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

pred = model.predict(test_set)
y_pred = np.concatenate(np.round(pred).astype(int))

cm = confusion_matrix(test_set.classes, y_pred)

target_names = ['Powdery Mildew', 'Healthy']

fig, ax = plt.subplots(figsize=(7, 6))

sns.heatmap(cm, annot=True, fmt='g', cmap='Blues',
            xticklabels=target_names, yticklabels=target_names,
            ax=ax)

ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
ax.set_title('Confusion Matrix')

plt.savefig(f'{file_path}/confusion_matrix.png', bbox_inches='tight', dpi=150)

## Show wrong class image

In [None]:
wrong_indices = np.where((test_set.classes == 1) & (y_pred == 0))[0]

wrong_images = [test_set.filenames[i] for i in wrong_indices]

print("Healthy images wrongly predicted as Powdery Mildew:")
for image in wrong_images:
    print(image)

Display wrong class image from the confusion matrix

In [None]:
from PIL import Image

image_path = "inputs/cherry-leaves_dataset/cherry-leaves/test/powdery_mildew/4c756b73-5e7d-40ec-9b36-1866c49f2e43___FREC_Pwd.M 5156_flipLR.JPG"

image = Image.open(image_path)
image.show()

## Classification Report - A

In [None]:
from sklearn.metrics import classification_report

print('Classification Report:\n----------------------\n')
print(classification_report(test_set.classes, y_pred, 
      target_names=target_names))


## Classification Report - B

In [None]:
classification_rep = classification_report(test_set.classes, y_pred, target_names=target_names, output_dict=True)

print(classification_rep.keys())

target_classes = ['accuracy', 'Healthy', 'Powdery Mildew', 'macro avg', 'weighted avg']

metric_labels = ['precision', 'recall', 'f1-score']

data = np.zeros((len(target_classes), len(metric_labels)))
for i, class_label in enumerate(target_classes):
    if class_label == 'accuracy':
        data[i, :] = round(classification_rep[class_label], 3)
    else:
        for j, metric_label in enumerate(metric_labels):
            data[i, j] = round(classification_rep[class_label][metric_label], 3)

plt.figure(figsize=(8, 5))
sns.heatmap(data, annot=True, fmt='.3f', cmap="Blues", cbar=True, linewidths=1)
plt.xticks(np.arange(len(metric_labels)) + 0.5, metric_labels, rotation=0)
plt.yticks(np.arange(len(target_classes)) + 0.5, target_classes, rotation=0)
plt.xlabel('Metric')
plt.ylabel('Class')
plt.title('Classification Report')
plt.savefig(f'{file_path}/clf_report.png', bbox_inches='tight', dpi=150)
plt.show()

## Test prediction using a random image

In [None]:
import random
from tensorflow.keras.preprocessing import image

label = labels[0] # select 0 for healthy or 1 for infected

image_files = os.listdir(test_path + '/' + label)

random_image_file = random.choice(image_files)

pil_image = image.load_img(test_path + '/'+ label + '/'+ random_image_file,
                          target_size=image_shape, color_mode='rgb')

print(f'Image shape: {pil_image.size}, Image mode: {pil_image.mode}')
pil_image

## Convert image to array and prepare for prediction

In [None]:
my_image = image.img_to_array(pil_image)
my_image = np.expand_dims(my_image, axis=0)/255
print(my_image.shape)

## Predict class probabilities

In [None]:
pred_proba = model.predict(my_image)[0,0]

target_map = {v: k for k, v in train_set.class_indices.items()}
pred_class =  target_map[pred_proba > 0.5]  

if pred_class == target_map[0]:
    pred_proba = 1 - pred_proba

pred_percentage = round(pred_proba * 100, 2)

print(f"Label: {pred_class}")
print(f"Percentage: {pred_percentage}%")

---

# Conclusion

* We analyzed the dataset and visualized the label distribution, providing insights into the dataset's composition.
* To improve the model's performance, we applied data augmentation techniques to expand our training dataset.
* Using Keras Tuner, we conducted a hyperparameter search to find the best combination of hyperparameters for our model.
* The model was trained using the best hyperparameters, and the resulting model was saved for future use.
* During the training process, we monitored the loss and accuracy, which allowed us to track the model's learning curve. The final model achieved an accuracy of 99.88%   and a loss of 0.0041 on the test set.
* We evaluated the model using a confusion matrix, which provided insights into the model's performance across different classes.
* Additionally, we generated two classification reports, which provided detailed metrics such as precision, recall, and F1-score for each class.
* Based on the evaluation results, our model demonstrates excellent performance in detecting powdery mildew on cherry leaves, with a high accuracy of 99.88% on the       test set. This meets the performance requirement specified in the business case.

By incorporating these evaluation metrics, we can confidently state that our model achieves the desired performance level and can be utilized effectively for the task of powdery mildew detection.