# **Modelling and Evaluation**

---

## Objectives

* Answer business requirement 2:
    * Develop a machine learning model for automating image categorization, leveraging CNN architecture for efficient and scalable classification.

* Answer Business Requirement 3:
    * Assess model performance using accuracy, precision, recall, F1-score, and a confusion matrix to ensure high accuracy and effectiveness in predictions.

## Inputs

* inputs/cifar10_dataset_small/train
* inputs/cifar10_dataset_small/validation
* inputs/cifar10_dataset_small/test
* image shape embeddings

## Outputs

* Class distribution plots for training, validation, and test sets.
* Image augmentation
* Development and training of the machine learning model
* Learning curve plot for model performance.
* Model evaluation saved as a pickle file.

---

## Install packages and libraries

In [1]:

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from matplotlib.image import imread
import pickle


## Change and Set directories

We need to change the working directory from its current folder to its parent folder


In [None]:
current_dir = os.getcwd()
print('Current folder: ' + current_dir)
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print('New folder: ' + current_dir)

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

### Input directories and paths

In [None]:
dataset_root_dir = 'inputs/cifar10_dataset_small'
train_path = dataset_root_dir + '/train'
validation_path = dataset_root_dir + '/validation'
test_path = dataset_root_dir + '/test'
train_path

### Set output directory

In [None]:
version = 'v1.5'
file_path = f'outputs/{version}'

if 'outputs' in os.listdir(current_dir) and version in os.listdir(current_dir + '/outputs'):
    print(f'Version {version} is already available.')
    pass
else:
    os.makedirs(name=file_path)
    print(f'New directory for version {version} has been created')

### Set label names

In [None]:
labels = os.listdir(train_path)
labels.sort()
print("Class names:", labels)

### Set image shape

In [None]:
version = 'v1'
image_shape = joblib.load(filename=f"outputs/{version}/image_shape.pkl")
image_shape

---

## Image Distribution in Train, Test and Validation Data

---

Let's recap on the plot from the previous notebook:

In [None]:
def count_images_in_path(path):
    """
    Counts the number of images in each class folder within the given path.

    Args:
        path (str): The directory path containing subfolders for each class.

    Returns:
        dict: A dictionary where keys are class labels (subfolder names) and values are the number of images in each class.
    
    """
    class_counts = {}
    for label in labels:
        label_path = os.path.join(path, label)
        class_counts[label] = len(os.listdir(label_path))
    return class_counts

# Count images in datasets
train_counts = count_images_in_path(train_path)
validation_counts = count_images_in_path(validation_path)
test_counts = count_images_in_path(test_path)

# Convert to DataFrame for plotting
train_df = pd.DataFrame(list(train_counts.items()), columns=['Class', 'Train'])
validation_df = pd.DataFrame(list(validation_counts.items()), columns=['Class', 'Validation'])
test_df = pd.DataFrame(list(test_counts.items()), columns=['Class', 'Test'])

# Merge dataframes for visualization
df = pd.merge(train_df, validation_df, on='Class')
df = pd.merge(df, test_df, on='Class')

df.set_index('Class').plot(kind='bar', figsize=(12, 6))
plt.ylabel('Number of Images')
plt.title('Number of Images per Class in Train, Validation, and Test Sets')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

---

## Image Data Augmentation

---

### Initialize image data generator

In [9]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Set this to False to skip augmentation for the first model training
use_augmentation = True

if use_augmentation:
    # Use image augmentation
    augmented_image_data = ImageDataGenerator(rotation_range=20,
                                              width_shift_range=0.10,
                                              height_shift_range=0.10,
                                              shear_range=0.1,
                                              zoom_range=0.1,
                                              horizontal_flip=True,
                                              vertical_flip=True,
                                              fill_mode='nearest',
                                              rescale=1./255)
else:
    # Only normalize the images, no augmentation
    augmented_image_data = ImageDataGenerator(rescale=1./255)

### Augment training, validation and test image datasets

In [None]:
batch_size = 20

# Prepare the training set
train_set = augmented_image_data.flow_from_directory(train_path,
                                                     target_size=image_shape[:2],
                                                     color_mode='rgb',
                                                     batch_size=batch_size,
                                                     class_mode='categorical',
                                                     shuffle=True
                                                     )

# Validation and Test sets always just normalized, no augmentation
validation_set = ImageDataGenerator(rescale=1./255).flow_from_directory(validation_path,
                                                                        target_size=image_shape[:2],
                                                                        color_mode='rgb',
                                                                        batch_size=batch_size,
                                                                        class_mode='categorical',
                                                                        shuffle=False
                                                                        )

test_set = ImageDataGenerator(rescale=1./255).flow_from_directory(test_path,
                                                                  target_size=image_shape[:2],
                                                                  color_mode='rgb',
                                                                  batch_size=batch_size,
                                                                  class_mode='categorical',
                                                                  shuffle=False
                                                                  )

### Plot augmented training image

In [None]:
# Training set
for _ in range(3):
    img, label = train_set.next()
    print(img.shape)
    plt.imshow(img[0])
    plt.show()

### Plot augmented validation and test images

In [None]:
# Validation set
for _ in range(3):
    img, label = validation_set.next()
    print(img.shape)
    plt.imshow(img[0])
    plt.show()

# Test set
for _ in range(3):
    img, label = test_set.next()
    print(img.shape)
    plt.imshow(img[0])
    plt.show()


### Save class indicies

In [None]:
joblib.dump(value=train_set.class_indices,
            filename=f"{file_path}/class_indices.pkl")

---

## Model Creation

---

### ML model

In [14]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense, Conv2D, MaxPooling2D, BatchNormalization

def create_tf_model():
    model = Sequential()

    model.add(Conv2D(filters=32, kernel_size=(3, 3),
              input_shape=image_shape, activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(filters=64, kernel_size=(3, 3),
              activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(filters=64, kernel_size=(3, 3),
              activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Flatten())
    model.add(Dense(128, activation='relu'))

    model.add(Dropout(0.5))
    model.add(Dense(10, activation='softmax'))

    model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    return model


### Summary

In [None]:

create_tf_model().summary()


### Early Stopping

In [16]:
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=3)

### Fit Model For Model Training

In [None]:
model = create_tf_model()
model.fit(train_set,
          epochs=25,
          steps_per_epoch=len(train_set.classes) // batch_size,
          validation_data=validation_set,
          callbacks=[early_stop],
          verbose=1
          )

### Save Model

In [18]:
model.save(f'{file_path}/snapsort_model.h5')

---

## Model Performance

---

### Model Learning Curve

In [None]:

losses = pd.DataFrame(model.history.history)

sns.set_style("whitegrid")
losses[['loss', 'val_loss']].plot(style='.-')
plt.title("Loss")
plt.savefig(f'{file_path}/model_training_losses.png',
            bbox_inches='tight', dpi=150)
plt.show()

print("\n")
losses[['accuracy', 'val_accuracy']].plot(style='.-')
plt.title("Accuracy")
plt.savefig(f'{file_path}/model_training_acc.png',
            bbox_inches='tight', dpi=150)
plt.show()

### Model Evaluation

#### Load model

In [20]:
from keras.models import load_model
model = load_model(f'{file_path}/snapsort_model.h5')

#### Evaluate on the test set

In [None]:
evaluation = model.evaluate(test_set)

#### Save Evaluation

In [None]:
joblib.dump(value=evaluation,
            filename=f'{file_path}/evaluation.pkl')

### Predict On New Data

#### Load a random image as PIL

In [None]:
from tensorflow.keras.preprocessing import image

pointer = 66
label = labels[0]

pil_image = image.load_img(test_path + '/' + label + '/' + os.listdir(test_path+'/' + label)[pointer],
                           target_size=image_shape, color_mode='rgb')
print(f'Image shape: {pil_image.size}, Image mode: {pil_image.mode}')
pil_image

#### Convert image to array

In [None]:
my_image = image.img_to_array(pil_image)
my_image = np.expand_dims(my_image, axis=0)/255
print(my_image.shape)

#### Predict class for the image

In [None]:
# Predict probabilities
pred_proba = model.predict(my_image)[0]

# Map indices to class names
target_map = {v: k for k, v in train_set.class_indices.items()}

# Get the index of the class with the highest probability
predicted_class_index = np.argmax(pred_proba)
pred_class = target_map[predicted_class_index]

print("Predicted Probabilities:", pred_proba)
print("Predicted Class:", pred_class)

fig, axs = plt.subplots(2, 1, figsize=(7, 6), gridspec_kw={'height_ratios': [3, 1]})

# Display the input image
axs[0].imshow(pil_image)
axs[0].set_title('Input Image')
axs[0].axis('off')

# Plot the prediction probabilities
axs[1].bar(range(len(labels)), pred_proba, color='skyblue')
axs[1].set_title('Prediction Probabilities')
axs[1].set_xlabel('Classes')
axs[1].set_ylabel('Probability')

# Show all class labels
axs[1].set_xticks(range(len(labels)))
axs[1].set_xticklabels(labels, rotation=90)

# Add the probability value next to the bar for the predicted class
axs[1].text(predicted_class_index, pred_proba[predicted_class_index] + 0.01, 
            f'{pred_proba[predicted_class_index]:.2f}', ha='center')

plt.tight_layout()
plt.show()


---

## Push Files To Repo

Add to gitignore:

View changed files:

In [None]:
!git status

Add, commit and push your files to the repo (all or single files):

In [None]:
!git add .

!git commit -m "feat: Train and evaluate v1.5 model" -m "apply augmented images and decrease accuracy of model"

!git push

---

## Conclusions

---

**Model v1:**

**Model Performance on Test Set:**
- **Final Test Accuracy:** Approximately X.XX% *(Insert actual test accuracy here)*
- **Final Test Loss:** X.XXXX *(Insert actual test loss here)*

**Summary:**
- **Training Accuracy (after 18 epochs):** Approximately 73.54%
- **Validation Accuracy (after 18 epochs):** Approximately 56.00%
- **Test Accuracy:** Approximately X.XX% 

**Training Time:**
- **Time per Epoch:** Approximately 5 seconds
- **Total Time for 18 Epochs:** 90 seconds (or about 1.5 minutes)

**Model Architecture:**
- **Conv2D Layers:** Filters → 32 → 64 → 64

**Model Summary:**
- **Trainable Parameters:** 90,506
- **Model Size:** 
  - 90,506 params × 4 bytes/param = 362,024 bytes ≈ 0.36 MB

**Model v1.5:** (With augmented images)

**Model Performance on Test Set:**
- **Final Test Accuracy:** Approximately 40.90%
- **Final Test Loss:** 1.5921

 **Summary:**
- **Training Accuracy (after 15 epochs):** Approximately 39.60%
- **Validation Accuracy (after 15 epochs):** Approximately 39.40%
- **Test Accuracy:** Approximately 40.90%

**Training Time:**
- **Time per Epoch:** Approximately 5 seconds
- **Total Time for 15 Epochs:** 75 seconds (or about 1.25 minutes)

**Model Architecture:**
- **Conv2D Layers:** Filters → 32 → 64 → 64

**Model Summary:**
- **Trainable Parameters:** 90,506
- **Model Size:** 
  - 90,506 params × 4 bytes/param = 362,024 bytes ≈ 0.36 MB