---
# Model T - Transfer Learning, Data Augmentation, Fine-Tuning, Adam
- 224 x 224 x 3 Image size.  
- **256 Batch size**.  
- Horizontal Flip, Zoom, Rotation, Contrast and Brightness **Data augmentation**.  
- Adaptive Moment Estimation **(Adam) optimizer**.  
- **0.001 Initial Learning rate**.   
- 7 x 7 x 512 Tensor before flatten.  
- **40 Epochs** to train the classifier. 
- **40 Epochs** to train the model. 

---
#### Imports and Setup

In [None]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
print(f'TensorFlow version: {tf.__version__}')
tf.get_logger().setLevel('ERROR')
tf.autograph.set_verbosity(3)
import matplotlib.pyplot as plt
import pickle
import numpy as np
from tensorflow.keras.utils import image_dataset_from_directory
from tensorflow import keras
from tensorflow.keras import callbacks, layers, optimizers, models
from keras import regularizers
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay ,accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc
from sklearn.preprocessing import label_binarize
from itertools import cycle
from PIL import Image

---
#### Group Datasets

In [None]:
train_dirs = ['../data/train1_resized_224', '../data/train3_resized_224', '../data/train4_resized_224', '../data/train5_resized_224']
validation_dir = '../data/train2_resized_224'
test_dir = '../data/test_resized_224'

- ((2221985 + 2221986) % 5) + 1 = 2  
- **Validation set: train2**.  

---
####  Count Images in Categories

In [None]:
def count_images_in_categories(directory):
    categories = os.listdir(directory)
    category_counts = {}
    for category in categories:
        category_counts[category] = len(os.listdir(os.path.join(directory, category)))
    return category_counts

train_counts_each_dir = [count_images_in_categories(train_dir) for train_dir in train_dirs]
validation_counts = count_images_in_categories(validation_dir)
test_counts = count_images_in_categories(test_dir)
train_counts = {category: sum([count.get(category, 0) for count in train_counts_each_dir]) for category in train_counts_each_dir[0]}

def plot_statistics(dataset_name, category_counts, color):
    categories = list(category_counts.keys())
    counts = list(category_counts.values())
    num_categories = len(categories)
    plt.figure(figsize=(8, 6))
    bars = plt.barh(range(num_categories), counts, color=color, alpha=1)

    for bar, count in zip(bars, counts):
        plt.text(bar.get_width() - 5, bar.get_y() + bar.get_height()/2, str(count), va='center', ha='right', color='white', fontweight='bold')

    plt.ylabel('Categories')
    plt.xlabel('Number of Images')
    plt.yticks(range(num_categories), categories)
    plt.title(f'Distribution of Images in {dataset_name} Dataset')
    plt.tight_layout()
    plt.show()

plot_statistics('Training Set', train_counts, 'blue')
plot_statistics('Validation Set', validation_counts, 'purple')
plot_statistics('Test Set', test_counts, 'red')

- We count the images of each category in each folder and plot the distribution.  
- We see that there are **minor deviations** the number of samples of each category in the train dataset and a bit **more in the validation dataset**.  

---
#### Create Datasets

In [None]:
IMG_SIZE = 224
BATCH_SIZE = 256
NUM_CLASSES = len(train_counts)

train_datasets = [image_dataset_from_directory(directory, image_size=(IMG_SIZE, IMG_SIZE), batch_size=BATCH_SIZE) for directory in train_dirs]

train_dataset = train_datasets[0]
for dataset in train_datasets[1:]:
    train_dataset = train_dataset.concatenate(dataset)

train_dataset = train_dataset.shuffle(buffer_size=1000).prefetch(buffer_size=tf.data.AUTOTUNE)
validation_dataset = image_dataset_from_directory(validation_dir, image_size=(IMG_SIZE, IMG_SIZE), batch_size=BATCH_SIZE).prefetch(buffer_size=tf.data.AUTOTUNE)
test_dataset = image_dataset_from_directory(test_dir, image_size=(IMG_SIZE, IMG_SIZE), batch_size=BATCH_SIZE).prefetch(buffer_size=tf.data.AUTOTUNE)

class_names = train_datasets[0].class_names

- We define the image size of 224 x 224 x 3 and batch size of 256 and create an array with the label's names.  
- We create the train dataset by concatenating them, we **shuffle** the samples before each epoch and **prefetch** them to memory.  
- We do the same for the validation and test dataset except shuffling which is unnecessary.

---
#### Dataset Analysis

In [None]:
for data_batch, labels_batch in train_dataset.take(1):
    print('data batch shape:', data_batch.shape)
    print('labels batch shape:', labels_batch.shape)

- We have batches of 256 images, 224 by 224 pixels with 3 channels (RGB).  
- We also have batches of 256 labels, one for each image. 

---
#### Data Augmentation

In [None]:
data_augmentation = keras.Sequential(
    [
        # We tried more techniques but the model didn't perform well
        keras.layers.RandomFlip("horizontal"),
        # keras.layers.RandomTranslation(0.1, 0.1),
        keras.layers.RandomContrast(0.1),
        keras.layers.RandomBrightness(0.1),
        keras.layers.RandomRotation(0.05),
        keras.layers.RandomZoom(0.1),
    ]
)

> We define a data augmentation pipeline with horizontal flip, translation, contrast, brightness, rotation and zoom.

---
#### Augmented Dataset Visualization

In [None]:
for images, labels in train_dataset.take(1):
    plt.figure(figsize=(128, 128))
    for i in range(len(images)):
        augmented_images = data_augmentation(images)
        ax = plt.subplot(16, 16, i + 1)
        plt.imshow(augmented_images[i].numpy().astype("uint8"))
        plt.title(class_names[labels[i].numpy()])
        plt.axis("off")
        plt.tight_layout()
    plt.show()

- We print a random batch of images from the train dataset along with their respective labels.  
- We see that the images are of different categories and are now of **224 x 224** pixels despite **not having gained quality**.

---
#### Loading the VGG16 Model

In [None]:
from keras.applications.vgg16 import VGG16
conv_base = VGG16(weights='imagenet', include_top=False)
conv_base.trainable = False

> We load the VGG16 model with the imagenet weights and without the top layer.  
> Setting trainable to False empties the list of trainable weights of the layer or model.  

---
#### Dense Network Arquitecture

In [None]:
inputs = keras.Input(shape=(IMG_SIZE, IMG_SIZE, 3))
x = data_augmentation(inputs)
x = keras.applications.vgg16.preprocess_input(x)
x = conv_base(x)
x = layers.Flatten()(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(256, activation="relu", kernel_regularizer=regularizers.L1L2(0.0001, 0.001))(x)
outputs = layers.Dense(NUM_CLASSES, activation="softmax", kernel_regularizer=regularizers.L1L2(0.0001, 0.001))(x)
model = models.Model(inputs=inputs, outputs=outputs)
model.summary()

- Input size of 224 x 224 x 3.  
- Data augmentation (Horizontal Flip, Zoom(+-10%), Rotation (+-10%), Contrast (+-10%), Brightness (+-10%)).  
- Preprocess images to fit VGG16 convolutional layers.
- VGG16 convolutional layers.
- Flatten the tensor to a 25088 x 1 tensor.  
- Dropout of 0.5 on the input layer.  
- A 256 and 10 dense output layer, one for each category. 
- L1 regularization 0.0001 and L2 regularization 0.001 on the 256 and 10 dense layer.  
- ReLU activation function on the dense layer.  
- Softmax activation function on the output layer for multiclass classification.  

---
#### Model Compilation

In [None]:
initial_learning_rate = 0.001
optimizer = optimizers.Adam(learning_rate=initial_learning_rate)
loss_function = keras.losses.SparseCategoricalCrossentropy()

lr_scheduler = callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1)
early_stopping = callbacks.EarlyStopping(monitor='val_loss', patience=6, restore_best_weights=True, verbose=1)
save_best_model = callbacks.ModelCheckpoint(filepath='../models/04_model_t_data_augm_fine_tune_adam_1st_model.h5', save_best_only=True, monitor='val_loss', verbose=1)

callbacks = [lr_scheduler, early_stopping, save_best_model]

model.compile(optimizer=optimizer,
              loss=loss_function,
              metrics=['accuracy'])

- Adam as the optimizer for this model with an initial learning rate of 0.001.  
- Sparse categorical cross entropy as the loss function.  
- Learning rate scheduler to lower the learning rate by 0.1 on validation loss plateau (patience of 3).  
- Early train stopping based on validation loss improvement (stops when validation loss doesn't improve for 6 straight epochs (patience of 6)).  
- Checkpoints to save the best model between each epoch based on validation loss.

---
#### Model Training
- Training the model during 40 epochs. 

In [None]:
history = model.fit(train_dataset,
                    validation_data=validation_dataset,
                    epochs=40,
                    callbacks=callbacks)

---
#### Save Model History

In [None]:
with open("../history/03_model_t_data_augm_fine_tune_adam_1st_model.pkl", "wb") as file:
    pickle.dump(history.history, file)

---
#### Model Evaluation

In [None]:
val_loss, val_acc = model.evaluate(validation_dataset)
print(f'Model Validation Loss: {val_loss:.2%}')
print(f'Model Validation Accuracy: {val_acc:.2%}')

---
#### Model Training Visualization

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.tight_layout()
plt.show()

- We see that the model begins overfitting after xx epochs.  
- The validation accuracy stops improving significantly after the xxth epoch while the training accuracy keeps improving.  
- The validation loss stops improving significantly after the xxth epoch while the training loss keeps improving. 
- The validation loss improves slightly when the learning rate changes to 0.00001 on the xxth epoch
- However, the best model is saved on the xxth epoch.

---
#### Transfer Learning - Fine Tuning

In [None]:
conv_base.trainable = True
for layer in conv_base.layers[:-4]:
    layer.trainable = False

- We unfreeze the last 4 layers of the convolutional base.

---
#### Model Compilation

In [None]:
save_best_model = callbacks.ModelCheckpoint(filepath='../models/04_model_t_data_augm_fine_tune_adam_2nd_model.h5', save_best_only=True, monitor='val_loss', verbose=1)

model.compile(optimizer=optimizer,
              loss=loss_function,
              metrics=['accuracy'])

- We compile the model again with the same properties. 
- We redefine the checkpoints to save the best model between each epoch based on validation loss.

---
#### Model Compilation

In [None]:
history = model.fit(train_dataset,
                    validation_data=validation_dataset,
                    epochs=40,
                    callbacks=callbacks)

---
#### Save Model History

In [None]:
with open("../history/model_t_data_augm_fine_tune_adam_2nd_train.pkl", "wb") as file:
    pickle.dump(history.history, file)

---
#### Model Evaluation

In [None]:
val_loss, val_acc = model.evaluate(validation_dataset)
print(f'Model Validation Loss: {val_loss:.2%}')
print(f'Model Validation Accuracy: {val_acc:.2%}')

---
#### Model Training Visualization

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.tight_layout()
plt.show()

- We see that the model begins overfitting after xx epochs.  
- The validation accuracy stops improving significantly after the xxth epoch while the training accuracy keeps improving.  
- The validation loss stops improving significantly after the xxth epoch while the training loss keeps improving. 
- The validation loss improves slightly when the learning rate changes to 0.00001 on the xxth epoch
- However, the best model is saved on the xxth epoch.  

---
#### Model Testing

In [None]:
test_labels = []
test_predictions = []
test_probabilities = []

for images, labels in test_dataset:
    test_labels.extend(labels.numpy())
    predictions = model.predict(images)
    test_predictions.extend(np.argmax(predictions, axis=-1))
    test_probabilities.extend(predictions)

test_labels = np.array(test_labels)
test_predictions = np.array(test_predictions)
test_probabilities = np.array(test_probabilities)

---
#### Confusion Matrix

In [None]:
cm = confusion_matrix(test_labels, test_predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names)
disp.plot(cmap=plt.cm.Blues, xticks_rotation=90)
plt.show()

- Looking at the confusion matrix, we see that the model has some trouble distinguishing between some categories.  
- The model has a hard time distinguishing between the categories 003_cat and 005_dog.  
- The model also has a hard time distinguishing between some other categories but the error is not as significant.  
- The model has an acceptable performance on the categories 001_automobile, 006_frog, 008_ship and 009_truck.

---
#### ROC Curve Analysis

In [None]:
test_labels_bin = label_binarize(test_labels, classes=range(NUM_CLASSES))

fpr = dict()
tpr = dict()
roc_auc = dict()

for i in range(NUM_CLASSES):
    fpr[i], tpr[i], _ = roc_curve(test_labels_bin[:, i], test_probabilities[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

plt.figure(figsize=(10, 8))
colors = cycle(['aqua', 'darkorange', 'cornflowerblue', 'blue', 'green', 'red', 'purple', 'brown', 'pink', 'grey'])
for i, color in zip(range(NUM_CLASSES), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=2, label=f'Class {class_names[i]} (AUC = {roc_auc[i]:.2f})')

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

- We see that the model has a good performance on the ROC curve for most categories.  
- The categories 003_cat, 003_bird, 005_dog and 004_deer have the worst AUC (Area Under Curve) performance.
- A perfect AUC of 1.0 would mean that the model classifies all true positives and true negatives correctly.

---
#### Performance Metrics
- **Accuracy** is the proportion of correctly predicted instances out of the total instances.  
- **Precision** is the ratio of true positive predictions to the total predicted positives. Macro precision calculates this for each class independently and then averages them.  
- **Weighted precision** calculates the precision for each class, then averages them, weighted by the number of true instances for each class.  
- **Recall** is the ratio of true positive predictions to the total actual positives. Macro recall calculates this for each class independently and then averages them.  
- **Weighted recall** calculates the recall for each class, then averages them, weighted by the number of true instances for each class.  
- The **F1-score** is the harmonic mean of precision and recall. Macro F1-score calculates this for each class independently and then averages them.  
- **Weighted F1-score** calculates the F1-score for each class, then averages them, weighted by the number of true instances for each class.

In [None]:
acc = accuracy_score(y_true =  test_labels, y_pred = test_predictions)
print(f'Accuracy : {np.round(acc*100,2)}%')
precision = precision_score(y_true =  test_labels, y_pred = test_predictions, average='macro')
print(f'Precision - Macro: {np.round(precision*100,2)}%')
recall = recall_score(y_true =  test_labels, y_pred = test_predictions, average='macro')
print(f'Recall - Macro: {np.round(recall*100,2)}%')
f1 = f1_score(y_true =  test_labels, y_pred = test_predictions, average='macro')
print(f'F1-score - Macro: {np.round(f1*100,2)}%')
precision = precision_score(y_true =  test_labels, y_pred = test_predictions, average='weighted')
print(f'Precision - Weighted: {np.round(precision*100,2)}%')
recall = recall_score(y_true =  test_labels, y_pred = test_predictions, average='weighted')
print(f'Recall - Weighted: {np.round(recall*100,2)}%')
f1 = f1_score(y_true =  test_labels, y_pred = test_predictions, average='weighted')
print(f'F1-score - Weighted: {np.round(f1*100,2)}%')

- **Since the dataset is balanced, the **MACRO** average is a good metric to evaluate the model.**

# Conclusion
- We made a model composed of the convolutional base of the VGG16 and a classifier.
- We have trained the classifier with data augmentation (Horizontal Flip, Zoom, Rotation, Contrast and Brightness), using the Sparse Categorical Cross-Entropy loss function and the Adam optimizer.  
- We then unfroze the 4 last layers of the VG16 and retrained the model again with data augmentation.
- We experimented with various classifiers, but decided to try a more simple classifier in this notebook.  
    - Different learning rates were tested; we settled for the Reduce Learning Rate on Plateau callback.  
    - Various batch sizes were explored; this size was optimal for this architecture.  
    - Multiple optimizers were evaluated; Adam was the chosen.
    - Several regularization values were tried; these values worked best.
    - Different dropout rates were assessed; this rate provided the best results.
    - Various epoch counts were tested; 40 epochs were optimal.
    - Different numbers of layers were evaluated; this architecture was the chosen.
- Dropout was applied after the flatten layer.
- The model showed some difficulty distinguishing between certain categories, particularly cats and dogs.
- Overfitting was observed after xx epochs, but the best model was saved at the xxth epoch.
- We evaluated the model using a confusion matrix to analyze its performance on each category.
- We evaluated the model using ROC curves for a deeper performance analysis.
- The model achieved an accuracy of xx% on the test set.
- Performance on the test set was good, with:
    - Macro F1-score: xx%
    - Weighted F1-score: xx%
    - Macro precision: xx%
    - Weighted precision: xx%
    - Macro recall: xx%
    - Weighted recall: xx%