### Train Image Classifier (Notebook)

This notebook can be used to train a patch-based image classifier on the data downloaded from
CoralNet. The data (images and annotations) must have been downloaded, and using the Annotations
 to Patches Notebook / script, made patches.

#### Imports

In [None]:
import os
import glob

import warnings
warnings.filterwarnings("ignore")

import math
import numpy as np
import pandas as pd
from skimage import io
import matplotlib.pyplot as plt

import tensorflow
from tensorflow.keras import backend as K
from tensorflow.keras.models import Sequential
from tensorflow.keras import optimizers, losses, metrics
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import *

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

from imgaug import augmenters as iaa

#### Setting Data Paths

Here we'll set the path to the directory containing the data for the source we want to use.
In this case, we're using the data downloaded from source 3420.

In [None]:
# Root and Source directory
ROOT = "../CoralNet_Data/"
SOURCE_DIR = ROOT + "3420/"

# Patch dataframe
patches_df = pd.read_csv(SOURCE_DIR + "patches.csv")

# We'll also create folders in this source to hold results of the model
MODEL_DIR = SOURCE_DIR + "model/"
WEIGHTS_DIR = MODEL_DIR + "weights/"
LOGS_DIR = MODEL_DIR + "logs/"

# Make the directories
os.makedirs(MODEL_DIR, exist_ok=True)
os.makedirs(WEIGHTS_DIR, exist_ok=True) 
os.makedirs(LOGS_DIR, exist_ok=True)

#### Prepare Data

Here we'll prepare the data for training. We'll use the annotations to create a dataframe
containing the image names and their labels. We'll then split the data into training, validation,
and testing sets.

In [None]:
image_names = patches_df['Image_Name'].unique()

# Split the Images into training, validation, and test sets.
# We split based on the image names, so that we don't have the same image in multiple sets.
training_images, testing_images = train_test_split(image_names, test_size=0.35, random_state=42)
validation_images, testing_images = train_test_split(testing_images, test_size=0.5, random_state=42)

# Create training, validation, and test dataframes
train = patches_df[patches_df['Image_Name'].isin(training_images)]
valid = patches_df[patches_df['Image_Name'].isin(validation_images)]
test = patches_df[patches_df['Image_Name'].isin(testing_images)]

# The number of class categories
num_classes = len(patches_df['Label'].unique())

#### Data Exploration

As a sanity check, we can see how many images are in each set, and the class distribution.

In [None]:
plt.figure(figsize=(20,10))

plt.subplot(1,3,1)
plt.title("Train: " + str(len(train)))
ax = train['Label'].value_counts().plot(kind='bar')

plt.subplot(1,3,2)
plt.title("Valid: " + str(len(valid)))
ax = valid['Label'].value_counts().plot(kind='bar')

plt.subplot(1,3,3)
plt.title("Test: " + str(len(test)))
ax = test['Label'].value_counts().plot(kind='bar')
plt.savefig(LOGS_DIR + "DatasetSplit.png")
plt.show()

#### Data Augmentation

In this cell, we'll use the imgaug library to create augmentation pipelines for the training,
validation, and testing sets. The training set will be augmented more heavily than the validation
and testing sets. We'll also create a data generator for each set, which will read the images
from the dataframe, augment them, and normalize them on-the-fly while training.

Further down, we'll set some model training parameters, such as the number of epochs, batch size,
and learning rate.

In [None]:
# Setting the amount of dropout for our model (form of data augmentation)
dropout_rate = 0.80

# For the training set
augs_for_train = iaa.Sequential([   
                          iaa.Resize(224, interpolation = 'linear'),
                          iaa.Fliplr(0.5),
                          iaa.Flipud(0.5),
                          iaa.Rot90([1, 2, 3, 4], True),
                          iaa.Sometimes(.3, iaa.Affine(scale = (.95, 1.05))),
                       ])

# For the validation and testing sets
augs_for_valid = iaa.Sequential([iaa.Resize(224, interpolation = 'linear')])

In [None]:
# Number of epochs to train for
num_epochs = 15

# Batch size is dependent on the amount of memory available on your machine
batch_size = 32

# Defines the length of an epoch, all images are used
steps_per_epoch_train = len(train)/batch_size
steps_per_epoch_valid = len(valid)/batch_size

# Learning rate 
lr = .0001

# Training images are augmented, and then normalized
train_augmentor = ImageDataGenerator(preprocessing_function = augs_for_train.augment_image)
                                     
                                                                   
# Reading from dataframe
train_generator = train_augmentor.flow_from_dataframe(dataframe = train, 
                                                      directory = None,
                                                      x_col = 'Name',
                                                      y_col = 'Label', 
                                                      target_size = (224, 224), 
                                                      color_mode = "rgb",  
                                                      class_mode = 'categorical', 
                                                      batch_size = batch_size,
                                                      shuffle = True, 
                                                      seed = 42)
                                                     
# Only normalize images, no augmentation
validate_augmentor = ImageDataGenerator( preprocessing_function = augs_for_valid.augment_image)

# Reading from dataframe                             
validation_generator = validate_augmentor.flow_from_dataframe(dataframe = valid,
                                                              directory = None, 
                                                              x_col = 'Name',
                                                              y_col = 'Label', 
                                                              target_size = (224, 224), 
                                                              color_mode = "rgb",  
                                                              class_mode = 'categorical', 
                                                              batch_size = batch_size, 
                                                              shuffle = True, 
                                                              seed = 42)

#### Model Creation

In this cell we'll create the model. We'll use the ConvNeXt architecture, which is a convolutional
neural network that uses grouped convolutions to reduce the number of parameters. We'll use the
pretrained weights from the ImageNet dataset, and we'll freeze the weights of the convolutional
layers. We'll only train the fully-connected layers at the end of the network.

In [None]:
# Now we define the convolutional portion of the model
convnet = tensorflow.keras.applications.convnext.ConvNeXtTiny(
        model_name='convnext_tiny',
        include_top=False,
        include_preprocessing=True,
        weights='imagenet',
        input_shape=(224, 224, 3),
        pooling='max',
        classes=num_classes,
        classifier_activation='softmax',
)

# Here we create the entire model, with the convnet previously defined
# as the encoder. Our entire model is simple, consisting of the convnet,
# a dropout layer for regularization, and a fully-connected layer with
# softmax activation for classification.
model = Sequential([
        convnet,
        Dropout(dropout_rate),
        Dense(num_classes),
        Activation('softmax')
])

#### Visualizing the Model

If you want, output the model summary to get an idea of the model architecture. This can be
useful as the number of parameters in the model can be quite large, and that may affect the
training time, and the amount of memory required to train the model.

In [None]:
model.summary()

#### Defining Callbacks

Here we define the callbacks that will be used during training. The first callback will reduce
the learning rate by N% if the validation loss does not decrease after N epochs. The second callback
will save the model weights after each epoch, but only if the validation loss decreases. The third
callback will stop training if the validation loss does not decrease after N epochs. The fourth
callback will write logs to a Tensorboard log file, which can be used to visualize the training
progress.

In [None]:
callbacks = [
                ReduceLROnPlateau(monitor='val_loss', factor=.65, patience=5, verbose=1),

                ModelCheckpoint(filepath=WEIGHTS_DIR + 'model-{epoch:03d}-{acc:03f}-{val_acc:03f}.h5',
                                 monitor='val_loss', save_weights_only=True, save_best_only=False, verbose=1),

                EarlyStopping(monitor="val_loss", min_delta=0, patience=10, verbose=0,  mode="auto", baseline=None,
                             restore_best_weights=True, start_from_epoch=0),

                Tensorboard(log_dir=LOGS_DIR, histogram_freq=0, write_graph=True, write_images=True, update_freq='epoch',
                            profile_batch=2, embeddings_freq=0, embeddings_metadata=None),
            ]

#### Compiling the Model

Here we define the model using the compile function. The compile function is not part of the
model definition, but it is necessary to define the model before training. We use the Adam
optimizer, and the categorical cross-entropy loss function. We also define the metrics that
we want to track during training, in this case accuracy, precision, and recall.

Any adjustments to the model training parameters will require you to recompile the model.

In [None]:
model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.Adam(learning_rate=lr),
              metrics=['acc', precision_m, recall_m])

#### Calculating Class Weights

If you want to calculate the class weights for the training set, you can do so here. The class
weights are used to weight the loss function during training. This is useful if the dataset is
imbalanced, as it will help the model learn the minority classes better. If you do not want to
use class weights, set the weighted variable to False.

In [None]:
# Calculate the class weights, plot and save figure
if True:
    class_weight = compute_class_weights(train)
else:
    class_weight = {c: 1.0 for c in range(len(num_classes))}

# Reformat for model.fit()
class_weight = {short_codes.index(k): v for (k, v) in class_weight.items()}

#### Training the Model

In this cell we train the model. We use the fit function to train the model. We pass in the
training and validation generators that we created earlier, as well as the number of epochs
to train for. We also pass in the callbacks that we defined earlier, and the class weights
that we calculated earlier. The fit function will return a history object that we can use
to plot the training and validation loss and accuracy.

In [None]:
history = model.fit(train_generator,
                    steps_per_epoch=steps_per_epoch_train,
                    epochs=num_epochs,
                    validation_data=validation_generator,
                    validation_steps=steps_per_epoch_valid,
                    callbacks=callbacks,
                    verbose=1,
                    class_weight=class_weight)

#### Selecting the Best Weights

After training, we'll select the best weights to use for testing. We'll select the weights
that yield the highest validation accuracy. We'll then load those weights into the model.

In [None]:
# Get the list of weights, sorted by modification time
weights = sorted(glob.glob(WEIGHTS_DIR + "*.h5"), key=os.path.getmtime)
[print(w, i) for i, w in enumerate(weights)];

In [None]:
# Select the index with the best metrics
best_weights = weights[3]
print("Best Weights: ", best_weights)

# Load into the model
model.load_weights(best_weights)

#### Testing the Model

Now we create a generator for the test set, and use the model to predict on the test set. We
will print the classification report and plot the confusion matrix.

In [None]:
test_generator = validate_augmentor.flow_from_dataframe(dataframe=test,
                                                        x_col = 'Name',
                                                        y_col = 'Label', 
                                                        target_size = (224, 224), 
                                                        color_mode = "rgb",  
                                                        class_mode = 'categorical', 
                                                        batch_size = batch_size, 
                                                        shuffle = False, 
                                                        seed = 42)
# Grab the ground-truth
test_y = test_generator.classes

In [None]:
# Defines the length of an epoch
steps_per_epoch_test = len(test)//1

# Use the model to predict on all of the test set
predictions = model.predict_generator(test_generator, steps=steps_per_epoch_test)

# Collapse the probability distribution to the most likely category
predict_classes = np.argmax(predictions, axis = 1)

In [None]:
print("# of images:", len(predict_classes))

# Create the confusion matrix between the ground-truth and predicted
cm = confusion_matrix(y_true=test_y, y_pred=predict_classes)

# Create a display for the confusion matrix, providing the labels
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=short_codes)

# Calculate the overall accuracy
overall_accuracy = accuracy_score(y_true=test_y,
                                  y_pred=predict_classes)

# Calculate the accuracy per class category, store in a dict
class_accuracy = cm.diagonal()/cm.sum(axis=1)
class_accuracy = dict(zip(short_codes, class_accuracy))

# Write the accuracy per class category to a .csv file
df = pd.DataFrame(list(zip(class_accuracy.keys(),
                           class_accuracy.values())),
                  columns=['Class', 'Accuracy'])

df.to_csv(LOGS_DIR + "ClassAccuracy.csv")

# Plot the results
fig, ax = plt.subplots(figsize=(30, 30))
plt.title("Overall Accuracy :" + str(overall_accuracy))
disp.plot(ax=ax)
plt.savefig(LOGS_DIR + "ConfusionMatrix.png")
print("Class Accuracy: ", df)

#### Confidence Threshold

Here we define a confidence threshold. This is the minimum difference between the most
probably class and the actual class. If the difference is less than this, the model is
unsure of the prediction. We can use this to filter out the predictions that the model is
unsure of, and only use the predictions that the model is sure of. This can be useful if
we want to use the predictions from a model that is not very accurate.

In [None]:
# Higher values represents more sure/confident predictions
# .1 unsure -> .5 pretty sure -> .9 very sure

# Creating a graph of the threshold values and the accuracy
# useful for determining how sure the model is when making predictions
threshold_values = np.arange(0.0, 1.0, 0.05)
class_ACC = []

# Looping through the threshold values and calculating the accuracy
for threshold in threshold_values:
    # Creating a list to store the sure index
    sure_index = []
    #  Looping through all predictions and calculating the sure predictions
    for i in range(0, len(predictions)):
        # If the difference between the most probable class and the actual class
        # is greater than the threshold, add it to the sure index
        if(sorted(predictions[i])[-1]) - (sorted(predictions[i])[-2]) > threshold:
            sure_index.append(i)

    #  Calculating the accuracy for the threshold value
    sure_test_y = np.take(test_y, sure_index, axis = 0)
    sure_pred_y = np.take(predict_classes, sure_index)

    class_ACC.append(accuracy_score(sure_test_y, sure_pred_y)) 

# Plotting the results
plt.figure(figsize=(10, 5))
plt.plot(threshold_values, class_ACC)
plt.xlabel('Threshold Values')
plt.xlim([0, 1])
plt.xticks(ticks = np.arange(0, 1.05, 0.1))
plt.ylabel('Classification Accuracy')
plt.title('Identifying the ideal threshold value')
plt.savefig(LOGS_DIR + "AccuracyThreshold.png")
plt.show()

#### Saving the Model

Finally, we save the model and the weights. We save the model as a .h5 file.

In [None]:
model.save(WEIGHTS_DIR + "Best_Model_and_Weights.h5")