# Deep learning
In this tutorial we will train convolutional neural network (CNN) classification models in Python. The following code requires tensorflow, pandas, and numpy installations. If you do not yet have tensorflow installed, follow the instructions here: https://www.tensorflow.org/install/pip. 

If you do not yet have pandas installed, follow the instructions here:https://pandas.pydata.org/docs/getting_started/install.html. 

If you do not yet have numpy installed, follow the instructions here: https://numpy.org/install/.

## Directory Setup
To load the training and testing images,  the image directory structure needs a specific format, where the root directory (i.e. training directory) should contain subdirectories, and each subdirectory represents a distinct class (i.e. species). Each subdirectory should then contain the corresponding images belonging to that class. See the example below:

In [None]:
Training_Set/
    ├── Species_1/
    │   ├── Sp1_image1.jpg
    │   ├── Sp1_image2.jpg
    │   └── ...
    ├── Species_2/
    │   ├── Sp2_image1.jpg
    │   ├── Sp2_image2.jpg
    │   └── ...
    └── ...

The training and testing image sets need to be stored in separate directories, but must have the same structure.

## Model Training
### ResNet-50
For our first model, we will use the ResNet-50 architecture as the base to train a transfer-learned classification model. First, we must import the necessary modules.

In [None]:
from tensorflow.keras.applications.resnet50 import preprocess_input, ResNet50
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D, Dropout
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.models import Model
import tensorflow as tf
import numpy as np
import random
import os

Here we are initializing our image sizes, batch sizes, and image directories. All images will be resized to the specified image size (224⨉224 in this case). Setting the image size smaller will decrease the amount of information given to the model, but will make the model train more quickly. The opposite is true for larger images.

Batch size determines how many images will be propagated through the model at a time. Large batch sizes are more memory intensive, and can take longer to train. You might need to change the batch size based on your hardware specifications.

Set `mytrainingdirectory` and `mytestingdirectory` to the path where your training and testing images are stored, respectively. Also set `os.chdir(savedirectory)` to the directory you'd like the model to be saved to.

In [None]:
img_height, img_width = (224,224)
batch_size = 128

train_data_dir = mytrainingdirectory
test_data_dir = mytestingdirectory

os.chdir(savedirectory)

`ImageDataGenerator` will generate batches of images that are normalized using a preprocessing function and augmented to the user's specifications. Image augmentations can increase the effective size and diversity of the training set, which can lead to enhanced performance and a reduction in overfitting. To get the most benefit from image augmentation, the augmentation's should be carefully chosen to match changes that might be expected in real data. For example, adding a vertical flip augmentation might not make sense if your model is unlikely to encounter any images where the subject is upside down. Some common augmentations include: 
- Flipping
- Rotation
- Translation
- Scaling
- Shearing
- Gaussian noise
- Adjustments to brightness/contrast

Because the test image set is meant to simulate a real-life images unseen by the model, augmentations are not usually applied.

In [None]:
train_datagen = ImageDataGenerator(preprocessing_function = preprocess_input,
                                   shear_range = 0.2,
                                   horizontal_flip = True)

test_datagen = ImageDataGenerator(preprocessing_function = preprocess_input)

Within `ImageDataGenerator`, we will use `flow_from_directory`. This will read in batches directly from our image directory, opposed to `flow`, which would read images preloaded into the Python environment. This makes `flow_from_directory` more memory efficient.

Setting the seed within `flow_from_directory` makes the batches reproduceable, and setting `shuffle = True` for the training data can prevent the model from overfitting to sequences within the data (e.g. species 1 images appear first, then species 2, etc.).

In [None]:
train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size = (img_height, img_width),
    batch_size = batch_size,
    seed = 123,
    shuffle = True,
    class_mode = 'categorical')


test_generator = test_datagen.flow_from_directory(
    test_data_dir,
    target_size = (img_height, img_width),
    batch_size = batch_size,
    seed = 123,
    shuffle = False,
    class_mode = 'categorical')

This code sets the seed and ensures reproduceability for our model training. The training will work without this code, but the final model will be slighlty different each time.

In [None]:
seed_value= 321

os.environ['PYTHONHASHSEED']=str(seed_value)

random.seed(seed_value)

np.random.seed(seed_value)

tf.random.set_seed(seed_value)

Here we are defining our model architecture. As a base, we are using the ImageNet ResNet50, making this a transfer learning CNN. After this we add a global average pooling layer, one dense layer with ReLu activation, dropout layers before and after the dense layer, and a softmax classification layer (see the "**Custom CNN**" section for more information on these layers). 

We must also set the base layers as non-trainable so their weights are not modified.

`EarlyStopping` is used to automatically stop the training procedure when the model performance plateaus on the validation image set. This helps ensure the model is not trained for too short or too long. In this code, the model will stop training if the validation loss is not improved for 10 epochs. Early stopping can be set to monitor validation accuracy instead by setting `monitor='val_loss'` to `monitor='val_accuracy'`.
`ModelCheckpoint` is used in tandem with `EarlyStopping` to save the model using the weights from the best epoch.

The model is then compiled and ready to be fit to the data.

In [None]:
base_model = VGG16(include_top = False, weights = 'imagenet')
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dropout(0.2)(x) 
x = Dense(1024, activation = 'relu')(x)
x = Dropout(0.2)(x) 
predictions = Dense(train_generator.num_classes, activation = 'softmax')(x)
model = Model(inputs = base_model.input, outputs = predictions)

for layer in base_model.layers:
    layer.trainable = False

early_stopping = EarlyStopping(monitor='val_loss', patience=10)
checkpoint = ModelCheckpoint('best_weights_ResNet50.h5', monitor='val_loss', save_best_only=True)

model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

This code trains the model using data from `train_generator` over a maximum of 100 epochs. If the conditions for early stopping are met before 100 epochs, the model will halt training and revert back to the epoch at which early stopping metric reached its peak. It is also tested using the test data after every epoch, which allows you to monitor its progression.

In [None]:
history = model.fit(train_generator,
          epochs = 100,
          validation_data = test_generator,
          callbacks = [early_stopping, checkpoint])

After the model is trained, we can access its training history to plot its accuracy and loss progression across epochs. However, this is only possible in the Python environment the model was trained in, as the training history is not preserved when the model is saved to your computer's memory.

The following code requires matplotlib. If you do not yet have matplotlib installed, follow the instructions here: https://matplotlib.org/stable/users/installing/index.html

In [None]:
import matplotlib.pyplot as plt

# Plot accuracy
plt.figure(figsize=(8, 6))
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('ResNet-50 Accuracy', fontsize=16)
plt.xlabel('Epoch', fontsize=14)
plt.ylabel('Accuracy', fontsize=14)
plt.legend(['Train', 'Validation'], loc='upper left', fontsize=12)
plt.tick_params(axis='both', labelsize=12)
plt.show()

# Plot loss
plt.figure(figsize=(8, 6))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('ResNet-50 Loss', fontsize=16)
plt.xlabel('Epoch', fontsize=14)
plt.ylabel('Loss', fontsize=14)
plt.legend(['Train', 'Validation'], loc='upper right', fontsize=12)
plt.tick_params(axis='both', labelsize=12)
plt.show()

Finally, we can load the model saved by `ModelCheckpoint` to create a prediction probability matrix and export it as a .csv file.

In [None]:
best_model = tf.keras.models.load_model('best_weights_ResNet50.h5')

np.random.seed(seed_value)
tf.random.set_seed(seed_value)

preds = model.predict(test_generator)
preddf = pd.DataFrame(preds)
preddf.to_csv("ResNet-Predictions.csv", index = False)

### VGG16
For our second model, we will use the VGG16 architecture as our base instead of the ResNet-50 architecture. The training procedure is extremely similar, with only a few small changes. Because of this, we will highlight the changes to the ResNet-50 code first, then provide the entire script to run train the model.

First, we must load in the VGG16 modules.

In [None]:
from tensorflow.keras.applications.vgg16 import preprocess_input, VGG16

The only other change to the code is to the `base_model`, where we use `VGG16` instead of `ResNet50`.

In [None]:
base_model = VGG16(include_top = False, weights = 'imagenet')

The entire script for the VGG16 model is as follows:

In [None]:
from tensorflow.keras.applications.vgg16 import preprocess_input, VGG16
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D, Dropout
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Model
import tensorflow as tf
import numpy as np
import random
import os

#
# Load in images
#

img_height, img_width = (224,224)
batch_size = 128

train_data_dir = mytrainingdirectory
test_data_dir = mytestingdirectory

os.chdir(savedirectory)

train_datagen = ImageDataGenerator(preprocessing_function = preprocess_input,
                                   shear_range = 0.2,
                                   horizontal_flip = True)

test_datagen = ImageDataGenerator(preprocessing_function = preprocess_input)

train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size = (img_height, img_width),
    batch_size = batch_size,
    seed = 123,
    shuffle = True,
    class_mode = 'categorical')


test_generator = test_datagen.flow_from_directory(
    test_data_dir,
    target_size = (img_height, img_width),
    batch_size = batch_size,
    seed = 123,
    shuffle = False,
    class_mode = 'categorical')

#
# Set seeds
#

seed_value= 321

os.environ['PYTHONHASHSEED']=str(seed_value)

random.seed(seed_value)

np.random.seed(seed_value)

tf.random.set_seed(seed_value)

#
# Define & fit model
#

base_model = VGG16(include_top = False, weights = 'imagenet')
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dropout(0.2)(x) 
x = Dense(1024, activation = 'relu')(x)
x = Dropout(0.2)(x) 
predictions = Dense(train_generator.num_classes, activation = 'softmax')(x)
model = Model(inputs = base_model.input, outputs = predictions)

for layer in base_model.layers:
    layer.trainable = False

early_stopping = EarlyStopping(monitor='val_loss', patience=10)
checkpoint = ModelCheckpoint('best_weights_VGG16.h5', monitor='val_loss', save_best_only=True)

model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

history = model.fit(train_generator,
          epochs = 100,
          validation_data = test_generator,
          callbacks = [early_stopping, checkpoint])

#
# Plot training progression
#

import matplotlib.pyplot as plt

# Plot accuracy
plt.figure(figsize=(8, 6))
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('VGG16 Accuracy', fontsize=16)
plt.xlabel('Epoch', fontsize=14)
plt.ylabel('Accuracy', fontsize=14)
plt.legend(['Train', 'Validation'], loc='upper left', fontsize=12)
plt.tick_params(axis='both', labelsize=12)
plt.show()

# Plot loss
plt.figure(figsize=(8, 6))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('VGG16 Loss', fontsize=16)
plt.xlabel('Epoch', fontsize=14)
plt.ylabel('Loss', fontsize=14)
plt.legend(['Train', 'Validation'], loc='upper right', fontsize=12)
plt.tick_params(axis='both', labelsize=12)
plt.show()

#
# Make and save predictions
#

best_model = tf.keras.models.load_model('best_weights_VGG16.h5')

np.random.seed(seed_value)
tf.random.set_seed(seed_value)

preds = model.predict(test_generator)
preddf = pd.DataFrame(preds)
preddf.to_csv("VGG16-Predictions.csv", index = False)

### Custom CNN
For our final model, we build a CNN with a custom architecture. The code has several differences to the transfer learning models. At first, we will only highlight the code that differs from the previous two models, then provide the entire script to train the model.

First, we must load in the necessary modules.

In [None]:
from tensorflow.keras.layers import Conv2D,  Dense, MaxPooling2D, GlobalAveragePooling2D
from tensorflow.keras.models import Sequential

In the previous models, we used the preprocessing functions that come with `ResNet50` and `VGG16`. When creating a custom CNN, you have the freedom to use your own preprocessing function (a common one being `rescale=1./255`), or use premade functions like the ones we used with `ResNet50` and `VGG16`. For the example dataset, we found that the ResNet-50 preprocessing function worked the best, so it is what we used. We also found image transformations decreased the model's performance, so we did not apply them and used one data generator for both the training and validation image sets.

In [None]:
datagen = ImageDataGenerator(preprocessing_function = preprocess_input)

The largest change to the code comes when we are defining the model architecture. Rather than loading pretrained convolutional layers as our base model, we must define these layers ourselves. A brief description of each layer we used is provided below:

`Sequential` is a container class that allows you to build models by stacking multiple layers on top of each other in a sequential manner. It simplifies the process of creating and managing deep learning models.

`Conv2D` is a convolutional layer for 2D image data. They are the first layers of a CNN, and perform the main operation of convolution: extracting features from the input data. These layers slide multiple filters (i.e. "kernels") over the input to capture spatial patterns. Convolutional layers are commonly used in image recognition tasks.

`MaxPooling2D` is a downsampling operation that reduces the spatial dimensions (width and height) of the input tensor while preserving the most important features. It divides the input into non-overlapping rectangular regions and outputs the maximum value within each region.

`GlobalAveragePooling` computes the average value for each channel across the entire spatial dimensions of the input feature maps. It reduces the spatial dimensions to a fixed-length vector, summarizing the spatial information and allowing for efficient global representation of the input. It is commonly used to transition from convolutional layers to fully connected layers.

`Dense` is a fully connected layer in TensorFlow where each neuron is connected to every neuron in the previous layer. It performs a linear operation on the input data followed by an activation function. The number of neurons in the dense layer determines the dimensionality of the layer's output.

We use the *Rectified Linear Unit (ReLU)* activation function in our convolutional and fully connected layers, as this introduces non-linearity into the model, allowing the model to learn and approximate complex relationships between input data and output predictions.

The final dense layer uses *softmax activation*, which converts its input into a probability distribution for the learned classes.

Some other common layers that were not included in this model include:

`BatchNormalization` normalizes the inputs of a neural network layer by adjusting and scaling them to improve training stability and performance. It helps to address the internal covariate shift problem and accelerates the training process by reducing the dependence on initialization and learning rate tuning.

`Dropout` layers are used for regularization and help prevent overfitting. They randomly drop a certain percentage of the neurons during training (i.e. set the neuron output values to 0), forcing the network to learn more robust features.

`Flatten` transforms a multi-dimensional tensor into a one-dimensional tensor. It "flattens" the input by reshaping the tensor to have a shape of (batch_size, total_number_of_elements). It is an alternative to `GlobalAveragePooling` for transitioning from convolutional layers to fully connected layers.

In [None]:
model = Sequential([
    Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)),
    Conv2D(32, (3, 3), activation='relu'),
    MaxPooling2D(2, 2),
    Conv2D(64, (3, 3), activation='relu'),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D(2, 2),
    GlobalAveragePooling2D(),
    Dense(512, activation='relu'),
    Dense(train_generator.num_classes, activation='softmax')
])

Now we can print a summary of the model's architecture.

In [None]:
model.summary()

The remaining code is the same as the previous models. The entire script for the custom CNN is as follows:

In [None]:
from tensorflow.keras.layers import Conv2D,  Dense, MaxPooling2D, GlobalAveragePooling2D
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
import tensorflow as tf
import numpy as np
import random
import os

#
# Load in images
#

img_height, img_width = (224,224)
batch_size = 128

train_data_dir = mytrainingdirectory
test_data_dir = mytestingdirectory

save_dir = mysavedirectory

train_datagen = ImageDataGenerator(preprocessing_function = preprocess_input)

train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size = (img_height, img_width),
    batch_size = batch_size,
    seed = 123,
    shuffle = True,
    class_mode = 'categorical')


test_generator = train_datagen.flow_from_directory(
    test_data_dir,
    target_size = (img_height, img_width),
    batch_size = batch_size,
    seed = 123,
    shuffle = False,
    class_mode = 'categorical')

os.chdir(savedir)

#
# Set seeds
#

seed_value= 321

os.environ['PYTHONHASHSEED']=str(seed_value)

random.seed(seed_value)

np.random.seed(seed_value)

tf.random.set_seed(seed_value)

#
# Define & fit model
#

model = Sequential([
    Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)),
    Conv2D(32, (3, 3), activation='relu'),
    MaxPooling2D(2, 2),
    Conv2D(64, (3, 3), activation='relu'),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D(2, 2),
    GlobalAveragePooling2D(),
    Dense(512, activation='relu'),
    Dense(train_generator.num_classes, activation='softmax')
])

early_stopping = EarlyStopping(monitor='val_loss', patience=10)
checkpoint = ModelCheckpoint('best_weights_CNN.h5', monitor='val_loss', save_best_only=True)
    
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

history = model.fit(train_generator,
          epochs = 100,
          validation_data = test_generator,
          callbacks = [early_stopping, checkpoint])

#
# Plot training progression
#

# Plot accuracy
plt.figure(figsize=(8, 6))
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Custom CNN Accuracy', fontsize=16)
plt.xlabel('Epoch', fontsize=14)
plt.ylabel('Accuracy', fontsize=14)
plt.legend(['Train', 'Validation'], loc='upper left', fontsize=12)
plt.tick_params(axis='both', labelsize=12)
plt.show()

#Plot loss
plt.figure(figsize=(8, 6))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Custom CNN Loss', fontsize=16)
plt.xlabel('Epoch', fontsize=14)
plt.ylabel('Loss', fontsize=14)
plt.legend(['Train', 'Validation'], loc='upper right', fontsize=12)
plt.tick_params(axis='both', labelsize=12)
plt.show()

#
# Make and save predictions
#

best_model = tf.keras.models.load_model('best_weights_CNN.h5')

np.random.seed(seed_value)
tf.random.set_seed(seed_value)

preds = model.predict(test_generator)
preddf = pd.DataFrame(preds)
preddf.to_csv("CNN-Predictions.csv", index = False)

## Evaluation
After your model(s) has been trained and predictions have been made on a testing image set, we can measure the model's performance. The following code requires scikit-learn. If you do not yet have scikit-learn installed, follow the instructions here: https://scikit-learn.org/stable/install.html.

In [None]:
from sklearn.metrics import classification_report, roc_curve, precision_recall_curve, auc, confusion_matrix

First we will simply measure the accuracy and loss of the model.

In [None]:
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

valid_loss, valid_acc = best_model.evaluate(test_generator, verbose = 1)

Next, we will measure F1 score, precision, and recall.

In [None]:
# If you have not run created the `preds` object, do so using:
# np.random.seed(seed_value)
# tf.random.set_seed(seed_value)
# preds = best_model.predict(test_generator, verbose = 1)

predicted_classes = tf.argmax(preds, axis=1)

true_classes = test_generator.classes

report = classification_report(true_classes, 
                               predicted_classes, 
                               target_names=test_generator.class_indices,
                               output_dict=True)

macro_precision = report['macro avg']['precision']
macro_recall = report['macro avg']['recall']
macro_f1_score = report['macro avg']['f1-score']

Now we can measure top x accuracy (x = 3 in this example).

In [None]:
top_x = 3

predicted_indices = tf.argsort(preds, axis=1)[:, -top_x:]

top3_accuracy = np.mean(np.any(np.equal(predicted_indices, true_classes[:, np.newaxis]), axis=1))

Finally, we can generate a confusion matrix to visualize the distribution of classifications among out taxa.

In [None]:
conf_matrix = confusion_matrix(true_classes, predicted_classes)

### Additional metrics
We can also make one-vs-rest ROC curves for each class.

In [None]:
class_names = list(test_generator.class_indices.keys())

for class_name in class_names:
    class_index = test_generator.class_indices[class_name]
    class_predictions = predictions[:, class_index]
    class_true_labels = (true_classes == class_index).astype(int)

    fpr, tpr, _ = roc_curve(class_true_labels, class_predictions)
    roc_auc = auc(fpr, tpr)

    plt.figure(figsize=(6, 4))
    plt.plot(fpr, tpr, label='ROC Curve (AUC = {:.2f})'.format(roc_auc))
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve - Class {}'.format(class_name))
    plt.legend(loc='lower right')
    plt.show()

We can also do the same for PR curves.

In [None]:
class_names = list(test_generator.class_indices.keys())

for class_name in class_names:
    class_index = test_generator.class_indices[class_name]
    class_predictions = predictions[:, class_index]
    class_true_labels = (true_classes == class_index).astype(int)

    precision, recall, _ = precision_recall_curve(class_true_labels, class_predictions)
    pr_auc = auc(recall, precision)

    plt.figure(figsize=(6, 4))
    plt.plot(recall, precision, label='PR Curve (AUC = {:.2f})'.format(pr_auc))
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall Curve - Class {}'.format(class_name))
    plt.legend(loc='lower left')
    plt.show()