# Multimodal fusion model training
This vignette will demonstrate multimodal fusion classification model training using images DNA metabarcoding data.

We recommend running this code on a machine with a GPU and CUDA installed. However, if such a machine is not available to you, the code can be run on a CPU with no issues, but it will be significantly slower.

### Files and directory structure

Before we begin, it is important that your files are arranged in the correct directory structure. If you cloned our GitHub repository, the files will already be arranged correctly. If you have not cloned the repository, or are applying this vignette to your own data, we have visualized the basic structure below.

In [None]:
CV-eDNA-Hybrid/
│── configs/
│   │── exp_order_concat.yaml
│── Data/
│   │── Model_Data/
│   │   ├── Images/
│   │   │   ├── img01.jpg
│   │   │   ├── img02.jpg (etc.)
│   │   ├── assemblages/
│   │   │   ├── train.csv
│   │   │   ├── valid.csv
│── Model_Scripts/
│   │── tf_loader_concat.py
│   │── util_order.py

### Setup

Now that our directory is set up, we can begin with the code. First, we must load the required libraries and modules. We will be using TensorFlow Keras for training. The Anaconda environment used to run this code is specified in the environment.yml file on our GitHub repository.

In [None]:
from tensorflow.keras.layers import Dense, BatchNormalization, GlobalAveragePooling2D, Dropout, Activation
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.models import Model
from tensorflow.keras.layers import concatenate
from tensorflow.keras.optimizers import Adam
from keras.layers import Input
from sklearn.utils.class_weight import compute_class_weight
import numpy as np
import pandas as pd
import os
import json
import argparse
import yaml

We will also be using some custom modules that must be loaded. The code for these can be found in the "Model_Scripts" directory in the GitHub repository. Make sure to set your working directory to the location of "Model_Scripts" to ensure the code for this vignette runs correctly.

In [None]:
# Set working directory. Adjust the path to match the location on your machine
os.chdir("C:/Your/Path/CV-eDNA-Hybrid/Model_Scripts")

from util_order import init_seed
from tf_loader_concat import CTDataset

Many of our environment, dataset, and model parameters are defined in a config YAML file. The config files can be found in the "config" directory of the GitHub repository. For this code to work properly, you may need to update the `data_root` and `annotate_root` paths to match their corresponding location on your machine. The `data_root` path corresponds to the parent folder where all of your model data is stored (e.g. "C:/Your/Path/CV-eDNA-Hybrid/Data/Model_Data"). `annotate_root` should be a subdirectory of `data_root` that contains the train.csv and valid.csv files. These files contain the paths to the corresponding images for each specimen, the specimen's ground truth class, and the specimen's corresponding DNA-based assemblage data.

The code below loads the config file and initializes some elements from it.

In [None]:
parser = argparse.ArgumentParser(description='Train deep learning model.')
parser.add_argument('--config', help='Path to config file', default='../configs/exp_order_fusion.yaml')
parser.add_argument('--seed', help='Seed index', type=int, default = 0)
args = parser.parse_args()

# load config
print(f'Using config "{args.config}"')
cfg = yaml.safe_load(open(args.config, 'r'))

# Unpacking some stuff from the config
cfg["seed"] = cfg["seed"][args.seed]
seed = cfg["seed"]
batch_size = cfg["batch_size"]
ncol = cfg["num_col"]
num_class = cfg["num_classes"]
experiment = cfg["experiment_name"]

# Path to the image annotations
anno_path = os.path.join(
    cfg["data_root"],
    cfg["annotate_root"],
    'train.csv'
)

Our dataset is imbalanced, as some classes are more abundant than others. To account for this, we compute class weights which are later used to adjust the loss function during the training step. This should not be confused with the "weighted masks" discussed in our paper.

In [None]:
# Reading in the annotations and setting class weights
meta = pd.read_csv(anno_path)
classes = meta["longlab"].values
class_weights = compute_class_weight(class_weight="balanced", classes=np.unique(classes), y=classes)
class_weights = dict(enumerate(class_weights))

Next we initialize our training and validation data loaders, which pair images with their corresponding DNA-based assemblage data and ground truth class names. The full code for the data loader can be found in the tf_loader_concat.py file in the Model_Scipts folder.

In [None]:
# Setting the seed
init_seed(seed)

# Initialize the datasets
train_loader = CTDataset(cfg, split='train')
valid_loader = CTDataset(cfg, split='valid')

# Create TensorFlow datasets
train_data = train_loader.create_tf_dataset()
valid_data = valid_loader.create_tf_dataset()

### Model architecture

Now that we have completed our setup, we are ready to define our model architecture. First, we define a shallow neural network that the assemblage data will pass through.

In [None]:
# Define simple ANN for tabular data
inputs = Input(shape = (ncol,))
annx = Dense(128)(inputs)
annx = BatchNormalization()(annx)
annx = Activation('relu')(annx)
annx = Dropout(0.3)(annx)
ann = Model(inputs, annx)

Next, we define our CNN architecture. We will be using the ResNet-50 architecture with preloaded ImageNet weights.

In [None]:
# Define ResNet for image data
base_model = ResNet50(include_top = False, weights = 'imagenet')
x = base_model.output
x = GlobalAveragePooling2D()(x)
resnet = Model(inputs = base_model.input, outputs = x)

for layer in base_model.layers:
    layer.trainable = False

Next, we concantenate the outputs of the shallow neural network (`ann`) and the CNN (`resnet`). The concatenated layer is then passed through another shallow neural network before reaching the classification layer.

In [None]:
# Concatenating the ANN output with the ResNet output
concat = concatenate([ann.output, resnet.output])

# Inputting the concatenated layer to another ANN for final classification
combined = Dense(128)(concat)
combined = BatchNormalization()(combined)
combined = Activation('relu')(combined)
combined = Dropout(0.3)(combined)
combined = Dense(num_class, activation = "softmax")(combined)
model = Model(inputs = [ann.input, resnet.input], outputs = combined)

### Model parameters and training

Now that we've defined the model archtecture, we only need to set a few more parameters before training the model. Here, we set the learning rate, optimizer, and number of epochs. We also create a model checkpoint that saves the model whenever an epoch produces a new best validation loss.

In [None]:
# Setting parameters
learning_rate = cfg['learning_rate']
optimizer = Adam(learning_rate=learning_rate)
epochs = 150

cp_loss = ModelCheckpoint(f'{experiment}_loss.h5', monitor='val_loss', save_best_only=True)

Finally, we compile and train the model.

In [None]:
model.compile(optimizer = optimizer, loss = 'categorical_crossentropy', metrics = ['accuracy'])

# Model fitting
history = model.fit(train_data,
                    epochs = epochs, 
                    verbose = 1,
                    validation_data = valid_data,
                    callbacks = [cp_loss,
                                 cp_acc],
                    class_weight = class_weights)

The model can be evaluated the same as any other TensorFlow Keras model. To see how we evaluated our model, refer to order_concat_eval.py in Model_Scripts.