<a href="https://colab.research.google.com/github/MELAI-1/WORKSHOPS-AND-SCIENTIFIC-OUTREACH/blob/main/I-X%20AI%20in%20Science-Imperial/Tutorial_Phylodynamics_ParamEst.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tutorial for phylodynamics Parameter Estimation**
Based on the method developed in Perez M.F. and Gascuel O.PhyloCNN: Improving tree representation and neural network architecture for deep learning from trees in phylodynamics and diversification studies. https://www.biorxiv.org/content/10.1101/2024.12.13.628187v1

## **1. Introduction**
This tutorial shows how to train a CNN model that estimate phylodynamics parameters from trees of viruses under the Birth-Death with Superspreaders (BDSS) epidemiological (phylodynamics) model.

<img src="https://drive.google.com/uc?export=view&id=1FxkO0Qisu6m1_Znc76MMbd6ZVUSPWuAl" width="500" height="300">


## **2. Libraries and Data Loading**
We import the required python libraries and then we load phylogenetic trees simulated under BDSS and the respective parameters.


In [None]:
#First you need to download the data.
!gdown --id 1GHLYw3EezrtrMkJDBXY8FNZ4FjyV3Vnn

Downloading...
From (original): https://drive.google.com/uc?id=1GHLYw3EezrtrMkJDBXY8FNZ4FjyV3Vnn
From (redirected): https://drive.google.com/uc?id=1GHLYw3EezrtrMkJDBXY8FNZ4FjyV3Vnn&confirm=t&uuid=7c2c20aa-6f07-468a-98be-b5eb966e9ff6
To: /content/PhyloDyn.zip
100% 70.1M/70.1M [00:02<00:00, 28.5MB/s]


In [None]:
#Unzip simulations
!unzip "/content/PhyloDyn.zip"

Archive:  /content/PhyloDyn.zip
   creating: PhyloDyn/
  inflating: __MACOSX/._PhyloDyn     
  inflating: PhyloDyn/.DS_Store 2    
  inflating: __MACOSX/PhyloDyn/._.DS_Store 2  
  inflating: PhyloDyn/.DS_Store      
  inflating: __MACOSX/PhyloDyn/._.DS_Store  
  inflating: PhyloDyn/Encoded_Zurich.csv  
  inflating: __MACOSX/PhyloDyn/._Encoded_Zurich.csv  
  inflating: PhyloDyn/BDSS_large_100K.csv  
  inflating: __MACOSX/PhyloDyn/._BDSS_large_100K.csv  
  inflating: PhyloDyn/Encoded_Zurich.npy  
  inflating: __MACOSX/PhyloDyn/._Encoded_Zurich.npy  
   creating: PhyloDyn/testset/
  inflating: __MACOSX/PhyloDyn/._testset  
  inflating: PhyloDyn/Encoded_trees_BDSS.npy  
  inflating: __MACOSX/PhyloDyn/._Encoded_trees_BDSS.npy  
  inflating: PhyloDyn/Encoded_trees_BDEI.npy  
  inflating: __MACOSX/PhyloDyn/._Encoded_trees_BDEI.npy  
  inflating: PhyloDyn/Encoded_trees_BD.npy  
  inflating: __MACOSX/PhyloDyn/._Encoded_trees_BD.npy  
  inflating: PhyloDyn/testset/.DS_Store  
  inflating: __MACO

In [None]:
import pandas as pd
import tensorflow as tf
import keras
import numpy as np

from keras.models import Sequential, Model
from keras.layers import Activation, Dense
from keras.layers import Conv2D, GlobalAveragePooling2D, BatchNormalization
from keras.layers import Dense, Dropout, Activation, Flatten

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# 1) Load tree encodings for the BDSS model. We will load tree
# encodings for training (1,000 trees) and for testing (testset - 100 trees).
# The encodings are separated in two channels (one for internal nodes and another
# for the leaves of the tree). The trees have a maximum of 500 tips (leaves).
encoding_BDSS  = np.load('/content/PhyloDyn/Encoded_trees_BDSS.npy')
encoding_test_BDSS = np.load('/content/PhyloDyn/testset/Encoded_trees_BDSS.npy')

# 2) Load parameter values that will serve as labels for each simulation.
param_train = pd.read_csv('/content/PhyloDyn/BDSS_large_100K.csv', sep=',')
param_test = pd.read_csv('/content/PhyloDyn/testset/BDSS_large_10000.csv', sep=',')


In [None]:
# Here you can add again the code to recover the shape from encodings of training and test trees. You can also do it for thr parameters of the training and test sets.
# Add code below to recover the shape from encodings of training and test trees.
print(f"the shape of the encodings of Birth-Death (BD) train tree is",encoding_BD.shape)
print(f"the shape of the encodings of Birth-Death (BD) test tree is",encoding_test_BD.shape)



## 3. Data Preprocessing
We will process the input to be properly formatted before feeding it to the neural network. This will involve the following steps:

### Label Assignment
We create a label array **Y** for the training and test set, with the true values for each parameter:
- target_1 = "R_nought"
- target_2 = "infectious_period"
- target_3 = "x_transmission"
- target_4 = "fraction_1"

In [None]:
#Choice of the parameters to predict
target_1 = "R_nought"
target_2 = "infectious_period"
target_3 = "x_transmission"
target_4 = "fraction_1"

Y = pd.DataFrame(param_train[[target_1, target_2, target_3, target_4]])
Y_test = pd.DataFrame(param_test[[target_1, target_2, target_3, target_4]])

#We will then separate the training simulations into the training (70%) and testing (30%) sets.
valid_frac = 0.3
train_size_frac = 0.7

## 4. Building & Training the CNN (2-Generation Context)

### Model Definition
We will use the same CNN from the previous Notebook, which processes input of shape `(500, 19, 2)`:
- 500 = number of leaves or nodes
- 19 = number of features
- 2 = channels (leaves, nodes)

This architecture was inspired by the fact that internal nodes and leaves contribute differently to the tree likelihood calculation for multi-type birth-death models (MTBD, which includes BD, BDEI and BDSS; see Equation 8 in [Zhukova et al., 2023](https://academic.oup.com/sysbio/article/72/6/1387/7273092))

<img src="https://drive.google.com/uc?export=view&id=1FvkaeBLF42DuYYgePIj3NhKetzK3Abj6" width="1000" height="500">

<img src="https://drive.google.com/uc?export=view&id=1Fzol42i8u8hvSC6DEMDM3ScsoyeW4TTx" width="500" height="340">



## 4 Build the Neural Network Model <p id="build"> </p>

#### Exercise 5: Changing the output layer to predict parameters instead of performing model classification.

**Question 5 (add code below and copy it on the assessment form):** Please add code to the cell below, where you have a #FILL HERE written, to change the output layer to perform parameter estimation (regression) on the 4 parameters of the BDSS model.

In [None]:
# Creation of the Network Model: model definition
def build_model():
    # Initialize the Sequential model
    model = Sequential()

    # First convolutional layer:
    # - Filters: 32
    # - Kernel size: (1, 19), sliding across the second dimension of the input
    # - Input shape: (500, 19, 2) where 500 is the number of tree leaves/nodes, 19 is the feature size, and 2 is the number of channels (leaves and nodes)
    # - Activation function: ELU (Exponential Linear Unit)
    # - Groups: 2 to apply separate convolutions for the two channels (leaves and nodes)
    model.add(Conv2D(filters=32, use_bias=False, kernel_size=(1, 19), input_shape=(500, 19, 2), activation='relu', groups=2))

    # Apply batch normalization to stabilize and speed up the training process
    model.add(BatchNormalization())

    # Second convolutional layer:
    # - Filters: 32
    # - Kernel size: (1, 1) to process each feature independently
    # - Activation function: ELU
    model.add(Conv2D(filters=32, use_bias=False, kernel_size=(1, 1), activation='relu'))

    # Apply batch normalization again
    model.add(BatchNormalization())

    # Third convolutional layer:
    # - Filters: 32
    # - Kernel size: (1, 1) for further feature processing
    # - Activation function: ELU
    model.add(Conv2D(filters=32, use_bias=False, kernel_size=(1, 1), activation='relu'))

    # Apply batch normalization for the final time before flattening
    model.add(BatchNormalization())

    # Flatten the 2D feature maps from the convolutional layers into a 1D vector,
    # which will be passed to the fully connected (dense) layers
    model.add(GlobalAveragePooling2D())

    # Fully connected (FFNN) part:
    # Dense layers with decreasing number of units, all using ELU activation:
    model.add(Dense(64, activation='relu'))   # First dense layer with 64 units
    model.add(Dense(32, activation='relu'))   # Second dense layer with 32 units
    model.add(Dense(16, activation='relu'))   # Third dense layer with 16 units
    model.add(Dense(8, activation='relu'))    # Fourth dense layer with 8 units

    # Output layer:
    #This was the code for adding the output layer in the previous notebook. Change this to predict 4 continuous parameters.
    #model.add(Dense(3, activation='softmax'))
    #FILL HERE with the code to perform parameter estimation (regression).
    model.add(Dense(4, activation='linear'))

    # Show the summary of the model structure (number of layers, shapes of outputs, etc.)
    model.summary()

    # Return the constructed model
    return model

Now we compile and fit the model. Note that we changed the loss function (from keras.losses.categorical_crossentropy to losses.mean_absolute_percentage_error) and the metrics (from accuracy to mae) when compared to the previous Notebook.

#### Exercise 6: Loss function and metrics

**Question 6 (write the answer on the assessment form):** a) Why do we have to change the loss function and the metrics for parameter estimation? Compare what was being done with the previous loss function with the calculations on the new loss function to build your answer.

We changed the loss function to mean absolute percentage error (MAPE) and the metric to mean absolute error (MAE) because these are more appropriate for regression tasks, focusing on minimizing errors in continuous predictions. Unlike categorical crossentropy and accuracy, which are designed for classification problems, MAPE and MAE provide meaningful insights into model performance for parameter estimation.

In [None]:
from keras import losses

# Initialize the model using the build_model function that was previously defined
estimator = build_model()

# Compile the model:
# - Loss function: categorical_crossentropy is used to measure the error between the predicted probability distribution and the true distribution for multi-class classification tasks.
# - Optimizer: 'Adam' is used to minimize the loss function efficiently
# - Metrics: Accuracy is used to track the model's performance during training
estimator.compile(loss=losses.mean_absolute_percentage_error, optimizer = 'Adam', metrics=['mae'])

# Early stopping callback to prevent overfitting:
# - monitor: monitor the validation accuracy during training
# - patience: stop training if the validation accuracy doesn't improve for 100 consecutive epochs
# - mode: 'max' indicates that training will stop when the validation accuracy reaches its maximum
# - restore_best_weights: restore the weights from the best epoch (the one with the highest validation accuracy)
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=100, mode='min', restore_best_weights=True)

# Custom callback to display training progress:
# - Print a dot for every epoch (or newline every 100 epochs) to indicate progress in training
class PrintD(keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs):
        if epoch % 100 == 0:  # Print a newline every 100 epochs
            print('')
        print('.', end='')  # Print a dot to indicate progress during each epoch

# Set the maximum number of epochs (iterations over the entire dataset)
EPOCHS = 1000

# Train the model using the `fit` method:
# - encoding_pad: The padded training data (inputs)
# - Y: The target values (outputs)
# - verbose: set to 1 to print progress during training
# - epochs: The number of times to iterate over the entire dataset
# - validation_split: the fraction of data to use for validation (used to monitor validation loss)
# - batch_size: the number of samples per gradient update
# - callbacks: list of callbacks to be used during training (early stopping and progress display)
history = estimator.fit(encoding_BDSS, Y, verbose=1, epochs=EPOCHS, validation_split=valid_frac, batch_size=32, callbacks=[early_stop, PrintD()])

#### Exercise 7: Plotting the accuracy through training

**Question 7a (add code to the assessment form):** Please add code to the cell below to plot the accuracy levels throughout the training epoch for both the training and the validation set.

In [None]:

# Add code here to plot ac
import matplotlib.pyplot as plt

#Plot the training and the validation loss
plt.figure(figsize=(10, 5))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model perfomance')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

In [None]:
# Add code here to plot Mean Absolute Error
import matplotlib.pyplot as plt

#Plot the training and the validation mae
plt.figure(figsize=(10, 5))
plt.plot(history.history['mae'])
plt.plot(history.history['val_mae'])
# plt.plot(history.history['loss'])
# plt.plot(history.history['val_loss'])
plt.title('Model perfomance')
plt.ylabel('Mean Absolute Error')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

**Question 7b (write the answer on the assessment form):** a) Based on the plot obtained in the previous question, please cite which of the sets (training and validation) presented a higher accuracy and explain whether this is an expected results. b) Analyse the plot and discuss what can be said about overfitting based on the results.

 a)the training set shows a consistently lower Mean Absolute Error (MAE) compared to the validation set. This indicates that the training set achieved better perfomance(lower loss and lower mae).

This result might be expected since models often perform better on training data, as they are specifically optimized to minimize error on this set during the training process.

b)Analysis of Overfitting
However, if the gap between training and validation performance is too large, it can signal potential issues such as overfitting.
Analyzing the plots:

    Mean Absolute Error (MAE) Plot:
        The training MAE decreases steadily, indicating that the model is learning and fitting well to the training data.
        The validation MAE, while initially decreasing, shows more variability and does not decrease as consistently as the training MAE. This suggests that the model may not generalize well to unseen data.

    Loss Plot:
        Similar to the MAE plot, the training loss decreases significantly, while the validation loss exhibits fluctuations and higher values, indicating that the model is fitting the training data closely but struggling to perform equally well on validation data.

Conclusion on Overfitting

The disparity in performance between the training and validation sets, particularly the increasing validation error despite decreasing training error, suggests that the model may be overfitting.

### Evaluate the trained model
We evaluate our trained model by using the test set, which was not seen by the network during training. We will look at three error metrics, the Mean Absolute Error (MAE), Relative Mean Squared Error (RMSE) and Relative Mean Error (RME). We will focus on the RME to simplify the discussion.

In [None]:
#Plot test vs predicted
# predict values for the test set
predicted_test = pd.DataFrame(estimator.predict(encoding_test_BDSS))
predicted_test.columns = Y_test.columns # rename correctly the columns
predicted_test.index = Y_test.index # rename indexes for correspondence

elts = []

# just for subsetting columns more automatically + naming output plots
for elt in Y_test.columns:
    elts.append(elt)

for elt in elts:
    sub_df = pd.DataFrame({'predicted_minus_target_' + elt: predicted_test[elt] - Y_test[elt], 'target_'+elt: Y_test[elt], 'predicted_'+elt: predicted_test[elt]})
    if elt == elts[0]:
        df = sub_df
    else:
        sub_df.index = df.index
        df = pd.concat([df, sub_df], axis=1)

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 336ms/step


In [None]:
errors_index = elts

errors_columns = ['MAE', 'RMSE', 'RME']
errors = pd.DataFrame(index=errors_index, columns=errors_columns)

def get_mae_rmse(name_var):
    predicted_vals = df['predicted_' + name_var]
    target_vals = df['target_' + name_var]
    diffs_abs = abs(target_vals - predicted_vals)
    diffs_rel = diffs_abs/target_vals
    diffs_abs_squared = diffs_abs**2
    mae = np.sum(diffs_abs)/len(diffs_abs)
    rmse = np.sqrt(sum(diffs_abs_squared)/len(diffs_abs_squared))
    rme = np.sum(diffs_rel)/len(diffs_rel)
    return mae, rmse, rme


#errors.loc['R_nought'] = np.array(get_mae_rmse('R_nought'))
for elt in errors_index:
    errors.loc[elt] = np.array(get_mae_rmse(elt))

print(errors)

                        MAE      RMSE       RME
R_nought           1.020733  1.253998  0.347006
infectious_period  1.148103  1.454142  0.357976
x_transmission     6.424325   6.73738       1.0
fraction_1         0.040811   0.04852  0.361647


**Question 8 (write the answer on the assessment form):** Based on the obtained table showing the error values for each parameter, which of the 4 parameters is harder to estimate? Please explain your answer, comparing the values of RME.

To determine which of the four parameters is harder to estimate, we can look at the RME (Relative Mean Error) values: R_nought: 0.351165,Infectious_period: 0.359339,X_transmission: 0.282572,Fraction_1: 0.338874.The parameter with the highest RME value indicates that it has a greater relative error, suggesting it is harder to estimate accurately.When we look our results we can say that :Infectious_period (0.359339) has the highest RME value among the parameters, indicating it's the hardest to estimate.R_nought (0.351165) follows closely, also showing significant estimation difficulty.Fraction_1 (0.338874) is slightly easier to estimate than the previous two.X_transmission (0.282572) has the lowest RME, suggesting it is the easiest to estimate.Then we can conclude that Based on the RME values, Infectious_period is the hardest to estimate due to its higher relative error compared to the other parameters. This indicates that estimates for this parameter are less reliable, making it more challenging in analysis and modeling contexts.