# DeepRank-Core basic Protein-Protein Interface training

## Setup

### Data

The example data used in this tutorial are available on Zenodo at [this record address](https://zenodo.org/record/7997586). To download the data, please visit the link and click the “Download" button. Unzip the downloaded file, and save the contents as a folder named `data/` in the same directory as this notebook. The name data and the folder location are optional but recommended, as this are the name and the location we will use to refer to the folder throughout the tutorial.

The dataset contains only 100 data points, which are obviously not enough to develop an impactful predictive model, and the scope of its use is indeed only demonstrative and informative for the users.

### Software

1. Follow the [updated instructions](https://github.com/DeepRank/deeprank-core#installation) in the README.md on the main branch for successfully installing `deeprankcore` package.
2. To test the environment in which` deeprankcore` has been (successfully) installed, first clone [deeprank-core repository](https://github.com/DeepRank/deeprank-core). Navigate into it, and after having activated the environment and installed [pytest](https://anaconda.org/anaconda/pytest), run `pytest tests`. All tests should pass. We recommend installing `deeprankcore` and all its dependencies into a [conda](https://docs.conda.io/en/latest/) environment.
3. Additionally, for this tutorial you need to install [scikit-learn](https://scikit-learn.org/stable/install.html) and [plotly](https://anaconda.org/plotly/plotly).

## Introduction

<img style="margin-left: 1.5rem" align="right" src="images/training_ppi.png" width="400">

This tutorial will demonstrate the use of DeepRank-Core for training Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs) using Protein-Protein Interface (PPI) data for classification and regression predictive tasks.

This tutorial assumes that the PPI data of interest have already been generated and saved into [HDF5 files](https://en.wikipedia.org/wiki/Hierarchical_Data_Format), with the data structure that DeepRank-Core expects. Such data have already been generated in `data/ppi/processed/` folder, but for more details about the process please refer to the tutorial notebook [data_generation_ppi.ipynb](https://github.com/DeepRank/deeprank-core/blob/main/tutorials/data_generation_ppi.ipynb). It contains a detailed description of how such data are generated starting from the PDB files and gives instructions for generating different resolutions' data, i.e. residue- and atomic-level.

This tutorial assumes also a basic knowledge of the [PyTorch](https://pytorch.org/) framework, on top of which the machine learning pipeline of DeepRank-Core has been developed. 

### Use case

<img style="margin-right: 1.5rem" align="left" src="images/pmhc_pdb_example.png" width="200"/>

The example dataset that we provide contains PDB files representing [Major Histocompatibility Complex (MHC) protein](https://en.wikipedia.org/wiki/Major_histocompatibility_complex) + peptide (pMHC) complexes, which play a key role in T-cell immunity. We are interested in predicting the Binding Affinity (BA) of the complexes, which can be used to determine the most suitable mutated tumor peptides as vaccine candidates.

PDB models used in this tutorial have been generated with [PANDORA](https://github.com/X-lab-3D/PANDORA), an anchor restrained modeling pipeline for generating peptide-MHC structures. While target data, so the BA values for such pMHC complexes, have been retrieved from [MHCFlurry 2.0](https://data.mendeley.com/datasets/zx3kjzc3yx).

On the left an example of a pMHC structure is shown, rendered using [ProteinViewer](https://marketplace.visualstudio.com/items?itemName=ArianJamasb.protein-viewer). The MHC protein is displayed in green, while the peptide is in orange.

## Utilities

### Libraries

Let's import the libraries needed for this tutorial:

In [1]:
import logging
import glob
import os
import h5py
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    roc_curve,
    auc,
    precision_score,
    recall_score,
    accuracy_score,
    f1_score)
import plotly.express as px
import torch
import numpy as np
np.seterr(divide = 'ignore')
np.seterr(invalid='ignore')
import pandas as pd
logging.basicConfig(level=logging.INFO)
from deeprankcore.dataset import GraphDataset, GridDataset
from deeprankcore.trainer import Trainer
from deeprankcore.neuralnets.gnn.naive_gnn import NaiveNetwork
from deeprankcore.neuralnets.cnn.model3d import CnnClassification
from deeprankcore.utils.exporters import HDF5OutputExporter
import warnings
warnings.filterwarnings('ignore')

### Paths and sets

Let's define the paths for reading the processed data:

In [2]:
level = "residue"
processed_data_path = os.path.join("data", "ppi", "processed", level)
input_data_path = glob.glob(os.path.join(processed_data_path, '*.hdf5'))
output_path = os.path.join("data", "ppi") # for saving predictions results

The levels refer to a different molecular resolution, either residue- or atomic-level. In residue-level graphs, each node represents one amino acid residue, while in atomic-level graphs each node represents one atom within the amino acid residues. In this tutorial, we will use residue-level data, but the same code could be applied to atomic-level data with no changes.

Let's define a Pandas dataframe containing data points' IDs and the binary target values:

In [3]:
df_dict = {}
df_dict['entry'] = []
df_dict['target'] = []
for fname in input_data_path:
    with h5py.File(fname, 'r') as hdf5:
        for mol in hdf5.keys():
            target_value = float(hdf5[mol]["target_values"]["binary"][()])
            df_dict['entry'].append(mol)
            df_dict['target'].append(target_value)

df = pd.DataFrame(data=df_dict)
df.head()

Unnamed: 0,entry,target
0,residue-ppi:M-P:BA-102611,1.0
1,residue-ppi:M-P:BA-102669,0.0
2,residue-ppi:M-P:BA-102719,0.0
3,residue-ppi:M-P:BA-114468,0.0
4,residue-ppi:M-P:BA-115138,0.0


As explained in [data_generation_ppi.ipynb](https://github.com/DeepRank/deeprank-core/blob/main/tutorials/data_generation_ppi.ipynb), for each data point there are two targets: "BA" and "binary". The first represents the continuous Binding Affinity value of the complex, while the second represents its binary representation, being 0 (BA > 500 nM) a not-binding complex and 1 (BA <= 500 nM) binding one.

The dataframe `df` is used only to split data points into training, validation and test sets according to the "binary" target - using the target stratification, to keep the proportion of 0s and 1s sort of constant among the different sets. Training and validation set will be used during the training for updating the netowork's weights, while the test set will be held out as an indipendent test and will be used for later model's evaluation.

In [4]:
df_train, df_test = train_test_split(df, test_size=0.1, stratify=df.target, random_state=42)
df_train, df_valid = train_test_split(df_train, test_size=0.2, stratify=df_train.target, random_state=42)

print(f'Data statistics:\n')
print(f'Total samples: {len(df)}\n')
print(f'Training set: {len(df_train)} samples, {round(100*len(df_train)/len(df))}%')
print(f'\t- Class 0: {len(df_train[df_train.target == 0])} samples, {round(100*len(df_train[df_train.target == 0])/len(df_train))}%')
print(f'\t- Class 1: {len(df_train[df_train.target == 1])} samples, {round(100*len(df_train[df_train.target == 1])/len(df_train))}%')
print(f'Validation set: {len(df_valid)} samples, {round(100*len(df_valid)/len(df))}%')
print(f'\t- Class 0: {len(df_valid[df_valid.target == 0])} samples, {round(100*len(df_valid[df_valid.target == 0])/len(df_valid))}%')
print(f'\t- Class 1: {len(df_valid[df_valid.target == 1])} samples, {round(100*len(df_valid[df_valid.target == 1])/len(df_valid))}%')
print(f'Testing set: {len(df_test)} samples, {round(100*len(df_test)/len(df))}%')
print(f'\t- Class 0: {len(df_test[df_test.target == 0])} samples, {round(100*len(df_test[df_test.target == 0])/len(df_test))}%')
print(f'\t- Class 1: {len(df_test[df_test.target == 1])} samples, {round(100*len(df_test[df_test.target == 1])/len(df_test))}%')

Data statistics:

Total samples: 100

Training set: 72 samples, 72%
	- Class 0: 36 samples, 50%
	- Class 1: 36 samples, 50%
Validation set: 18 samples, 18%
	- Class 0: 9 samples, 50%
	- Class 1: 9 samples, 50%
Testing set: 10 samples, 10%
	- Class 0: 5 samples, 50%
	- Class 1: 5 samples, 50%


## Classification example

Let's train a GNN and a CNN for a classification predictive task, which consists in predicting the "binary" target values. 

### GNN

#### GraphDataset

For training GNNs the user can create `GraphDataset` instances. The latter class inherits from `DeeprankDataset` class, which in turns inherits from `Dataset` [PyTorch geometric class](https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/data/dataset.html), a base class for creating graph datasets.

A few notes about `GraphDataset` parameters:
- By default, all features contained in the HDF5 files are used, but the user can specify `node_features` and `edge_features` in `GraphDataset` if not all of them are needed. See the [docs](https://deeprankcore.readthedocs.io/en/latest/features.html) for more details about all the possible pre-implemented features. 
- For the `GraphDataset` class only it is possible to define a dictionary to indicate which transformations to apply to the features, being the transformations lambda functions and/or standardization. If `True`, standardization is applied after transformation, if the latter is present. Standardization consists in applying the following formula on each feature's value: ${x' = \frac{x - \mu}{\sigma}}$, being ${\mu}$ the mean and ${\sigma}$ the standard deviation. Standardization is a scaling method where the values are centered around mean with a unit standard deviation. In the example below we will apply a logarithmic transformation and then standardization to all the features. It is also possible to use specific features as keys for indicating that we want to apply transformation and/or standardization to few features only. 
- Since we are applying standardization, we need to use training features' means and standard deviations to scale validation and test sets. For doing so, `train` and `dataset_train` parameters are used. When `train` is set `False`, a `dataset_train` of the same class must be provided and it will be used to scale the validation/testing set according to its features values. You need to pass `features_transform` to the training dataset only, since in other cases it will be ignored and only the one of `dataset_train` will be considered. 
- For regression, `task` can be assigned to `regress` and the `target` to `BA`, which is a continuous variable and this it is suitable for regression. 

In [25]:
target = "binary"
task = "classif"
features_transform = {'all': {'transform': lambda x: np.cbrt(x), 'standardize': True}}

print('Loading training data...')
dataset_train = GraphDataset(
    hdf5_path = input_data_path,
    subset = list(df_train.entry), # selects only data points with ids in df_train.entry
    target = target,
    task = task,
    features_transform = features_transform
)
print('\nLoading validation data...')
dataset_val = GraphDataset(
    hdf5_path = input_data_path,
    subset = list(df_valid.entry), # selects only data points with ids in df_valid.entry
    target = target,
    task = task,
    train = False,
    dataset_train = dataset_train
)
print('\nLoading test data...')
dataset_test = GraphDataset(
    hdf5_path = input_data_path,
    subset = list(df_test.entry), # selects only data points with ids in df_test.entry
    target = target,
    task = task,
    train = False,
    dataset_train = dataset_train
)

INFO:deeprankcore.dataset:
Checking dataset Integrity...
INFO:deeprankcore.dataset:Target classes set up to: [0, 1]


Loading training data...
   ['data/ppi/processed/residue/proc-28835.hdf5', 'data/ppi/processed/residue/proc-28842.hdf5', 'data/ppi/processed/residue/proc-28839.hdf5', 'data/ppi/processed/residue/proc-28838.hdf5', 'data/ppi/processed/residue/proc-28843.hdf5', 'data/ppi/processed/residue/proc-28834.hdf5', 'data/ppi/processed/residue/proc-28837.hdf5', 'data/ppi/processed/residue/proc-28840.hdf5', 'data/ppi/processed/residue/proc-28841.hdf5', 'data/ppi/processed/residue/proc-28836.hdf5'] dataset                 : 100%|██████████| 10/10 [00:00<00:00, 403.02it/s, entry_name=proc-28836.hdf5]


INFO:deeprankcore.dataset:
Checking dataset Integrity...
INFO:deeprankcore.dataset:Target classes set up to: [0, 1]



Loading validation data...
   ['data/ppi/processed/residue/proc-28835.hdf5', 'data/ppi/processed/residue/proc-28842.hdf5', 'data/ppi/processed/residue/proc-28839.hdf5', 'data/ppi/processed/residue/proc-28838.hdf5', 'data/ppi/processed/residue/proc-28843.hdf5', 'data/ppi/processed/residue/proc-28834.hdf5', 'data/ppi/processed/residue/proc-28837.hdf5', 'data/ppi/processed/residue/proc-28840.hdf5', 'data/ppi/processed/residue/proc-28841.hdf5', 'data/ppi/processed/residue/proc-28836.hdf5'] dataset                 : 100%|██████████| 10/10 [00:00<00:00, 728.01it/s, entry_name=proc-28836.hdf5]

features_transform will remain the same as the one used in training phase.
INFO:deeprankcore.dataset:
Checking dataset Integrity...
INFO:deeprankcore.dataset:Target classes set up to: [0, 1]




Loading test data...
   ['data/ppi/processed/residue/proc-28835.hdf5', 'data/ppi/processed/residue/proc-28842.hdf5', 'data/ppi/processed/residue/proc-28839.hdf5', 'data/ppi/processed/residue/proc-28838.hdf5', 'data/ppi/processed/residue/proc-28843.hdf5', 'data/ppi/processed/residue/proc-28834.hdf5', 'data/ppi/processed/residue/proc-28837.hdf5', 'data/ppi/processed/residue/proc-28840.hdf5', 'data/ppi/processed/residue/proc-28841.hdf5', 'data/ppi/processed/residue/proc-28836.hdf5'] dataset                 : 100%|██████████| 10/10 [00:00<00:00, 805.93it/s, entry_name=proc-28836.hdf5]

features_transform will remain the same as the one used in training phase.





#### Trainer

The class `Trainer` implements training, validation and testing of PyTorch-based neural networks. 

A few notes about `Trainer` parameters:
- `neuralnet` can be whatever neural network class which inherits from `torch.nn.Module`, and it shouldn't be specific to regression or classification in terms of output shape. `Trainer` class takes care of formatting the output shape according to the task. In this tutorial we will use `NaiveNetwork` (in `deeprankcore.neuralnets.gnn.naive_gnn`), whose architecture is shown below:
  ```python
  class NaiveConvolutionalLayer(Module):

    def __init__(self, count_node_features, count_edge_features):
        super().__init__()
        message_size = 32
        edge_input_size = 2 * count_node_features + count_edge_features
        self._edge_mlp = Sequential(Linear(edge_input_size, message_size), ReLU())
        node_input_size = count_node_features + message_size
        self._node_mlp = Sequential(Linear(node_input_size, count_node_features), ReLU())

    def forward(self, node_features, edge_node_indices, edge_features):
        # generate messages over edges
        node0_indices, node1_indices = edge_node_indices
        node0_features = node_features[node0_indices]
        node1_features = node_features[node1_indices]
        message_input = torch.cat([node0_features, node1_features, edge_features], dim=1)
        messages_per_neighbour = self._edge_mlp(message_input)
        # aggregate messages
        out = torch.zeros(node_features.shape[0], messages_per_neighbour.shape[1]).to(node_features.device)
        message_sums_per_node = scatter_sum(messages_per_neighbour, node0_indices, dim=0, out=out)
        # update nodes
        node_input = torch.cat([node_features, message_sums_per_node], dim=1)
        node_output = self._node_mlp(node_input)
        return node_output

    class NaiveNetwork(Module):

        def __init__(self, input_shape: int, output_shape: int, input_shape_edge: int):
            """
            Args:
                input_shape (int): Number of node input features.
                output_shape (int): Number of output value per graph.
                input_shape_edge (int): Number of edge input features.
            """
            super().__init__()
            self._external1 = NaiveConvolutionalLayer(input_shape, input_shape_edge)
            self._external2 = NaiveConvolutionalLayer(input_shape, input_shape_edge)
            hidden_size = 128
            self._graph_mlp = Sequential(Linear(input_shape, hidden_size), ReLU(), Linear(hidden_size, output_shape))

        def forward(self, data):
            external_updated1_node_features = self._external1(data.x, data.edge_index, data.edge_attr)
            external_updated2_node_features = self._external2(external_updated1_node_features, data.edge_index, data.edge_attr)
            means_per_graph_external = scatter_mean(external_updated2_node_features, data.batch, dim=0)
            graph_input = means_per_graph_external
            z = self._graph_mlp(graph_input)
            return z
  ```
- `class_weights` is used in classification tasks only, to assign class weights based on the training dataset content. In this case we have a balanced dataset (50% 0 and 50% 1), so it doesn't make sense to use it. It defaults to False. 
- `cuda` and `ngpu` are used for indicating whether to use CUDA and how many GPUs. By default, CUDA is not used and `ngpu` is 0.
- The user can specify a deeprankcore exporter or a custom one in `output_exporters` parameter, together with the path where to save the results. Exporters are used for storing predictions information collected later on during training and testing. We will see later how to read the results saved by `HDF5OutputExporter`.

##### Training

In [26]:
trainer = Trainer(
    neuralnet = NaiveNetwork,
    dataset_train = dataset_train,
    dataset_val = dataset_val,
    dataset_test = dataset_test,
    output_exporters = [HDF5OutputExporter(os.path.join(output_path, "gnn_classif"))]
)

INFO:deeprankcore.trainer:Device set to cpu.
INFO:deeprankcore.trainer:No loss function provided, the default loss function for classif tasks is used: <class 'torch.nn.modules.loss.CrossEntropyLoss'>


The default optimizer is `torch.optim.Adam`. It is possible to specify optimizer's parameters or to use another PyTorch optimizer object:

In [27]:
optimizer = torch.optim.SGD
lr = 1e-3
weight_decay = 0.001

trainer.configure_optimizers(optimizer, lr, weight_decay)

The default loss function for classification is `torch.nn.CrossEntropyLoss`, while for regression is `torch.nn.MSELoss`. It is also possible to set other PyTorch loss functions by using `Trainer.set_lossfunction` method.

Then we can train our model, using the `train()` method of the `Trainer` class.

A few notes about `train()` method parameters:
- `earlystop_patience`, `earlystop_maxgap` and `min_epoch` are used for controlling early stopping logic. `earlystop_patience` indicates the number of epochs after which the training ends if the validation loss as not improved. `earlystop_maxgap` indicated the maximum difference allowed between validation and training loss, and `min_epoch` is the minimum epoch to be reached before looking at maxgap.
- `validate` is set True to perform validation on an independent dataset, which we called `dataset_val` few cells above.
- `num_workers` can be set for indicating how many subprocesses to use for data loading. The default is 0 and it means that the data will be loaded in the main process.

In [28]:
epochs = 20
batch_size = 8
earlystop_patience = 5
earlystop_maxgap = 0.1
min_epoch = 10

trainer.train(
    nepoch = epochs,
    batch_size = batch_size,
    earlystop_patience = earlystop_patience,
    earlystop_maxgap = earlystop_maxgap,
    min_epoch = min_epoch,
    validate = True,
    filename = os.path.join(output_path, "gnn_classif", "model.pth.tar"))

epoch = trainer.epoch_saved_model
print(f"Model saved at epoch {epoch}")
pytorch_total_params = sum(p.numel() for p in trainer.model.parameters())
print(f'Total # of parameters: {pytorch_total_params}')
pytorch_trainable_params = sum(p.numel() for p in trainer.model.parameters() if p.requires_grad)
print(f'Total # of trainable parameters: {pytorch_trainable_params}')

INFO:deeprankcore.trainer:Training set loaded

INFO:deeprankcore.trainer:Validation set loaded

INFO:deeprankcore.trainer:Epoch 0:
INFO:deeprankcore.trainer:training loss 0.8239355815781487 | time 0.4840049743652344
INFO:deeprankcore.trainer:validation loss 0.750345379114151 | time 0.13220000267028809
INFO:deeprankcore.trainer:Epoch 1:
INFO:deeprankcore.trainer:training loss 4.725361492898729 | time 1.5999512672424316
INFO:deeprankcore.trainer:validation loss 1.005086514684889 | time 0.1488018035888672
INFO:deeprankcore.trainer:Best model saved at epoch # 1.
INFO:deeprankcore.trainer:Epoch 2:
INFO:deeprankcore.trainer:training loss 0.835904081662496 | time 1.2459933757781982
INFO:deeprankcore.trainer:validation loss 0.6674932440121969 | time 0.15125608444213867
INFO:deeprankcore.trainer:Best model saved at epoch # 2.
INFO:deeprankcore.trainer:Validation loss decreased (1.005087 --> 0.667493).
INFO:deeprankcore.trainer:Epoch 3:
INFO:deeprankcore.trainer:training loss 0.7002725137604607 

Model saved at epoch 5
Total # of parameters: 12230
Total # of trainable parameters: 12230


##### Testing

And we can test the trained model on our `dataset_test`:

In [29]:
trainer.test()

INFO:deeprankcore.trainer:Loading independent testing dataset...
INFO:deeprankcore.trainer:Testing set loaded

INFO:deeprankcore.trainer:testing loss 0.7380596995353699 | time 0.1068108081817627


##### Results visualization

Finally, we can inspect the results saved by `HDF5OutputExporter`, which can be found in `data/ppi/gnn_classif/` folder in the form of an HDF5 file, `output_exporter.hdf5`. Note that the folder contains the saved pre-trained model as well. 

`output_exporter.hdf5` contains [HDF5 Groups](https://docs.h5py.org/en/stable/high/group.html) which refer to each phase, e.g. training and testing if both are run, only one of them otherwise. Training phase includes validation results as well. This HDF5 file can be read as a Pandas Dataframe:

In [30]:
output_train = pd.read_hdf(os.path.join(output_path, "gnn_classif", "output_exporter.hdf5"), key="training")
output_test = pd.read_hdf(os.path.join(output_path, "gnn_classif", "output_exporter.hdf5"), key="testing")
output_train.head()

Unnamed: 0,phase,epoch,entry,output,target,loss
0,training,0.0,residue-ppi:M-P:BA-116429,"[0.7257537841796875, 0.2742462158203125]",1.0,0.823936
1,training,0.0,residue-ppi:M-P:BA-116810,"[0.49615606665611267, 0.5038439631462097]",0.0,0.823936
2,training,0.0,residue-ppi:M-P:BA-135488,"[0.6636247038841248, 0.33637526631355286]",1.0,0.823936
3,training,0.0,residue-ppi:M-P:BA-153607,"[0.8759755492210388, 0.12402446568012238]",0.0,0.823936
4,training,0.0,residue-ppi:M-P:BA-119169,"[0.6754884719848633, 0.3245115578174591]",0.0,0.823936


The dataframes contain `phase`, `epoch`, `entry`, `output`, `target`, and `loss` columns, and can be easily used to visualize the results.

Let's plot for example the loss across the epochs for the training and the validation sets:

In [31]:
fig = px.line(
    output_train,
    x='epoch',
    y='loss',
    color='phase',
    markers=True)

fig.add_vline(x=trainer.epoch_saved_model, line_width=3, line_dash="dash", line_color="green")

fig.update_layout(
    xaxis_title='Epoch #',
    yaxis_title='Loss',
    title='Loss vs epochs - GNN training',
    width=700, height=400,
)

And now let's print a few metrics of interest for classification tasks: the Area under the ROC curve (AUC), and for a threshold of 0.5 the precision, the recall, the accuracy and the f1 score.

In [32]:
threshold = 0.5
df = pd.concat([output_train, output_test])
df_plot = df[(df.epoch == trainer.epoch_saved_model) | ((df.epoch == trainer.epoch_saved_model) & (df.phase == 'testing'))]

for idx, set in enumerate(['training', 'validation', 'testing']):
    df_plot_phase = df_plot[(df_plot.phase == set)]
    y_true = df_plot_phase.target
    y_score = np.array(df_plot_phase.output.values.tolist())[:, 1]

    print(f'\nMetrics for {set}:')
    fpr_roc, tpr_roc, thr_roc = roc_curve(y_true, y_score)
    auc_score = auc(fpr_roc, tpr_roc)
    print(f'AUC: {round(auc_score, 1)}')
    print(f'Considering a threshold of {threshold}')
    y_pred = (y_score > threshold)*1
    print(f'- Precision: {round(precision_score(y_true, y_pred), 1)}')
    print(f'- Recall: {round(recall_score(y_true, y_pred), 1)}')
    print(f'- Accuracy: {round(accuracy_score(y_true, y_pred), 1)}')
    print(f'- F1: {round(f1_score(y_true, y_pred), 1)}')


Metrics for training:
AUC: 0.6
Considering a threshold of 0.5
- Precision: 0.6
- Recall: 0.6
- Accuracy: 0.6
- F1: 0.6

Metrics for validation:
AUC: 0.8
Considering a threshold of 0.5
- Precision: 1.0
- Recall: 0.4
- Accuracy: 0.7
- F1: 0.6

Metrics for testing:
AUC: 0.4
Considering a threshold of 0.5
- Precision: 0.0
- Recall: 0.0
- Accuracy: 0.5
- F1: 0.0


Of course such metrics results are not satisfying from the performance point of view, but we did use only 100 data points, which of course is not enough to train a neural network. Now let's do the same exercise but using grids instead of graphs and CNNs instead of GNNs. 

### CNN

#### GridDataset

For training CNNs the user can create `GridDataset` instances.

A few notes about `GridDataset` parameters:
- By default, all features contained in the HDF5 files are used, but the user can specify `features` in `GridDataset` if not all of them are needed. Since grids features are derived from node and edge features mapped from graphs to grid, the easiest way to see which features are available is to look at the HDF5 file, as explained in detail in `data_generation_ppi.ipynb`, section "Other tools".  
- As for graphs, if we want to perform regression, `task` can be assigned to `regress` and the `target` to `BA`.

In [14]:
target = "binary"
task = "classif"

print('Loading training data...')
dataset_train = GridDataset(
    hdf5_path = input_data_path,
    subset = list(df_train.entry), # selects only data points with ids in df_train.entry
    target = target,
    task = task
)
print('\nLoading validation data...')
dataset_val = GridDataset(
    hdf5_path = input_data_path,
    subset = list(df_valid.entry), # selects only data points with ids in df_valid.entry
    target = target,
    task = task
)
print('\nLoading test data...')
dataset_test = GridDataset(
    hdf5_path = input_data_path,
    subset = list(df_test.entry), # selects only data points with ids in df_test.entry
    target = target,
    task = task
)

INFO:deeprankcore.dataset:
Checking dataset Integrity...
INFO:deeprankcore.dataset:Target classes set up to: [0, 1]


Loading training data...
   ['data/ppi/processed/residue/proc-28835.hdf5', 'data/ppi/processed/residue/proc-28842.hdf5', 'data/ppi/processed/residue/proc-28839.hdf5', 'data/ppi/processed/residue/proc-28838.hdf5', 'data/ppi/processed/residue/proc-28843.hdf5', 'data/ppi/processed/residue/proc-28834.hdf5', 'data/ppi/processed/residue/proc-28837.hdf5', 'data/ppi/processed/residue/proc-28840.hdf5', 'data/ppi/processed/residue/proc-28841.hdf5', 'data/ppi/processed/residue/proc-28836.hdf5'] dataset                 : 100%|██████████| 10/10 [00:00<00:00, 449.68it/s, entry_name=proc-28836.hdf5]

INFO:deeprankcore.dataset:
Checking dataset Integrity...
INFO:deeprankcore.dataset:Target classes set up to: [0, 1]




Loading validation data...
   ['data/ppi/processed/residue/proc-28835.hdf5', 'data/ppi/processed/residue/proc-28842.hdf5', 'data/ppi/processed/residue/proc-28839.hdf5', 'data/ppi/processed/residue/proc-28838.hdf5', 'data/ppi/processed/residue/proc-28843.hdf5', 'data/ppi/processed/residue/proc-28834.hdf5', 'data/ppi/processed/residue/proc-28837.hdf5', 'data/ppi/processed/residue/proc-28840.hdf5', 'data/ppi/processed/residue/proc-28841.hdf5', 'data/ppi/processed/residue/proc-28836.hdf5'] dataset                 : 100%|██████████| 10/10 [00:00<00:00, 222.46it/s, entry_name=proc-28836.hdf5]

INFO:deeprankcore.dataset:
Checking dataset Integrity...
INFO:deeprankcore.dataset:Target classes set up to: [0, 1]




Loading test data...
   ['data/ppi/processed/residue/proc-28835.hdf5', 'data/ppi/processed/residue/proc-28842.hdf5', 'data/ppi/processed/residue/proc-28839.hdf5', 'data/ppi/processed/residue/proc-28838.hdf5', 'data/ppi/processed/residue/proc-28843.hdf5', 'data/ppi/processed/residue/proc-28834.hdf5', 'data/ppi/processed/residue/proc-28837.hdf5', 'data/ppi/processed/residue/proc-28840.hdf5', 'data/ppi/processed/residue/proc-28841.hdf5', 'data/ppi/processed/residue/proc-28836.hdf5'] dataset                 : 100%|██████████| 10/10 [00:00<00:00, 894.69it/s, entry_name=proc-28836.hdf5]


#### Trainer

As for graphs, the class `Trainer` is used for training, validation and testing of the PyTorch-based CNN. 

- Also in this case, `neuralnet` can be whatever neural network class which inherits from `torch.nn.Module`, and it shouldn't be specific to regression or classification in terms of output shape. Here we use `CnnClassification` (in `deeprankcore.neuralnets.cnn.model3d`), whose architecture is shown below:
  ```python
    class CnnClassification(torch.nn.Module):

        def __init__(self, num_features, box_shape):
            super().__init__()

            self.convlayer_000 = torch.nn.Conv3d(num_features, 4, kernel_size=2)
            self.convlayer_001 = torch.nn.MaxPool3d((2, 2, 2))
            self.convlayer_002 = torch.nn.Conv3d(4, 5, kernel_size=2)
            self.convlayer_003 = torch.nn.MaxPool3d((2, 2, 2))

            size = self._get_conv_output(num_features, box_shape)

            self.fclayer_000 = torch.nn.Linear(size, 84)
            self.fclayer_001 = torch.nn.Linear(84, 2)

        def _get_conv_output(self, num_features, shape):
            inp = Variable(torch.rand(1, num_features, *shape))
            out = self._forward_features(inp)
            return out.data.view(1, -1).size(1)

        def _forward_features(self, x):
            x = F.relu(self.convlayer_000(x))
            x = self.convlayer_001(x)
            x = F.relu(self.convlayer_002(x))
            x = self.convlayer_003(x)
            return x

        def forward(self, data):
            x = self._forward_features(data.x)
            x = x.view(x.size(0), -1)
            x = F.relu(self.fclayer_000(x))
            x = self.fclayer_001(x)
            return x
  ```
- The rest of the `Trainer` parameters can be used as explained already for graphs.

##### Training

In [20]:
optimizer = torch.optim.SGD
lr = 1e-3
weight_decay = 0.001
epochs = 20
batch_size = 8
earlystop_patience = 5
earlystop_maxgap = 0.1
min_epoch = 10

trainer = Trainer(
    neuralnet = CnnClassification,
    dataset_train = dataset_train,
    dataset_val = dataset_val,
    dataset_test = dataset_test,
    output_exporters = [HDF5OutputExporter(os.path.join(output_path, "cnn_classif"))]
)

trainer.configure_optimizers(optimizer, lr, weight_decay)

trainer.train(
    nepoch = epochs,
    batch_size = batch_size,
    earlystop_patience = earlystop_patience,
    earlystop_maxgap = earlystop_maxgap,
    min_epoch = min_epoch,
    validate = True,
    filename = os.path.join(output_path, "cnn_classif", "model.pth.tar"))

epoch = trainer.epoch_saved_model
print(f"Model saved at epoch {epoch}")
pytorch_total_params = sum(p.numel() for p in trainer.model.parameters())
print(f'Total # of parameters: {pytorch_total_params}')
pytorch_trainable_params = sum(p.numel() for p in trainer.model.parameters() if p.requires_grad)
print(f'Total # of trainable parameters: {pytorch_trainable_params}')

INFO:deeprankcore.trainer:Device set to cpu.
INFO:deeprankcore.trainer:No loss function provided, the default loss function for classif tasks is used: <class 'torch.nn.modules.loss.CrossEntropyLoss'>
INFO:deeprankcore.trainer:Training set loaded

INFO:deeprankcore.trainer:Validation set loaded

INFO:deeprankcore.trainer:Epoch 0:
INFO:deeprankcore.trainer:training loss 0.697644485367669 | time 4.762144088745117
INFO:deeprankcore.trainer:validation loss 0.6870552036497328 | time 1.185007095336914
INFO:deeprankcore.trainer:Epoch 1:
INFO:deeprankcore.trainer:training loss 0.7181484434339735 | time 5.222409009933472
INFO:deeprankcore.trainer:validation loss 0.696343461672465 | time 1.183293104171753
INFO:deeprankcore.trainer:Best model saved at epoch # 1.
INFO:deeprankcore.trainer:Epoch 2:
INFO:deeprankcore.trainer:training loss 0.6928930679957072 | time 5.244858980178833
INFO:deeprankcore.trainer:validation loss 0.7079185048739115 | time 1.1831212043762207
INFO:deeprankcore.trainer:Validat

Model saved at epoch 3
Total # of parameters: 122599
Total # of trainable parameters: 122599


##### Testing

And we can test the trained model on our `dataset_test`:

In [21]:
trainer.test()

INFO:deeprankcore.trainer:Loading independent testing dataset...
INFO:deeprankcore.trainer:Testing set loaded

INFO:deeprankcore.trainer:testing loss 0.6965798735618591 | time 0.6703002452850342


##### Results visualization

As for the GNN, we can finally inspect the results saved by `HDF5OutputExporter`, which can be found in `data/ppi/cnn_classif/` folder in the form of an HDF5 file, `output_exporter.hdf5`, together with the saved pre-trained model. 

In [22]:
output_train = pd.read_hdf(os.path.join(output_path, "cnn_classif", "output_exporter.hdf5"), key="training")
output_test = pd.read_hdf(os.path.join(output_path, "cnn_classif", "output_exporter.hdf5"), key="testing")
output_train.head()

Unnamed: 0,phase,epoch,entry,output,target,loss
0,training,0.0,residue-ppi:M-P:BA-107135,"[0.45900216698646545, 0.5409978032112122]",0.0,0.697644
1,training,0.0,residue-ppi:M-P:BA-128016,"[0.4739755690097809, 0.5260244607925415]",0.0,0.697644
2,training,0.0,residue-ppi:M-P:BA-120135,"[0.4273616671562195, 0.5726382732391357]",1.0,0.697644
3,training,0.0,residue-ppi:M-P:BA-119409,"[0.46763601899147034, 0.5323639512062073]",0.0,0.697644
4,training,0.0,residue-ppi:M-P:BA-153952,"[0.48576146364212036, 0.5142385363578796]",0.0,0.697644


Let's plot also in this case the loss across the epochs for the training and the validation sets:

In [23]:
fig = px.line(
    output_train,
    x='epoch',
    y='loss',
    color='phase',
    markers=True)

fig.add_vline(x=trainer.epoch_saved_model, line_width=3, line_dash="dash", line_color="green")

fig.update_layout(
    xaxis_title='Epoch #',
    yaxis_title='Loss',
    title='Loss vs epochs - CNN training',
    width=700, height=400,
)

And some metrics of interest for classification tasks:

In [24]:
threshold = 0.5
df = pd.concat([output_train, output_test])
df_plot = df[(df.epoch == trainer.epoch_saved_model) | ((df.epoch == trainer.epoch_saved_model) & (df.phase == 'testing'))]

for idx, set in enumerate(['training', 'validation', 'testing']):
    df_plot_phase = df_plot[(df_plot.phase == set)]
    y_true = df_plot_phase.target
    y_score = np.array(df_plot_phase.output.values.tolist())[:, 1]

    print(f'\nMetrics for {set}:')
    fpr_roc, tpr_roc, thr_roc = roc_curve(y_true, y_score)
    auc_score = auc(fpr_roc, tpr_roc)
    print(f'AUC: {round(auc_score, 1)}')
    print(f'Considering a threshold of {threshold}')
    y_pred = (y_score > threshold)*1
    print(f'- Precision: {round(precision_score(y_true, y_pred), 1)}')
    print(f'- Recall: {round(recall_score(y_true, y_pred), 1)}')
    print(f'- Accuracy: {round(accuracy_score(y_true, y_pred), 1)}')
    print(f'- F1: {round(f1_score(y_true, y_pred), 1)}')


Metrics for training:
AUC: 0.5
Considering a threshold of 0.5
- Precision: 0.6
- Recall: 0.2
- Accuracy: 0.5
- F1: 0.4

Metrics for validation:
AUC: 0.6
Considering a threshold of 0.5
- Precision: 0.5
- Recall: 1.0
- Accuracy: 0.5
- F1: 0.7

Metrics for testing:
AUC: 0.4
Considering a threshold of 0.5
- Precision: 0.5
- Recall: 1.0
- Accuracy: 0.5
- F1: 0.7


The results appear to be less favorable for GNN compared to CNN, but it's important to note that the dataset used in this analysis is not sufficiently large to provide conclusive and reliable insights. Depending on your specific application, you might find regression, classification, GNNs, and/or CNNs to be valuable options. Feel free to choose the approach that best aligns with your particular problem!