In [None]:
#---#| default_exp model.model_interface

# Model Interface

## Description

This notebook mainly defines the basic interface that is used to interact with the deep learning models. Its 'public' functions are intended to stay untouched over the project, while the specific workings of the interface can be changed (i.e. programming polymorphism concept). For example, models can always be loaded with the `load()` function and details of the loading can be changed by inheriting the interface and changing the functions that `load()` calls. More details are given below.


## Imports

In [None]:
from peptdeep.model.model_interface import *

In [None]:
m = ModelInterface()
m.fixed_sequence_len = 10
df = pd.DataFrame({
    'sequence':['ABCD']*5+['CDEFGHIJ']*5,
    'mods': 'Oxidation@M;Oxidation@M',
    'mod_sites': '2;2'
})
df['nAA'] = df.sequence.str.len()
m._pad_zeros_if_fixed_len(df)
assert m._get_features_from_batch_df(df).size()==(10,12)
assert m._get_mod_features(df).size() == (10,12,109)
assert (m._get_mod_features(df)[:,2,3]==2).all() # two Oxidation on one site

## Interface Class
The `ModelInterface` below is intended to provide a standardized way to handle deep learning models. It does not contain the PyTorch-based models themselves, but provides methods to `load()`, `save()`, `build()`, `train()` and `predict()` new models. These methods are intended to stay unchanged. 
To adapt the interface to a new usecase, we inherit the interface in a new class and re-implement the relevant method `_get_features_from_batch_df()`. Sometimes we also need to re-implement `_get_targets_from_batch_df()` and `_prepare_predict_data_df()`.

The interface will adapt the training and prediction procedures. The implementation below will automatically empty the GPU cache at the end of `train()` and `predict()` to save GPU memory.

For example, if we would like to design a new model for peptides with different purposes, for example RT prediction, we need to:

- Design the pytorch model (`class RTPrediction(torch.nn.Module):...`).
- Design the sub-class inherited from ModelInterface (`class RTPredictionModel(ModelInterface):...`).
- In `__init__` method, define `self.target_column_to_train = "detect_value"` and `self.target_column_to_predict = "predict_value"`. Also define `self._min_pred_value = some_value`.
- Re-implement `def _get_features_from_batch_df(self, batch_df): return self._get_aa_indice_features(batch_df)` (default) to predict property for sequence. For modified sequence, use `def _get_features_from_batch_df(self, batch_df): return self._get_aa_mod_features(batch_df)`.

- At last, execute the model in a python script or a notebook:
```
model = RTPredictionModel()
model.build(model_class=RTPrediction)
df = ... # the training data
model.train(df)
pred_df = model.predict(df)
```

Check out `peptdeep.model.generic_property_prediction` for details. `peptdeep.model.rt.AlphaRTModel` and `peptdeep.model.ccs.AlphaCCSModel` are also similar. MS2 prediction model is more complicated as the output value for a peptide is not a scalar value, see `peptdeep.model.ms2.pDeepModel`.

## Testing the APIs

Building a model for peptide classification (e.g. detectability)

First, design the `torch.nn.Module` (Transformer model)

In [None]:
import peptdeep.model.building_block as building_block

In [None]:
class Test_Bert(torch.nn.Module):
    def __init__(self,
        nlayers = 3,
        input_dim = 128, #ascii code number
        hidden_dim = 256,
        dropout = 0.1
    ):
        """
        Model based on a transformer Architecture from 
        Huggingface's BertEncoder class.
        """
        super().__init__()

        self.dropout = torch.nn.Dropout(dropout)

        self.input_nn =  torch.nn.Sequential(
            torch.nn.Embedding(input_dim, hidden_dim),
            building_block.PositionalEncoding(hidden_dim)
        )
        
        self.hidden_nn = building_block.Hidden_HFace_Transformer(
            hidden_dim, nlayers=nlayers, dropout=dropout
        )

        self.output_nn = torch.nn.Sequential(
            building_block.SeqAttentionSum(hidden_dim),
            torch.nn.PReLU(),
            self.dropout,
            torch.nn.Linear(hidden_dim, 1),
            torch.nn.Sigmoid()
        )

    def forward(self, x):
        x = self.dropout(self.input_nn(x))

        x = self.hidden_nn(x)
        x = self.dropout(x[0])

        return self.output_nn(x).squeeze(1)

Second, implement the ModelInterface APIs

In [None]:
class Test_Model(ModelInterface):
    def __init__(self, 
        dropout=0.1,
        model_class:torch.nn.Module=Test_Bert, #model class defined above
        device:str='gpu',
        **kwargs,
    ):
        super().__init__(device=device)
        self.build(
            model_class,
            dropout=dropout,
            **kwargs
        )
        self.loss_func = torch.nn.BCELoss() # loss for binary classification
        self.target_column_to_predict = 'predicted_prob'
        self.target_column_to_train = 'detected_prob'

Last, test the model

In [None]:
df = pd.DataFrame({
    'sequence':['ABCD']*5+['CDEFGHIJ']*5,
})
df['detected_prob'] = 1.0

model = Test_Model()
model.train(df, epoch=2)
model.predict(df)
assert 'predicted_prob' in df.columns
df

Unnamed: 0,sequence,detected_prob,nAA,predicted_prob
0,ABCD,1.0,4,0.876801
1,ABCD,1.0,4,0.876801
2,ABCD,1.0,4,0.876801
3,ABCD,1.0,4,0.876801
4,ABCD,1.0,4,0.876801
5,CDEFGHIJ,1.0,8,0.860851
6,CDEFGHIJ,1.0,8,0.860851
7,CDEFGHIJ,1.0,8,0.860851
8,CDEFGHIJ,1.0,8,0.860851
9,CDEFGHIJ,1.0,8,0.860851


In [None]:
model.training_groupby_nAA = False
model.train(df, epoch=2)
model.predict(df)

Unnamed: 0,sequence,detected_prob,nAA,predicted_prob
0,ABCD,1.0,4,0.946883
1,ABCD,1.0,4,0.946883
2,ABCD,1.0,4,0.946883
3,ABCD,1.0,4,0.946883
4,ABCD,1.0,4,0.946883
5,CDEFGHIJ,1.0,8,0.934433
6,CDEFGHIJ,1.0,8,0.934433
7,CDEFGHIJ,1.0,8,0.934433
8,CDEFGHIJ,1.0,8,0.934433
9,CDEFGHIJ,1.0,8,0.934433


### Test `build_from_py_codes()`

In [None]:
from peptdeep.model.ms2 import pDeepModel
from peptdeep.pretrained_models import MODEL_ZIP_FILE_PATH

In [None]:
ms2_model = pDeepModel()
ms2_model.build_from_py_codes(
    MODEL_ZIP_FILE_PATH, 'generic/ms2.pth.model.py', 
    include_model_params_yaml=True
)

ms2_model.model

Model(
  (dropout): Dropout(p=0.1, inplace=False)
  (input_nn): Input_26AA_Mod_PositionalEncoding(
    (mod_nn): Mod_Embedding_FixFirstK(
      (nn): Linear(in_features=103, out_features=2, bias=False)
    )
    (aa_emb): Embedding(27, 240, padding_idx=0)
    (pos_encoder): PositionalEncoding()
  )
  (meta_nn): Meta_Embedding(
    (nn): Linear(in_features=9, out_features=7, bias=True)
  )
  (hidden_nn): Hidden_HFace_Transformer(
    (bert): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=256, out_features=256, bias=True)
              (key): Linear(in_features=256, out_features=256, bias=True)
              (value): Linear(in_features=256, out_features=256, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=256, out_features=256, bias=True)
     

In [None]:
#|hide

from peptdeep.pretrained_models import ModelManager

In [None]:
#| hide
model_mgr = ModelManager()
model_mgr.ms2_model.set_device('mps')
model_mgr.ms2_model.set_device('gpu')
model_mgr.ms2_model.set_device('cuda')
model_mgr.ms2_model.set_device('cpu')