# Tutorial: building new models from scratch

**even without experience in deep learning**

In [1]:
%reload_ext autoreload
%autoreload 2

In order to predict or classify novel properties of peptides, the user simply needs to provide peptides with corresponding properties (e.g. 'binding_affinity'). 

We provides several generic `ModelInterface` and `Model` classes in `peptdeep.model.generic_property_prediction` module for users to easily build models for regression and classification problems. Examples are shown as following:

### Imports

In [2]:
from peptdeep.model.generic_property_prediction import (
    ModelInterface_for_Generic_AASeq_BinaryClassification,
    ModelInterface_for_Generic_AASeq_Regression,
    ModelInterface_for_Generic_ModAASeq_BinaryClassification,
    ModelInterface_for_Generic_ModAASeq_Regression,
)
from peptdeep.model.generic_property_prediction import (
    Model_for_Generic_AASeq_BinaryClassification_LSTM,
    Model_for_Generic_AASeq_BinaryClassification_Transformer,
    Model_for_Generic_AASeq_Regression_LSTM,
    Model_for_Generic_AASeq_Regression_Transformer,
    Model_for_Generic_ModAASeq_BinaryClassification_LSTM,
    Model_for_Generic_ModAASeq_BinaryClassification_Transformer,
    Model_for_Generic_ModAASeq_Regression_LSTM,
    Model_for_Generic_ModAASeq_Regression_Transformer,
)

#### Define example Table/DataFrame

In [3]:
from peptdeep.model.rt import IRT_PEPTIDE_DF

In [4]:
def create_example_input_dataframe_normalized_irt():
    irt_df=IRT_PEPTIDE_DF.copy()
    irt_df['normalized_irt'] = (
        irt_df.irt-irt_df.irt.min()
    )/(irt_df.irt.max()-irt_df.irt.min()) # 0 to 1 norm
    return irt_df
create_example_input_dataframe_normalized_irt()

Unnamed: 0,sequence,pep_name,irt,mods,mod_sites,nAA,normalized_irt
0,LGGNEQVTR,RT-pep a,-24.92,,,9,0.0
1,GAGSSEPVTGLDAK,RT-pep b,0.0,,,14,0.199488
2,VEATFGVDESNAK,RT-pep c,12.39,,,13,0.298671
3,YILAGVENSK,RT-pep d,19.79,,,10,0.357909
4,TPVISGGPYEYR,RT-pep e,28.71,,,12,0.429315
5,TPVITGAPYEYR,RT-pep f,33.38,,,12,0.466699
6,DGLDAASYYAPVR,RT-pep g,42.26,,,13,0.537784
7,ADVTPADFSEWSK,RT-pep h,54.62,,,13,0.636728
8,GTFIIDPGGVIR,RT-pep i,70.52,,,12,0.764009
9,GTFIIDPAAVIR,RT-pep k,87.23,,,12,0.897775


### Steps to build a model from scratch

In the following examples, we only need 7 steps to build a model.

1. Prepare a training dataframe with `sequence` column (and `mods`,`mod_sites` columns if the model also takes modifications into consideration), and a target value column to train.
2. Select a `ModelInterface` class based on the prediction problem (classification or regression for sequences or modified sequences). Select a `Model` class when initialzing the `ModelInterface` class.
3. Tell the `ModelInterface` object which column in the training dataframe stores the target values, and which column stores the values to be predicted.
4. `model.train()` for training.
5. `model.predict()` for prediction.

> Save and load models:
6. `model.save("/model_folder/model.pth")` to save the model.
7. Use the same `ModelInterface` and `Model` classes, and call `model.load("/model_folder/model.pth")` to load the model for transfer learning and prediction.

#### Building an simple RT model based on `Model_for_Generic_AASeq_Regression_LSTM`

In [5]:
example_df = create_example_input_dataframe_normalized_irt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_Regression(
    model_class=Model_for_Generic_AASeq_Regression_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)

Unnamed: 0,sequence,pep_name,irt,mods,mod_sites,nAA,normalized_irt,predicted_normalized_irt
0,LGGNEQVTR,RT-pep a,-24.92,,,9,0.0,0.0
1,GAGSSEPVTGLDAK,RT-pep b,0.0,,,14,0.199488,0.203671
2,VEATFGVDESNAK,RT-pep c,12.39,,,13,0.298671,0.312852
3,YILAGVENSK,RT-pep d,19.79,,,10,0.357909,0.365846
4,TPVISGGPYEYR,RT-pep e,28.71,,,12,0.429315,0.43476
5,TPVITGAPYEYR,RT-pep f,33.38,,,12,0.466699,0.465173
6,DGLDAASYYAPVR,RT-pep g,42.26,,,13,0.537784,0.564576
7,ADVTPADFSEWSK,RT-pep h,54.62,,,13,0.636728,0.678894
8,GTFIIDPGGVIR,RT-pep i,70.52,,,12,0.764009,0.893195
9,GTFIIDPAAVIR,RT-pep k,87.23,,,12,0.897775,1.061624


#### Building an simple RT model for only sequences based on `Model_for_Generic_AASeq_Regression_Transformer`

In [6]:
example_df = create_example_input_dataframe_normalized_irt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_Regression(
    model_class=Model_for_Generic_AASeq_Regression_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)

Unnamed: 0,sequence,pep_name,irt,mods,mod_sites,nAA,normalized_irt,predicted_normalized_irt
0,LGGNEQVTR,RT-pep a,-24.92,,,9,0.0,0.0
1,GAGSSEPVTGLDAK,RT-pep b,0.0,,,14,0.199488,0.0
2,VEATFGVDESNAK,RT-pep c,12.39,,,13,0.298671,0.140912
3,YILAGVENSK,RT-pep d,19.79,,,10,0.357909,0.142185
4,TPVISGGPYEYR,RT-pep e,28.71,,,12,0.429315,0.210857
5,TPVITGAPYEYR,RT-pep f,33.38,,,12,0.466699,0.2772
6,DGLDAASYYAPVR,RT-pep g,42.26,,,13,0.537784,0.088854
7,ADVTPADFSEWSK,RT-pep h,54.62,,,13,0.636728,0.399164
8,GTFIIDPGGVIR,RT-pep i,70.52,,,12,0.764009,0.596815
9,GTFIIDPAAVIR,RT-pep k,87.23,,,12,0.897775,0.701862


## Regression models for predicting a scalar value for a given amino acid sequence and site-specific PTMs

#### Scalar regression model (RT) with modified AA sequences using `Model_for_Generic_ModAASeq_Regression_LSTM`

In [7]:
example_df = create_example_input_dataframe_normalized_irt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_Regression(
    model_class=Model_for_Generic_ModAASeq_Regression_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)

Unnamed: 0,sequence,pep_name,irt,mods,mod_sites,nAA,normalized_irt,predicted_normalized_irt
0,LGGNEQVTR,RT-pep a,-24.92,,,9,0.0,0.0
1,GAGSSEPVTGLDAK,RT-pep b,0.0,,,14,0.199488,0.368827
2,VEATFGVDESNAK,RT-pep c,12.39,,,13,0.298671,0.295824
3,YILAGVENSK,RT-pep d,19.79,,,10,0.357909,0.337074
4,TPVISGGPYEYR,RT-pep e,28.71,,,12,0.429315,0.527409
5,TPVITGAPYEYR,RT-pep f,33.38,,,12,0.466699,0.506031
6,DGLDAASYYAPVR,RT-pep g,42.26,,,13,0.537784,0.629531
7,ADVTPADFSEWSK,RT-pep h,54.62,,,13,0.636728,0.708878
8,GTFIIDPGGVIR,RT-pep i,70.52,,,12,0.764009,0.79857
9,GTFIIDPAAVIR,RT-pep k,87.23,,,12,0.897775,0.856519


#### Scalar regression model (RT) with modified AA sequences using `Model_for_Generic_ModAASeq_Regression_Transformer`

In [8]:
example_df = create_example_input_dataframe_normalized_irt()
example_df.loc[1,'mods'] = 'Phospho@S'
example_df.loc[1,'mod_sites'] = '4'

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_Regression(
    model_class=Model_for_Generic_ModAASeq_Regression_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)

Unnamed: 0,sequence,pep_name,irt,mods,mod_sites,nAA,normalized_irt,predicted_normalized_irt
0,LGGNEQVTR,RT-pep a,-24.92,,,9,0.0,0.088521
1,GAGSSEPVTGLDAK,RT-pep b,0.0,Phospho@S,4.0,14,0.199488,0.57192
2,VEATFGVDESNAK,RT-pep c,12.39,,,13,0.298671,0.285101
3,YILAGVENSK,RT-pep d,19.79,,,10,0.357909,0.367173
4,TPVISGGPYEYR,RT-pep e,28.71,,,12,0.429315,0.615492
5,TPVITGAPYEYR,RT-pep f,33.38,,,12,0.466699,0.589607
6,DGLDAASYYAPVR,RT-pep g,42.26,,,13,0.537784,0.539454
7,ADVTPADFSEWSK,RT-pep h,54.62,,,13,0.636728,0.587029
8,GTFIIDPGGVIR,RT-pep i,70.52,,,12,0.764009,0.880274
9,GTFIIDPAAVIR,RT-pep k,87.23,,,12,0.897775,0.811531


## Binary classification models for a given amino acid sequence

In [9]:
# a simple classification dataset
def create_example_input_dataframe_classification_rt():
    rt_df = create_example_input_dataframe_normalized_irt()
    rt_df['is_in_first_half_of_column'] = 0
    rt_df.loc[:5,'is_in_first_half_of_column']=1
    return rt_df

#### A sequence classification model using `Model_for_Generic_AASeq_BinaryClassification_LSTM`

In [10]:
example_df = create_example_input_dataframe_classification_rt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_BinaryClassification(
    model_class=Model_for_Generic_AASeq_BinaryClassification_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column' 
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)

Unnamed: 0,sequence,pep_name,irt,mods,mod_sites,nAA,normalized_irt,is_in_first_half_of_column,predicted_will_be_in_first_half_of_column
0,LGGNEQVTR,RT-pep a,-24.92,,,9,0.0,1,0.991829
1,GAGSSEPVTGLDAK,RT-pep b,0.0,,,14,0.199488,1,0.990733
2,VEATFGVDESNAK,RT-pep c,12.39,,,13,0.298671,1,0.991083
3,YILAGVENSK,RT-pep d,19.79,,,10,0.357909,1,0.9916
4,TPVISGGPYEYR,RT-pep e,28.71,,,12,0.429315,1,0.992202
5,TPVITGAPYEYR,RT-pep f,33.38,,,12,0.466699,1,0.990124
6,DGLDAASYYAPVR,RT-pep g,42.26,,,13,0.537784,0,0.351366
7,ADVTPADFSEWSK,RT-pep h,54.62,,,13,0.636728,0,0.359982
8,GTFIIDPGGVIR,RT-pep i,70.52,,,12,0.764009,0,0.352756
9,GTFIIDPAAVIR,RT-pep k,87.23,,,12,0.897775,0,0.351209


#### A sequence classification model using `Model_for_Generic_AASeq_BinaryClassification_Transformer`

In [11]:
example_df = create_example_input_dataframe_classification_rt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_BinaryClassification(
    model_class=Model_for_Generic_AASeq_BinaryClassification_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column'
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)

Unnamed: 0,sequence,pep_name,irt,mods,mod_sites,nAA,normalized_irt,is_in_first_half_of_column,predicted_will_be_in_first_half_of_column
0,LGGNEQVTR,RT-pep a,-24.92,,,9,0.0,1,0.997586
1,GAGSSEPVTGLDAK,RT-pep b,0.0,,,14,0.199488,1,0.997438
2,VEATFGVDESNAK,RT-pep c,12.39,,,13,0.298671,1,0.996627
3,YILAGVENSK,RT-pep d,19.79,,,10,0.357909,1,0.997642
4,TPVISGGPYEYR,RT-pep e,28.71,,,12,0.429315,1,0.996989
5,TPVITGAPYEYR,RT-pep f,33.38,,,12,0.466699,1,0.996926
6,DGLDAASYYAPVR,RT-pep g,42.26,,,13,0.537784,0,0.004032
7,ADVTPADFSEWSK,RT-pep h,54.62,,,13,0.636728,0,0.004321
8,GTFIIDPGGVIR,RT-pep i,70.52,,,12,0.764009,0,0.004137
9,GTFIIDPAAVIR,RT-pep k,87.23,,,12,0.897775,0,0.003938


## Binary classification models for given amino acid sequence and site-specific PTMs

In [12]:
def create_example_input_dataframe_classification_rt():
    rt_df = create_example_input_dataframe_normalized_irt()
    rt_df['is_in_first_half_of_column'] = 0
    rt_df.loc[:5,'is_in_first_half_of_column']=1
    return rt_df

#### A sequence classification model using `Model_for_Generic_ModAASeq_BinaryClassification_LSTM`

In [13]:
example_df = create_example_input_dataframe_classification_rt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_BinaryClassification(
    model_class=Model_for_Generic_ModAASeq_BinaryClassification_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column' 
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)

Unnamed: 0,sequence,pep_name,irt,mods,mod_sites,nAA,normalized_irt,is_in_first_half_of_column,predicted_will_be_in_first_half_of_column
0,LGGNEQVTR,RT-pep a,-24.92,,,9,0.0,1,0.99312
1,GAGSSEPVTGLDAK,RT-pep b,0.0,,,14,0.199488,1,0.9906
2,VEATFGVDESNAK,RT-pep c,12.39,,,13,0.298671,1,0.992972
3,YILAGVENSK,RT-pep d,19.79,,,10,0.357909,1,0.992984
4,TPVISGGPYEYR,RT-pep e,28.71,,,12,0.429315,1,0.992323
5,TPVITGAPYEYR,RT-pep f,33.38,,,12,0.466699,1,0.988538
6,DGLDAASYYAPVR,RT-pep g,42.26,,,13,0.537784,0,0.370841
7,ADVTPADFSEWSK,RT-pep h,54.62,,,13,0.636728,0,0.368691
8,GTFIIDPGGVIR,RT-pep i,70.52,,,12,0.764009,0,0.378124
9,GTFIIDPAAVIR,RT-pep k,87.23,,,12,0.897775,0,0.367393


#### A sequence classification model using `Model_for_Generic_ModAASeq_BinaryClassification_Transformer`

In [14]:
example_df = create_example_input_dataframe_classification_rt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_BinaryClassification(
    model_class=Model_for_Generic_ModAASeq_BinaryClassification_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column' 
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)

Unnamed: 0,sequence,pep_name,irt,mods,mod_sites,nAA,normalized_irt,is_in_first_half_of_column,predicted_will_be_in_first_half_of_column
0,LGGNEQVTR,RT-pep a,-24.92,,,9,0.0,1,0.997545
1,GAGSSEPVTGLDAK,RT-pep b,0.0,,,14,0.199488,1,0.996575
2,VEATFGVDESNAK,RT-pep c,12.39,,,13,0.298671,1,0.995498
3,YILAGVENSK,RT-pep d,19.79,,,10,0.357909,1,0.997241
4,TPVISGGPYEYR,RT-pep e,28.71,,,12,0.429315,1,0.996784
5,TPVITGAPYEYR,RT-pep f,33.38,,,12,0.466699,1,0.995732
6,DGLDAASYYAPVR,RT-pep g,42.26,,,13,0.537784,0,0.004
7,ADVTPADFSEWSK,RT-pep h,54.62,,,13,0.636728,0,0.005084
8,GTFIIDPGGVIR,RT-pep i,70.52,,,12,0.764009,0,0.004195
9,GTFIIDPAAVIR,RT-pep k,87.23,,,12,0.897775,0,0.003547
