# Protein Pretraining and Property Prediction

In recent years, with sequencing technology development, the protein sequence database scale has significantly increased. However, the cost of obtaining labeled protein sequences is still very high, as it requires biological experiments. Besides, due to the inadequate number of labeled samples, the model has a high probability of overfitting the data. Borrowing the ideas from natural language processing (NLP), we can pre-train numerous unlabeled sequences by self-supervised learning. In this way, we can extract useful biological information from proteins and transfer them to other tagged tasks to make these tasks training faster and more stable convergence. These instructions refer to the work of paper TAPE, providing the model implementation of Transformer, LSTM, and ResNet.


In [1]:
import os
import sys
os.chdir('../apps/pretrained_protein/tape')
sys.path.append('../../../')
sys.path.append('./')

## Loading Related Tools

In [2]:
import paddle
from utils import *

# paddle.enable_static() # when paddle version >= 2.0

is_distributed = False
use_cuda = False
thread_num = 8 # for training with cpu

# Setup the execution-related parameters according to the training modes.
exe_params = default_exe_params(is_distributed=is_distributed, use_cuda=use_cuda, thread_num=thread_num)
exe = exe_params['exe']
trainer_num = exe_params['trainer_num']
trainer_id = exe_params['trainer_id']
dist_strategy = exe_params['dist_strategy'] 
places = exe_params['places']

## Model Configuration Settings

The network is setup according to `model_config`.
- Task-related configurations
    - “task”：The type of training task. Candidate task types：
        - “pretrain": Leverage self-supervised learning for pretraining task，for dataset `TAPE`.
        - “classification”: Clasification task, for dataset `Remote Homology`.
        - "regression": Regression task, for datasets `Fluroscence` and `Stability`.
        - “seq_classification”: Sequence classification task, for dataset `Secondary Structure`。
    - “class_num”: The number of class for tasks `classification` and `seq_classification`。
    - "label_name": The label name in the dataset。
- Network-related configurations
    - “model_type": The network type. For each network, we need to set the corresponding network hyper-parameters. We support the following networks:
        - “transformer“
            - ”hidden_size"
            - "layer_num"
            - "head_num"
        - "lstm"
            - "hidden_size"
            - "layer_num"
        - "resnet"
            - "hidden_size"
            - "layer_num"
            - "filter_size"
- Other configurations (See the code for more details)
    - “dropout_rate"
    - "weight_decay"
    
Following is the demo `model_config` of the task of `Secondary Structure`.

In [3]:
model_config = \
{
    "model_name": "secondary_structure",

    "task": "seq_classification",
    "class_num": 3,
    "label_name": "labels3",

    "model_type": "lstm",
    "hidden_size": 512,
    "layer_num": 3,

    "comment": "The following hyper-parameters are optional.",
    "dropout_rate": 0.1,
    "weight_decay": 0.01
}

## Define Model

In [4]:
from tape_model import TAPEModel # More details of the network structure are shown in tape_model.py.
from data_gen import setup_data_loader

model = TAPEModel(model_config=model_config)

lr = 0.0001 # learning rate
batch_size = 32 # batch size
train_data = './demos/secondary_structure_toy_data'

train_program = fluid.Program()
train_startup = fluid.Program()
with fluid.program_guard(train_program, train_startup):
    with fluid.unique_name.guard():
        model.forward(False)
        model.cal_loss()

        # setup the optimizer
        optimizer = default_optimizer(lr=lr, warmup_steps=0, max_grad_norm=0.1)
        setup_optimizer(optimizer, model, use_cuda, is_distributed)
        optimizer.minimize(model.loss)
        
        # setup the data loader
        train_data_loader = setup_data_loader(
                model,
                model_config,
                train_data,
                trainer_id,
                trainer_num,
                places,
                batch_size)
        exe.run(train_startup)


## Training Model

In [5]:
task = model_config['task']
train_metric = get_metric(task)
train_fetch_list = model.get_fetch_list()

for epoch_id in range(2):
    print('Epoch %d' % epoch_id)
    train_metric.clear()
    for data in train_data_loader():
        results = exe.run(
                program=train_program,
                feed=data,
                fetch_list=train_fetch_list,
                return_numpy=False)
        update_metric(task, train_metric, results) # update the metric
        train_metric.show()


Epoch 0
	Example: 10042
	Accuracy: 0.382892
	Example: 21055
	Accuracy: 0.423652
	Example: 32094
	Accuracy: 0.424877
	Example: 40606
	Accuracy: 0.427819
	Example: 49157
	Accuracy: 0.427243
	Example: 59397
	Accuracy: 0.430257
	Example: 70316
	Accuracy: 0.432121
	Example: 78011
	Accuracy: 0.429516
	Example: 86317
	Accuracy: 0.428861
	Example: 94610
	Accuracy: 0.429257
	Example: 102436
	Accuracy: 0.430298
	Example: 110738
	Accuracy: 0.433176
	Example: 119579
	Accuracy: 0.434190
	Example: 127880
	Accuracy: 0.434454
	Example: 136755
	Accuracy: 0.435012
	Example: 144800
	Accuracy: 0.435055
Epoch 1
	Example: 10042
	Accuracy: 0.450607
	Example: 21055
	Accuracy: 0.459416
	Example: 32094
	Accuracy: 0.451767
	Example: 40606
	Accuracy: 0.451140
	Example: 49157
	Accuracy: 0.449132
	Example: 59397
	Accuracy: 0.449871
	Example: 70316
	Accuracy: 0.450680
	Example: 78011
	Accuracy: 0.447398
	Example: 86317
	Accuracy: 0.446204
	Example: 94610
	Accuracy: 0.446073
	Example: 102436
	Accuracy: 0.446230
	Exam