# TransformerBeta: An Introduction

TransformerBeta is a generative model based on Transformer architecture, developed to generate complementary binder for linear peptide epitopes of length 8 in an antiparallel beta strand conformation. The model is trained on a curated dataset of length 8 antiparallel beta strand pairs from the AF2 Beta Strand Database.

TransformerBeta.png

This Google Colaboratory notebook serves as an accessible user interface for TransformerBeta. It allows users to effortlessly generate, predict, and evaluate length 8 antiparallel beta interactions using the TransformerBeta model.

For detailed insight into TransformerBeta, we suggest referring to our research paper (yet to be released). 

To install TransformerBeta locally for your projects, visit: [TransformerBeta on Github](https://github.com/HZ3519/TransformerBeta). 

For individual usage of the AF2 Beta Strand Database, check out: [AF2 Beta Strand Database on Huggingface](https://huggingface.co/datasets/hz3519/AF2_Beta_Strand_Database/tree/main). 

For accessing the model weights and corresponding training, validation, and test sets mentioned in the paper, refer to: [TransformerBeta on Huggingface](https://huggingface.co/hz3519/TransformerBeta).

## Outline of the Notebook
1. Section 1: Setup Guidance
2. Section 2: Model Loading
3. Section 3: Generation of Peptide Sequences (4 different methods available: 1. Iterative sampling, 2. Random sampling, 3. Greedy prediction, 4. Evaluation of beta conformation probability. Check the cell below for more details.)
4. Section 4: Analysis of Iterative Sampling Results (if you choose for this method in Section 3)

## Notebook Usage Instructions
Please follow the instructions for each section to run the notebook successfully.

## Licensing of TransformerBeta

TransformerBeta, inclusive of the model weights, training set, validation set, and test set, along with the AF2 Beta Strand Database, is distributed under the terms of the [MIT License](https://opensource.org/licenses/MIT).

**Kindly ensure that you adhere to the conditions stipulated by these licenses when using these files.**

# Section 1: Setup Guidance

Please **strictly follow** the instructions provided in each cell (Go through **Step** in order).

In [None]:
#@title Setting up Google Drive Connection
#@markdown To run TransformerBeta, a connection with your Google Drive is essential. This allows the program to save and access the necessary files.

#@markdown **Step 1**: Execute this cell by pressing `Ctrl+Enter` or by clicking the **Play** button to the left.

# This chunk will mount Google Drive to allow the storage and retrieval of files
from google.colab import drive
import os, sys
drive.mount('/content/gdrive') # This path is where all output will be stored.

In [None]:
#@title Installing Dependencies and TransformerBeta package (this cell takes approx 4mins)
#@markdown Ensure your runtime environment is set to CPU. 

#@markdown **Step 2**: Navigate to `Runtime --> Change runtime type` in the menu above and set the Hardware Accelerator to 'None'.

#@markdown **Step 3**: Execute this cell to install the necessary dependencies by pressing `Ctrl+Enter` or by clicking the **Play** button to the left. 

# remove TransformerBeta_project if it exists
import shutil
try:
  shutil.rmtree('/content/TransformerBeta_project', ignore_errors=True)
except:
  print('')

# personal token for testing repo: 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
# if github repo released public
# !git clone https://github.com/HZ3519/TransformerBeta_project.git
# personal token for testing repo: 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
!git clone https://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@github.com/HZ3519/TransformerBeta_project.git

!pip install d2l==0.17.5 --no-deps
!pip install -r ./TransformerBeta_project/requirements.txt
!pip install -e ./TransformerBeta_project

import sys
if '/content/TransformerBeta_project/' not in sys.path:
    sys.path.append('/content/TransformerBeta_project/')

In [None]:
#@title Restarting Runtime and reconnecting Google Drive
#@markdown Once the installation of the necessary packages is completed, a restart of the runtime is required for the changes to take effect.
#@markdown **Step 4**: Navigate to `Runtime --> Restart runtime` in the menu above to restart.

#@markdown After-restart, the connection to Google Drive needs to be reconnected.
#@markdown **Step 5**: Execute this cell again by pressing `Ctrl+Enter` or by clicking the **Play** button to the left to reconnect Google Drive.

# reconnect to Google drive

from google.colab import drive
drive.mount('/content/gdrive') 

# Section 2: Model Loading

In [None]:
#@markdown The default model is "model M retrain", current best performing model from the paper. Execute this cell by pressing `Ctrl+Enter` or by clicking the **Play** button to the left.

import torch
import json
from collections import OrderedDict
from TransformerBeta import *
import matplotlib.pyplot as plt
import torch.nn as nn
import numpy as np
import os
from google.colab import files

model_name = "model_M_retrain" #@param {type:"string"}

try:
  shutil.rmtree('/content/TransformerBeta_models', ignore_errors=True)
except:
  print('')

# git clone the model directory from huggerface
!git lfs install
# if the hugging face repo is public, change it:
# !git clone https://huggingface.co/hz3519/TransformerBeta_models
# personal token for testing repo: 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
!git clone https://hz3519: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@huggingface.co/hz3519/TransformerBeta_models

model_dir = "TransformerBeta_models/{}".format(model_name)

# Load the config
with open("{}/config.json".format(model_dir), "r") as f:
    config = json.load(f)

# Create instances of your encoder and decoder
encoder_standard = TransformerEncoder(
    config["vocab_size"], config["key_size"], config["query_size"], config["value_size"], config["num_hiddens"], 
    config["norm_shape"], config["ffn_num_input"], config["ffn_num_hiddens"], config["num_heads"],
    config["num_layers"], config["dropout"])
decoder_standard = TransformerDecoder(
    config["vocab_size"], config["key_size"], config["query_size"], config["value_size"], config["num_hiddens"], 
    config["norm_shape"], config["ffn_num_input"], config["ffn_num_hiddens"], config["num_heads"],
    config["num_layers"], config["dropout"], shared_embedding=encoder_standard.embedding)

# Create an instance of your model
model_standard = EncoderDecoder(encoder_standard, decoder_standard)
model_standard_total_params = sum(p.numel() for p in model_standard.parameters())
model_standard_total_trainable_params = sum(p.numel() for p in model_standard.parameters() if p.requires_grad)

# Load the model's state_dict
state_dict = torch.load("{}/model_weights.pth".format(model_dir), map_location='cpu')

# If the state_dict was saved with 'module' prefix due to DataParallel
# Remove 'module' prefix if present
if list(state_dict.keys())[0].startswith('module'):
    new_state_dict = OrderedDict()
    for k, v in state_dict.items():
        name = k[7:] # remove 'module'
        new_state_dict[name] = v
    state_dict = new_state_dict
    
model_standard.load_state_dict(state_dict)

model_use = model_standard 
prediction_length = 8
device = d2l.try_gpu()

output_dir = "gdrive/MyDrive/model_prediction_{}".format(model_name)
# save peptide candidates as a txt file in a model prediction folder
if not os.path.exists(output_dir):
	os.mkdir(output_dir)

print('Transformer model loaded: total number of parameters: {}'.format(model_standard_total_params))
print('Transformer model loaded:: total number of trainable parameters: {}'.format(model_standard_total_trainable_params))

# Section 3: Generation of Peptide Sequences

In [None]:
#@title Iterative Sampling

#@markdown Iterative sampling samples **num_candidates** non repeative random complementary peptides given a target sequence. 

#@markdown please specify the length 8 target sequence from N-terminal to C-terminal
Target = 'QPRTFLLK' #@param {type:"string"}
#@markdown specify number of camdidates to sample
num_candidates = 50 #@param {type:"integer"}
#@markdown specify whether to download the sampled candidates files
DOWNLOAD = False #@param {type:"boolean"}
#@markdown After running the code, the results can be accessed either by ticking download file, from your Google Drive or from the left sidebar of this Google Colab notebook at path `gdrive/MyDrive/model_prediction_{model_name}/{Target}_{num_candidates}candidates.txt`

#@markdown **Note:** Please only sample 100 candidates maximum at a time in Google Colab. If you want to sample more candidates, please install the project locally and run the code locally with enough memory. Sample time for 100 candidates is ~11 sec.

max_iter = 20
peptide_candidates = sample_candidates(model_use, Target, num_candidates, amino_dict, prediction_length + 2, device, max_iter=max_iter)
# add a reverse column if antiparallel
sythesis_pep = np.array([string[::-1] for string in peptide_candidates[:, 0]])
peptide_candidates = np.concatenate((peptide_candidates, sythesis_pep.reshape(-1, 1)), axis=1)

print('The first 10 examples candidates are:')
print('designed complementary peptide, probability, synthesis complementary peptide') 
print(peptide_candidates[:10])

with open('{}/{}_{}candidates.txt'.format(output_dir, Target, num_candidates), 'w') as f:
	for i in range(len(peptide_candidates)):
		f.write(peptide_candidates[i][0] + '\t' + str(peptide_candidates[i][1]) + '\t' + str(peptide_candidates[i][2]) + '\n')

if DOWNLOAD:
	files.download('{}/{}_{}candidates.txt'.format(output_dir, Target, num_candidates))

In [1]:
#@title Random Sampling

#@markdown Given a target sequence, random sampling samples 1 random complementary peptide by taking sampling a amino acid at each decoding position.

#@markdown please specify the length 8 target sequence from N-terminal to C-terminal
Target = 'QPRTFLLK' #@param {type:"string"}

peptide_candidates = sample_single_candidate(model_use, Target, amino_dict, prediction_length + 2, device)
print(peptide_candidates)

In [None]:
#@title Greedy Prediction

#@markdown Given a target sequence, greedy prediction predicts 1 complementary peptide by taking the amino acid with the highest probability at each decoding position.

#@markdown please specify the length 8 target sequence from N-terminal to C-terminal
Target = 'QPRTFLLK' #@param {type:"string"}

dec_comple_peptide_pred, dec_prob, dec_attention_weight_seq = predict_greedy_single(model_use, Target, amino_dict, prediction_length + 2, device, save_attention_weights=True, print_info=True)

In [None]:
#@title Evaluation of beta conformation probability

#@markdown Given a target sequence and a complementary peptide, following function evaluates the antiparallel beta conformation probability of the peptide complex.

#@markdown please specify the length 8 target sequence from N-terminal to C-terminal
Target = 'QPRTFLLK' #@param {type:"string"}
#@markdown please specify the length 8 complementary peptide from C-terminal to N-terminal (face to face orientation)
complementary_peptide = 'LQYDIIFL' #@param {type:"string"}

dec_prob, dec_attention_weight_seq = evaluate_single(model_use, Target, complementary_peptide, amino_dict, prediction_length + 2, device, save_attention_weights=True, print_info=True)

# Section 4: Analysis of Iterative Sampling Results (if applicable)

In [None]:
#@title Load the sampled results

#@markdown please specify the length 8 target sequence from N-terminal to C-terminal
Target = 'QPRTFLLK' #@param {type:"string"}
#@markdown specify number of camdidates sampled
num_candidates = 50 #@param {type:"integer"}
#@markdown specify the number of top candidates to analyze
num_analysis = 20 #@param {type:"integer"}
#@markdown specify whether to use training data as a reference
use_train = False #@param {type:"boolean"}

#@markdown If use_train is False, the output table will only contain properties of the analyzed candidates. If use_train is True, the output table will contain properties of the analyzed candidates and their percentile within the training data. For example if hydrophobicity percentile is 90, it means that the hydrophobicity of the candidate is higher than 90% of the training complementary peptides.

#@markdown if use_train is False, running time is very short. If use_train is True, analyzing 20 candidates will take ~6mins and running time will increase linearly with the number of analyzed candidates.

# read peptide candidates from a txt file in a model prediction folder
with open('{}/{}_{}candidates.txt'.format(output_dir, Target, num_candidates), 'r') as f:
    peptide_candidates = []
    for line in f:
        peptide_candidates.append(line.strip().split('\t'))
peptide_candidates_all = np.array(peptide_candidates)

# load training data
if use_train:
	train_dict = np.load('{}/train_data.npy'.format(model_dir), allow_pickle=True)
	train_dict = train_dict.tolist()
	train_list = []
	for target, value_dict in train_dict.items():
		for comp, count in value_dict.items():
			train_list.append([target, comp, count])
	train_array = np.array(train_list)

In [None]:
#@title Analyze the sampled results

#@markdown specify whether to download the output table for the sampled candidates
DOWNLOAD = False #@param {type:"boolean"}

#@markdown After running the code, the results can be accessed either by ticking download file, from your Google Drive or from the left sidebar of this Google Colab notebook at path `gdrive/MyDrive/output_analysis_{Target}_{num_candidates}_{num_analysis}.xlsx`

# compute the cluster labels
cluster_labels_all = compute_clustering_labels(peptide_candidates_all[:, 0], amino_dict)
peptide_candidates_analysis = peptide_candidates_all[:num_analysis]
cluster_labels_analysis = cluster_labels_all[:num_analysis]

# Generate output table for
output_file_name = '{}/output_analysis_{}_{}_{}.xlsx'.format(output_dir, Target, num_candidates, num_analysis)
peptide_candidates = peptide_candidates_analysis[:, 0]
peptide_candidates_prob = peptide_candidates_analysis[:, 1]
if use_train:
	reference_list = train_array[:, 1]
	target_reference_list = train_array[:, 0]
else:
	reference_list = None
	target_reference_list = None
generate_output_table(peptide_candidates, peptide_candidates_prob, reference_list, output_file=output_file_name, cluster_labels=cluster_labels_analysis, target=Target, target_reference_list=target_reference_list)

if DOWNLOAD:
	files.download(output_file_name)

#### Headers for the output table is defined below:

1. **Rank:** The ordinal rank of the peptide candidate ranked by their evaluated probability. The higher rank represents a higher probability of the peptide candidate to be formed in an antiparallel beta conformation.

2. **Cluster Label:** The label indicating which cluster the peptide candidate belongs to. It is determined by the K-Medoids algorithm, the number of clusters is set to 2 by default. This label allows users to identify the peptide candidates that are different from each other.

3. **Target Sequence:** The target sequence which the complementary peptide is designed for.

4. **Designed Complementary Peptide:** The complementary peptide sequences designed by the model. This is designed in an face to face orientation with the target sequence. This means the designed complementary peptide is from C-terminus to N-terminus.

5. **Synthesis Complementary Peptide:** The reverse sequence of the designed complementary peptide. This is from N-terminus to C-terminus, which is the order for peptide synthesis in lab.

6. **Probability:** The probability score for each peptide candidate evaluted by the model. Higher score indicates a higher probability of the peptide candidate to be formed in an antiparallel beta conformation.

7. **Novelty Score:** A score calculated by the closest hamming distance between the designed peptide and all training complementary peptides. Novelty score of 0 indicates the designed peptide is identical to one of the training complementary peptides. Novelty score >= 1 indicates the designed peptide is different from all training complementary peptides. Higher scores suggest higher novelty.

8. **Target Novelty Score:** A score calculated by the hamming distance of the Target sequence and the most similar training target sequence of the designed peptide. Target novelty score of 0 indicates the most similar training target sequence of the designed peptide is identical to the target sequence. Target novelty score >= 1 indicates the most similar training target sequence of the designed peptide is different from the target sequence. Higher scores suggest higher novelty.

9. **CamSol Solubility Score:** A measure of the peptide's solubility, as calculated by the CamSol method. Higher scores suggest higher solubility. We currently do not have access to the CamSol method, so this column is not available.

10. **Net Charge:** The total charge of the peptide, calculated by summing the charge of all amino acids in a peptide.

11. **Net Charge Percentile:** The percentile rank of the peptide's net charge compared to the training complementary peptides.

12. **Hydrophobicity:** The measure of the peptide's hydrophobicity, calculated by averaging the hydrophobicity (using the Kyte-Doolittle scale) of all amino acids in a peptide.

13. **Hydrophobicity Percentile:** The percentile rank of the peptide's hydrophobicity compared to the training complementary peptides.

14. **Molecular Weight:** The average molecular weight of the peptide, calculated by averaging the molecular weight of all amino acids in a peptide.

15. **Molecular Weight Percentile:** The percentile rank of the peptide's molecular weight compared to the training complementary peptides.

16. **Isoelectric Point:** The pH at which the peptide carries no net electrical charge, estimated using Bio.SeqUtils.IsoelectricPoint module in Biopython package. 

17. **Isoelectric Point Percentile:** The percentile rank of the peptide's isoelectric point compared to the training complementary peptides.

18. **Aromaticity:** The percentage of aromatic amino acids in the peptide, calculated by the relative frequency of aromatic amino acids (F, W, Y) in a peptide.

19. **Aromaticity Percentile:** The percentile rank of the peptide's aromaticity compared to the training complementary peptides.