In [2]:
#################################
#Run this cell if using Google COLAB
#################################

#clone the repository and the data to run the notebook on Google Colab
!git clone https://github.com/GfellerLab/TCRpred
!pip install wget
!pip install pytorch_lightning==1.6.0
%cd TCRpred

# TCRpred

TCRpred is a sequence-based TCR-pMHC interaction predictor. TCR binding predictions are currently possible for 146 pMHCs. For 43 pMHC robust predictions were achieved in internal cross validation, while models with less than 50 training TCRs have low confidence.

A list of all TCRpred models and additional information is available at ./pretrained_models/info_models.csv  

In [2]:
import pandas as pd
import wget
import os
df_info = pd.read_csv("./pretrained_models/info_models.csv")
df_info.head(10)

Unnamed: 0,TCRpred_model_name,Peptide,Origin,MHC_class,MHC,Host_species,Number_training_abTCR,AUC_5fold
0,H2-IAb_DIYKGVYQFKSV,DIYKGVYQFKSV,Lymphocytic choriomeningitis mammarenavirus,MHCII,H2-IAb,MusMusculus,3650,0.953809
1,A0201_GILGFVFTL,GILGFVFTL,Influenza A virus,MHCI,HLA-A*02:01,HomoSapiens,2079,0.943283
2,H2-Kb_SSYRRPVGI,SSYRRPVGI,Influenza A virus,MHCI,H2-Kb,MusMusculus,1158,0.938677
3,H2-Db_SSLENFRAYV,SSLENFRAYV,Influenza A virus,MHCI,H2-Db,MusMusculus,798,0.90634
4,A0201_LLWNGPMAV,LLWNGPMAV,Yellow fever virus,MHCI,HLA-A*02:01,HomoSapiens,644,0.928752
5,A0201_NLVPMVATV,NLVPMVATV,Human betaherpesvirus 5,MHCI,HLA-A*02:01,HomoSapiens,581,0.765138
6,B0801_RAKFKQLL,RAKFKQLL,,MHCI,HLA-B*08:01,HomoSapiens,556,0.951152
7,H2-Db_HGIRNASFI,HGIRNASFI,Murid betaherpesvirus 1,MHCI,H2-Db,MusMusculus,529,0.927418
8,H2-Kb_ASNENMETM,ASNENMETM,Influenza A virus,MHCI,H2-Kb,MusMusculus,515,0.904144
9,A0201_GLCTLVAML,GLCTLVAML,Human gammaherpesvirus 4,MHCI,HLA-A*02:01,HomoSapiens,483,0.956309


For each pMHC in the previous list, we can predict which TCRs are more likely to bind it.
The input of TCRpred are paired TCRs (V,J gene and CDR3 sequence) like those in the file ./test/test.out. 

In [2]:
input_tcrs = pd.read_csv("./test/test.csv")
input_tcrs

Unnamed: 0,cdr3_TRA,cdr3_TRB,TRAV,TRAJ,TRBV,TRBJ
0,CARGSNYNVLYF,CASRGQSQNTLYF,TRAV14D,TRAJ21,TRBV13-3,TRBJ2-4
1,CAMSAIMNRDDKIIF,CASRPNPGQGSYEQYF,TRAV17,TRAJ9,TRBV30,TRBJ2-7
2,CAVQRGGQKLLF,CASSPPQRLQETQYF,TRAV19,TRAJ10,TRBV19,TRBJ1-2
3,CAASIVWGSNFGNEKLTF,CASRTGDGQPQHF,TRAV14/DV4,TRAJ26,TRBV10-3,TRBJ2-1
4,CAAGSYNFNKFYF,CASSLSGGRTEAFF,TRAV12-3,TRAJ10,TRBV9,TRBJ2-7
...,...,...,...,...,...,...
998,CAGRDYGGATNKLIF,CSVRLVSKNIQYF,TRAV17,TRAJ9,TRBV12-3,TRBJ2-2
999,CAVRDKGTGGFKTIF,CASSLTGTGAQEQYF,TRAV14/DV4,TRAJ17,TRBV6-6,TRBJ2-3
1000,CAGTDRGSTLGRLYF,CASSQVKVSSYNEQFF,TRAV14/DV4,TRAJ3,TRBV7-9,TRBJ2-7
1001,CAATYGTNAGKSTF,CASSQNGVGGEQYF,TRAV8-2,TRAJ39,TRBV5-5,TRBJ2-1


If we want predict which TCRs can bind to one specific pMHC, we can run TCRpred specifying the 
1. --input (the list of TCRs) and the 
2. --model (the pMHC to which they can bind). 
3. --out (output file) the file where the results will be stored.  

For example, we want to determine which TCRs contained in ./test/test.csv can recognize the HLA-A\*02:01 GILGFVFTL epitope (TCRpred model = A0201_GILGFVFTL), and save the results in ./test/out_A0201_GILGFVFTL.csv.
The code to run in is:

In [9]:
!python TCRpred.py --model A0201_GILGFVFTL --input ./test/test.csv --out ./test/out_A0201_GILGFVFTL.csv
#Please note the leadind "!" which allows to run bash code in the notebook

2023-05-13 12:46:05.793158: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-05-13 12:46:05.822907: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
#########################################################################
###### TCRpred: a sequence-based predictor of TCR-pMHC interactions  ####
#########################################################################
TCRpred model A0201_GILGFVFTL 
Computing binding predictions for {0} ./test/test.csv
Testing DataLoader 0: 100%|████████████████| 1003/1003 [00:05<00:00

Done! The results have been sorted from the most probable binder to the least, and we can now proceed with their analysis. We provide both the raw TCRpred score and the %rank. In the paper we observed that TCRs with %rank < 0.1 are likely to true binders. 

In [10]:
results_GIL = pd.read_csv("./test/out_A0201_GILGFVFTL.csv", comment = '#')
results_GIL

Unnamed: 0,cdr3_TRA,cdr3_TRB,TRAV,TRAJ,TRBV,TRBJ,score,perc_rank
0,CAGGGSQGNLIF,CASSIRSSYEQYF,TRAV27,TRAJ42,TRBV19,TRBJ2-6,3.85309,0.00040
1,CAGQYGGSQGNLIF,CASSIRSTDTQYF,TRAV12,TRAJ42,TRBV19,TRBJ2-3,3.57812,0.00164
2,CAFINGSSNTGKLIF,CATSSFLAVSYEQYF,TRAV1-2,TRAJ29,TRBV3-1,TRBJ2-2,0.50054,26.01601
3,CAASIDGRNNDMRF,CASSPFTGPPYEQYF,TRAV14/DV4,TRAJ45,TRBV19,TRBJ2-7,0.36692,31.19044
4,CAYRSPWGMGGSQGNLIF,CATSFMVQETQYF,TRAV13-1,TRAJ6,TRBV19,TRBJ1-4,0.04156,45.24820
...,...,...,...,...,...,...,...,...
998,CAEISFFSGGYNKLIF,CSALAGGLNTQYF,TRAV21,TRAJ49,TRBV9,TRBJ2-1,-5.66217,100.00000
999,CAVTDGAGSYQLTF,CASSPSGITGELFF,TRAV2,TRAJ42,TRBV5-8,TRBJ1-1,-5.68870,100.00000
1000,CAVRWDTGNQFYF,CASSQTGRYQETQYF,TRAV8-4,TRAJ26,TRBV9,TRBJ1-5,-5.77388,100.00000
1001,CAVSPTGRRALTF,CATSPGQNTGELFF,TRAV2,TRAJ52,TRBV10-3,TRBJ1-1,-6.11638,100.00000


Not all the 146 pretrained models are stored on GitHub. To make predictions for another pMHC we first need to download the corresponding TCRpred model.  
For example, if we want to study the HLA-A\*02:01,YLQPRTFLL epitope from the SARS-CoV-2 with TCRpred model_name A0201_YLQPRTFLL, we can download it by running:

In [3]:
model_name="A0201_YLQPRTFLL"
if os.path.exists("./pretrained_models/model_{0}.ckpt".format(model_name)):
    print("TCRpred model already downloaded")
else:
    url = "https://zenodo.org/record/7930623/files/model_"+model_name+".ckpt"
    print("Downloading TCRpred model for {0}".format(model_name))
    filename = wget.download(url, out = './pretrained_models')

Downloading TCRpred model for A0201_YLQPRTFLL
100% [........................................................................] 31808047 / 31808047

 Now we can make predictions for the new pMHC.

In [7]:
!python TCRpred.py --model A0201_YLQPRTFLL --input ./test/test.csv --out ./test/out_A0201_YLQPRTFLL.csv

2023-05-13 12:38:38.848013: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-05-13 12:38:38.877175: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
#########################################################################
###### TCRpred: a sequence-based predictor of TCR-pMHC interactions  ####
#########################################################################
TCRpred model A0201_YLQPRTFLL 
Computing binding predictions for {0} ./test/test.csv
Testing DataLoader 0: 100%|████████████████| 1003/1003 [00:05<00:00

In [8]:
results_LLW = pd.read_csv("./test/out_A0201_YLQPRTFLL.csv", comment = '#')
results_LLW

Unnamed: 0,cdr3_TRA,cdr3_TRB,TRAV,TRAJ,TRBV,TRBJ,score,perc_rank
0,CAANRDDKIIF,CSVEFKSRAGELFF,TRAV12-1,TRAJ53,TRBV11-3,TRBJ2-7,2.84108,0.39592
1,CAENLGENQFYF,CATIDTNTGELFF,TRAV13-2,TRAJ47,TRBV4-1,TRBJ1-5,2.46276,0.88953
2,CVVSLRDNYGQNFVF,CASSDTDTGELFF,TRAV19,TRAJ42,TRBV7-8,TRBJ2-3,1.71331,4.08557
3,CAARGFQKLVF,CSVDRTNEKLFF,TRAV21,TRAJ29,TRBV27,TRBJ2-5,1.14473,11.29627
4,CAVYEDDKIIF,CASSLGTDGNEQFF,TRAV1-2,TRAJ23,TRBV6-5,TRBJ2-2,0.78418,19.61315
...,...,...,...,...,...,...,...,...
998,CIVRVGGSSNTGKLIF,CASSQGWGAEGNTIYF,TRAV19,TRAJ45,TRBV19,TRBJ2-7,-3.60105,99.99661
999,CAASTPSGGGADGLTF,CASSQGDQHTDTQYF,TRAV6,TRAJ26,TRBV24-1,TRBJ2-1,-3.69587,99.99787
1000,CAVRIYNAGNNRKLIW,CASSPIDGYGYTF,TRAV17,TRAJ16,TRBV28,TRBJ1-1,-3.75208,99.99839
1001,CAVGGYGGSQGNLIF,CASSLWRGLSAGNTIYF,TRAV26-2,TRAJ47,TRBV27,TRBJ2-3,-3.77857,99.99859


If you want to test your own set of TCRs you can upload it in Google Colab and specify after --input the path to your TCRs file. Please refer to the ./test/test.csv file for the correct input format.