<h1> Epitope Prediction </h1>

This tutorial illustrates the use of epytope to predict T cell receptor (TCR) binding to epitopes and how to analyze results. epytope offers a long list of epitope prediction methods and was designed in such a way that extending epytope with your favorite method is easy.

This tutorial will entail:
- Simple TCR-epitope prediction from a repertoire of TCRs and selected epitopes
- Manipulation of the results
- Consensus prediction with multiple prediction methods
- Integration of a new prediction method


<h2> Chapter 1: The data structures </h2>
<br/>
We first start with importing the needed packages.

In [1]:
import sys
sys.path.append("../..")

In [2]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

from epytope.Core import Peptide, Allele, TCREpitope, ImmuneReceptorChain, ImmuneReceptor

from epytope.IO import IRDatasetAdapterFactory
from epytope.TCRSpecificityPrediction import TCRSpecificityPredictorFactory

Lets start of with something simple: Defining TCR-epitopes (consisting of peptides and optionally HLA alleles) and TCRs. You find all basic classes under `epytope.Core`. For more information on HLA alleles and peptides see the tutorial `EpitopePrediction.ipynb`.

In [3]:
peptide = Peptide("SYFPEITHI")
allele = Allele("HLA-A*02:01")
epitope_1 = TCREpitope(peptide=peptide, allele=allele)
epitope_1

TCR EPITOPE:
 PEPTIDE SYFPEITHI
 bound by ALLELE: HLA-A*02:01

In [4]:
epitope_2 = TCREpitope(peptide="EAAGIGILTV", allele=None)
epitope_2

TCR EPITOPE:
 PEPTIDE EAAGIGILTV

Next we will start to define a TCR from the superclass Immune receptor, which also can store B cell receptors (BCRs). Each TCR consists of at least 1 Immune Receptor Chain, that is typicall an α- or β-chain for TCR-Epitope prediction, as γ-δ TCRs are typically not considered by the predictors. Additionally, you can provide the celltype (e.g. "CD4 T cell"), and its host organism.

Each IR-Chain can contain the following information: chain type (out of ["TRA", "TRB", "TRG", "TRD", "IGK", "IGH", "IGL"]), V-, (D-), and J-gene, and the complementary determining regions 1-3 (CDR1-3). Note, that all information is optional except the CDR3 sequence.

In [5]:
alpha_1 = ImmuneReceptorChain(chain_type="TRA", 
                              v_gene="TRAV26-1", 
                              j_gene="TRAJ43",
                              cdr3="CIVRAPGRADMRF")
beta_1 = ImmuneReceptorChain(chain_type="TRB", 
                             v_gene="TRBV13", 
                             j_gene="TRBJ1-5",
                             cdr3="CASSYLPGQGDHYSNQPQHF")
tcr_1 = ImmuneReceptor([alpha_1, beta_1], cell_type="CD4 T cell", organism=None)

beta_2 = ImmuneReceptorChain(chain_type="TRB", 
                             v_gene="TRBV5-8", 
                             j_gene="TRBJ1-1",
                             cdr3="CASSTIRQGSNTFF")
tcr_2 = ImmuneReceptor([beta_2], cell_type="T cell")

tcrs = [tcr_1, tcr_2]

Of course we don't have to specify a full TCR repertoire by hand. We implemented interfaces to several formats of IRs (AIRR, Scirpy), databases (IEDB, VDJdb, McPas-TCR), and generall data frames.

In [6]:
path_data = "../../scrubs/vdjdb.tsv"

In [7]:
for name, version in IRDatasetAdapterFactory.available_methods().items():
    print(name, ",".join(version))

vdjdb 2023.06.01
iedb 1.0.0
mcpas-tcr 1.0.0
scirpy 0.10.1
airr scirpy:0.10.1
dataframe 0.0.0.1


In [8]:
tcr_repertoire = IRDatasetAdapterFactory("vdjdb")
tcr_repertoire.from_path(path_data)
tcr_repertoire.receptors = tcr_repertoire.receptors[:20]

In [12]:
tcr_repertoire.to_pandas().to_csv("../Data/examples/test_tcrs.csv")

In [9]:
for tcr in tcr_repertoire.receptors[:3]:
    print(tcr)
    print()

ALPHA-BETA T CELL RECEPTOR
in HomoSapiens
 IMMUNE RECEPTOR CHAIN: TRA
 CDR3: CIVRAPGRADMRF
 V_gene: TRAV26-1*01
 J_gene: TRAJ43*01
 IMMUNE RECEPTOR CHAIN: TRB
 CDR3: CASSYLPGQGDHYSNQPQHF
 V_gene: TRBV13*01
 J_gene: TRBJ1-5*01

ALPHA-BETA T CELL RECEPTOR
in HomoSapiens
 IMMUNE RECEPTOR CHAIN: TRA
 CDR3: CAVPSGAGSYQLTF
 V_gene: TRAV20*01
 J_gene: TRAJ28*01
 IMMUNE RECEPTOR CHAIN: TRB
 CDR3: CASSFEPGQGFYSNQPQHF
 V_gene: TRBV13*01
 J_gene: TRBJ1-5*01

ALPHA-BETA T CELL RECEPTOR
in HomoSapiens
 IMMUNE RECEPTOR CHAIN: TRA
 CDR3: CAVKASGSRLT
 V_gene: TRAV2*01
 IMMUNE RECEPTOR CHAIN: TRB
 CDR3: CASSYEPGQVSHYSNQPQHF
 V_gene: TRBV13*01
 J_gene: TRBJ1-5*01



We can also convert the resulting repertoire back into a data frame if common format for better readability.

In [10]:
df_tmp = tcr_repertoire.to_pandas()
df_tmp.head()

Unnamed: 0,celltype,organism,VDJ_chain_type,VDJ_cdr3,VDJ_v_gene,VDJ_d_gene,VDJ_j_gene,VJ_chain_type,VJ_cdr3,VJ_v_gene,VJ_j_gene
0,alpha-beta T cell,HomoSapiens,TRB,CASSYLPGQGDHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CIVRAPGRADMRF,TRAV26-1*01,TRAJ43*01
1,alpha-beta T cell,HomoSapiens,TRB,CASSFEPGQGFYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CAVPSGAGSYQLTF,TRAV20*01,TRAJ28*01
2,alpha-beta T cell,HomoSapiens,TRB,CASSYEPGQVSHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CAVKASGSRLT,TRAV2*01,
3,alpha-beta T cell,HomoSapiens,TRB,CASSALASLNEQFF,TRBV14*01,,TRBJ2-1*01,TRA,CAYRPPGTYKYIF,TRAV38-2/DV8*01,TRAJ40*01
4,alpha-beta T cell,HomoSapiens,TRB,CASSYLPGQGDHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CIVRAPGRADMRF,TRAV26-1*01,TRAJ43*01


<h2> Chapter 2: The predictors </h2>
<br/>
epytope has only one entry point to the different prediction methods, namely `TCRSpecificityPredictorFactory`. It handles the initialization of the different methods and also collects newly implemented prediction methods if properly implemented. To see which prediction methods epytope supports `TCRSpecificityPredictorFactory` can helps here as well:

In [11]:
for name, version in TCRSpecificityPredictorFactory.available_methods().items():
    print(name, ",".join(version))

imrex 
titan 1.0.0
tcellmatch 
stapler 
ergo-ii 
pmtnet 
epitcr 
atm-tcr 
attntap 
teim 
bertrand 
ergo-i 
teinet 
panpep 
dlptcr 
tulip-tcr 


We will run `imrex` on the tcr repertoire. The different predictors may have clashing dependencies between each other and to `epytope`. Therefore, epytope allows you to specify an environment to run a predictor, which can be different from the epytope environemnt, by providing one of the following arguments:
- `conda`: specify an anaconda environment in which the predictor is installed
- `interpreter`: directly specify the python interpreter, which should be used for the predictor
- `cmd_prefix`: any shell command that is executed before the predictor is run (e.g. `cmd_prefix=conda activate env_imrex` is equivalent to `conda=env_imrex`)

In [12]:
predictor = TCRSpecificityPredictorFactory("imrex")
results_imrex = predictor.predict(tcr_repertoire, [epitope_1] * len(tcr_repertoire.receptors), 
                                  conda="tcr_predictors",
                                  pairwise=False)
results_imrex.head(3)

Unnamed: 0_level_0,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,Epitope,Epitope,Method
Unnamed: 0_level_1,celltype,organism,VDJ_chain_type,VDJ_cdr3,VDJ_v_gene,VDJ_d_gene,VDJ_j_gene,VJ_chain_type,VJ_cdr3,VJ_v_gene,VJ_j_gene,Peptide,MHC,imrex
0,alpha-beta T cell,HomoSapiens,TRB,CASSYLPGQGDHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CIVRAPGRADMRF,TRAV26-1*01,TRAJ43*01,SYFPEITHI,HLA-A*02:01,0.638894
1,alpha-beta T cell,HomoSapiens,TRB,CASSFEPGQGFYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CAVPSGAGSYQLTF,TRAV20*01,TRAJ28*01,SYFPEITHI,HLA-A*02:01,0.558767
2,alpha-beta T cell,HomoSapiens,TRB,CASSYEPGQVSHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CAVKASGSRLT,TRAV2*01,,SYFPEITHI,HLA-A*02:01,0.774822


In the previous run, we set `pairwise=False`. Here, each TCR of the repertoire is predicted against the epitope of the corresponding position within a list of epitopes. We can also perform the prediction between all pairs of TCRs and epitopes by choosing `pairwise=True`.

In [13]:
results_imrex = predictor.predict(tcr_repertoire, [epitope_1, epitope_2], 
                            conda="tcr_predictors",
                            pairwise=True)
results_imrex.head(3)

Unnamed: 0_level_0,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,"SYFPEITHI, HLA-A*02:01",EAAGIGILTV
Unnamed: 0_level_1,celltype,organism,VDJ_chain_type,VDJ_cdr3,VDJ_v_gene,VDJ_d_gene,VDJ_j_gene,VJ_chain_type,VJ_cdr3,VJ_v_gene,VJ_j_gene,imrex,imrex
0,alpha-beta T cell,HomoSapiens,TRB,CASSYLPGQGDHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CIVRAPGRADMRF,TRAV26-1*01,TRAJ43*01,0.638894,0.652137
1,alpha-beta T cell,HomoSapiens,TRB,CASSFEPGQGFYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CAVPSGAGSYQLTF,TRAV20*01,TRAJ28*01,0.558767,0.532529
2,alpha-beta T cell,HomoSapiens,TRB,CASSYEPGQVSHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CAVKASGSRLT,TRAV2*01,,0.774822,0.48082


The following predictors are installable via `pip` and can therefore be found automatically by the python interpreter:
- ImRex
- TITAN
- TCellMatch
- STAPLER

For the remaining predictors, you will need to provide the location of the installation by providing the path via the `repository` argument.

In [14]:
predictor = TCRSpecificityPredictorFactory("pMTnet")
results_pmtnet = predictor.predict(tcr_repertoire, [epitope_1], 
                                  conda="epytope_tf20",
                                  repository="../../external/pMTnet",
                                  pairwise=True)
results_pmtnet.head(3)

Unnamed: 0_level_0,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,"SYFPEITHI, HLA-A*02:01"
Unnamed: 0_level_1,celltype,organism,VDJ_chain_type,VDJ_cdr3,VDJ_v_gene,VDJ_d_gene,VDJ_j_gene,VJ_chain_type,VJ_cdr3,VJ_v_gene,VJ_j_gene,pmtnet
0,alpha-beta T cell,HomoSapiens,TRB,CASSYLPGQGDHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CIVRAPGRADMRF,TRAV26-1*01,TRAJ43*01,0.045
1,alpha-beta T cell,HomoSapiens,TRB,CASSFEPGQGFYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CAVPSGAGSYQLTF,TRAV20*01,TRAJ28*01,0.017
2,alpha-beta T cell,HomoSapiens,TRB,CASSYEPGQVSHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CAVKASGSRLT,TRAV2*01,,0.085


<h2> Chapter 3: The Results</h2>
<br/>
The predictor all return a data table like object storing the tcr information and the epitope objects, as well as the predicted results. 

The results format is based on pandas.DataFrame. Therefore, all standard operation such as exporting (`results.to_csv(<path>)`), or plotting can be performed. 
Additionally, we can merge the results of different predictors for comparisson or consensus prediction:

To combine prediction results we can use `merge_results` from `epytope.Core`. In addition to the result object we want to merge, also have to specify the type of these objects (here `EpitopePredictionResult`). The function will return a merged results object of the same type. Note, that you should only merge results all stemming from `pairwise=True` or `pairwise=False`, but not mix them due to different formats. Also, duplicated TCRs will be removed.

In [15]:
results = [results_imrex, results_pmtnet]
df = results[0].merge_results(results[1:])
df.head(3)

Unnamed: 0_level_0,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,EAAGIGILTV,"SYFPEITHI, HLA-A*02:01","SYFPEITHI, HLA-A*02:01"
Unnamed: 0_level_1,celltype,organism,VDJ_chain_type,VDJ_cdr3,VDJ_v_gene,VDJ_d_gene,VDJ_j_gene,VJ_chain_type,VJ_cdr3,VJ_v_gene,VJ_j_gene,imrex,imrex,pmtnet
0,alpha-beta T cell,HomoSapiens,TRB,CASSYLPGQGDHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CIVRAPGRADMRF,TRAV26-1*01,TRAJ43*01,0.652137,0.638894,0.045
1,alpha-beta T cell,HomoSapiens,TRB,CASSFEPGQGFYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CAVPSGAGSYQLTF,TRAV20*01,TRAJ28*01,0.532529,0.558767,0.017
2,alpha-beta T cell,HomoSapiens,TRB,CASSYEPGQVSHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CAVKASGSRLT,TRAV2*01,,0.48082,0.774822,0.085


<h2> Chapter 4: Model Alternatives</h2>
<br/>
The following predictors provide multiple models, that can be specified by key-word arguments during the predict call:

- tcellmatch: `model=iedb_BILSTM_CONCAT_LEARN_1_1_1_1_1_s_bilstm_cv0` (see model folder)
- ergo-i: `model` from `['ae_vdjdb1', 'lstm_vdjdb1', 'ae_mcpas1', 'lstm_mcpas1']`
- ergo-ii: `dataset` from `['vdjdb', 'mcpas']
- imrex: `model` from `['2020-07-24_19-18-39_trbmhcidown-shuffle-padded-b32-lre4-reg001/2020-07-24_19-18-39_trbmhcidown-shuffle-padded-b32-lre4-reg001.h5', '2020-07-30_11-30-27_trbmhci-shuffle-padded-b32-lre4-reg001/2020-07-30_11-30-27_trbmhci-shuffle-padded-b32-lre4-reg001.h5']`
- teinet: `model` from `['large_dset', 'teinet_data']`
- epitcr: `model` from `['rdforestWithMHCModel', 'rdforestWithoutMHCModel', 'rdforestWithoutMHCNonOverlapingModel']`
- dlptcr: `model_type` from `['AB', 'A', 'B']` corresponding to alpha-beta, alpha, and beta chains used as input
- attntap: `model` from `['cv_model_0_mcpas_0', 'cv_model_0_vdjdb_0']`


In [16]:
predictor = TCRSpecificityPredictorFactory("imrex")
results_imrex = predictor.predict(tcr_repertoire, [epitope_1] * len(tcr_repertoire.receptors), 
                                  model='2020-07-30_11-30-27_trbmhci-shuffle-padded-b32-lre4-reg001/2020-07-30_11-30-27_trbmhci-shuffle-padded-b32-lre4-reg001.h5',
                                  conda="tcr_predictors",
                                  pairwise=False)
results_imrex.head(3)

Unnamed: 0_level_0,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,Epitope,Epitope,Method
Unnamed: 0_level_1,celltype,organism,VDJ_chain_type,VDJ_cdr3,VDJ_v_gene,VDJ_d_gene,VDJ_j_gene,VJ_chain_type,VJ_cdr3,VJ_v_gene,VJ_j_gene,Peptide,MHC,imrex
0,alpha-beta T cell,HomoSapiens,TRB,CASSYLPGQGDHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CIVRAPGRADMRF,TRAV26-1*01,TRAJ43*01,SYFPEITHI,HLA-A*02:01,0.906885
1,alpha-beta T cell,HomoSapiens,TRB,CASSFEPGQGFYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CAVPSGAGSYQLTF,TRAV20*01,TRAJ28*01,SYFPEITHI,HLA-A*02:01,0.753795
2,alpha-beta T cell,HomoSapiens,TRB,CASSYEPGQVSHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CAVKASGSRLT,TRAV2*01,,SYFPEITHI,HLA-A*02:01,0.567215


<h2> Chapter 5: Adding a new predictor </h2>
<br/>

epytope possesses a potent plugin system allowing the user to extend its capability quite easily. To include a new epitope prediction method one simply has to inherit from `epytope.Core.ATCRSpecificityPrediction` and implement its interface. 

For methods calling using a pip-installable version, we recommend using `epytope.TCRSpecificityPrediction.ML.ACmdTCRSpecificityPrediction`. 

For other methods based on an uninstallable GitHub repository, we used `epytope.TCRSpecificityPrediction.External.ARepoTCRSpecificityPrediction`

In [17]:
from epytope.TCRSpecificityPrediction import ACmdTCRSpecificityPrediction
import numpy as np
import pandas

class RandomTCRSpecificityPrediction(ACmdTCRSpecificityPrediction):
    __name = "random"
    __version = "1.0.0"
    __tcr_length = (0, 40)  # Admissable CDR3 length
    __epitope_length = (0, 40)  # Admissable epitope lenght 
    __organism = "H"  # Allowed organism of the TCRs (H=Human, M=Mouse, e.g. HM)

    @property
    def version(self):
        return self.__version

    @property
    def name(self):
        return self.__name

    @property
    def tcr_length(self):
        return self.__tcr_length

    @property
    def epitope_length(self):
        return self.__epitope_length
    
    @property
    def organism(self):
        return self.__organism
    
    def predict(self, tcrs, epitopes, pairwise=True, interpreter=None, conda=None, cmd_prefix=None, **kwargs):
        # Make multiple input types available
        if isinstance(epitopes, TCREpitope):
            epitopes = [epitopes]
        if isinstance(tcrs, ImmuneReceptor):
            tcrs = [tcrs]
        if isinstance(tcrs, list):
            tcrs = IRDataset(tcrs)
        if pairwise:
            epitopes = list(set(epitopes))
            
        # test wether the input is correct
        self.input_check(tcrs, epitopes, pairwise, **kwargs)
        
        # insert your own prediction here
        results = tcrs.to_pandas()
        if pairwise:
            results = self.combine_tcrs_epitopes_pairwise(results, epitopes)
        else:
            results = self.combine_tcrs_epitopes_list(results, epitopes)
        results = results.drop_duplicates(["VDJ_cdr3", "Epitope"])
        results["Score"] = np.random.rand(len(results))
        
        # create a TCRspecificity result
        # alternatively you can create a pandas.DataFrame yourself, that follows the epytope-TCR-prediction format
        # list of columns on which tcrs and results should be joined. Note, no duplicates are allowed based on these columns
        
        joining_list = ["VDJ_cdr3", "Epitope"]  
        results = results[joining_list + ['Score']]
        df_result = self.transform_output(results, tcrs, epitopes, pairwise, joining_list)
        return df_result

Now lets use our new predictor.

In [18]:
results_random = TCRSpecificityPredictorFactory("random").predict(tcr_repertoire, [epitope_1], pairwise=True)
results_random.head()

Unnamed: 0_level_0,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,TCR,"SYFPEITHI, HLA-A*02:01"
Unnamed: 0_level_1,celltype,organism,VDJ_chain_type,VDJ_cdr3,VDJ_v_gene,VDJ_d_gene,VDJ_j_gene,VJ_chain_type,VJ_cdr3,VJ_v_gene,VJ_j_gene,random
0,alpha-beta T cell,HomoSapiens,TRB,CASSYLPGQGDHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CIVRAPGRADMRF,TRAV26-1*01,TRAJ43*01,0.570602
1,alpha-beta T cell,HomoSapiens,TRB,CASSFEPGQGFYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CAVPSGAGSYQLTF,TRAV20*01,TRAJ28*01,0.780039
2,alpha-beta T cell,HomoSapiens,TRB,CASSYEPGQVSHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CAVKASGSRLT,TRAV2*01,,0.149553
3,alpha-beta T cell,HomoSapiens,TRB,CASSALASLNEQFF,TRBV14*01,,TRBJ2-1*01,TRA,CAYRPPGTYKYIF,TRAV38-2/DV8*01,TRAJ40*01,0.883517
4,alpha-beta T cell,HomoSapiens,TRB,CASSYLPGQGDHYSNQPQHF,TRBV13*01,,TRBJ1-5*01,TRA,CIVRAPGRADMRF,TRAV26-1*01,TRAJ43*01,0.570602


The predictor is now fully integrated and can be used in any context defined by epytope.