# How to add a new method to this benchmark.

In this notebook, we will explain how to test your own method on this benchmark.
You can either:
- implement a prediction function following the interfaces of the benchmark
- or implement it into ePytope-TCR and use the wrapper function provided here.

## 01. Predictor class

In [1]:
import numpy as np

def prediction_func(df_input, **kwargs):
    """
    This function needs to serve as a wrapper function between your model and the test. 
    :param df_input: pandas.DataFrame with columns:
        - CDR3_alpha, CDR3_beta: CDR3 amino acid region of the alpha and beta chains starting with 'C' and ending with 'F' or 'W' (e.g. 'CASSMRSAVEQYF')
        - V_alpha, V_beta, J_alpha, J_beta: Categorical V,J-gene annotation (e.g. TRBJ2-7*01) 
        - Epitope: amino acid sequence of the epitope (e.g. 'GILGFVFTL')
        - MHC: Categorical annotation of the MHC type 
        - clone_id: identifier ID of this clonotype without any predictive meaning
        - dataset: name of the dataset without any predictive meaning
        - Label: 1 for binder, 0 for none-binder, DO NOT USE THIS LABEL IN ANY PART OF YOUR MODEL
       Note: while the DataFrame provides all these inputs, feel free to use only parts of them for your model 
    :param kwargs: Any additional parameters (e.g. different model choices) can be passed as keyword arguments here at your choosing, which can be passed through the config (see below)

    :return: pandas.DataFrame of columns:
        - the same as df_input
        - Score: numeric prediction score, where higher values indicate a higher probability of binding for this row (TCR-Epitope pair)
    """
    # Let's print the input to the function to get an example later
    print(df_input.columns)
    print(df_input.head(5))
    print(kwargs)

    # The output should follow the same format as the input but additionally contains the prediction score
    df_output = df_input.copy()

    # Here you need to plug in your model including preprocessing (e.g. reconstruct full TCR sequence, prune leading 'C' from CDR3, rename the columns, ...)
    # For the dummy model, we simply assign a random score without any relationship to the input
    np.random.seed(0)  # Seed your model for reproducability
    df_output["Score"] = np.random.rand(len(df_output))

    return df_output

In [2]:
# This name is needed to compile the results table later
predictor_name = 'my_dummy_predictor'

Let's test this function on a dataset:
- viral
- mutation
- mutation-mouse

In [3]:
from tcr_benchmark.eval.allTests import NAME_2_TEST
from tcr_benchmark.pp.datasets import download_datasets

In [4]:
dataset = 'viral'
download_datasets(dataset)
test = NAME_2_TEST[dataset](None)

100%|██████████| 3/3 [00:00<00:00, 7367.04it/s]


In [5]:
config = {'dummyModelChoice': 'Use the super model'}
results_dataset = test.run_tests(prediction_func, predictor_name, config)

Index(['CDR3_alpha', 'CDR3_beta', 'V_alpha', 'V_beta', 'J_alpha', 'J_beta',
       'Epitope', 'clone_id', 'MHC', 'dataset', 'Label'],
      dtype='object')
          CDR3_alpha       CDR3_beta     V_alpha      V_beta    J_alpha  \
0    CAVNAPTGTASKLTF   CASSMRSAVEQYF  TRAV8-1*01   TRBV19*01  TRAJ44*01   
1       CAVEGSQGNLIF   CASSMRSAVEQYF    TRAV2*01   TRBV19*01  TRAJ42*01   
2    CAGWPGSSNTGKLIF   CASSIRSLAEQYF   TRAV25*01   TRBV19*01  TRAJ37*01   
3  CAVRDAILTGGGNKLTF  CASRRQGITETQYF    TRAV3*01   TRBV27*01  TRAJ10*01   
4         CATEDNDMRF     CASSFSDTQYF   TRAV17*01  TRBV5-4*01  TRAJ43*01   

       J_beta    Epitope  clone_id          MHC dataset  Label  
0  TRBJ2-7*01  GILGFVFTL      25.0  HLA-A*02:01   viral      1  
1  TRBJ2-7*01  GILGFVFTL      27.0  HLA-A*02:01   viral      1  
2  TRBJ2-7*01  GILGFVFTL      34.0  HLA-A*02:01   viral      1  
3  TRBJ2-5*01  GILGFVFTL      35.0  HLA-A*02:01   viral      1  
4  TRBJ2-3*01  GILGFVFTL      39.0  HLA-A*02:01   viral      0  
{'d

Here, we see the input data passed to our `prediction_func`. First, columns of the DateFrame, then first five rows, and last the configuration passed through the model.

In [6]:
results_dataset.head()

Unnamed: 0,Method,Dataset,Group,Support,Metric_Type,Metric,Value
0,my_dummy_predictor,Viral,full_data,638,MPS,AverageRank,7.565831
1,my_dummy_predictor,Viral,full_data,638,MPS,R@1,0.051724
2,my_dummy_predictor,Viral,full_data,638,MPS,R@3,0.189655
3,my_dummy_predictor,Viral,full_data,638,MPS,R@5,0.354232
4,my_dummy_predictor,Viral,full_data,638,MPS,R@8,0.570533


In [7]:
results_dataset[(results_dataset['Metric']=='AUC')
                & (results_dataset['Group']=='full_data')    
            ]

Unnamed: 0,Method,Dataset,Group,Support,Metric_Type,Metric,Value
75,my_dummy_predictor,Viral,full_data,8932,TTP,AUC,0.493067


As we used random values for prediction, the AUC score is close to 0.5 as expected.

## 02. ePytope-TCR
A more detailed description on ePytope-TCR can be found in this tutorial https://github.com/SchubertLab/epytope/blob/main/epytope/tutorials/TCRPrediction.ipynb

We will use the predictor class from this tutorial.

In [8]:
from epytope.TCRSpecificityPrediction import ACmdTCRSpecificityPrediction
from epytope.Core import Peptide, Allele, TCREpitope, ImmuneReceptorChain, ImmuneReceptor
import numpy as np
import pandas

class RandomTCRSpecificityPrediction(ACmdTCRSpecificityPrediction):
    __name = "random"
    __version = "1.0.0"
    __tcr_length = (0, 40)  # Admissable CDR3 length
    __epitope_length = (0, 40)  # Admissable epitope lenght 
    __organism = "H"  # Allowed organism of the TCRs (H=Human, M=Mouse, e.g. HM)

    @property
    def version(self):
        return self.__version

    @property
    def name(self):
        return self.__name

    @property
    def tcr_length(self):
        return self.__tcr_length

    @property
    def epitope_length(self):
        return self.__epitope_length
    
    @property
    def organism(self):
        return self.__organism
    
    def predict(self, tcrs, epitopes, pairwise=True, interpreter=None, conda=None, cmd_prefix=None, **kwargs):
        # Make multiple input types available
        if pairwise:
            epitopes = list(set(epitopes))
            
        # test wether the input is correct
        self.input_check(tcrs, epitopes, pairwise, **kwargs)
        
        # insert your own prediction here
        # We suggest this workflow:
        results = tcrs.to_pandas()  # convert the dataframe to pandas

        if pairwise:
            results = self.combine_tcrs_epitopes_pairwise(results, epitopes)  # Prebuild function to perform pairwise prediction (e.g. format the dataframe above to test each epitope against each TCR)
        else:
            results = self.combine_tcrs_epitopes_list(results, epitopes)  # Prebuild function to perform prediction between a list of TCRs to a corresponding list of epitopes
        results = results.drop_duplicates(["VDJ_cdr3", "Epitope"])  # remove duplicates to avoid unnecessary computation

        # Potentially rename any column names that require a different name for your predictor
        results["Score"] = np.random.rand(len(results))  # assign a score to each column, here we use random prediction without any connection of input and output
        
        # create a TCRspecificity result
        # alternatively you can create a pandas.DataFrame yourself, that follows the epytope-TCR-prediction format
        # list of columns on which tcrs and results should be joined. Note, no duplicates are allowed based on these columns. This typically boils down to the collumns needed for your prediction model
        joining_list = ["VDJ_cdr3", "Epitope"]  
        results = results[joining_list + ['Score']]
        df_result = self.transform_output(results, tcrs, epitopes, pairwise, joining_list)  # this creates the right output format
        return df_result

We can use the wrapper provided by the benchmark suite to obtain the prediction function.

In [9]:
from tcr_benchmark.study.ePytopeWrapper import wrapp_predictor
predictor_name2 = 'random'
prediction_func2 = wrapp_predictor(predictor_name2)

In [10]:
test2 = NAME_2_TEST[dataset](None)
results_dataset2 = test2.run_tests(prediction_func2, predictor_name2, {})
results_dataset2.head(5)



Unnamed: 0,Method,Dataset,Group,Support,Metric_Type,Metric,Value
0,random,Viral,full_data,638,MPS,AverageRank,7.606583
1,random,Viral,full_data,638,MPS,R@1,0.068966
2,random,Viral,full_data,638,MPS,R@3,0.210031
3,random,Viral,full_data,638,MPS,R@5,0.346395
4,random,Viral,full_data,638,MPS,R@8,0.573668


In [11]:
results_dataset2[(results_dataset2['Metric']=='AUC')
                & (results_dataset2['Group']=='full_data')    
            ]

Unnamed: 0,Method,Dataset,Group,Support,Metric_Type,Metric,Value
75,random,Viral,full_data,8932,TTP,AUC,0.49194
