# Article: [Predicting TCR-Epitope Binding Specificity Using Deep Metric Learning and Multimodal Learning](https://www.mdpi.com/2073-4425/12/4/572)

## **Objects**

*1. Develop a Computational Model: The paper aims to create a convolutional neural network model that utilizes deep metric learning and multimodal learning techniques to predict interactions between T cell receptors (TCRs) and Major Histocompatibility Complex class I-peptide complexes (pMHC).*

*2. Simultaneous TCR-Epitope Binding Prediction: The paper seeks to perform two critical tasks in TCR-epitope binding prediction: identifying the TCRs that bind a given epitope from a TCR repertoire and identifying the binding epitope of a given TCR from a list of candidate epitopes. The goal is to achieve accurate predictions for both tasks simultaneously.*

*3. Gain Insights into Binding Specificity: The paper aims to provide insights into the factors that determine TCR-epitope binding specificity, including the identification of key amino acid sequence patterns and positions within the TCR that are important for binding specificity. Additionally, the paper challenges the assumption that physical proximity to epitopes is the sole determinant of TCR-epitope specificity.*

## Packages

In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer
import torch

2023-10-30 18:17:55.384688: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-30 18:17:55.560325: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-10-30 18:17:55.560346: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-10-30 18:17:55.597976: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-30 18:17:56.396291: W tensorflow/stream_executor/platform/de

## One-hot enocoding.

In [2]:
positives = pd.read_csv("./positive.csv")
negatives = pd.read_csv("./negative.csv")

amino_acids = []

def add_to_amino_acids(a_sequence: str):
    for acid in a_sequence:
        if acid not in amino_acids:
            amino_acids.append(acid)

positives.stack().reset_index(drop=True).apply(add_to_amino_acids)

amino_acids.sort()

amino_acid_label_encoder = LabelEncoder()
amino_acid_label_encoder.fit(amino_acids)

all_amino_acids = amino_acid_label_encoder.transform(amino_acids)

def feature_map(p_sequence):
    return [tf.one_hot(amino_acid_label_encoder.transform(list(x)), len(all_amino_acids)) for x in p_sequence]

data_cd3r = feature_map(positives["cdr3"])
data_epitope = feature_map(positives["antigen.epitope"])

2023-10-30 18:18:00.016469: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-10-30 18:18:00.016496: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2023-10-30 18:18:00.016513: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (matts-computer): /proc/driver/nvidia/version does not exist
2023-10-30 18:18:00.016777: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Data Representation

<center>
    <img src="gene.jpg" alt="Figure 1">
</center>

### 2.2. CDR3B and Epitope Sequence Representation:



**Data Representation Goals:**

  1. Convert amino acid sequences from string format to a numeric representation.
  2. Develop a numerical procedure utilizing Atchley representation to capture physical and biochemical properties.
  3. Create matrices with specified dimensions through padding to accommodate varying sequence lengths.

### Sentence Construction with the Atchley Representation in both CDR3B and the Epitope
Fixed: Instead of constructing the sentences manually, I constructed the sentences utilizing the BertTokenizer in relateion to the amino_aciv_vocab.txt.

In [3]:
def convert_to_space_separated_string(series):
    return ' '.join(series)

tokenizer = BertTokenizer(vocab_file="./amino_acid_vocab.txt")

def construct_sentences(dataframe):
    cdr3_sentences = dataframe["cdr3"]
    epitope_sentences = dataframe["antigen.epitope"]
    return cdr3_sentences, epitope_sentences

def pad_sentences(sentences, max_length):
    input_ids = []
    attention_masks = []

    for sentence in sentences:
        encoded_dict = tokenizer.encode_plus(
            sentence,
            add_special_tokens=True,
            max_length=max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt',
            return_attention_mask=True
        )
        input_ids.append(encoded_dict['input_ids'])
        attention_masks.append(encoded_dict['attention_mask'])

    return torch.stack(input_ids), torch.stack(attention_masks)

max_length = 32

positives = pd.read_csv("./positive.csv")

for column in positives.columns:
    positives[column] = positives[column].apply(convert_to_space_separated_string)

cdr3_sentences, epitope_sentences = construct_sentences(positives)

cdr3_input_ids, cdr3_attention_masks = pad_sentences(cdr3_sentences, max_length)
epitope_input_ids, epitope_attention_masks = pad_sentences(epitope_sentences, max_length)

cdr3_combined = torch.cat((cdr3_input_ids, cdr3_attention_masks), dim=1)
epitope_combined = torch.cat((epitope_input_ids, epitope_attention_masks), dim=1)

cdr3_train_data, cdr3_test_data = train_test_split(cdr3_combined, test_size=0.2, random_state=42)
epitope_train_data, epitope_test_data = train_test_split(epitope_combined, test_size=0.2, random_state=42)
print(cdr3_input_ids)
print(epitope_input_ids)

tensor([[[2, 5, 6,  ..., 0, 0, 0]],

        [[2, 5, 6,  ..., 0, 0, 0]],

        [[2, 5, 6,  ..., 0, 0, 0]],

        ...,

        [[2, 5, 6,  ..., 0, 0, 0]],

        [[2, 5, 6,  ..., 0, 0, 0]],

        [[2, 5, 6,  ..., 0, 0, 0]]])
tensor([[[ 2,  8, 10,  ...,  0,  0,  0]],

        [[ 2,  8, 10,  ...,  0,  0,  0]],

        [[ 2,  8, 10,  ...,  0,  0,  0]],

        ...,

        [[ 2, 10, 10,  ...,  0,  0,  0]],

        [[ 2, 10, 10,  ...,  0,  0,  0]],

        [[ 2, 10, 10,  ...,  0,  0,  0]]])


### Procedure encoding CDR3B and Eptiope Amino Acid Sequences as Numerical Matrices
#### Taken from Matt's code.

In [14]:
from transformers import BertConfig, BertForMaskedLM

config = BertConfig.from_json_file("./bert_config.json")
model = BertForMaskedLM(config=config)

device = "cuda" if torch.cuda.is_available() else "cpu"

cdr3_train_data = cdr3_train_data.to(device)
cdr3_test_data = cdr3_test_data.to(device)
epitope_train_data = epitope_train_data.to(device)
epitope_test_data = epitope_test_data.to(device)

model.to(device)

model(cdr3_train_data[:2][:,0], cdr3_train_data[:2][:,1])

MaskedLMOutput(loss=None, logits=tensor([[[ 0.0000,  0.1290, -0.1712,  ..., -0.1345, -0.0106, -0.0524],
         [ 0.0000, -0.0574,  0.1205,  ..., -0.1356, -0.1956,  0.0307],
         [ 0.0000,  0.0640, -0.0882,  ..., -0.1235,  0.0558, -0.1000],
         ...,
         [ 0.0000,  0.1143,  0.0363,  ..., -0.1049, -0.1265,  0.0897],
         [ 0.0000,  0.1627,  0.0230,  ...,  0.0143, -0.0678,  0.0448],
         [ 0.0000,  0.1746, -0.0648,  ..., -0.0725,  0.0789,  0.0503]],

        [[ 0.0000,  0.0524, -0.2114,  ..., -0.1261, -0.0148, -0.0675],
         [ 0.0000, -0.0600,  0.1611,  ..., -0.1093, -0.1943,  0.0145],
         [ 0.0000, -0.0087, -0.0788,  ..., -0.1019,  0.0794, -0.1122],
         ...,
         [ 0.0000,  0.0559,  0.0750,  ..., -0.1248, -0.2235,  0.0278],
         [ 0.0000,  0.1855, -0.0010,  ..., -0.0070, -0.0348,  0.0621],
         [ 0.0000,  0.1914, -0.0860,  ..., -0.1114,  0.0447, -0.0004]]],
       grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)

In [17]:
def train_model(cdr3_train, cdr3_test, epitope_train, epitope_test, optim = torch.optim.Adam(model.parameters(), lr=1e-5)):
    model.train()
    for epoch in range(1):
      optim.zero_grad()
      outputs = model(cdr3_train_data[:,0], cdr3_train_data[:,1])
      loss = outputs[0]
      loss.backward()
      optimizer.step()
      print(loss)
train_model(cdr3_train_data, cdr3_test_data, epitope_train_data, epitope_test_data)

RuntimeError: grad can be implicitly created only for scalar outputs