# Article: [Predicting TCR-Epitope Binding Specificity Using Deep Metric Learning and Multimodal Learning](https://www.mdpi.com/2073-4425/12/4/572)

## **Objects**

#### *1. Develop a Computational Model: The paper aims to create a convolutional neural network model that utilizes deep metric learning and multimodal learning techniques to predict interactions between T cell receptors (TCRs) and Major Histocompatibility Complex class I-peptide complexes (pMHC).*

#### *2. Simultaneous TCR-Epitope Binding Prediction: The paper seeks to perform two critical tasks in TCR-epitope binding prediction: identifying the TCRs that bind a given epitope from a TCR repertoire and identifying the binding epitope of a given TCR from a list of candidate epitopes. The goal is to achieve accurate predictions for both tasks simultaneously.*

##### *3. Gain Insights into Binding Specificity: The paper aims to provide insights into the factors that determine TCR-epitope binding specificity, including the identification of key amino acid sequence patterns and positions within the TCR that are important for binding specificity. Additionally, the paper challenges the assumption that physical proximity to epitopes is the sole determinant of TCR-epitope specificity.*

## Packages

In [18]:
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer
import torch

## One-hot enocoding.

In [19]:
positives = pd.read_csv("./positive.csv")
negatives = pd.read_csv("./negative.csv")

amino_acids = []

def add_to_amino_acids(a_sequence: str):
    for acid in a_sequence:
        if acid not in amino_acids:
            amino_acids.append(acid)

positives.stack().reset_index(drop=True).apply(add_to_amino_acids)

amino_acids.sort()

amino_acid_label_encoder = LabelEncoder()
amino_acid_label_encoder.fit(amino_acids)

all_amino_acids = amino_acid_label_encoder.transform(amino_acids)

def feature_map(p_sequence):
    return [tf.one_hot(amino_acid_label_encoder.transform(list(x)), len(all_amino_acids)) for x in p_sequence]

data_cd3r = feature_map(positives["cdr3"])
data_epitope = feature_map(positives["antigen.epitope"])

## Data Representation

<center>
    <img src="gene.jpg" alt="Figure 1">
</center>

### 2.2. CDR3B and Epitope Sequence Representation:



**Data Representation Goals:**

  1. Convert amino acid sequences from string format to a numeric representation.
  2. Develop a numerical procedure utilizing Atchley representation to capture physical and biochemical properties.
  3. Create matrices with specified dimensions through padding to accommodate varying sequence lengths.

### Sentence Construction with the Atchley Representation in both CDR3B and the Epitope
Fixed: Instead of constructing the sentences manually, I constructed the sentences utilizing the BertTokenizer in relateion to the amino_aciv_vocab.txt.

### Procedure encoding CDR3B and Eptiope Amino Acid Sequences as Numerical Matrices

In [20]:
def convert_to_space_separated_string(series):
    return ' '.join(series)

tokenizer = BertTokenizer(vocab_file="./amino_acid_vocab.txt")

def construct_sentences(dataframe):
    cdr3_sentences = dataframe["cdr3"]
    epitope_sentences = dataframe["antigen.epitope"]
    return cdr3_sentences, epitope_sentences

def pad_sentences(sentences, max_length):
    input_ids = []
    attention_masks = []

    for sentence in sentences:
        encoded_dict = tokenizer.encode_plus(
            sentence,
            add_special_tokens=True,
            max_length=max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt',
            return_attention_mask=True
        )
        input_ids.append(encoded_dict['input_ids'])
        attention_masks.append(encoded_dict['attention_mask'])

    return torch.stack(input_ids), torch.stack(attention_masks)

max_length = 32

positives = pd.read_csv("./positive.csv")

for column in positives.columns:
    positives[column] = positives[column].apply(convert_to_space_separated_string)

cdr3_sentences, epitope_sentences = construct_sentences(positives)

cdr3_input_ids, cdr3_attention_masks = pad_sentences(cdr3_sentences, max_length)
epitope_input_ids, epitope_attention_masks = pad_sentences(epitope_sentences, max_length)

cdr3_combined = torch.cat((cdr3_input_ids, cdr3_attention_masks), dim=1)
epitope_combined = torch.cat((epitope_input_ids, epitope_attention_masks), dim=1)

cdr3_train_data, cdr3_test_data = train_test_split(cdr3_combined, test_size=0.2, random_state=42)
epitope_train_data, epitope_test_data = train_test_split(epitope_combined, test_size=0.2, random_state=42)
print(cdr3_input_ids)
print(epitope_input_ids)

tensor([[[2, 5, 6,  ..., 0, 0, 0]],

        [[2, 5, 6,  ..., 0, 0, 0]],

        [[2, 5, 6,  ..., 0, 0, 0]],

        ...,

        [[2, 5, 6,  ..., 0, 0, 0]],

        [[2, 5, 6,  ..., 0, 0, 0]],

        [[2, 5, 6,  ..., 0, 0, 0]]])
tensor([[[ 2,  8, 10,  ...,  0,  0,  0]],

        [[ 2,  8, 10,  ...,  0,  0,  0]],

        [[ 2,  8, 10,  ...,  0,  0,  0]],

        ...,

        [[ 2, 10, 10,  ...,  0,  0,  0]],

        [[ 2, 10, 10,  ...,  0,  0,  0]],

        [[ 2, 10, 10,  ...,  0,  0,  0]]])


### Initialization of Training 

In [21]:
from transformers import *

config = BertConfig.from_json_file("./bert_config.json")

model = BertForMaskedLM(config=config) # change this line and flag what we're using

device = "cuda" if torch.cuda.is_available() else "cpu"

cdr3_train_data = cdr3_train_data.to(device)
cdr3_test_data = cdr3_test_data.to(device)
epitope_train_data = epitope_train_data.to(device)
epitope_test_data = epitope_test_data.to(device)

model.to(device)

Generate config GenerationConfig {
  "pad_token_id": 0
}



BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(25, 32, padding_idx=0)
      (position_embeddings): Embedding(128, 32)
      (token_type_embeddings): Embedding(2, 32)
      (LayerNorm): LayerNorm((32,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=32, out_features=32, bias=True)
              (key): Linear(in_features=32, out_features=32, bias=True)
              (value): Linear(in_features=32, out_features=32, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=32, out_features=32, bias=True)
              (LayerNorm): LayerNorm((32,), eps=1e-12, elementwise_affine=True)
      

### Triplet Loss Function:
Article: [PyTorch Metric Learning](https://kevinmusgrave.github.io/pytorch-metric-learning/#:~:text=This%20customized%20triplet%20loss%20has,than%200.3%20will%20be%20discarded.)

In [22]:
from pytorch_metric_learning.distances import CosineSimilarity
from pytorch_metric_learning.reducers import ThresholdReducer
from pytorch_metric_learning.regularizers import LpRegularizer
from pytorch_metric_learning import losses

In [23]:
loss_func = losses.TripletMarginLoss(distance = CosineSimilarity(), reducer = ThresholdReducer(high=0.3), embedding_regularizer = LpRegularizer())

#### Analyze the embedding between the CDR3 and the Epitope.

#### 1. BertForPreTraining


In [27]:
import torch
import pandas as pd
from transformers import BertTokenizer, BertForMaskedLM, LineByLineTextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments

# Define your model configuration
from transformers import BertConfig

model_config = BertConfig(
    vocab_size=30522,                # Common BERT vocab size
    hidden_size=768,                # Standard BERT hidden size
    num_hidden_layers=12,           # Standard BERT number of hidden layers
    num_attention_heads=12,         # Standard BERT number of attention heads
    intermediate_size=3072,         # Standard BERT intermediate size
    max_position_embeddings=512,    # Maximum position embeddings in BERT-base
)


# Create a custom BERT model
model = BertForMaskedLM(config=model_config)

# Load and preprocess your text data from a CSV file
data = pd.read_csv("./positive.csv")
text_data = data["antigen.epitope"].tolist()

# Join the text data into a single string, separated by newlines
train_data = "\n".join(text_data)


# Tokenize the text data
input_ids = tokenizer.encode(train_data, add_special_tokens=True, return_tensors="pt")
labels = input_ids.clone()

# Use the "amino_acid_vocab.txt" file as a placeholder for your training data
dataset = LineByLineTextDataset(tokenizer=tokenizer, file_path="./amino_acid_vocab.txt", block_size=128)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

training_args = TrainingArguments(
    output_dir="./bert-pretraining",
    overwrite_output_dir=True,
    num_train_epochs=100,
    per_device_train_batch_size=16,
    save_steps=10_000,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()


Generate config GenerationConfig {
  "pad_token_id": 0
}

Creating features from dataset file at ./amino_acid_vocab.txt
Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 12736
  Num Epochs = 100
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2
  Number of trainable parameters = 109,514,298
1

{'train_runtime': 2.194, 'train_samples_per_second': 11.395, 'train_steps_per_second': 0.912, 'train_loss': 0.6165385246276855, 'epoch': 1.0}





TrainOutput(global_step=2, training_loss=0.6165385246276855, metrics={'train_runtime': 2.194, 'train_samples_per_second': 11.395, 'train_steps_per_second': 0.912, 'train_loss': 0.6165385246276855, 'epoch': 1.0})

#### 2. BertForMaskedLM


#### 3. BertForNextPrediction