# Antibody Design -  test

These are the rules to complete this test:

- In this test, you will be asked to perform some analyses on antibody sequences using python, and return the results as a jupyter notebook in a git repository.
- The test should not take more than a couple of hours to complete. You can use any tools on internet, but you can't receive any external help.
- You should describe the protocol you have used to obtain the results, whenever it involves anything outside the jupyter notebook itself (e.g. databases, web tools, etc)
- The results of the test will be discussed in the 2nd interview

The sequence of the antibody we will analyse is the following:



>VH
QVQLKESGPGLVAPSQSLSITCTVSGFPLTAYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSLQTDDTARYYCARDPYGSKPMDYWGQGTSVTVSS
>VL
DIVMSQSPSSLVVSVGEKVTMSCKSSQSLLYSSNQKNFLAWYQQKPGQSPKLLIYWASTRESGVPDRFTGSGSGTDFTLTISSVKAEDLAVYYCQQYFRYRTFGGGTKLEIKRA

__Q1__ Install the package anarci, and using it retrieve the following information:
- germlines
- CDR sequences
- SHMs

In [246]:
# Creating csv for VH
!ANARCI -i QVQLKESGPGLVAPSQSLSITCTVSGFPLTAYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSLQTDDTARYYCARDPYGSKPMDYWGQGTSVTVSS --assign_germline --outfile VH --csv --scheme imgt --r H

# Creating csv for VL
!ANARCI -i DIVMSQSPSSLVVSVGEKVTMSCKSSQSLLYSSNQKNFLAWYQQKPGQSPKLLIYWASTRESGVPDRFTGSGSGTDFTLTISSVKAEDLAVYYCQQYFRYRTFGGGTKLEIKRA --assign_germline --outfile VL --csv --scheme imgt --r L

In [170]:
import pandas as pd
# Loading saved CSV
df_VH = pd.read_csv("ABDesign_case_H.csv")
df_VL = pd.read_csv("ABDesign_case_KL.csv")

In [171]:
aligned_VH = df_VH.iloc[:,13:]
aligned_VL = df_VL.iloc[:,13:]

In [172]:
# Providing overview of important information from the VH chain
VH_info = dict(
    v_gene=df_VH.v_gene.values[0],
    j_gene=df_VH.j_gene.values[0],
    CDR1="".join(aligned_VH.iloc[:,26:38].values.tolist()[0]),
    CDR2="".join(aligned_VH.iloc[:,55:65].values.tolist()[0]),
    CDR3="".join(aligned_VH.iloc[:,104:117].values.tolist()[0])
)
VH_info

{'v_gene': 'IGHV2-6-7*01',
 'j_gene': 'IGHJ4*01',
 'CDR1': 'GFPL----TAYG',
 'CDR2': 'IWGD---GNT',
 'CDR3': 'ARDPYG-SKPMDY'}

In [174]:
# Providing overview of important information from the VL chain
VL_info = dict(
    v_gene=df_VL.v_gene.values[0],
    j_gene=df_VL.j_gene.values[0],
    CDR1="".join(aligned_VL.iloc[:,26:38].values.tolist()[0]),
    CDR2="".join(aligned_VL.iloc[:,55:65].values.tolist()[0]),
    CDR3="".join(aligned_VL.iloc[:,104:117].values.tolist()[0])
)
VL_info

{'v_gene': 'IGKV8-30*01',
 'j_gene': 'IGKJ1*01',
 'CDR1': 'QSLLYSSNQKNF',
 'CDR2': 'WA-------S',
 'CDR3': 'QQYF-----RYRT'}

__Q2__ write a function that returns all the occurrences of the motifs NG, DG, NS, and DS, if they overlap at least partially with any of the CDRs

In [175]:
def occurences(sequence):
    """
    A function that counts the occurences of overlapping motifs with any CDRs. 

    Assuming sequence is IMGT-numbered
    """
    motifs = ["NG", "DG", "NS", "DS"] # A list to look up motifs
    motif_counts = dict(
        NG=0,
        DG=0,
        NS=0,
        DS=0
    ) # dict where we count each motif
    for i in range(len(sequence[25:39].replace("-",""))-1): # CDR1 - We add stem since we are interested in partially overlapping motifs
        motif=sequence[i] + sequence[i+1]
        if motif in motifs:
            motif_counts[motif] += 1
    for i in range(len(sequence[54:66].replace("-",""))-1): # CDR2
        motif=sequence[i] + sequence[i+1]
        if motif in motifs:
            motif_counts[motif] += 1 
    for i in range(len(sequence[103:118].replace("-",""))-1): # CDR3
        motif=sequence[i] + sequence[i+1]
        if motif in motifs:
            motif_counts[motif] += 1 
    return motif_counts

In [176]:
occurences("".join(aligned_VL.values.tolist()[0]))

{'NG': 0, 'DG': 0, 'NS': 0, 'DS': 0}

In [247]:
occurences("".join(aligned_VH.values.tolist()[0]))

{'NG': 0, 'DG': 0, 'NS': 0, 'DS': 0}

No overlapping or partially overlapping motifs with CDRs were found in the sequences

__Q3__ These sites are potential liabilities in the antibody. Using online tools, and your own intuition, describe which othese occurences might be problematic, which residues you would suggest mutating to eliminate such liabilities minimizing the risk of affecting the antibody binding affinity, and why

If the motifs are found at CDR regions it can affect binding. As CDRs are the ones that interact with antigen it is likely that a change in motif in these regions lead to poor affinity. Thus, I would suggest mutating residues in the framework regions to minimize risk affecting antibody binding affinity

__Q4__ Using the online tool ABodybuilder, build a 3d structure for this antibody. On the structure, using pymol or any python library of your choice (e.g. BioPDB) identify the minimal distance between any atom of any residue in the CDR3 of the heavy and light chains. You will need to report the residue in the CDR3 of the havy chain and the atom, the residue in the CDFR3 of the light chain and the atom

In [239]:
from Bio.PDB.PDBParser import PDBParser
parser = PDBParser(PERMISSIVE=True)
structure_id = "ABDesign_case_rank1_imgt_scheme"
filename = "ABDesign_case_rank1_imgt_scheme.pdb"
model = parser.get_structure(structure_id, filename)[0]
chainH = model["H"]
chainL = model["L"]

In [245]:
# Here I chose CA of residue 114 of the heavy chain and calculated its distance to the CA of residue 107 in the light chain 
chainH[114]["CA"] - chainL[107]["CA"]

7.928552

__Q5__ A library was constructed on the antibody by introducing variability in the CDR3 of the heavy chain, and the binding affinity was measured for each antibody in the library. The affinities are reported as -log10, so the higher values indicate the stronger binders. Can you provide the code to train an XGboost model on the train_data using the techniques you deem necessary to ensure the model robustness, and use it to predict the binding affinity of the antibody in the list _test_
Report the model performance in terms of RMSE, the predicted values for the test antibodies, and any comment on the model.

In [None]:
import numpy as np
import random
import string

# Define the number of data points
n_samples = 1000

# Define possible amino acids
amino_acids = list('ARNDCEQGHILKMFPSTWYV')

# Generate random CDR3 sequences
def generate_CDR3(length):
    return ''.join(random.choice(amino_acids) for _ in range(length))

CDR3_sequences = [generate_CDR3(np.random.randint(15, 20)) for _ in range(n_samples)]

# Generate random scores between 0 and 10
scores = np.random.uniform(6, 10, n_samples)

# Create dataset
train_data = list(zip(CDR3_sequences, scores))

test_data = ['KYWRDGSTWWELIYGYG',
 'RQSFKFCCTFE', 
 'LLWRYFFRKFSCD',
 'STLTFSDNIGSYPCWD',
 'QRLQEHTAGRG']