# Predicting antibody-antigen interactions with Transformer-based machine learning
### Part 1: Data Clean & Dataset Generation
This workbook contains all of the code for the data cleaning portion of the project


### Step 0: Imports & Installations
This project makes use of the BioPython package to assist in the reading, parsing, and extraction of information from FASTA sequence files

In [1]:
## Code to install biopython package; uncomment lines below to install the package if not already installed
# import sys
# !conda install --yes --prefix {sys.prefix} biopython

In [1]:
import pandas as pd # pandas package
import numpy as np # numpy package
import re # for regex
import os # os
import urllib.request # urllib package for making GET requests
from Bio import SeqIO # biopython package for parsing and reading FASTA files

### Step 1: Read and import data
We will be making use of the dataset from the Coronavirus Antibody Database by the Oxford Protein Informatics Group in this project

In [35]:
#Read and import data
covabdab_filepath = "data/CoV-AbDab_201222.csv"
covabdab_df = pd.read_csv(covabdab_filepath)
covabdab_df.head()

Unnamed: 0,Name,Ab or Nb,Binds to,Doesn't Bind to,Neutralising Vs,Not Neutralising Vs,Protein + Epitope,Origin,VHorVHH,VL,...,Light J Gene,CDRH3,CDRL3,Structures,ABB Homology Model (if no structure),Sources,Date Added,Last Updated,Update Description,Notes/Following Up?
0,029-2E1,Ab,SARS-CoV2_WT,,,,S; RBD,B-cells; SARS-CoV2_WT Human Patient,EVQLVESGGGLIQPGGSLRLSCAASGLTVSSNYMNWVRQAPGKGLE...,DIQMTQSPSSLSASVGDRVTITCQASQDISNYLNWYQQKPGKAPKL...,...,IGKJ5 (Human),ARFRWGDV,QQYDNLPIT,ND,,"Sanjeev Kumar et al., 2022 (https://www.scienc...","Dec 20, 2022","Dec 20, 2022",,Complete
1,029-1A4,Ab,SARS-CoV2_WT,,,,S; RBD,B-cells; SARS-CoV2_WT Human Patient,EVQLVQSGGGLVQPGGSLRLSCLASGLTFSSYEFNWIRQAPGKGLE...,QAGLTQPPSVSKGLRQTATLTCTGNSNDVGRQGAAWLQQHQGHPPK...,...,IGLJ1 (Human),VTDLPGDLEFDF,SSWDSSRGGYV,ND,,"Sanjeev Kumar et al., 2022 (https://www.scienc...","Dec 20, 2022","Dec 20, 2022",,Complete
2,029-1C1,Ab,SARS-CoV2_WT,,,,S; RBD,B-cells; SARS-CoV2_WT Human Patient,QVQLVQSGAEVKRPGASVKVLCMASGYSFTNYGINWVRQAPGQGLE...,QSVLTQPPSVAAAPGQKVTISWSGSSSNIGNNYVSWYQQVPGTAPK...,...,IGLJ2 (Human),ARGSVLREADY,GTWDSSLNSLVV,ND,,"Sanjeev Kumar et al., 2022 (https://www.scienc...","Dec 20, 2022","Dec 20, 2022",,Complete
3,029-1C3,Ab,SARS-CoV2_WT,,,,S; RBD,B-cells; SARS-CoV2_WT Human Patient,EVQLVQSGAEVKKPRESLKISCKGSGYNFTSYWIGWVRQMPGKGLE...,EIVLTQSPGTLSLSPGERATLSCRASQGFSSNSLAWYQQRPGQAPR...,...,IGKJ1 (Human),SRPPYCSGTSCLDF,QQSHRSSMYT,ND,,"Sanjeev Kumar et al., 2022 (https://www.scienc...","Dec 20, 2022","Dec 20, 2022",,Complete
4,029-1D1,Ab,SARS-CoV2_WT,,,,S; RBD,B-cells; SARS-CoV2_WT Human Patient,EVQLVESGGGLVQPGGSLRLSCAASRFTFANYWMSWVRQAPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQAISNWLAWYQQKPEKAPKS...,...,IGKJ4 (Human),ARGGPSRGTEVRPDDVFDM,QQYSSYPLT,ND,,"Sanjeev Kumar et al., 2022 (https://www.scienc...","Dec 20, 2022","Dec 20, 2022",,Complete


We note that several series of data are present
- The protein sequences for the Heavy & Light chains of the antibody
- Structures that were referenced, and its sources
- Information about interactions (binds to/doesn't bind to)
- Information about neutralization (neutralizing vs/not neutralizing vs)

There is unfortunately no information on the virus protein sequences other than the referenced structures

### Step 2: Obtain unique virus values
- We note that each set of heavy and light chains may interact with or neutralize multiple virus types as well as their variants.
- Here we obtain all the possible unique values for the virus types for mapping later

In [4]:
# Function to obtain unique from a column with delimited values
def obtain_unique_values(df, columnName, delimiter):
    result = [];
    
    columnUniqueValues = df[columnName].unique()
    for value in columnUniqueValues:
        result = result + str(value).split(delimiter)
    
    result = pd.Series(result)
    result = result.unique()
    return result

In [7]:
#Obtain overall unique dictionary
overallResult = obtain_unique_values(covabdab_df, "Binds to", ";")
np.append(overallResult, obtain_unique_values(covabdab_df, "Doesn't Bind to", ";"))
np.append(overallResult, obtain_unique_values(covabdab_df, "Neutralising Vs", ";"))
np.append(overallResult, obtain_unique_values(covabdab_df, "Not Neutralising Vs", ";"))
overallResult = pd.Series(overallResult).unique()
overallResult


array(['SARS-CoV2_WT', 'SARS-CoV2_Beta', 'SARS-CoV2_Gamma',
       'SARS-CoV2_ Beta', 'SARS-CoV2_Gamma (weak)', 'SARS-CoV2_Delta',
       'SARS-CoV2_Alpha', 'SARS-CoV2_Omicron-BA1',
       'SARS-CoV2_Omicron-BA2', 'SARS-CoV2_Kappa', 'SARS-CoV1',
       'MERS-CoV', 'HKU1', 'OC43', 'SARS-CoV2_Epsilon', 'SARS-CoV2_Eta',
       'SARS-CoV2_Zeta', 'SARS-CoV2_Iota', 'SARS-CoV2_Lambda',
       'SARS-CoV2_Mu', 'SARS-CoV2_Omicron-BA3', 'SARS-CoV2_Omicron-BA4',
       'SARS-CoV2_Omicron-BA5', 'SARS-CoV2_Omicron-BA2.12.1',
       'SARS-CoV2_Omicron-BA2.75', 'SARS-CoV2_Omicron-BA2.38',
       'SARS-CoV2_Omicron-BA2.76', '229E', 'SARS-COV2_Omicron-BA2.12.1',
       'SARS-CoV2_Omicron-BA4/BA5', 'SARS-CoV2_Omicron-BA1 (weak)',
       'SARS-CoV2_Omicron-BA2 (weak)', 'SARS-CoV2-Delta', 'Pangolin-GD',
       'RaTG13', 'SARS-CoV2_Omicron-BA4/5', 'SARS_CoV2_Delta', 'RaTG013',
       'Pangolin-GXP2V', 'A021', 'Civet007-2004', 'CS24', 'Frankfurt-1',
       'LYRa11', 'Rs4231', 'SHC014', 'WIV1', 'HKU3', 'Rs408

In [8]:
pd.Series(overallResult).to_csv("virus_names.csv", index=False, header=False)

### Step 3: Download structure data referenced in CoVAbDb
We note that many of the records within the CoVAbDb dataset do not contain the virus protein sequences. However, they do contain the IDs of the referenced structures.

These structures mostly contain:
- Heavy and Light chains of one or more antibodies
- The spike protein of the virus itself

We will:
- Download the FASTA sequence from RCSB Protein Data Bank (RCSB PDB) using their IDs
- Using BioPython Library, we parse the FASTA Sequence (https://www.biostars.org/p/710/)
- From each structure file, extract out the spike protein sequence to create a record for our training dataset
- Save these outputs into a file: `fastaComplexes.csv`

In [9]:
# Filter out data that does not reference the structure files
covabdab_df_with_structures = covabdab_df[covabdab_df['Structures'] != 'ND']

In [10]:
# This function downloads the FASTA file from RCSB given its ID or opens the fasta file it is already downloaded and returns the sequence data
def get_fasta_from_id(id):
    # Normalize id name
    id = id.upper()
    fasta_file_path = "fasta/"+id+".fasta"

    fasta_sequences = []

    #if fasta file does not exist then download it from rcsdb
    if not os.path.exists(path="fasta/"+id+".fasta"):
        url = "https://www.rcsb.org/fasta/entry/"+id+"/display"
        try:
            urllib.request.urlretrieve(url, fasta_file_path)
        except:
            print("Error occured while obtaining FASTA file for id: "+id)
            return fasta_sequences

    #if file exists then open it directly
    for seq_record in SeqIO.parse(fasta_file_path, "fasta"):
        seq_prelim_virus_type, seq_prelim_protein_type = get_prelim_virus_type(seq_record.description)
        fasta_sequences.append({
            "id": seq_record.id,
            "prelim_virus": seq_prelim_virus_type,
            "prelim_protein": seq_prelim_protein_type,
            "description": seq_record.description,
            "sequence": str(seq_record.seq)
        })
    
    return fasta_sequences
    

In [11]:
# This function assigns the preliminary virus type and protein type (Ab or Spike) based on the description of a FASTA sequence
def get_prelim_virus_type(description):
    normalzied_description = description.upper()
    prelim_virus_type = ""
    prelim_protein_type=""
    if "coronavirus 2".upper() in normalzied_description:
        prelim_virus_type = "SARS-CoV2"
    elif "SARS coronavirus".upper() in normalzied_description:
        prelim_virus_type = "SARS-CoV"
    elif "Middle East".upper() in normalzied_description:
        prelim_virus_type = "MERS-CoV"
    
    if "spike glycoprotein".upper() in normalzied_description:
        prelim_protein_type = "Spike Glycoprotein"
    elif "spike protein".upper() in normalzied_description:
        prelim_protein_type = "Spike Protein"
    elif "vl".upper() in normalzied_description or "light".upper() in normalzied_description:
        prelim_protein_type = "Light Chain"
        prelim_virus_type = "Antibody"
    elif "vh".upper() in normalzied_description or "heavy".upper() in normalzied_description:
        prelim_protein_type = "Heavy Chain"
        prelim_virus_type = "Antibody"
    elif "homo".upper() in normalzied_description:
        prelim_protein_type = "Heavy/Light Chain"
        prelim_virus_type = "Antibody"
    elif "Nanobody".upper() in normalzied_description:
        prelim_protein_type = "Nanobody"
        prelim_virus_type = "Antibody"
        
    return prelim_virus_type, prelim_protein_type

In [12]:
combined_dataset = []

# Iterate through the rows and construct new dataset
for index, row in covabdab_df_with_structures.iterrows():
    vhh_chain = str(row['VHorVHH'])
    vl_chain = str(row['VL'])

    # Use regex to extract out all the RCSB data
    structure_ids = re.findall("[0-9A-Z]{4,}",str(row['Structures']) )
    for structure_id in structure_ids:
        fasta_sequences = get_fasta_from_id(structure_id)

        # Iterate through all the fasta sequences in the set
        for fasta_sequence in fasta_sequences:
            #Skip if this is one of our own vl or vhh chains
            if vhh_chain in fasta_sequence['sequence']:
                vhh_chain = fasta_sequence['sequence']
                continue
            elif vl_chain in fasta_sequence['sequence']:
                vl_chain = fasta_sequence['sequence']
                continue
            
            prelim_virus_type, prelim_protein_type = get_prelim_virus_type(fasta_sequence['description'])
            if prelim_virus_type == "Antibody":
                continue

            combined_dataset.append({
                "name": row['Name'],
                "structure_id": structure_id,
                "vh": vhh_chain,
                "vl": vl_chain,
                "virus": fasta_sequence['sequence'],
                "description": fasta_sequence['description'],
                "binds_to": row['Binds to'],
                "doesnt_bind_to": row['Doesn\'t Bind to'],
                "neutralising_vs": row['Neutralising Vs'],
                "not_neutralising_vs": row['Not Neutralising Vs'],
            })
        

Error occured while obtaining FASTA file for id: 7N01
Error occured while obtaining FASTA file for id: 7NY5
Error occured while obtaining FASTA file for id: 7KGL
Error occured while obtaining FASTA file for id: 7B11


In [13]:
# Create dataframe from it
fasta_complexes_df = pd.DataFrame.from_records(combined_dataset)
fasta_complexes_df.head()

Unnamed: 0,name,structure_id,vh,vl,virus,description,binds_to,doesnt_bind_to,neutralising_vs,not_neutralising_vs
0,1-2C7,7X2M,QVQLQESGGGLVQPGGSLRLSCAASGDTLDLYAIGWFRQTPGEERE...,,TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFK...,7X2M_1|Chain A[auth E]|Spike protein S1|Severe...,SARS-CoV2_WT,,SARS-CoV2_WT,
1,P2B4,8DXS,EVQLVESGGGLVQPGRSLRLSCAASGFNFDDYAMHWARQVPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQSIHSFLSWYQQKPGKAPKL...,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,"8DXS_1|Chains A, B, C|Spike glycoprotein|Sever...",SARS-CoV2_WT,,SARS-CoV2_WT,
2,P1D9,8DWA,VQLVQSGAEVKKPGSSVRVSCKASGGTFGDDSITWVRQAPGQGLEW...,DIQMTQSPSSLSASVGDTVTITCRAGQTINTFLNWYQQKPGKAPKL...,LCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCY...,8DWA_1|Chain A|Spike protein S1|Severe acute r...,SARS-CoV2_WT,,SARS-CoV2_WT,
3,D29,8DW9,QLVESGAEVKKPGSSVKVSCKASGDTFYTFDISWVRQAPGQGLEWM...,EIVLTQSPGTLSLSPGERATLSCRASQSVGNIYLAWYQQKPGQAPR...,TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFK...,"8DW9_1|Chains A, B|Spike protein S1|Severe acu...",SARS-CoV2_WT,,SARS-CoV2_WT,
4,Ab159,7X8Y,QVQLVESGGGLVKPGGSLRLSCAASGFTFNNAWMSWVRQAPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQSITNYLNWYQQKPGKAPKF...,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,7X8Y_1|Chain A|Spike glycoprotein|Severe acute...,SARS-CoV2_WT;SARS-CoV2_Alpha;SARS-CoV2_Beta;SA...,,SARS-CoV2_WT;SARS-CoV2_Alpha;SARS-CoV2_Beta;SA...,SARS-CoV2_Delta;SARS-CoV2_Omicron-BA1;SARS-CoV...


In [24]:
fasta_complexes_df.to_csv("fastaComplexes.csv",index=False)

A problem which we encounter here is that although we have the virus sequences, we do not know variants of which the belong to.

To resolve this, we reference the workflow that was done in the Jupytope paper where we put these sequences through an application in GISAID called AudacityInstant, which finds the closest lineage of each sequence; and hence its variant.

As AudacityInstant is only able to accept nucleotide sequences, we will have to translate these protein sequences using an application in National Center for Biotechnology Information called tBLASTn

tBLASTn finds potential translated nucleotide sequences given a amino acid sequence which is what our protein sequences are encoded in.

### Step 4: Prepare data for tBLASTn
Removal of duplicate values
- Upon further investigation of the viral sequences, we note that many of the virus spike protein sequences are shared across multiple complexes.
- To reduce the amount of searches required for tBLASTn, we extract out these unique virus protein sequences
- We will keep the descriptions for the first instance of duplucate virus sequences

Preparation of data for tBLASTn
- We will perform tBLASTn for the sequences in batches of 10 due to NCBI's CPU usage limitations, as well as space constraints preventing us to perform a local BLAST search using the BLAST+ binaries
- Spike protein sequences are split into groups of 10 and saved into individual FASTA files for upload and processing on the NCBI website

In [14]:
fasta_complexes_df = pd.read_csv("fastaComplexes.csv")
fasta_complexes_df.head()

Unnamed: 0,name,structure_id,vh,vl,virus,description,binds_to,doesnt_bind_to,neutralising_vs,not_neutralising_vs
0,1-2C7,7X2M,QVQLQESGGGLVQPGGSLRLSCAASGDTLDLYAIGWFRQTPGEERE...,,TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFK...,7X2M_1|Chain A[auth E]|Spike protein S1|Severe...,SARS-CoV2_WT,,SARS-CoV2_WT,
1,P2B4,8DXS,EVQLVESGGGLVQPGRSLRLSCAASGFNFDDYAMHWARQVPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQSIHSFLSWYQQKPGKAPKL...,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,"8DXS_1|Chains A, B, C|Spike glycoprotein|Sever...",SARS-CoV2_WT,,SARS-CoV2_WT,
2,P1D9,8DWA,VQLVQSGAEVKKPGSSVRVSCKASGGTFGDDSITWVRQAPGQGLEW...,DIQMTQSPSSLSASVGDTVTITCRAGQTINTFLNWYQQKPGKAPKL...,LCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCY...,8DWA_1|Chain A|Spike protein S1|Severe acute r...,SARS-CoV2_WT,,SARS-CoV2_WT,
3,D29,8DW9,QLVESGAEVKKPGSSVKVSCKASGDTFYTFDISWVRQAPGQGLEWM...,EIVLTQSPGTLSLSPGERATLSCRASQSVGNIYLAWYQQKPGQAPR...,TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFK...,"8DW9_1|Chains A, B|Spike protein S1|Severe acu...",SARS-CoV2_WT,,SARS-CoV2_WT,
4,Ab159,7X8Y,QVQLVESGGGLVKPGGSLRLSCAASGFTFNNAWMSWVRQAPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQSITNYLNWYQQKPGKAPKF...,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,7X8Y_1|Chain A|Spike glycoprotein|Severe acute...,SARS-CoV2_WT;SARS-CoV2_Alpha;SARS-CoV2_Beta;SA...,,SARS-CoV2_WT;SARS-CoV2_Alpha;SARS-CoV2_Beta;SA...,SARS-CoV2_Delta;SARS-CoV2_Omicron-BA1;SARS-CoV...


In [19]:
# Obtain unique values for the virus protein sequences
unique_virus_protein_sequences = fasta_complexes_df.drop_duplicates('virus')
print(unique_virus_protein_sequences.shape)
unique_virus_protein_sequences.to_csv("fastaComplexesUNIQUE.csv",index=False)

(335, 10)


In [20]:
# Filter out sequences that are COVID-19 (contains 'coronavirus 2') only
covid19_protein_sequences = unique_virus_protein_sequences[unique_virus_protein_sequences['description'].str.contains('coronavirus 2',case=False)]
print(covid19_protein_sequences.shape)
covid19_protein_sequences.to_csv("fastaComplexesUNIQUE_covid19.csv",index=False)

(273, 10)


In [22]:
# Generate large FASTA file containing all sequences
output = ""
counter = 1
fastaFile = open("covid19_sequences.fasta","w")
for index, row in viral_protein_sequences_covid19.iterrows():
    fastaFile.write(">" + row['description'] +"\n" + row['virus'] + "\n")
fastaFile.close()

In [41]:
# Generate batches of 10 sequences files
counter = 0
fastaFile = open("batches/covid19_sequences("+str(counter)+").fasta","a")
for index, row in viral_protein_sequences_covid19.iterrows():
    if counter%10 == 0:
        fastaFile.close()
        fastaFile = open("batches/covid19_sequences("+str(counter)+").fasta","a")
    fastaFile.write(">" + row['description'] +"\n" + row['virus'] + "\n")
    counter += 1
fastaFile.close()

These batches were put through the NCBI tBLASTn application searching under the `Betacoronavirus` database using the default parameters except for the `Max target sequences` which was adjusted to 10.

The top or top 10 sequence results were downloaded every protein sequence that was put through tBLASTn. They are saved in the folder `tblastn-results` under the nomenclature `<PROTEIN SEQ ID>.txt`.

### Step 5: tBLASTn Results & preparation for lineage classification in GISAID's AudacityInstant
- After performing tBLASTn on the sequences we import all of them into a pandas df
- tBLASTn's results are sorted by e-value which refers to the chances a sequence like that can be found in the database at random; hence, a lower value indicates a higher accuracy
- We will take the top result for every protein sequence's tBLASTn results
- This will be used to form a new excel spreadsheet for GSAID's AudacityInstant to determine the lineage/variant of the virus spike proteins

In [24]:
#Read all fasta files and form an array of dictionaries
tblastn_sequences = []

for nucleotide_fasta in os.listdir('tblastn-results'):
    for nucleotide_sequence in SeqIO.parse('tblastn-results/'+nucleotide_fasta,"fasta"):
        tblastn_sequences.append({
            "id":nucleotide_fasta.replace(".txt",""),
            "id_tblastn": nucleotide_sequence.id,
            "sequence": str(nucleotide_sequence.seq),
            "nucleotide_desc": nucleotide_sequence.description
        })
        break # because we take the top result only

# Create a pandas dataframe from it
tblastn_df = pd.DataFrame.from_records(tblastn_sequences)
tblastn_df

Unnamed: 0,id,id_tblastn,sequence,nucleotide_desc
0,6WPS_1,OA965538.1,AGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTC...,OA965538.1 Severe acute respiratory syndrome c...
1,6XDG_1,OL454750.1,TACGTTGAAATCCTTCACTGTAGAAAAAGGAATCTATCAAACTTCT...,OL454750.1 Severe acute respiratory syndrome c...
2,6XE1_3,OW680892.1,AGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTC...,OW680892.1 Severe acute respiratory syndrome c...
3,6XKP_1,OL454750.1,TACGTTGAAATCCTTCACTGTAGAAAAAGGAATCTATCAAACTTCT...,OL454750.1 Severe acute respiratory syndrome c...
4,6YLA_1,OL444812.1,GTAGAAAAAGGAATCTATCAAACTTCTAACTTTAGAGTCCAACCAA...,OL444812.1 Severe acute respiratory syndrome c...
...,...,...,...,...
255,8DXS_1,OY602403.3,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...,OY602403.3 Severe acute respiratory syndrome c...
256,8DZH_1,ON334068.1,TAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCA...,ON334068.1 Severe acute respiratory syndrome c...
257,8ERQ_3,ON334068.1,TAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCA...,ON334068.1 Severe acute respiratory syndrome c...
258,8GX9_1,OL444814.1,GTAGAAAAAGGAATCTATCAAACTTCTAACTTTAGAGTCCAACCAA...,OL444814.1 Severe acute respiratory syndrome c...


In [48]:
# import previous unique fasta files
fasta_sequences_covid19_df = pd.read_csv("fastaComplexesUNIQUE_covid19.csv")
fasta_sequences_covid19_df['virus_id'] =  fasta_sequences_covid19_df['description'].str[:6]

In [50]:
#perform a union join between the 2 dataframes
merged_tblastn_sequences_df = pd.merge(fasta_sequences_covid19_df,tblastn_df,how="outer",left_on="virus_id",right_on="id")
merged_tblastn_sequences_df

merged_tblastn_sequences_df.to_csv("translated_sequences.csv",index=False)

We note that there are a lot of duplicate top results for nucleotide translations; hence we will drop duplicates so that GISAID has a smaller set of records to work with.
- Note that AudacityInstant has a base size limit of 250 to 40,000
- We will remove records that have <250 bases

In [25]:
tblastn_df_unique = tblastn_df.drop_duplicates('id_tblastn')
print(tblastn_df_unique.shape)
tblastn_df_unique = tblastn_df_unique[tblastn_df_unique['sequence'].str.len() >= 250]
print(tblastn_df_unique.shape)
tblastn_df_unique.to_csv("translated_sequences_unique.csv",index=False)

(80, 4)
(78, 4)


In [60]:
#Add them all into a new fasta file
nucleotide_fasta = open("nucleotide_sequences.fasta","w")
for index, row in tblastn_df_unique.iterrows():
    nucleotide_fasta.write(">" + row['nucleotide_desc'] +"\n" + row['sequence'] + "\n")
nucleotide_fasta.close()

Since the AudacityInsant application is on a web platform without an official API, the nucleotide sequences were put into the platform and the results were manually entered into the excel sheet.

### Step 6: Matching variants from nucleotide sequences to protein sequences
Now that we have identified the variants of several nucleotide sequences, we can map them back to their pre-translated protein sequences

We map the protein sequences to their original pre-translated protein sequences based on their PDB Id

In [12]:
# Retrieve translated mapped protein sequences from earlier
translated_virus_sequences = pd.read_csv("translated_sequences.csv")
print(translated_virus_sequences.shape)
translated_virus_sequences.head()

(273, 15)


Unnamed: 0,name,structure_id,vh,vl,virus,description,binds_to,doesnt_bind_to,neutralising_vs,not_neutralising_vs,virus_id,id,id_tblastn,sequence,nucleotide_desc
0,1-2C7,7X2M,QVQLQESGGGLVQPGGSLRLSCAASGDTLDLYAIGWFRQTPGEERE...,,TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFK...,7X2M_1|Chain A[auth E]|Spike protein S1|Severe...,SARS-CoV2_WT,,SARS-CoV2_WT,,7X2M_1,7X2M_1,OP964904.1,GTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTACAAACT...,OP964904.1 Severe acute respiratory syndrome c...
1,P2B4,8DXS,EVQLVESGGGLVQPGRSLRLSCAASGFNFDDYAMHWARQVPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQSIHSFLSWYQQKPGKAPKL...,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,"8DXS_1|Chains A, B, C|Spike glycoprotein|Sever...",SARS-CoV2_WT,,SARS-CoV2_WT,,8DXS_1,8DXS_1,OY602403.3,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...,OY602403.3 Severe acute respiratory syndrome c...
2,P1D9,8DWA,VQLVQSGAEVKKPGSSVRVSCKASGGTFGDDSITWVRQAPGQGLEW...,DIQMTQSPSSLSASVGDTVTITCRAGQTINTFLNWYQQKPGKAPKL...,LCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCY...,8DWA_1|Chain A|Spike protein S1|Severe acute r...,SARS-CoV2_WT,,SARS-CoV2_WT,,8DWA_1,8DWA_1,MZ242023.1,AAACTTGTGCCCTTTTGGTGAAGTTTTTAACGCCACCAGATTTGCT...,MZ242023.1 Severe acute respiratory syndrome c...
3,D29,8DW9,QLVESGAEVKKPGSSVKVSCKASGDTFYTFDISWVRQAPGQGLEWM...,EIVLTQSPGTLSLSPGERATLSCRASQSVGNIYLAWYQQKPGQAPR...,TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFK...,"8DW9_1|Chains A, B|Spike protein S1|Severe acu...",SARS-CoV2_WT,,SARS-CoV2_WT,,8DW9_1,8DW9_1,OL444814.1,GTAGAAAAAGGAATCTATCAAACTTCTAACTTTAGAGTCCAACCAA...,OL444814.1 Severe acute respiratory syndrome c...
4,Ab159,7X8Y,QVQLVESGGGLVKPGGSLRLSCAASGFTFNNAWMSWVRQAPGKGLE...,DIQMTQSPSSLSASVGDRVTITCRASQSITNYLNWYQQKPGKAPKF...,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,7X8Y_1|Chain A|Spike glycoprotein|Severe acute...,SARS-CoV2_WT;SARS-CoV2_Alpha;SARS-CoV2_Beta;SA...,,SARS-CoV2_WT;SARS-CoV2_Alpha;SARS-CoV2_Beta;SA...,SARS-CoV2_Delta;SARS-CoV2_Omicron-BA1;SARS-CoV...,7X8Y_1,7X8Y_1,OA965538.1,AGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTC...,OA965538.1 Severe acute respiratory syndrome c...


In [13]:
# Import results after GISAID
nucleotide_variants_df = pd.read_csv("translated_sequences_variant.csv")
print(nucleotide_variants_df.shape)
nucleotide_variants_df.head()

(78, 8)


Unnamed: 0,id,id_tblastn,sequence,nucleotide_desc,lineage,who_name,comment,mapped_virus_name
0,6WPS_1,OA965538.1,AGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTC...,OA965538.1 Severe acute respiratory syndrome c...,B.39,,,
1,6XDG_1,OL454750.1,TACGTTGAAATCCTTCACTGTAGAAAAAGGAATCTATCAAACTTCT...,OL454750.1 Severe acute respiratory syndrome c...,B.1.2,,,
2,6XE1_3,OW680892.1,AGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTC...,OW680892.1 Severe acute respiratory syndrome c...,B.1.1,,,
3,6YLA_1,OL444812.1,GTAGAAAAAGGAATCTATCAAACTTCTAACTTTAGAGTCCAACCAA...,OL444812.1 Severe acute respiratory syndrome c...,B,,,
4,6ZDG_1,OL444814.1,GTAGAAAAAGGAATCTATCAAACTTCTAACTTTAGAGTCCAACCAA...,OL444814.1 Severe acute respiratory syndrome c...,B.1.177.12,,,


In [15]:
# Join the the two dataframes together based on their tBLASTn ID
merged_lineages_df = pd.merge(translated_virus_sequences,nucleotide_variants_df,how="outer",on="id_tblastn")
merged_lineages_df = merged_lineages_df[['id_y','id_tblastn','virus','sequence_y','lineage','who_name','mapped_virus_name']]
merged_lineages_df = merged_lineages_df.rename(columns={'id_y':'pdb_id','id_tblastn':'nucleotide_id','virus':'protein_sequence','sequence_y':'nucleotide_sequence'})
merged_lineages_df.head()

Unnamed: 0,pdb_id,nucleotide_id,protein_sequence,nucleotide_sequence,lineage,who_name,mapped_virus_name
0,7X2M_1,OP964904.1,TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFK...,GTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTACAAACT...,B.1.351,Beta,SARS-CoV2_Beta
1,7X2M_1,OP964904.1,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTRFPNITNLCPFGEV...,GTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTACAAACT...,B.1.351,Beta,SARS-CoV2_Beta
2,6ZDH_1,OY602403.3,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...,B.3,,
3,6ZDH_1,OY602403.3,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...,B.3,,
4,6ZDH_1,OY602403.3,QCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFF...,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...,B.3,,


In [16]:
# Drop empty columns with no WHO name
merged_lineages_df_with_who_name = merged_lineages_df.dropna(subset = ['who_name'])
merged_lineages_df_with_who_name.head()

Unnamed: 0,pdb_id,nucleotide_id,protein_sequence,nucleotide_sequence,lineage,who_name,mapped_virus_name
0,7X2M_1,OP964904.1,TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFK...,GTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTACAAACT...,B.1.351,Beta,SARS-CoV2_Beta
1,7X2M_1,OP964904.1,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTRFPNITNLCPFGEV...,GTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTACAAACT...,B.1.351,Beta,SARS-CoV2_Beta
115,7WS4_1,OW023282.1,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...,BA.1,Omicron,SARS-CoV2_OmicronBA1;SARS-CoV2_Omicron-BA1
116,7WS4_1,OW023282.1,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...,BA.1,Omicron,SARS-CoV2_OmicronBA1;SARS-CoV2_Omicron-BA1
124,7QTI_5,ON334068.1,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,TAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCA...,BA.1.18,Omicron,SARS-CoV2_OmicronBA1;SARS-CoV2_Omicron-BA1


In [17]:
# Explode the dataframe based on the mapped name
merged_lineages_df_with_who_name['mapped_virus_name'] = merged_lineages_df_with_who_name['mapped_virus_name'].str.split(';')
merged_lineages_df_with_who_name = merged_lineages_df_with_who_name.explode('mapped_virus_name')
merged_lineages_df_with_who_name.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_lineages_df_with_who_name['mapped_virus_name'] = merged_lineages_df_with_who_name['mapped_virus_name'].str.split(';')


Unnamed: 0,pdb_id,nucleotide_id,protein_sequence,nucleotide_sequence,lineage,who_name,mapped_virus_name
0,7X2M_1,OP964904.1,TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFK...,GTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTACAAACT...,B.1.351,Beta,SARS-CoV2_Beta
1,7X2M_1,OP964904.1,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTRFPNITNLCPFGEV...,GTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTACAAACT...,B.1.351,Beta,SARS-CoV2_Beta
115,7WS4_1,OW023282.1,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...,BA.1,Omicron,SARS-CoV2_OmicronBA1
115,7WS4_1,OW023282.1,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...,BA.1,Omicron,SARS-CoV2_Omicron-BA1
116,7WS4_1,OW023282.1,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...,BA.1,Omicron,SARS-CoV2_OmicronBA1


In [18]:
merged_lineages_df_with_who_name.to_csv('protein_variant_mapping.csv',index=False)

### Step 7: Generate training dataset
Map these protein sequences based on their variant names to generate a training dataset, also import the SARS-CoV-2 reference strain for SARS-CoV2_WT and add it to our dataset

In [10]:
# Import the spike protein variant mapping
spike_protein_variant_mapping = pd.read_csv("protein_variant_mapping.csv")
print("BEFORE ADDING WT SHAPE: " + str(spike_protein_variant_mapping.shape))

# Import the SARS-CoV-2 reference strain for SARS-CoV2_WT and add it into the dataframe
wildtype_protein_sequence = ""
for seq_record in SeqIO.parse("covid19_ref_strain.fasta", "fasta"):
        wildtype_protein_sequence = str(seq_record.seq)
        break
spike_protein_variant_mapping.loc[len(spike_protein_variant_mapping.index)] = ["","",wildtype_protein_sequence,"","","","SARS-CoV2_WT"]
print("AFTER ADDING WT SHAPE: " + str(spike_protein_variant_mapping.shape))
print(spike_protein_variant_mapping.iloc[len(spike_protein_variant_mapping.index)-1])
spike_protein_variant_mapping.head()


BEFORE ADDING WT SHAPE: (133, 7)
AFTER ADDING WT SHAPE: (134, 7)
pdb_id                                                                  
nucleotide_id                                                           
protein_sequence       MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...
nucleotide_sequence                                                     
lineage                                                                 
who_name                                                                
mapped_virus_name                                           SARS-CoV2_WT
Name: 133, dtype: object


Unnamed: 0,pdb_id,nucleotide_id,protein_sequence,nucleotide_sequence,lineage,who_name,mapped_virus_name
0,7X2M_1,OP964904.1,TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFK...,GTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTACAAACT...,B.1.351,Beta,SARS-CoV2_Beta
1,7X2M_1,OP964904.1,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTRFPNITNLCPFGEV...,GTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTACAAACT...,B.1.351,Beta,SARS-CoV2_Beta
2,7WS4_1,OW023282.1,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...,BA.1,Omicron,SARS-CoV2_OmicronBA1
3,7WS4_1,OW023282.1,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...,BA.1,Omicron,SARS-CoV2_Omicron-BA1
4,7WS4_1,OW023282.1,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...,BA.1,Omicron,SARS-CoV2_OmicronBA1


In [11]:
# Import the original CoVAbDb dataset
covabdab_filepath = "data/CoV-AbDab_201222.csv"
covabdab_df = pd.read_csv(covabdab_filepath)
covabdab_df['VHorVHH'] = covabdab_df['VHorVHH'].replace('ND',np.nan,regex=False)
covabdab_df['VL'] = covabdab_df['VL'].replace('ND',np.nan,regex=False)
covabdab_df['Not Neutralising Vs'] = covabdab_df['Not Neutralising Vs'].replace('ND',np.nan)
covabdab_df['Neutralising Vs'] = covabdab_df['Neutralising Vs'].replace('ND',np.nan)
covabdab_df = covabdab_df.dropna(subset=['Neutralising Vs', 'Not Neutralising Vs'], how='all')
covabdab_df = covabdab_df.dropna(subset=['VHorVHH', 'VL'], how='all')

covabdab_df.head()


Unnamed: 0,Name,Ab or Nb,Binds to,Doesn't Bind to,Neutralising Vs,Not Neutralising Vs,Protein + Epitope,Origin,VHorVHH,VL,...,Light J Gene,CDRH3,CDRL3,Structures,ABB Homology Model (if no structure),Sources,Date Added,Last Updated,Update Description,Notes/Following Up?
11,036-2C2,Ab,SARS-CoV2_WT,,,SARS-CoV2_WT,S; RBD,B-cells; SARS-CoV2_WT Human Patient,QVQLVQSGAEVKKPGASVKVSCKASGYTFTGYYMHWVRQAPGQGLE...,DIQMTQSPSSLSASVGDRVTIICRASQYINSFLNWYQQKPGQAPKL...,...,IGKJ2 (Human),ARGALRFLEWPILAY,QQSYSTPPYT,ND,,"Sanjeev Kumar et al., 2022 (https://www.scienc...","Dec 20, 2022","Dec 20, 2022",,Complete
12,036-1B1,Ab,SARS-CoV2_WT,,,SARS-CoV2_WT,S; RBD,B-cells; SARS-CoV2_WT Human Patient,EGQLVESGGRLVQPGRSLRLSCAASGFTFDDYAMHWVRQAPGKGLE...,EIVLTQSPPTLSVSPGETATLSCRASLSLDRNLAWYQQKPGQAPRL...,...,IGKJ3 (Human),VKDGNSDYGGNPI,QNYNKWPPLFT,ND,,"Sanjeev Kumar et al., 2022 (https://www.scienc...","Dec 20, 2022","Dec 20, 2022",,Complete
14,036-1E2,Ab,SARS-CoV2_WT,,,SARS-CoV2_WT,S; RBD,B-cells; SARS-CoV2_WT Human Patient,QVQLVESGGGVVQPGRSLRLSCAAPGLSFRNYGMHWVRQAPGKGLE...,SYVLTQPPSVSVSPGQTARITCGGNNIGGKSVHWYQQKPGQAPVLI...,...,IGLJ2 (Human),ARDRSGKDVLTGYPMFPAGMDV,QVWDVSSDHVV,ND,,"Sanjeev Kumar et al., 2022 (https://www.scienc...","Dec 20, 2022","Dec 20, 2022",,Complete
16,036-1A10,Ab,SARS-CoV2_WT,,,SARS-CoV2_WT,S; RBD,B-cells; SARS-CoV2_WT Human Patient,QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYGISWVRQAPGQGLE...,DIVMTQSPLSLPVTPGEPASISCRSSQSLLHSNGYNYLDWYLQKPG...,...,IGKJ2 (Human),AREGYSYGTSWLGSSYYYYMDV,MQALQTPVRT,ND,,"Sanjeev Kumar et al., 2022 (https://www.scienc...","Dec 20, 2022","Dec 20, 2022",,Complete
22,036-1F9,Ab,SARS-CoV2_WT,,,SARS-CoV2_WT,S; RBD,B-cells; SARS-CoV2_WT Human Patient,EVQVVESGGGLVQPGGSLRLSCAASGFPLSSYDMHWVRQRTGKGLE...,QSALTQPASVSGSPGQSITISCTGTSSDVGGFHYVSWYQQHPGKAP...,...,IGLJ3 (Human),ARAQGYCGGGSCYSAYAFAI,SSYTSSSTWV,ND,,"Sanjeev Kumar et al., 2022 (https://www.scienc...","Dec 20, 2022","Dec 20, 2022",,Complete


We note that in the CovAbDb dataset, there are some differing names used to refer to the same variants in the `neutralizing vs` and `not neutralizing vs` columns based on the earlier work done to obtain the unique virus values. We will standardize the names as such:
- All labels with prefix of "SARS_CoV2_" --> "SARS-CoV2_"
- "SARS-CoV2_WTSARS-CoV1" --> "SARS-CoV2_WT;SARS-CoV1"
- "SARS-CoV2_BetaSARS-CoV2_Gamma" --> "SARS-CoV2_Beta;SARS-CoV2_Gamma"
- "SRAS-CoV2_Beta" --> "SARS-CoV2_Beta"

In [12]:
def standardize_names(df, column):
    df[column] = df[column].str.replace('SARS_CoV2_','SARS-CoV2_',regex=False)
    df[column] = df[column].str.replace('SARS-CoV2_WTSARS-CoV1','SARS-CoV2_WT;SARS-CoV1',regex=False)
    df[column] = df[column].str.replace('SARS-CoV2_BetaSARS-CoV2_Gamma','SARS-CoV2_Beta;SARS-CoV2_Gamma',regex=False)
    df[column] = df[column].str.replace('SRAS-CoV2_Beta','SARS-CoV2_Beta',regex=False)
    return df

In [13]:
covabdab_df_normalized = covabdab_df.copy()
covabdab_df_normalized = standardize_names(covabdab_df_normalized, "Binds to")
covabdab_df_normalized = standardize_names(covabdab_df_normalized, "Doesn't Bind to")
covabdab_df_normalized = standardize_names(covabdab_df_normalized, "Neutralising Vs")
covabdab_df_normalized = standardize_names(covabdab_df_normalized, "Not Neutralising Vs")

In [14]:
# Convert values into lists for explosion
covabdab_df_normalized['Neutralising Vs'] = covabdab_df_normalized['Neutralising Vs'].str.split(';')
covabdab_df_normalized['Not Neutralising Vs'] = covabdab_df_normalized['Not Neutralising Vs'].str.split(';')


We now explode the data for both the neutralizing and not neutralising vs and not neutralising vs to form our dataset

In [15]:
# Explode for neutralizing dataset
neutralizing_dataset = covabdab_df_normalized.copy()
neutralizing_dataset = neutralizing_dataset.dropna(subset=['Neutralising Vs'], how='all')
neutralizing_dataset = neutralizing_dataset.explode('Neutralising Vs')
# Preserve the weak neutralizing proeprty
neutralizing_dataset['weak_neutralisation'] = neutralizing_dataset['Neutralising Vs'].str.contains('(weak)',case=False,na=False,regex=False)
neutralizing_dataset['neutralising'] = True
# Drop the (weak) notaion in the column for mapping later
neutralizing_dataset['Neutralising Vs'] = neutralizing_dataset['Neutralising Vs'].str.replace(' (weak)',"",case=False,regex=False)
neutralizing_dataset['Neutralising Vs'] = neutralizing_dataset['Neutralising Vs'].str.replace('(weak)',"",case=False,regex=False)
# Rename columns
neutralizing_dataset = neutralizing_dataset.rename(columns={'Neutralising Vs':'virus_type','VHorVHH':'heavy_chain','VL':'light_chain','CDRH3':'cdrh3','CDRL3':'cdrl3'})
# Merge the virus spike protein with the dataset
neutralizing_dataset = neutralizing_dataset.merge(spike_protein_variant_mapping,left_on="virus_type",right_on="mapped_virus_name")
neutralizing_dataset = neutralizing_dataset.rename(columns={'protein_sequence':'virus_sequence'})
# Retain columns of interest only -- we retain the CDRH3 and CDRL3 information as it may be of interest to us
neutralizing_dataset = neutralizing_dataset[['heavy_chain','light_chain','cdrh3','cdrl3','virus_type','virus_sequence','neutralising','weak_neutralisation']]

neutralizing_dataset.head()

Unnamed: 0,heavy_chain,light_chain,cdrh3,cdrl3,virus_type,virus_sequence,neutralising,weak_neutralisation
0,EVQLVESGGGLAQPGRSLRLSCAASGFTFDDYAMHWVRQAPGKGLE...,QSALTQPRSVSGSPGQSVTISCTGTSSDVGGYNYVSWYQQHPGKAP...,AKAEVPGYGSGWYQGFAS,CSYAGSYTGL,SARS-CoV2_WT,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,True,True
1,EVQLVESGGGLIQPGGSLRLSCAASGITVSSNYMSWVRQAPGKGLE...,AIQLTQSPSSLSASVGDRVTITCRASQGISTYLAWYQQKPGKAPKL...,ARDLDYYGMDV,QQVNSYPPIT,SARS-CoV2_WT,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,True,False
2,EVQLVESGGGLVQPGGSLRLSCAASGFTVSSHYMSWVRQAPGKGLE...,AIQLTQSPSSLSASVGDRVTITCRASQGISSYLAWYQQKPGKAPKL...,ARDSSWGPGYYGLDV,QQLNSLFT,SARS-CoV2_WT,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,True,True
3,QVQLVQSGAEVKKPGSSVKVSCKASGGTFSSYTITWVRQAPGQGLE...,QSLLTQPPSVSGAPGQRVTISCTGSNSNIGAGYDVHWYQQLPGTAP...,ARERGYSSSSSAWYFDL,QSYDSSLTGSL,SARS-CoV2_WT,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,True,True
4,QVQLVESGGGVVQPGRSLRLSCAASGFTFSNFAMYWVRQAPGKGLE...,SYELTQPPSVSVSPGQTARITCSGDALPKQYAYWYQKKPGQAPVLV...,ARDLEGEQWLLRDDYYYYYGMDV,QSADSSGTYRV,SARS-CoV2_WT,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,True,True


In [16]:
# Explode for neutralizing dataset
not_neutralizing_dataset = covabdab_df_normalized.copy()
not_neutralizing_dataset = not_neutralizing_dataset.dropna(subset=['Not Neutralising Vs'], how='all')
not_neutralizing_dataset = not_neutralizing_dataset.explode('Not Neutralising Vs')
# Preserve the weak neutralizing proeprty -- for not neutralizing, we set these both as false
not_neutralizing_dataset['weak_neutralisation'] = False
not_neutralizing_dataset['neutralising'] = False
# Drop the (weak) notaion in the column for mapping later
not_neutralizing_dataset['Not Neutralising Vs'] = not_neutralizing_dataset['Not Neutralising Vs'].str.replace(' (weak)',"",case=False,regex=False)
not_neutralizing_dataset['Not Neutralising Vs'] = not_neutralizing_dataset['Not Neutralising Vs'].str.replace('(weak)',"",case=False,regex=False)
# Rename columns
not_neutralizing_dataset = not_neutralizing_dataset.rename(columns={'Not Neutralising Vs':'virus_type','VHorVHH':'heavy_chain','VL':'light_chain','CDRH3':'cdrh3','CDRL3':'cdrl3'})
# Merge the virus spike protein with the dataset
not_neutralizing_dataset = not_neutralizing_dataset.merge(spike_protein_variant_mapping,left_on="virus_type",right_on="mapped_virus_name")
not_neutralizing_dataset = not_neutralizing_dataset.rename(columns={'protein_sequence':'virus_sequence'})
# Retain columns of interest only -- we retain the CDRH3 and CDRL3 information as it may be of interest to us
not_neutralizing_dataset = not_neutralizing_dataset[['heavy_chain','light_chain','cdrh3','cdrl3','virus_type','virus_sequence','neutralising','weak_neutralisation']]

not_neutralizing_dataset.head()

Unnamed: 0,heavy_chain,light_chain,cdrh3,cdrl3,virus_type,virus_sequence,neutralising,weak_neutralisation
0,QVQLVQSGAEVKKPGASVKVSCKASGYTFTGYYMHWVRQAPGQGLE...,DIQMTQSPSSLSASVGDRVTIICRASQYINSFLNWYQQKPGQAPKL...,ARGALRFLEWPILAY,QQSYSTPPYT,SARS-CoV2_WT,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,False,False
1,EGQLVESGGRLVQPGRSLRLSCAASGFTFDDYAMHWVRQAPGKGLE...,EIVLTQSPPTLSVSPGETATLSCRASLSLDRNLAWYQQKPGQAPRL...,VKDGNSDYGGNPI,QNYNKWPPLFT,SARS-CoV2_WT,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,False,False
2,QVQLVESGGGVVQPGRSLRLSCAAPGLSFRNYGMHWVRQAPGKGLE...,SYVLTQPPSVSVSPGQTARITCGGNNIGGKSVHWYQQKPGQAPVLI...,ARDRSGKDVLTGYPMFPAGMDV,QVWDVSSDHVV,SARS-CoV2_WT,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,False,False
3,QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYGISWVRQAPGQGLE...,DIVMTQSPLSLPVTPGEPASISCRSSQSLLHSNGYNYLDWYLQKPG...,AREGYSYGTSWLGSSYYYYMDV,MQALQTPVRT,SARS-CoV2_WT,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,False,False
4,EVQVVESGGGLVQPGGSLRLSCAASGFPLSSYDMHWVRQRTGKGLE...,QSALTQPASVSGSPGQSITISCTGTSSDVGGFHYVSWYQQHPGKAP...,ARAQGYCGGGSCYSAYAFAI,SSYTSSSTWV,SARS-CoV2_WT,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...,False,False


In [17]:
# Exports for datasets
neutralizing_dataset.to_csv("training_positive.csv",index=False)
not_neutralizing_dataset.to_csv("training_negative.csv",index=False)
# Concat both datasets together
combined_dataset = pd.concat([neutralizing_dataset,not_neutralizing_dataset])
combined_dataset.to_csv("training_combined.csv",index=False)
