# DeepTropism Notebook 1 - Data Wrangling
On this notebook we are going to start organizing our HIV-1 V3 loop Dataset to develop the a Deep Learning model.<br>
The goal of our model is to define the tropism of the virus solely based on the aminoacid sequence of HIV-1 GP120 V3 loop.

In [45]:
import pandas as pd
import numpy as np
import torch
import torch.nn.functional as F
import torchvision
import matplotlib.pyplot as plt
import os
import time

## Creating one Dataframe with all the sequences from the different datasets

In [2]:
# Define path to local repositoyr
PATH = '/home/gabriel/Documents/Repos'

In [3]:
df_newdb = pd.read_csv(f'{PATH}/DeepTropism/datasets/processed_tsv/newdb_all.tsv', sep='\t',
                       names=['seq_name', 'dataset', 'label', 'sequence'])
df_webpssm = pd.read_csv(f'{PATH}/DeepTropism/datasets/processed_tsv/webpssm_all.tsv',sep='\t',
                       names=['seq_name', 'dataset', 'label', 'sequence'])
df_hivcopred = pd.read_csv(f'{PATH}/DeepTropism/datasets/processed_tsv/hivcopred_all.tsv',sep='\t',
                       names=['seq_name', 'dataset', 'label', 'sequence'])

In [4]:
df_newdb.head()

Unnamed: 0,seq_name,dataset,label,sequence
0,RAB014775,newdb,CCR5,CTRPSNNTRTGITIGPGQVWYRTGDIIGDIRKAYC
1,RAB014776,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGDIRQAYC
2,RAB014778,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGDIRKAYC
3,RAB014781,newdb,CCR5,CTRPSNNTRTSVTIGPGQVWYRTGDIIGDIRQAYC
4,RAB014834,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGNIRKAYC


In [5]:
df_webpssm.head()

Unnamed: 0,seq_name,dataset,label,sequence
0,95ZW84_ZW_C_NSI_u20_BATRA_(2000),webpssm,CCR5,CTRPNNNTRKSMRIGPGQTFYATGDIIGDIRQAHC
1,95ZW295_ZW_C_NSI_u21_BATRA_(2000),webpssm,CCR5,CTRPNNNTRKSMRIGPGQVFYATDGIIGDIRQAHC
2,95ZW377_ZW_C_NSI_u22_BATRA_(2000),webpssm,CCR5,CTRPSNNTRKSIRIGPGQTFYATNDIIGDIRQAHC
3,95ZW530_ZW_C_NSI_u23_BATRA_(2000),webpssm,CCR5,CTRPGNNTRKSIRIGPGQAFFATGDIIGDIRQAHC
4,95ZW560_ZW_C_NSI_u24_BATRA_(2000),webpssm,CCR5,CTRPGNNTRKSIRIGPGQTFYAANGIIGDIRQAHC


In [6]:
df_hivcopred.head()

Unnamed: 0,seq_name,dataset,label,sequence
0,RFJ977091,hivcopred,CCR5,CARPGNNTKKSVRIGPGQTFYATGDIIGDIRQAHC
1,RFJ977094,hivcopred,CCR5,CARPGNNTRKSVRIGPGQAFYATGDIIGDIRQAHC
2,RDQ382364,hivcopred,CCR5,CARPGNNTRKSVRIGPGQTFFATGDIIGDIRKAHC
3,RFJ376003,hivcopred,CCR5,CARPGNNTRKSXRIGPGQSFHATGEIIGNIREAHC
4,RDQ382371,hivcopred,CCR5,CARPGNNTRRSVRIGPGQAFYATGEIIGDIRKAHC


These three datasets were separated in different list based on the tropism classification already. <br>
The datasets Geno2pheno and CM the classification was defined on the sequence name, so need to extract it.


In [7]:
df_cm = pd.read_csv(f'{PATH}/DeepTropism/datasets/processed_tsv/cm.tsv', sep='\t',
                       names=['seq_name', 'dataset', 'sequence'])
df_g2p = pd.read_csv(f'{PATH}/DeepTropism/datasets/processed_tsv/g2p_str.tsv',sep='\t',
                       names=['seq_name', 'dataset', 'sequence'])

In [8]:
df_cm.head()

Unnamed: 0,seq_name,dataset,sequence
0,-.HM246206.A.CCR5,cm,CVRPNNNTKKSVIGPGQTYANNIIGDIRKAC
1,ACH142.HQ644967.B.CCR5,cm,CTRPNNNTRKSIHIGPGRAFYATGDIIGDIRKAHC
2,TH020.U08754.01_AE.CCR5,cm,CTRPFNNTRTSLTIGPGQVFYRTGDIIGDIRKAYC
3,CW012.AJ418502.B.CCR5,cm,CTRLNNNTRKSIHMGPGRAFYTTGEIIGDIRQAHC
4,BP00069.JN687773.B.CCR5,cm,CTRPYNNTRRSIPIGPGRAFYATGEVIGNIRKAYC


In [9]:
df_g2p.head()

Unnamed: 0,seq_name,dataset,sequence
0,CCR5_1471_29187_CN_2003_B,geno2pheno,CTQTQQQYK-K-KYTSR-------TRASMVCNR---RNNRR---YK...
1,CCR5_AM262114_21502_FR_1995_O,geno2pheno,CVRPGSN-S-V-QEIKI---GP---MAWYSMQL---EQDGKRANAR...
2,CCR5_BCF02_13870_FR_1990_O,geno2pheno,CQRPGHQ-T-V-QEIRI---GP---MAWYS-MG---LAAGNGSESR...
3,CCR5_CA9_357_CM_1993_O,geno2pheno,CERPGNH-T-V-QEIRI---GP---LAWYS-MGIEKNSKNS---SR...
4,CCR5_BCF01_572_FR_1990_O,geno2pheno,CHRPGNL-S-V-QEMKI---GP---LSWYS-MG---LAANSSIKSR...


To make it easier to process the two remaining Datasets we are going to concatenate them.

In [10]:
df_g2p_cm = pd.concat([df_cm, df_g2p])

In [11]:
# Print sizes
print(df_cm.shape)
print(df_g2p.shape)
print(df_g2p_cm.shape)

(2679, 3)
(1188, 3)
(3867, 3)


In [12]:
df_g2p_cm.head(10)

Unnamed: 0,seq_name,dataset,sequence
0,-.HM246206.A.CCR5,cm,CVRPNNNTKKSVIGPGQTYANNIIGDIRKAC
1,ACH142.HQ644967.B.CCR5,cm,CTRPNNNTRKSIHIGPGRAFYATGDIIGDIRKAHC
2,TH020.U08754.01_AE.CCR5,cm,CTRPFNNTRTSLTIGPGQVFYRTGDIIGDIRKAYC
3,CW012.AJ418502.B.CCR5,cm,CTRLNNNTRKSIHMGPGRAFYTTGEIIGDIRQAHC
4,BP00069.JN687773.B.CCR5,cm,CTRPYNNTRRSIPIGPGRAFYATGEVIGNIRKAYC
5,I.DQ061525.B.CCR5,cm,CIRPNNNTRKSIHIGPGRAFYATGEIIGDIRQAHC
6,500.HQ377462.B.CCR5,cm,CTRPNNNTRKSISMGPGRAFYATGGIIGNIRQAHC
7,Pat1.AF541040.B.CCR5,cm,CTRPNNNTRKSIHIGPGRAFYTTGEIIGDIRQAHC
8,CMP013.JX140646.02_AG.CCR5,cm,CMRPNNNTRESVRIGPGQAFYATGEIIGDIRQAHC
9,U.DQ061827.B.CCR5,cm,CTRPNNNTRKGIHMGPGKVFYATGQIIGDIRQAHC


Check if there are labels 'CCR5' or 'CXCR4' on every row of the df_g2p_cm Dataframe.

In [13]:
df_g2p_cm[~((df_g2p_cm.seq_name.str.contains('CCR5'))|
          (df_g2p_cm.seq_name.str.contains('CXCR4')))]

Unnamed: 0,seq_name,dataset,sequence


In [14]:
def get_label(row):
    """
    Function to return co-receptor type based on seq_name
    
    Parameters
     - row (Series): A row of a Dataframe containing information for sample
    
    return (string): A type of co-receptor: 'R5X4', 'CCR5', 'CXCR4'
    
    """
    if 'CCR5' in row['seq_name'] and 'CXCR4' in row['seq_name']:
        return 'R5X4'
    elif 'CCR5' in row['seq_name']:
        return 'CCR5'
    elif 'CXCR4' in row['seq_name']:
        return 'CXCR4'

Apply get_label to df_g2p_cm

In [15]:
df_g2p_cm['label'] =  df_g2p_cm.apply(get_label, axis=1)

# Reorder columns
df_g2p_cm = df_g2p_cm[['seq_name', 'dataset', 'label', 'sequence']]

In [16]:
df_g2p_cm.head(10)

Unnamed: 0,seq_name,dataset,label,sequence
0,-.HM246206.A.CCR5,cm,CCR5,CVRPNNNTKKSVIGPGQTYANNIIGDIRKAC
1,ACH142.HQ644967.B.CCR5,cm,CCR5,CTRPNNNTRKSIHIGPGRAFYATGDIIGDIRKAHC
2,TH020.U08754.01_AE.CCR5,cm,CCR5,CTRPFNNTRTSLTIGPGQVFYRTGDIIGDIRKAYC
3,CW012.AJ418502.B.CCR5,cm,CCR5,CTRLNNNTRKSIHMGPGRAFYTTGEIIGDIRQAHC
4,BP00069.JN687773.B.CCR5,cm,CCR5,CTRPYNNTRRSIPIGPGRAFYATGEVIGNIRKAYC
5,I.DQ061525.B.CCR5,cm,CCR5,CIRPNNNTRKSIHIGPGRAFYATGEIIGDIRQAHC
6,500.HQ377462.B.CCR5,cm,CCR5,CTRPNNNTRKSISMGPGRAFYATGGIIGNIRQAHC
7,Pat1.AF541040.B.CCR5,cm,CCR5,CTRPNNNTRKSIHIGPGRAFYTTGEIIGDIRQAHC
8,CMP013.JX140646.02_AG.CCR5,cm,CCR5,CMRPNNNTRESVRIGPGQAFYATGEIIGDIRQAHC
9,U.DQ061827.B.CCR5,cm,CCR5,CTRPNNNTRKGIHMGPGKVFYATGQIIGDIRQAHC


Now that all Dataframes have labels, we concatenate them into one main Dataframe.

In [17]:
df_datasets = pd.concat([df_newdb,df_webpssm,df_hivcopred, df_g2p_cm])

In [18]:
df_datasets.shape

(9550, 4)

In [19]:
df_datasets.head(10)

Unnamed: 0,seq_name,dataset,label,sequence
0,RAB014775,newdb,CCR5,CTRPSNNTRTGITIGPGQVWYRTGDIIGDIRKAYC
1,RAB014776,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGDIRQAYC
2,RAB014778,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGDIRKAYC
3,RAB014781,newdb,CCR5,CTRPSNNTRTSVTIGPGQVWYRTGDIIGDIRQAYC
4,RAB014834,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGNIRKAYC
5,RAB023804,newdb,CCR5,CTRPNNNTRKSIRIGPGQTFYATGDIIGDIRQAHC
6,RAB287376,newdb,CCR5,CVRPNNNTRTSVRIGPGQTFYATGEIIGDIRQAFC
7,RAB553911,newdb,CCR5,CERPNNNTRRSIQIGPGRAWFEAEDIIGDIRKAHC
8,RAB553912,newdb,CCR5,CTRPNDNTRKSINIAPGRAFYATGDIIGDIRQAHC
9,RAB553913,newdb,CCR5,CTRPNNNTRKGIHMGPGRAIYTTDIIGDIRQAHC


In [20]:
df_datasets_validation = df_datasets[df_datasets.label == 'validation']
df_datasets_validation.head(10)

Unnamed: 0,seq_name,dataset,label,sequence
279,C.ZM.89.ZM20__phen_SI,webpssm,validation,CARPGNNTRKSIRIGPGQTFFATGAIIGDIRQAHC
280,C.ZW.01.TC28_2__phen_SI,webpssm,validation,CGRPNNHRIKGLRIGPGRAFFAMGAIGGEIRQAHC
281,C.ZW.01.TC03_1__phen_SI,webpssm,validation,CIRPGNNTSKSIRIGQRRPVYVH-KIIGDIRQAHC
282,C.ET.97.PHD79C1__phen_SI,webpssm,validation,CIRPNNNTRKSVRIGPGQAFYATGDIIGDIRQAHC
283,C.ZW.01.TC28_1__phen_SI,webpssm,validation,CMRPNNNTRKSVRIGPGQTFFATGAIIGNIRQAHC
284,AC.RW.92.92RW009_di1sCD__phen_SI,webpssm,validation,CPRPNNNTRKSVHIGPGQAFYATGDVIGDIRQAYC
285,AC.RW.92.92RW009_1gCR_AC.RW.92.92RW009_1gER_AC...,webpssm,validation,CSRPNNNTRKSVHIGPGQAFYATGDVIGDIRQAYC
286,C.ZW.01.TC22__phen_SI,webpssm,validation,CTRPGNKTRQSIRIGRGQSFHATGAIIGDIRKAYC
287,C.ZW.01.TC30__phen_SI,webpssm,validation,CTRPGNNT-----IGPGRTFYATDRIIGDIRQAHC
288,C.ZW.01.TC29__phen_SI,webpssm,validation,CTRPGNNTRKGLRIGPGRTIYATEVTVGDIRQAYC


In [21]:
df_datasets_validation.shape

(71, 4)

In [22]:
def label_validation_dataset(row):
    """
    Function to return co-receptor type based on seq_name for Webpssm 
    
    Parameters
     - row (Series): A row of a Dataframe containing information for sample
    
    return (string): A type of co-receptor: 'CCR5', 'CXCR4'
    
    """
    if 'NSI' in row.seq_name:
        return 'CCR5'
    elif 'SI' in row.seq_name:
        return 'CXCR4'

In [23]:
df_datasets_validation['label'] = df_datasets_validation.apply(label_validation_dataset, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [24]:
df_datasets_validation

Unnamed: 0,seq_name,dataset,label,sequence
279,C.ZM.89.ZM20__phen_SI,webpssm,CXCR4,CARPGNNTRKSIRIGPGQTFFATGAIIGDIRQAHC
280,C.ZW.01.TC28_2__phen_SI,webpssm,CXCR4,CGRPNNHRIKGLRIGPGRAFFAMGAIGGEIRQAHC
281,C.ZW.01.TC03_1__phen_SI,webpssm,CXCR4,CIRPGNNTSKSIRIGQRRPVYVH-KIIGDIRQAHC
282,C.ET.97.PHD79C1__phen_SI,webpssm,CXCR4,CIRPNNNTRKSVRIGPGQAFYATGDIIGDIRQAHC
283,C.ZW.01.TC28_1__phen_SI,webpssm,CXCR4,CMRPNNNTRKSVRIGPGQTFFATGAIIGNIRQAHC
...,...,...,...,...
345,C.ZW.01.TC33__phen_NSI,webpssm,CCR5,CTRPNNNTRTSVRIGPGQAFYATGDIIGDIRQAHC
346,C.FR.93.FRMP37__phen_NSI,webpssm,CCR5,CTRPSNNTRKSIRIGPGQAFYATNGIIGDIRAAHC
347,C.ZW.01.TC32__phen_NSI,webpssm,CCR5,CTRPSNNTRKSVWLGPGRAFYT-NKVIGNIRKAHC
348,C.FR.91.FRMP197__phen_NSI,webpssm,CCR5,CTRPYNNTRQSIRIGPGQTFYATGDIIGDIRKAHC


In [25]:
# Save validation dataset to TSV
df_datasets_validation.to_csv(f'{PATH}/DeepTropism/datasets/webpssm_validation_labeled.tsv', sep='\t')

In [26]:
# Now concatenate the parsed df_datasets_validation to df_datasets
df_datasets_final = pd.concat([df_datasets[df_datasets.label != 'validation'], df_datasets_validation])

In [27]:
df_datasets_final.head()

Unnamed: 0,seq_name,dataset,label,sequence
0,RAB014775,newdb,CCR5,CTRPSNNTRTGITIGPGQVWYRTGDIIGDIRKAYC
1,RAB014776,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGDIRQAYC
2,RAB014778,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGDIRKAYC
3,RAB014781,newdb,CCR5,CTRPSNNTRTSVTIGPGQVWYRTGDIIGDIRQAYC
4,RAB014834,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGNIRKAYC


In [28]:
df_datasets_final.shape

(9550, 4)

Now we replace the '-' on sequences to remove duplicated ones.<br>
First we check the number of sequences with '-' on the Dataframe.

In [29]:
df_datasets_final[df_datasets_final.sequence.str.contains('-')].shape

(1225, 4)

In [30]:
df_datasets_final[df_datasets_final.sequence.str.contains('-')].head(10)

Unnamed: 0,seq_name,dataset,label,sequence
108,TV013_ZA_C_NSI/CCR5_u125_TREURNICHT_(2002),webpssm,CCR5,CTRPNNNTRRSIRIGPGQAFY-TNDIIGDIRQAHC
127,98TZ013_TZ_C_CCR5_u144_RODENBURG_(2001),webpssm,CCR5,CTRPGNNTRKSVRIGPGQTFY-TNDIIGDIRQAYC
145,S018_MW_C_CCR5_u162_PING_(1999),webpssm,CCR5,CVRPNNNTRKSIRIGPGQTFYA-NDIIGDIRQAHC
153,S031_MW_C_CCR5_u170_PING_(1999),webpssm,CCR5,CTRPNNNTRKSIRIGPGQTFYA-NDIIGDIRQAHC
156,S180_MW_C_CCR5_u173_PING_(1999),webpssm,CCR5,CTRPGNNTRTSIRIGPGQTFFANN-IIGDIRQAHC
171,DU179MAY99U-R5_ZA_C_CCR5_u19_NICD_(UNPUBL),webpssm,CCR5,CTRPGNNTRKSIRIGPGQAFY-TNHIIGDIRQAYC
203,TM3__ZA_C_NSI/CCR5_u194_CHOGE_(IN_PRESS),webpssm,CCR5,CTRPGNNTRKSIRIGPGQTFYA-NDIIGDIRQAYC
207,TM10__ZA_C_NSI/CCR5_u198_CHOGE_(IN_PRESS),webpssm,CCR5,CTRPNNNTRKSIRIGPGQTFYATN-IIGDIRQAYC
216,TM31__ZA_C_NSI/CCR5_u207_CHOGE_(IN_PRESS),webpssm,CCR5,CTRPGSNTRRSIRIGPGQAFY-TQDIIGDIRQAHC
228,95ZW748_ZW_C_SI_u1_BATRA_(2000),webpssm,CXCR4,CTRPNNNVRKHIRIGIGKVFYA-NDIIGDIRQARC


In [31]:
df_datasets_final['sequence'] = df_datasets_final['sequence'].str.replace('-', '', regex=False)

Check if the replace worked:

In [32]:
df_datasets_final[df_datasets_final.sequence.str.contains('-')].shape

(0, 4)

Now that our sequences don't have '-' we can drop the duplicated sequences to avoid repetitive data on our trainning set.

In [66]:
df_datasets_final.shape

(9550, 4)

In [67]:
df_datasets_final.head(20)

Unnamed: 0,seq_name,dataset,label,sequence
0,RAB014775,newdb,CCR5,CTRPSNNTRTGITIGPGQVWYRTGDIIGDIRKAYC
1,RAB014776,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGDIRQAYC
2,RAB014778,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGDIRKAYC
3,RAB014781,newdb,CCR5,CTRPSNNTRTSVTIGPGQVWYRTGDIIGDIRQAYC
4,RAB014834,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGNIRKAYC
5,RAB023804,newdb,CCR5,CTRPNNNTRKSIRIGPGQTFYATGDIIGDIRQAHC
6,RAB287376,newdb,CCR5,CVRPNNNTRTSVRIGPGQTFYATGEIIGDIRQAFC
7,RAB553911,newdb,CCR5,CERPNNNTRRSIQIGPGRAWFEAEDIIGDIRKAHC
8,RAB553912,newdb,CCR5,CTRPNDNTRKSINIAPGRAFYATGDIIGDIRQAHC
9,RAB553913,newdb,CCR5,CTRPNNNTRKGIHMGPGRAIYTTDIIGDIRQAHC


In [33]:
df_datasets_final = df_datasets_final.reset_index(drop=True)
df_datasets_final

Unnamed: 0,seq_name,dataset,label,sequence
0,RAB014775,newdb,CCR5,CTRPSNNTRTGITIGPGQVWYRTGDIIGDIRKAYC
1,RAB014776,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGDIRQAYC
2,RAB014778,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGDIRKAYC
3,RAB014781,newdb,CCR5,CTRPSNNTRTSVTIGPGQVWYRTGDIIGDIRQAYC
4,RAB014834,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGNIRKAYC
...,...,...,...,...
9545,C.ZW.01.TC33__phen_NSI,webpssm,CCR5,CTRPNNNTRTSVRIGPGQAFYATGDIIGDIRQAHC
9546,C.FR.93.FRMP37__phen_NSI,webpssm,CCR5,CTRPSNNTRKSIRIGPGQAFYATNGIIGDIRAAHC
9547,C.ZW.01.TC32__phen_NSI,webpssm,CCR5,CTRPSNNTRKSVWLGPGRAFYTNKVIGNIRKAHC
9548,C.FR.91.FRMP197__phen_NSI,webpssm,CCR5,CTRPYNNTRQSIRIGPGQTFYATGDIIGDIRKAHC


In [69]:
df_datasets_final = df_datasets_final[['seq_name', 'dataset', 'label', 'sequence']]
df_datasets_final

Unnamed: 0,seq_name,dataset,label,sequence
0,RAB014775,newdb,CCR5,CTRPSNNTRTGITIGPGQVWYRTGDIIGDIRKAYC
1,RAB014776,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGDIRQAYC
2,RAB014778,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGDIRKAYC
3,RAB014781,newdb,CCR5,CTRPSNNTRTSVTIGPGQVWYRTGDIIGDIRQAYC
4,RAB014834,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGNIRKAYC
...,...,...,...,...
9545,C.ZW.01.TC33__phen_NSI,webpssm,validation,CTRPNNNTRTSVRIGPGQAFYATGDIIGDIRQAHC
9546,C.FR.93.FRMP37__phen_NSI,webpssm,validation,CTRPSNNTRKSIRIGPGQAFYATNGIIGDIRAAHC
9547,C.ZW.01.TC32__phen_NSI,webpssm,validation,CTRPSNNTRKSVWLGPGRAFYTNKVIGNIRKAHC
9548,C.FR.91.FRMP197__phen_NSI,webpssm,validation,CTRPYNNTRQSIRIGPGQTFYATGDIIGDIRKAHC


In [34]:
# Create TSV file from df_datasets
df_datasets_final.to_csv(f'{PATH}/DeepTropism/datasets/all_datasets_raw.tsv', sep='\t')

# Create fasta file from the df_unique_seqs
with open(f'{PATH}/DeepTropism/datasets/dataset_all_seqs.fasta', 'w') as f:
    for index, row in df_datasets_final.iterrows():
        f.write(f'>{row.seq_name}|{row.dataset}|{row.label}\n')
        f.write(f'{row.sequence}\n')    

In [50]:
df_datasets_final

Unnamed: 0,seq_name,dataset,label,sequence
0,RAB014775,newdb,CCR5,CTRPSNNTRTGITIGPGQVWYRTGDIIGDIRKAYC
1,RAB014776,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGDIRQAYC
2,RAB014778,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGDIRKAYC
3,RAB014781,newdb,CCR5,CTRPSNNTRTSVTIGPGQVWYRTGDIIGDIRQAYC
4,RAB014834,newdb,CCR5,CTRPSNNTRTSITIGPGQVWYRTGDIIGNIRKAYC
...,...,...,...,...
9545,C.ZW.01.TC33__phen_NSI,webpssm,validation,CTRPNNNTRTSVRIGPGQAFYATGDIIGDIRQAHC
9546,C.FR.93.FRMP37__phen_NSI,webpssm,validation,CTRPSNNTRKSIRIGPGQAFYATNGIIGDIRAAHC
9547,C.ZW.01.TC32__phen_NSI,webpssm,validation,CTRPSNNTRKSVWLGPGRAFYT-NKVIGNIRKAHC
9548,C.FR.91.FRMP197__phen_NSI,webpssm,validation,CTRPYNNTRQSIRIGPGQTFYATGDIIGDIRKAHC


# Creating alignment using Los Alamos HIV-1 Sequence Compendium
HIV-1 envelope V3 loop is a highly diverse region and its evolution does not follow a point substitution model, but instead by a poorly understood process that involves recombination and polymerase slippage.  In other words, the theoretical basis of alignment doesn't apply, except for very closely related sequences and the less variable boundaries.<br>
[Los Alamos National Laboratory](https://www.hiv.lanl.gov/content/index) offers an [HIV compendium](https://www.hiv.lanl.gov/content/sequence/HIV/COMPENDIUM/compendium.html) with curated alignments made by specialists, following rigorous criteria. In order to get a better result from the alignment of our Dataset, we are going to use the Compendium HIV-1 alignment profile as reference.<br>
On our df_datasets_final we have a range of sequence lengths from 21 to 39 residues. To use Muscle Aligner profile method we need two Multiple Sequence Alignments (MSA) have sequences with same length internally. So we are going to generate fasta files for sequences with same length from our df_datasets_final, and progressivelly align it to the Compendium reference alignment.

In [35]:
# Get size of sequences ondf_datasets_final 
seq_len_list = set(df_datasets_final['sequence'].apply(len))
seq_len_list

{21, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39}

In [40]:
# Get number of sequences for each length
for l in sorted(seq_len_list):
    n = df_datasets_final[df_datasets_final.sequence.str.len() == l].shape[0]
    print(f'Number of samples with sequence length equal to {l} => {n}')

Number of samples with sequence length equal to 21 => 1
Number of samples with sequence length equal to 28 => 2
Number of samples with sequence length equal to 29 => 12
Number of samples with sequence length equal to 30 => 13
Number of samples with sequence length equal to 31 => 33
Number of samples with sequence length equal to 32 => 44
Number of samples with sequence length equal to 33 => 166
Number of samples with sequence length equal to 34 => 1461
Number of samples with sequence length equal to 35 => 7600
Number of samples with sequence length equal to 36 => 91
Number of samples with sequence length equal to 37 => 107
Number of samples with sequence length equal to 38 => 19
Number of samples with sequence length equal to 39 => 1


In [46]:
# Iterate over seq_len_list, creating fasta files and aligning to Compendium Reference alignment
ref_fasta = f'{PATH}/DeepTropism/datasets/hiv_compedium_v3loop_aligned.fasta'
for l in sorted(seq_len_list):
    print(f'For sequence length = {l}')
    df_len = df_datasets_final[df_datasets_final.sequence.str.len() == l]
    current_fasta = f'{PATH}/DeepTropism/datasets/dataset_seqlen_{l}.fasta'
    if l == 39:
        out_fasta = f'{PATH}/DeepTropism/datasets/dataset_profile_final.fasta'
    else:    
        out_fasta = f'{PATH}/DeepTropism/datasets/dataset_profile_{l}.fasta'
    with open(current_fasta, 'w') as f:
        for index, row in df_len.iterrows():
            f.write(f'>{row.seq_name}|{row.dataset}|{row.label}\n')
            f.write(f'{row.sequence}\n')
            
    # Use Muscle aligner with Compendium alignment as profile        
    os.system(f'/home/gabriel/Documents/Bioinformatics/muscle3.8.31_i86linux64 -profile -in1 {ref_fasta}\
                -in2 {current_fasta} -out {out_fasta}')
    time.sleep(10)
    print(f'Finished running Muscle profile for {current_fasta}')
    ref_fasta = out_fasta
    

For sequence length = 21
Finished running Muscle profile for /home/gabriel/Documents/Repos/DeepTropism/datasets/dataset_seqlen_21.fasta
For sequence length = 28
Finished running Muscle profile for /home/gabriel/Documents/Repos/DeepTropism/datasets/dataset_seqlen_28.fasta
For sequence length = 29
Finished running Muscle profile for /home/gabriel/Documents/Repos/DeepTropism/datasets/dataset_seqlen_29.fasta
For sequence length = 30
Finished running Muscle profile for /home/gabriel/Documents/Repos/DeepTropism/datasets/dataset_seqlen_30.fasta
For sequence length = 31
Finished running Muscle profile for /home/gabriel/Documents/Repos/DeepTropism/datasets/dataset_seqlen_31.fasta
For sequence length = 32
Finished running Muscle profile for /home/gabriel/Documents/Repos/DeepTropism/datasets/dataset_seqlen_32.fasta
For sequence length = 33
Finished running Muscle profile for /home/gabriel/Documents/Repos/DeepTropism/datasets/dataset_seqlen_33.fasta
For sequence length = 34
Finished running Muscle

We have a file dataset_profile_final.fasta with our alignment matching the Compendium profile.<br>
We just need to replace some specific characters on the fasta file to change it into a TSV and load it as a Dataframe to continue our analysis.

In [48]:
with open(f'{PATH}/DeepTropism/datasets/dataset_profile_final.fasta', 'r') as fasta_in, \
     open(f'{PATH}/DeepTropism/datasets/processed_tsv/dataset_profile_final.tsv', 'w') as tsv_out:
    for line in fasta_in.readlines():
        if line.startswith('>'):
            new_line = line.replace('|', '\t').replace('\n', '\t').replace('>', '')
            tsv_out.write(new_line)
        else:
            tsv_out.write(line)

## Loading the new aligned Dataset

In [49]:
df_datasets_profiled = pd.read_csv(f'{PATH}/DeepTropism/datasets/processed_tsv/dataset_profile_final.tsv', sep='\t',
                                  names=['seq_name', 'dataset', 'label', 'sequence_aligned'])

We are going to remove the LANL sequences, as they don't have labels, as they were only used for executing the alignment.

In [50]:
df_datasets_profiled_final = df_datasets_profiled[df_datasets_profiled.dataset != 'LANL']

In [51]:
# Check the size of the sequence_aligned
set(df_datasets_profiled_final.sequence_aligned.apply(len))

{44}

In [52]:
df_datasets_profiled_final.head()

Unnamed: 0,seq_name,dataset,label,sequence_aligned
198,X138.EU074781.BG.CXCR4,cm,CXCR4,C-RPNN--TRKS------GPQ-----------YTIIGDIA---C
199,CCR5_AG1030_-_FR_-_02_AG,geno2pheno,CCR5,CSRPNNN-TRKSRI----GPGQTFYAT-----------DIGDQC
200,CCR5_AG1005_-_FR_-_02_AG,geno2pheno,CCR5,CTRPNNN-TRKSIH----PGRAFYATV-----------GPQAHC
201,-.FJ652339.02_AG.CCR5,cm,CCR5,CTRPNNNTRS--------VRIGPGQAF-------YAGDIGIQAC
202,-.FJ375998.C.CCR5,cm,CCR5,CRPNNTRKMR--------IGPGQTYAT-------GDIIGIRAHC


In [53]:
# Check number of duplicated sequences
df_datasets_profiled_final.duplicated(subset='sequence_aligned', keep=False).sum()

8765

In [54]:
df_datasets_profiled_final[df_datasets_profiled_final.sequence_aligned.str.contains('-')]

Unnamed: 0,seq_name,dataset,label,sequence_aligned
198,X138.EU074781.BG.CXCR4,cm,CXCR4,C-RPNN--TRKS------GPQ-----------YTIIGDIA---C
199,CCR5_AG1030_-_FR_-_02_AG,geno2pheno,CCR5,CSRPNNN-TRKSRI----GPGQTFYAT-----------DIGDQC
200,CCR5_AG1005_-_FR_-_02_AG,geno2pheno,CCR5,CTRPNNN-TRKSIH----PGRAFYATV-----------GPQAHC
201,-.FJ652339.02_AG.CCR5,cm,CCR5,CTRPNNNTRS--------VRIGPGQAF-------YAGDIGIQAC
202,-.FJ375998.C.CCR5,cm,CCR5,CRPNNTRKMR--------IGPGQTYAT-------GDIIGIRAHC
...,...,...,...,...
9743,DUR.AM262127.O.CCR5,cm,CCR5,CVRPGDNSVKEMRA----GPMAWYSME--LERNGSRTNSRTAFC
9744,DUR.X84327.O.CCR5,cm,CCR5,CVRPGNNSVQEIKI----GPMAWYSMQ--IEREGKGANSRTAFC
9745,CCR5_AM262114_21502_FR_1995_O,geno2pheno,CCR5,CVRPGSNSVQEIKI----GPMAWYSMQ--LEQDGKRANARTAFC
9746,CXCR4/GPR15_NDK_13796_CD_1983_D,geno2pheno,CXCR4,CTRPYKYTRQRTSI----GLRQSLYTI--TGKKKKTGYIGQAHC


In [55]:
# Get diversity of lenghts of sequences on df_datasets_final
set(df_datasets_profiled_final['sequence_aligned'].apply(len))

{44}

In [56]:
len(set(df_datasets_profiled_final.sequence_aligned.to_list()))

3608

In [58]:
df_datasets_profiled_final[df_datasets_profiled_final.label == 'validation'].head()

Unnamed: 0,seq_name,dataset,label,sequence_aligned


In [59]:
df_datasets_profiled_final

Unnamed: 0,seq_name,dataset,label,sequence_aligned
198,X138.EU074781.BG.CXCR4,cm,CXCR4,C-RPNN--TRKS------GPQ-----------YTIIGDIA---C
199,CCR5_AG1030_-_FR_-_02_AG,geno2pheno,CCR5,CSRPNNN-TRKSRI----GPGQTFYAT-----------DIGDQC
200,CCR5_AG1005_-_FR_-_02_AG,geno2pheno,CCR5,CTRPNNN-TRKSIH----PGRAFYATV-----------GPQAHC
201,-.FJ652339.02_AG.CCR5,cm,CCR5,CTRPNNNTRS--------VRIGPGQAF-------YAGDIGIQAC
202,-.FJ375998.C.CCR5,cm,CCR5,CRPNNTRKMR--------IGPGQTYAT-------GDIIGIRAHC
...,...,...,...,...
9743,DUR.AM262127.O.CCR5,cm,CCR5,CVRPGDNSVKEMRA----GPMAWYSME--LERNGSRTNSRTAFC
9744,DUR.X84327.O.CCR5,cm,CCR5,CVRPGNNSVQEIKI----GPMAWYSMQ--IEREGKGANSRTAFC
9745,CCR5_AM262114_21502_FR_1995_O,geno2pheno,CCR5,CVRPGSNSVQEIKI----GPMAWYSMQ--LEQDGKRANARTAFC
9746,CXCR4/GPR15_NDK_13796_CD_1983_D,geno2pheno,CXCR4,CTRPYKYTRQRTSI----GLRQSLYTI--TGKKKKTGYIGQAHC


# Defining the label as numeric 
* 'CCR5' = 0 
* 'CXCR4' = 1 
* 'R5X4' = 1

In [61]:
# Function to call labels
def tropism_label(row):
    """
    Define numeric label, 'CCR5' as 0 
    and 'CXCR4' or 'R5X4' as 1
    """
    # For CCR5
    if str(row.label).strip() == 'CCR5':
        return 0
    # For CXCR4
    elif str(row.label).strip() == 'CXCR4':
        return 1
    # For R5X4
    elif str(row.label).strip() == 'R5X4':
        return 1

In [62]:
df_datasets_profiled_final['label_numeric'] = df_datasets_profiled_final.apply(tropism_label, axis=1)
df_datasets_profiled_final['label_numeric'] = df_datasets_profiled_final['label_numeric'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Create a new column with the sequence without spaces on the final dataframe, and one column with the its size

In [72]:
df_datasets_profiled_final['sequence'] = df_datasets_profiled_final.sequence_aligned.str.replace('-', '', regex=True)
df_datasets_profiled_final['seq_len'] = df_datasets_profiled_final.sequence.apply(len)

# Reorder columns
df_datasets_profiled_final = df_datasets_profiled_final[['seq_name', 'dataset', 'sequence', 'seq_len',
                                                           'sequence_aligned','label', 'label_numeric']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [74]:
df_datasets_profiled_final.head()

Unnamed: 0,seq_name,dataset,sequence,seq_len,sequence_aligned,label,label_numeric
198,X138.EU074781.BG.CXCR4,cm,CRPNNTRKSGPQYTIIGDIAC,21,C-RPNN--TRKS------GPQ-----------YTIIGDIA---C,CXCR4,1
199,CCR5_AG1030_-_FR_-_02_AG,geno2pheno,CSRPNNNTRKSRIGPGQTFYATDIGDQC,28,CSRPNNN-TRKSRI----GPGQTFYAT-----------DIGDQC,CCR5,0
200,CCR5_AG1005_-_FR_-_02_AG,geno2pheno,CTRPNNNTRKSIHPGRAFYATVGPQAHC,28,CTRPNNN-TRKSIH----PGRAFYATV-----------GPQAHC,CCR5,0
201,-.FJ652339.02_AG.CCR5,cm,CTRPNNNTRSVRIGPGQAFYAGDIGIQAC,29,CTRPNNNTRS--------VRIGPGQAF-------YAGDIGIQAC,CCR5,0
202,-.FJ375998.C.CCR5,cm,CRPNNTRKMRIGPGQTYATGDIIGIRAHC,29,CRPNNTRKMR--------IGPGQTYAT-------GDIIGIRAHC,CCR5,0


In [76]:
df_datasets_profiled_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9550 entries, 198 to 9747
Data columns (total 7 columns):
seq_name            9550 non-null object
dataset             9550 non-null object
sequence            9550 non-null object
seq_len             9550 non-null int64
sequence_aligned    9550 non-null object
label               9550 non-null object
label_numeric       9550 non-null int64
dtypes: int64(2), object(5)
memory usage: 596.9+ KB


### Saving the final profiled dataset to tsv file


In [80]:
df_datasets_profiled_final.sort_values(by='dataset', inplace=True)
df_datasets_profiled_final.to_csv(f'{PATH}/DeepTropism/datasets/processed_tsv/deeptropism_profiled_dataset.tsv', sep='\t')