# Protein Protein Interaction Data
**[Work in progress]**

This notebook downloads and standardizes viral-host protein data from IntAct and other sources for ingestion into the Knowledge Graph.

Data sources: [IntAct](https://www.ebi.ac.uk/intact/query/pubid:IM-27814), [Sequence Information](https://docs.google.com/spreadsheets/d/1m2SiCxyU_B1f4Ruu0wZafNXu8VQnjmog73bjWCS834A/edit?usp=sharing), [BioArXiv](https://www.biorxiv.org/content/10.1101/2020.03.22.002386v3.full)

Authors: Kaushik Ganapathy, Eric Yu (krganapa@ucsd.edu, ery010@ucsd.edu)

**Problem Description**
* How does the SARS-COV2 Virus enter with the human body? How does the disease manifest itself once the virus has entered? These are some vital questions to answer, and we have reached a point where we have data to represent them qualitatively. 


* The SARS-COV2 virus enters into the Body with a help of a protein known as the enveleope protein, which is often seen as spikes on the surface of the virus. Furthermore, sequencing of the genome of the virus has provided us with other proteins which interact receptors (proteins) on humans. In this project, we present a workflow to assimilate information from such protein-protein interactions by integrating data from Experimental Papers, Online Interaction Databases and Genomic Data.


* The files generated from this workflow can then be digested by the Knowledge Graph to create linkages with other data sources within this graph. Towards the end, we demonstate that as well. 

### External Package Imports

In [797]:
import os
import re
import hashlib 


import pandas as pd
import numpy as np

from pathlib import Path
from Bio import SeqIO
from io import StringIO

pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [798]:
# NEO4J_HOME = Path(os.getenv('NEO4J_HOME'))
# print(NEO4J_HOME)

**Downloaded data from IntAct MI-TAB 2.5 format**

Autodownloading this is a WIP

In [799]:
data = pd.read_csv('../reference_data/intact_data.txt', sep = '\t')

### Data Cleanup

**Dropping unnecessary columns.**

In [800]:
columns_to_drop = 'Iteraction detection method(s)	Publication 1st author(s)	Publication Identifier(s)	Taxid interactor A	Taxid interactor B	Interaction type(s)\tSource database(s)\tConfidence value(s)'
columns_to_drop = columns_to_drop.split('\t')
columns_retain = [col for col in data.columns if col not in columns_to_drop]
#columns_retain += ['Taxid interactor A', 'Taxid interactor B']
data = data[columns_retain]

**Minor Column Renaming and Validation of Data Sources**

***Cell should print "All set to clean up"***

In [801]:
data = data.rename({'#ID(s) interactor A': 'SARS_COV2_Protein_ID'}, axis = 1)
unique_ids = data['SARS_COV2_Protein_ID'].unique()
unique_data_sources_ids = np.unique([id_.split(':')[0] for id_ in unique_ids])
if len(unique_data_sources_ids) == 1 and unique_data_sources_ids[0] == 'uniprotkb':
    print('All set to clean up SARS-COV-2 Column')
else:
    raise ValueError('Unknown Data Sources present. Please check before proceeding')

All set to clean up SARS-COV-2 Column


In [802]:
def standardize_names(identifier_id):
    if 'uniprot' in identifier_id: return identifier_id.replace('uniprotkb', 'uniprot')
    elif 'intact' in identifier_id: return identifier_id
data['SARS_COV2_Protein_ID'] = data['SARS_COV2_Protein_ID'].apply(standardize_names)

In [803]:
handled = set(['uniprotkb', 'intact'])
data = data.rename({'ID(s) interactor B': 'Human_Protein_ID'}, axis = 1)
unique_ids = data['Human_Protein_ID'].unique()
unique_data_sources_ids = np.unique([id_.split(':')[0] for id_ in unique_ids])
if set(unique_data_sources_ids) == handled: print('All set!')
else:
    print(unique_data_sources_ids)
    print ('Unknown Data Sources present. Please check before proceeding')

All set!


In [804]:
data['Human_Protein_ID'] = data['Human_Protein_ID'].apply(standardize_names)

### Standardizing names to match with other data sources

In [805]:
def find_viral_name(viral_name):
    return viral_name.split(':')[1].split('(')[0].split('_')[0].upper() 


def find_human_name(human_name):
    return human_name.split(':')[1].split('(')[0].upper() 


data['Alias(es) interactor A'] = data['Alias(es) interactor A'].apply(find_viral_name)
data['Alias(es) interactor B'] = data['Alias(es) interactor B'].apply(find_human_name)

data = data.rename({'Alias(es) interactor A':'SARS_COV2_Protein_Name', 'Alias(es) interactor B':'Human_Protein_Name', 'Interaction identifier(s)\
':'Interaction_ID'}, axis = 1)
data = data.drop('Interaction detection method(s)', axis = 1)

### Manual Fixes to errors in Data Entry 
_Automated Workflow: WIP_

**Correcting ORF3B**

In [806]:
data[data['Human_Protein_ID'] == 'intact:EBI-25475912']

Unnamed: 0,SARS_COV2_Protein_ID,Human_Protein_ID,Alt. ID(s) interactor A,Alt. ID(s) interactor B,SARS_COV2_Protein_Name,Human_Protein_Name,Interaction_ID
159,uniprot:Q9UJZ1,intact:EBI-25475912,intact:EBI-1044428|uniprotkb:B4E1K7|uniprotkb:...,-,STML2,ORF3B_WCPV,intact:EBI-25491308


In [807]:
correct = ['intact:EBI-25491308', 'uniprot:Q9UJZ1', '-', 'intact:EBI-1044428|uniprotkb:B4E1K7|uniprotkb:O60376|uniprotkb:Q53G29|uniprotkb:Q96FY2|uniprotkb:Q9P042|uniprotkb:D3DRN3', 'ORF3B', 'STML2', 'intact:EBI-25491308']
data.loc[159] = correct

**Correcting NSP5**

In [808]:
data[data['SARS_COV2_Protein_ID'] == 'uniprot:Q92769']

Unnamed: 0,SARS_COV2_Protein_ID,Human_Protein_ID,Alt. ID(s) interactor A,Alt. ID(s) interactor B,SARS_COV2_Protein_Name,Human_Protein_Name,Interaction_ID
108,uniprot:Q92769,uniprot:P0DTD1-PRO_0000449623,intact:EBI-301821|uniprotkb:B3KRS5|uniprotkb:E...,intact:EBI-25475864,HDAC2,NSP5_WCPV,intact:EBI-25490970


In [809]:
correct = ['uniprot:P0DTD1-PRO_0000449623', 'uniprot:Q92769', 'intact:EBI-25475864', 'intact:EBI-301821|uniprotkb:B3KRS5|uniprotkb:E1P561|uniprotkb:Q5SRI8|uniprotkb:Q5SZ86|uniprotkb:Q8NEH4|uniprotkb:B4DL58', 'NSP5', 'HDAC2', 'intact:EBI-25490970']
data.loc[108] = correct

**Correcting NSP11**

In [810]:
data[data['SARS_COV2_Protein_ID'] == 'uniprot:O75347']

Unnamed: 0,SARS_COV2_Protein_ID,Human_Protein_ID,Alt. ID(s) interactor A,Alt. ID(s) interactor B,SARS_COV2_Protein_Name,Human_Protein_Name,Interaction_ID
33,uniprot:O75347,uniprot:P0DTC1-PRO_0000449645,intact:EBI-2686341|uniprotkb:B4DT30,intact:EBI-25475882,TBCA,NSP11_WCPV,intact:EBI-25490682


In [811]:
correct = ['uniprot:P0DTC1-PRO_0000449645', 'uniprot:O75347', 'intact:EBI-25475882','intact:EBI-2686341|uniprotkb:B4DT30', 'NSP11', 'TBCA', 'intact:EBI-25490682']
data.loc[33] = correct

**Correcting NSP-C145A**

In [812]:
data.loc[102]['SARS_COV2_Protein_ID'] = 'uniprot:NSP5_C145A'
data.loc[103]['SARS_COV2_Protein_ID'] = 'uniprot:NSP5_C145A'
data.loc[102]['SARS_COV2_Protein_Name'] = 'NSP5_C145A'
data.loc[103]['SARS_COV2_Protein_Name'] = 'NSP5_C145A'

**Selecting Relavent Columns from data to form the initial ```interactions``` dataframe**

In [813]:
interactions = data[['SARS_COV2_Protein_ID', 'Human_Protein_ID', 'Interaction_ID']]

### Creating Node data files

In [814]:
data = data.rename({'Alt. ID(s) interactor A':'SARS_COV2_Alt_ID', 'Alt. ID(s) interactor B':'Human_Alt_ID'}, axis = 1)
virus_df = data[['SARS_COV2_Protein_ID', 'SARS_COV2_Alt_ID', 'SARS_COV2_Protein_Name']]
virus_df = virus_df.drop_duplicates(subset = ['SARS_COV2_Protein_ID', 'SARS_COV2_Protein_Name']).reset_index(drop = True)

**Unnesting the Alternate IDs**

In [815]:
def pre_un_nest(id_):
    if '|' not in id_:
        if 'intact' in id_:
            return {'Alt_intact_ID': id_, 'Alt_uniprot_ID': np.nan}
        elif 'uniprot_ID' in id_:
            return {'Alt_intact_ID': np.nan, 'Alt_uniprot_ID': id_}
    else:
        ids = id_.split('|')
        intact_data = []
        uniprot_data = []
        
        for id__ in ids:   
            if 'intact' in id__:
                intact_data += [id__]
                    
            elif 'uniprot' in id__:
                uniprot_data += [id__]
                            
        if len(uniprot_data) == 1:
            uniprot_data = uniprot_data[0]
            
        if len(intact_data) == 1:
            intact_data = intact_data[0]
            
        return {'Alt_uniprot_ID': uniprot_data, 'Alt_intact_ID': intact_data}

In [816]:
unnested = (virus_df['SARS_COV2_Alt_ID'].apply(pre_un_nest)).apply(pd.Series)
unnested['SARS_COV2_Protein_ID'] = virus_df['SARS_COV2_Protein_ID']
unnested = unnested[unnested.columns.tolist()[::-1]]
virus_df = virus_df.drop('SARS_COV2_Alt_ID', axis = 1)

**Load in Jeff Law's Data with Sequences**

In [817]:
sequence_interactions = pd.read_excel('../reference_data/2020-04-krogan-sarscov2-sequences-uniprot-mapping.xlsx')
sequence_interactions = sequence_interactions.loc[:26, :]

**Standardizing Names with IntAct data and selecting appropriate columns**

In [818]:
sequence_interactions['SARS_COV2_Protein_Name'] = sequence_interactions['Krogan name'].apply(lambda name: name.split()[-1].upper())
all_external_data = sequence_interactions[['SARS_COV2_Protein_Name', 'Sequence', 'Length', 'Start Pos', 'End Pos']]
virus_df = virus_df.merge(all_external_data, how = 'outer', indicator=True)
virus_df = virus_df[virus_df.columns[:-1]]

### Generating md5 hash based on sequence

In [819]:
assert len(virus_df) == len(virus_df['Sequence'].unique())

virus_df['md5Hash'] = virus_df['Sequence'].apply(lambda seq: hashlib.md5(seq.encode()).hexdigest())
virus_df = virus_df.rename({'SARS_COV2_Protein_ID': 'SARS_COV2_Identifier', 'md5Hash': 'SARS_COV2_Protein_ID'}, axis = 1)

sequences = virus_df[['SARS_COV2_Protein_ID', 'SARS_COV2_Identifier', 'SARS_COV2_Protein_Name', 'Sequence', 'Length', 'Start Pos', 'End Pos']]
sequences['Start Pos'] = sequences['Start Pos'].astype(int)
sequences['End Pos'] = sequences['End Pos'].astype(int)

map_identifiers = virus_df[['SARS_COV2_Identifier', 'SARS_COV2_Protein_ID']]

### Mapping interactions to the ```interactions``` dataframe and the ```unnested``` dataframe

In [820]:
interactions = interactions.rename({'SARS_COV2_Protein_ID': 'SARS_COV2_Identifier'}, axis = 1)
interactions = interactions.merge(map_identifiers)
interactions = interactions.drop('SARS_COV2_Identifier', axis = 1)
interactions = interactions[['SARS_COV2_Protein_ID', 'Human_Protein_ID']]


unnested = unnested.rename({'SARS_COV2_Protein_ID': 'SARS_COV2_Identifier'}, axis = 1)
unnested['Alt_uniprot_ID'] = unnested['Alt_uniprot_ID'].apply(lambda id_: np.nan if id_ == [] else id_)
unnested = unnested.merge(map_identifiers).drop('SARS_COV2_Identifier', axis = 1)
unnested = unnested[unnested.columns[::-1]]

### Creating the Human Proteins File

In [821]:
human_data = data[['Human_Protein_ID', 'Human_Protein_Name', 'Human_Alt_ID']]
unnested_aliases = (human_data['Human_Alt_ID'].apply(pre_un_nest)).apply(pd.Series)
human_data = human_data.drop('Human_Alt_ID', axis = 1)
unnested_aliases['Human_Protein_ID'] = human_data['Human_Protein_ID']
unnested_aliases = unnested_aliases[unnested_aliases.columns[::-1]]

In [822]:
sequences = sequences.rename({'SARS_COV2_Protein_ID':'ID', 'SARS_COV2_Protein_Name': 'Name', 'SARS_COV2_Identifier': 'Identifier'}, axis = 1)
sequences['TaxonomyID'] = 2697049

In [823]:
human_data = human_data.rename({'Human_Protein_ID':'ID', 'Human_Protein_Name': 'Name'}, axis = 1)

In [824]:
human_data['TaxonomyID'] = 9606
human_data['Identifier'] = human_data['ID']

### Generating Sequences for all Human Proteins from UniprotKB using Batch retrieval

In [825]:
def get_fasta_from_accession_id(ids_, source_fmt='ACC+ID', target_fmt="ACC", output_fmt="fasta"):
    '''
    Adapted from: Webinar on UniProt programmatically.
    https://www.ebi.ac.uk/training/online/sites/ebi.ac.uk.training.online/files/UniProt_programmatically_py3.pdf
    
    Name: get_fasta_from_rcsb_id

    Purpose: Finds the sequence as a fasta file for a given RCSB ID. 

    Arguments:
        ids_: The list of UNIPROT Accession IDs whose sequence is required.
        source_fmt: Defaults to Accession ID
        target: Defaults to Accession 
        output_format: fasta, since we want a Fasta file

    Output:
        List of Sequences: if successful retrieval of sequence batch.
        Empty: if unsuccessful in retrieving the sequence. 
    '''
    
    BASE = "http://www.uniprot.org"
    KB_ENDPOINT = "/uniprot/"
    TOOL_ENDPOINT = "/uploadlists/"

    all_uniprot_ids  = ' '.join(ids_)
    payload = {"from": source_fmt, "to": target_fmt, "format": output_fmt, "query": all_uniprot_ids,}    
    response = requests.get(BASE + TOOL_ENDPOINT, params=payload)
    
    if response.status_code == 200 and response.text != '':
        fasta_test = response.text
        temp_file = StringIO(fasta_test)
        record = SeqIO.parse(temp_file, "fasta")
        
        list_sequences = []
        
        for seqio_obj in record:
            list_sequences += [str(seqio_obj.seq)]
            
        return list_sequences
    
    return ''

### md5 Hashing the Sequence to create IDs for Uniformity

In [826]:
human_data['Sequence'] = human_data['ID'].apply(lambda id_: id_.split(':')[1])
human_data['Sequence'] = get_fasta_from_accession_id(human_data['Sequence'])
human_data['ID'] = human_data['Sequence'].apply(lambda seq: hashlib.md5(seq.encode()).hexdigest())

In [827]:
nodes = sequences.append(human_data, sort=True)
nodes = nodes.fillna('')
nodes = nodes[['ID', 'TaxonomyID', 'Name', 'Identifier', 'Sequence', 'Length', 'Start Pos', 'End Pos']]
nodes['Length'] = nodes['Sequence'].apply(lambda s: len(s))

### Displaying one Human and Viral Protein

In [828]:
(human_data.head(1).append(sequences.head(1), sort=True))[['ID', 'TaxonomyID', 'Name', 'Identifier', 'Sequence', 'Length', 'Start Pos', 'End Pos']].fillna('')

Unnamed: 0,ID,TaxonomyID,Name,Identifier,Sequence,Length,Start Pos,End Pos
0,d371d6023c01420a7c851ef893beaeee,9606,CTL2_HUMAN,uniprot:Q8IWA5,MGDERPHYYGKHGTPQKYDPTFKGPIYNRGCTDIICCVFLLLAIVG...,,,
0,375e0f905c315e06a99c80b736c125d2,2697049,E,uniprot:P0DTC4,MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNI...,75.0,26245.0,26472.0


## PROBLEM!

In [829]:
data_1c = pd.read_csv('01c-data.csv')

In [830]:
data_1c.head()

Unnamed: 0,start,end,gene,locus_tag,db_xref,product,protein_id,genbank_accession,protein_sequence
0,266,21555,ORF1ab,GU280_gp01,GeneID:43740578,ORF1ab polyprotein,YP_009724389.1,NC_045512.2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...
1,266,13483,ORF1ab,GU280_gp01,GeneID:43740578,ORF1a polyprotein,YP_009725295.1,NC_045512.2,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...
2,21563,25384,S,GU280_gp02,GeneID:43740568,surface glycoprotein,YP_009724390.1,NC_045512.2,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...
3,25393,26220,ORF3a,GU280_gp03,GeneID:43740569,ORF3a protein,YP_009724391.1,NC_045512.2,MDLFMRIFTIGTVTLKQGEIKDATPSDFVRATATIPIQASLPFGWL...
4,26245,26472,E,GU280_gp04,GeneID:43740570,envelope protein,YP_009724392.1,NC_045512.2,MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNI...


In [831]:
merged = nodes.merge(data_1c, left_on='Sequence', right_on='protein_sequence', indicator=True, how = 'outer')

In [832]:
merged[(merged['TaxonomyID'] == 2697049) & (merged['_merge'] == 'left_only')]

Unnamed: 0,ID,TaxonomyID,Name,Identifier,Sequence,Length,Start Pos,End Pos,start,end,gene,locus_tag,db_xref,product,protein_id,genbank_accession,protein_sequence,_merge
10,b6e8ea75d0679d091b1dc44cf395aaf4,2697049.0,NSP5_C145A,uniprot:NSP5_C145A,MSGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICT...,307.0,3550,4470,,,,,,,,,,left_only
22,20cfd282d01dd605a02d3d084ce320e9,2697049.0,ORF3B,intact:EBI-25491308,MAYCWRCTSCCFSERFQNHNPQKEMATSTLQGCSLCLQLAVVVCNS...,57.0,25524,25697,,,,,,,,,,left_only
31,14d0b1a958f970cb18618c9aa65493fa,2697049.0,ORF8,uniprot:P0DTC8,MKFLVFLGIITTVAAFHQECSLQSCTQHQPYVVDDPCPIHFYSKWY...,121.0,27894,28259,,,,,,,,,,left_only
32,9b66a6405fa5218ae819a76293365391,2697049.0,ORF9B,uniprot:P0DTD2,MDPKISEMHPALRLVDPQIQLAVTRMENAVGRDQNNVGPKVYPIIL...,97.0,28284,28577,,,,,,,,,,left_only
33,74fa328ac0995fdbe2a3a0b2c4389f59,2697049.0,ORF9C,uniprot:P0DTD3,MLQSCYNFLKEQHCQKASTQKGAEAAVKPLLVPHHVVATVQEIQLQ...,73.0,28734,28955,,,,,,,,,,left_only


### Changing the ID for Human Proteins within Interactions and showing ```interactions```

In [833]:
interactions = interactions.merge(nodes, left_on='Human_Protein_ID', right_on='Identifier')[['SARS_COV2_Protein_ID', 'ID']]
interactions = interactions.rename({'ID':'Human_Protein_ID'}, axis = 1)
interactions.head()

Unnamed: 0,SARS_COV2_Protein_ID,Human_Protein_ID
0,375e0f905c315e06a99c80b736c125d2,d371d6023c01420a7c851ef893beaeee
1,375e0f905c315e06a99c80b736c125d2,bbf43a67b8733f0649b049d494f6872c
2,375e0f905c315e06a99c80b736c125d2,0d02b8a38020b8f429a6cc9e238e3e4f
3,375e0f905c315e06a99c80b736c125d2,5dd925de15997e9e3938ab23517cdcfc
4,375e0f905c315e06a99c80b736c125d2,8dbafa77ef94052d86515e8d7fe1a6f3


## Deliverables for Protein-Protein Interactions

* **```Nodes```**: Contains all the  protein nodes (human & viral) and associated information, with most conflicts resolved from the 3 data sources. Proteins are identified with the md5 hash a.k.a. protein id. Also contains information on sequences, start-point in genome, end-point in genome, and a standard identifiers.org representation. 


* **```Interactions```**:  Contains all the interactions between a viral protein and a human protein. Each one of these interactions also has an ID which is resolvable on identifiers.org.

In [834]:
nodes.head()

Unnamed: 0,ID,TaxonomyID,Name,Identifier,Sequence,Length,Start Pos,End Pos
0,375e0f905c315e06a99c80b736c125d2,2697049,E,uniprot:P0DTC4,MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNI...,75,26245,26472
1,1cd6abff79ad3633e17582eb0e576539,2697049,M,uniprot:P0DTC5,MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFL...,222,26523,27191
2,5c2c364f44079728c451280435c4236a,2697049,NSP1,uniprot:P0DTD1-PRO_0000449619,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,180,266,805
3,63d2c81f37726f44c600eb5225676a66,2697049,NSP11,uniprot:P0DTC1-PRO_0000449645,SADAQSFLNGFAV,13,13442,13484
4,af0cec59296f3c845a7b04500cd6886b,2697049,NSP10,uniprot:P0DTD1-PRO_0000449628,AGNATEVPANSTVLSFCAFAVDAAKAYKDYLASGGQPITNCVKMLC...,139,13025,13441


In [763]:
interactions.head()

Unnamed: 0,SARS_COV2_Protein_ID,Human_Protein_ID
0,375e0f905c315e06a99c80b736c125d2,d371d6023c01420a7c851ef893beaeee
1,375e0f905c315e06a99c80b736c125d2,bbf43a67b8733f0649b049d494f6872c
2,375e0f905c315e06a99c80b736c125d2,0d02b8a38020b8f429a6cc9e238e3e4f
3,375e0f905c315e06a99c80b736c125d2,5dd925de15997e9e3938ab23517cdcfc
4,375e0f905c315e06a99c80b736c125d2,8dbafa77ef94052d86515e8d7fe1a6f3


### Writing all files to csv

**Converting all relavent information to Strings and replacing NaN's with '' to make sure NEO-4J understands them as empty properies**

In [764]:
nodes = nodes.fillna('')
nodes['Start Pos'] = nodes['Start Pos'].apply(str)
nodes['End Pos'] = nodes['End Pos'].astype(str)
nodes['Start Pos'] = nodes['Start Pos'].apply(lambda val: str(int(float(val))) if val != '' else val)
nodes['End Pos'] = nodes['End Pos'].apply(lambda val: str(int(float(val))) if val != '' else val)
nodes['Length'] = nodes['Length'].apply(lambda val: str(int(float(val))) if val != '' else val)

In [765]:
#FOR NEO-4J Server Setup (NOT LOCAL)
# nodes.to_csv(NEO4J_HOME / 'import/01e-nodes.csv', index = False)
# interactions.to_csv(NEO4J_HOME / 'import/01e-interactions.csv', index = False)

##########################################################################
# unnested.to_csv(NEO4J_HOME / 'import/01e-virus_alias.csv', index = False)
# unnested_aliases.to_csv(NEO4J_HOME / 'import/01e-human_alias.csv', index = False)
##########################################################################

In [766]:
nodes.to_csv('01e-nodes_OG.csv', index = False)
interactions.to_csv('01e-interactions_OG.csv', index = False)

### End resulting files

**```01e-nodes.csv```**: Contains all the protein sequences from the human and virus sides, along with associated information, with all conflicts resolved from the 3 data sources. Proteins are identified with the md5 hash a.k.a. protein id. Also contains information on sequences, start-point in genome, end-point in genome, and a standard identifiers.org representation. 

**```01e-virus_alias.csv```**: Contains all the alias IDs known for the viral sequences,be it interact or uniprot. Keyed by the protein ID.

**```01e-human_alias.csv```**: Contains all the alias IDs known for the human sequences,be it interact or uniprot. Keyed by the protein ID.

**```01e-interactions.csv```**: Contains all the interactions between a viral protein and a human protein. Each one of these interactions also has an ID which is resolvable on identifiers.org.

In [767]:
#TODO: 
#Automated Download of Intact Data

### Further Scope: Potential Cool Extensions/Integration

In [768]:
###NSP-9
import py3Dmol
interactions[interactions['SARS_COV2_Protein_ID'] == '8c32758bc2f4b49ed8c6dfe7caa7ea49']

Unnamed: 0,SARS_COV2_Protein_ID,Human_Protein_ID
213,8c32758bc2f4b49ed8c6dfe7caa7ea49,b93aa29fb8777bda1a99351c11cd5ba9
214,8c32758bc2f4b49ed8c6dfe7caa7ea49,1760a7e2012ae33c54aa97b27e3c8406
215,8c32758bc2f4b49ed8c6dfe7caa7ea49,7f4d2b129d44429a4ae8e8e467d484e9
216,8c32758bc2f4b49ed8c6dfe7caa7ea49,f54c2cc57f3ea7da2771fb348966ff0b
217,8c32758bc2f4b49ed8c6dfe7caa7ea49,ad9411e50bea7d1cf2aa06bb87e058ae
218,8c32758bc2f4b49ed8c6dfe7caa7ea49,d0f2785e425f4a8e3ecef669a9384a38
219,8c32758bc2f4b49ed8c6dfe7caa7ea49,3cb6b6b668ed5f7a90159f9d1dcb0ec2
220,8c32758bc2f4b49ed8c6dfe7caa7ea49,292d422968c9195f3868c77f183d7e5f
221,8c32758bc2f4b49ed8c6dfe7caa7ea49,d999520723b2365d2293939612d9ca37
222,8c32758bc2f4b49ed8c6dfe7caa7ea49,c9ddca4904950f36444729ce8e7a4e5f


In [769]:
wd = '/Users/kaushikramganapathy/covid-19-community/notebooks'
os.chdir(wd)

In [770]:
viz = py3Dmol.view(query='pdb:6W4B')
viz.setStyle({'cartoon': {'color':'white'}})

<py3Dmol.view at 0x12680b198>

# TODO / In-progress

* Workflow for integration with 01c data, thus connecting it to KG.
* Add it on Github/Binder.
* Integrate with Production version / Test Scripts on Server 
* Figure out how to deal with NSP C_145_A etc.
* Connect other features (Publication source for Interactions for example)

### Simple Integration with 01c to get the links to connect to other compoents of the knowledge graph. 

In [837]:
human_nodes = nodes[nodes['TaxonomyID'] == 9606]

In [838]:
gene_info = data_1c[['gene', 'db_xref']].drop_duplicates()

In [839]:
gene_info = gene_info.rename({'gene': 'Name', 'db_xref': 'ID'}, axis = 1)

In [840]:
nodes = nodes.merge(data_1c[['db_xref', 'protein_sequence']], left_on='Sequence', right_on='protein_sequence').rename({'db_xref':'GeneID'}, axis = 1)

In [841]:
nodes = nodes[['ID', 'TaxonomyID', 'GeneID', 'Name', 'Identifier', 'Sequence', 'Length', 'Start Pos', 'End Pos']]

In [842]:
gene_interactions = nodes[['GeneID', 'ID']].drop_duplicates().rename({'ID': 'ProteinID'}, axis = 1)

In [843]:
gene_interactions.to_csv('gene_interactions.csv', index = False)

In [844]:
nodes = nodes.append(human_nodes, sort = True)

In [845]:
nodes = nodes.reset_index(drop = True)

In [846]:
nodes = nodes.fillna('')

In [849]:
nodes = nodes.drop_duplicates()
interactions = interactions.drop_duplicates()

In [852]:
nodes.to_csv('01e-nodes.csv', index = False)
gene_interactions.to_csv('gene_interactions.csv', index = False)
gene_info.to_csv('gene_info.csv', index = False)
interactions.to_csv('01e-interactions.csv', index = False)

### Nodes

In [854]:
nodes[['ID', 'TaxonomyID', 'GeneID', 'Name', 'Identifier', 'Sequence', 'Length', 'Start Pos', 'End Pos']].head()

Unnamed: 0,ID,TaxonomyID,GeneID,Name,Identifier,Sequence,Length,Start Pos,End Pos
0,375e0f905c315e06a99c80b736c125d2,2697049,GeneID:43740570,E,uniprot:P0DTC4,MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNI...,75,26245,26472
1,1cd6abff79ad3633e17582eb0e576539,2697049,GeneID:43740571,M,uniprot:P0DTC5,MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFL...,222,26523,27191
2,5c2c364f44079728c451280435c4236a,2697049,GeneID:43740578,NSP1,uniprot:P0DTD1-PRO_0000449619,MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHL...,180,266,805
4,63d2c81f37726f44c600eb5225676a66,2697049,GeneID:43740578,NSP11,uniprot:P0DTC1-PRO_0000449645,SADAQSFLNGFAV,13,13442,13484
5,af0cec59296f3c845a7b04500cd6886b,2697049,GeneID:43740578,NSP10,uniprot:P0DTD1-PRO_0000449628,AGNATEVPANSTVLSFCAFAVDAAKAYKDYLASGGQPITNCVKMLC...,139,13025,13441


### Gene Info

In [857]:
gene_info.head()

Unnamed: 0,Name,ID
0,ORF1ab,GeneID:43740578
2,S,GeneID:43740568
3,ORF3a,GeneID:43740569
4,E,GeneID:43740570
5,M,GeneID:43740571


### Gene Interactions

In [858]:
gene_interactions.head()

Unnamed: 0,GeneID,ProteinID
0,GeneID:43740570,375e0f905c315e06a99c80b736c125d2
1,GeneID:43740571,1cd6abff79ad3633e17582eb0e576539
2,GeneID:43740578,5c2c364f44079728c451280435c4236a
4,GeneID:43740578,63d2c81f37726f44c600eb5225676a66
5,GeneID:43740578,af0cec59296f3c845a7b04500cd6886b
