## Phosphorylation Site Data Cleaning

Having obtained the raw data for kinase phosphorylation sites from PhosphoSitePlus as two Excel spreadsheets: Kinase_Substrate_Dataset and Phosphorylation_site_dataset, we now have to filter, clean and combine both datasets to get it into a form suitable to be used as a table for our database. First we'll work on filtering the Kinase_Substrate_Dataset as majority of the data for our table is going to be in this dataframe.

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
#Open both of our raw data files in pandas and converting them into panda dataframes
df = pd.read_csv('Kinase_Substrate_Dataset.csv')

In [3]:
#To see what our dataframe looks like and to ensure pandas has loaded in our data correctly.
df.head()

Unnamed: 0,GENE,KINASE,KIN_ACC_ID,KIN_ORGANISM,SUBSTRATE,SUB_GENE_ID,SUB_ACC_ID,SUB_GENE,SUB_ORGANISM,SUB_MOD_RSD,SITE_GRP_ID,SITE_+/-7_AA,DOMAIN,IN_VIVO_RXN,IN_VITRO_RXN,CST_CAT#
0,Pak2,PAK2,Q64303,rat,MEK1,170851.0,Q01986,Map2k1,rat,S298,448284,RtPGRPLsSYGMDSR,Pkinase,,X,9128; 98195
1,Pak2,PAK2,Q64303,rat,PRKD1,85421.0,Q9WTQ1,Prkd1,rat,S203,449896,GVRRRRLsNVsLTGL,,X,,
2,Pak2,PAK2,Q64303,rat,prolactin,5617.0,P01236,PRL,human,S207,451732,LHCLRRDsHKIDNYL,Hormone_1,,X,
3,Pak2,PAK2,Q64303,rat,prolactin,24683.0,P01237,Prl,rat,S206,451732,IRCLRRDsHKVDNYL,Hormone_1,,X,
4,EIF2AK1,HRI,Q9BQI3,human,eIF2-alpha,54318.0,P68101,Eif2s1,rat,S52,447635,MILLSELsRRRIRSI,S1,,X,3597; 9721; 3398; 5199


First, major thing we notice here is that under the Kinase Organism and Substrate Organism columns there are other species of animals included. As we are only interested in human kinases we filter accordingly.

In [4]:
#Filtering the dataframe for values where both Kinase and Substrate organisms are both human and checking to make sure this has
#been done
df = df[(df.KIN_ORGANISM=='human') & (df.SUB_ORGANISM=='human')]
df.head()

Unnamed: 0,GENE,KINASE,KIN_ACC_ID,KIN_ORGANISM,SUBSTRATE,SUB_GENE_ID,SUB_ACC_ID,SUB_GENE,SUB_ORGANISM,SUB_MOD_RSD,SITE_GRP_ID,SITE_+/-7_AA,DOMAIN,IN_VIVO_RXN,IN_VITRO_RXN,CST_CAT#
6,EIF2AK1,HRI,Q9BQI3,human,eIF2-alpha,1965.0,P05198,EIF2S1,human,S52,447635,MILLsELsRRRIRsI,S1,,X,3597; 9721; 3398; 5199
7,EIF2AK1,HRI,Q9BQI3,human,eIF2-alpha,1965.0,P05198,EIF2S1,human,S49,450210,IEGMILLsELsRRRI,S1,,X,
10,PRKCD,PKCD,Q05655,human,HDAC5,10014.0,Q9UQL6,HDAC5,human,S259,447995,FPLRkTAsEPNLKVR,,,X,3443
11,PRKCD,PKCD,Q05655,human,PTPRA iso2,5786.0,P18433-2,PTPRA,human,S204,447612,PLLARSPsTNRKYPP,,X,,
12,PRKCD,PKCD,Q05655,human,hnRNP K,3190.0,P61978,HNRNPK,human,S302,457408,GrGGrGGsrArNLPL,,X,X,


Next, there are a number of columns in the dataframe that are not useful to us and so can be removed.

In [5]:
#Remove columns in the df that aren't useful to us by their heading name and checking to make sure this has been done.
df = df.drop(['GENE','KIN_ORGANISM', 'SUB_GENE_ID', 'SUB_ORGANISM', 'SITE_GRP_ID', 'DOMAIN', 'IN_VIVO_RXN', 'IN_VITRO_RXN', 
              'CST_CAT#'],axis=1)
df.head()

Unnamed: 0,KINASE,KIN_ACC_ID,SUBSTRATE,SUB_ACC_ID,SUB_GENE,SUB_MOD_RSD,SITE_+/-7_AA
6,HRI,Q9BQI3,eIF2-alpha,P05198,EIF2S1,S52,MILLsELsRRRIRsI
7,HRI,Q9BQI3,eIF2-alpha,P05198,EIF2S1,S49,IEGMILLsELsRRRI
10,PKCD,Q05655,HDAC5,Q9UQL6,HDAC5,S259,FPLRkTAsEPNLKVR
11,PKCD,Q05655,PTPRA iso2,P18433-2,PTPRA,S204,PLLARSPsTNRKYPP
12,PKCD,Q05655,hnRNP K,P61978,HNRNPK,S302,GrGGrGGsrArNLPL


As SQLite and SQLAlchemy has some specific syntax when referring to specific columns we rename our columns so that we have no errors with the database and web app integration, in addition to clarity.

In [6]:
#Renaming columns of the dataframe for more clarity and to meet syntax requirements for SQLite and SQLAlchemy.
df = df.rename(index=str, columns = {'KINASE': 'KINASE_GENE_NAME', 'KIN_ACC_ID': 'KIN_UNIPROT_ID', 'SUBSTRATE': 'SUBSTRATE_NAME',
                                     'SUB_ACC_ID': 'SUB_UNIPROT_ID', 'SUB_GENE': 'SUB_GENE_NAME', 'SITE_+/-7_AA': 'SITE_7_AA'}) 
df

Unnamed: 0,KINASE_GENE_NAME,KIN_UNIPROT_ID,SUBSTRATE_NAME,SUB_UNIPROT_ID,SUB_GENE_NAME,SUB_MOD_RSD,SITE_7_AA
6,HRI,Q9BQI3,eIF2-alpha,P05198,EIF2S1,S52,MILLsELsRRRIRsI
7,HRI,Q9BQI3,eIF2-alpha,P05198,EIF2S1,S49,IEGMILLsELsRRRI
10,PKCD,Q05655,HDAC5,Q9UQL6,HDAC5,S259,FPLRkTAsEPNLKVR
11,PKCD,Q05655,PTPRA iso2,P18433-2,PTPRA,S204,PLLARSPsTNRKYPP
12,PKCD,Q05655,hnRNP K,P61978,HNRNPK,S302,GrGGrGGsrArNLPL
...,...,...,...,...,...,...,...
18450,NuaK1,O60285,MYPT1,O14974,PPP1R12A,S910,sLLGRsGsysyLEER
18451,NuaK1,O60285,LATS1,O95835,LATS1,S464,NIPVRsNsFNNPLGN
18452,ULK2,Q8IYT8,SEC16A,O15027,SEC16A,S846,LAQPINFsVSLSNSH
18453,ULK2,Q8IYT8,DENND3,A2RUS2,DENND3,S472,THRRMVVsMPNLQDI


The neighbouring amino acid sequence shows the 7 neighbouring amino acids either side of the amino acid residue that has been phosphorylated by the Kinase. However, the format that PhosphositePlus presents the information isn't the clearest with multiple lowercase amino acids in the same string that aren't the residue being phosphorylated. So we need to clean the column to make them look the same way.

In [7]:
#Putting the Amino acid sequence information into an object
AA_Sequences = df['SITE_7_AA'].values

#Make a function which takes a string and capitalizes the 8th character in a string and makes all of the other characters in
#lower case.
def capitalize_8th(string):
    return string[:7].lower() + string[7:].capitalize()

#Set up an empty list to collect all of the cleaned sequences
CleanedSequences = []

#Running a for loop to clean the amino acid sequences
for sequence in AA_Sequences: #for every sequence in the amino acid sequence
    sequence = str(sequence) #convert the sequence into a string
    sequence = capitalize_8th(sequence) #pass the string through our created function 
    sequence = sequence.swapcase() #swap the cases of the strings to the opposite i.e. all lowercase characters become uppercase
                                   #and vice versa
    CleanedSequences.append(sequence) #append the edited sequence to the empty list

In [8]:
#Dropping the old amino acid sequence column in the dataframe
df = df.drop(columns=['SITE_7_AA'])

#Inserting the new column with the cleaned data
df.insert(6, 'SITE_7_AA', CleanedSequences)

df

Unnamed: 0,KINASE_GENE_NAME,KIN_UNIPROT_ID,SUBSTRATE_NAME,SUB_UNIPROT_ID,SUB_GENE_NAME,SUB_MOD_RSD,SITE_7_AA
6,HRI,Q9BQI3,eIF2-alpha,P05198,EIF2S1,S52,MILLSELsRRRIRSI
7,HRI,Q9BQI3,eIF2-alpha,P05198,EIF2S1,S49,IEGMILLsELSRRRI
10,PKCD,Q05655,HDAC5,Q9UQL6,HDAC5,S259,FPLRKTAsEPNLKVR
11,PKCD,Q05655,PTPRA iso2,P18433-2,PTPRA,S204,PLLARSPsTNRKYPP
12,PKCD,Q05655,hnRNP K,P61978,HNRNPK,S302,GRGGRGGsRARNLPL
...,...,...,...,...,...,...,...
18450,NuaK1,O60285,MYPT1,O14974,PPP1R12A,S910,SLLGRSGsYSYLEER
18451,NuaK1,O60285,LATS1,O95835,LATS1,S464,NIPVRSNsFNNPLGN
18452,ULK2,Q8IYT8,SEC16A,O15027,SEC16A,S846,LAQPINFsVSLSNSH
18453,ULK2,Q8IYT8,DENND3,A2RUS2,DENND3,S472,THRRMVVsMPNLQDI


Now that the Kinase_Substrate_Dataset has been filtered and cleaned to our specifications we now need to incorproate any further required information from the Phosphorylation_site_dataset. 

In [9]:
#Load in the Phosphorylation_site_dataset into pandas and check to make sure it has loaded in properly.
df1 = pd.read_csv('Phosphorylation_site_dataset.csv')
df1.head()

Unnamed: 0,GENE,PROTEIN,ACC_ID,HU_CHR_LOC,MOD_RSD,SITE_GRP_ID,ORGANISM,MW_kD,DOMAIN,SITE_+/-7_AA,LT_LIT,MS_LIT,MS_CST,CST_CAT#
0,1110035H17Rik,1110035H17Rik,Q9CTA4,7|7,S10-p,7231581,mouse,24.31,,RPPPGSRstVAQSPP,,1.0,,
1,1110035H17Rik,1110035H17Rik,Q9CTA4,7|7,T11-p,7231583,mouse,24.31,,PPPGSRstVAQSPPQ,,1.0,,
2,YWHAB,14-3-3 beta,P31946,20q13.12,T2-p,15718712,human,28.08,,______MtMDksELV,,3.0,1.0,
3,Ywhab,14-3-3 beta,Q9CQV8,2|2 H3,T2-p,15718712,mouse,28.09,,______MtMDksELV,,2.0,,
4,YWHAB,14-3-3 beta,P31946,20q13.12,S6-p,15718709,human,28.08,,__MtMDksELVQkAk,,8.0,,


In [10]:
#Similar to the Kinase_substrate_dataset we see that there is data on organisms that aren't human. So we filter the dataframe 
#in the same way to only have rows where the organism is human and check to make sure this has been done.
df1 = df1[df1.ORGANISM == 'human']
df1.head()

Unnamed: 0,GENE,PROTEIN,ACC_ID,HU_CHR_LOC,MOD_RSD,SITE_GRP_ID,ORGANISM,MW_kD,DOMAIN,SITE_+/-7_AA,LT_LIT,MS_LIT,MS_CST,CST_CAT#
2,YWHAB,14-3-3 beta,P31946,20q13.12,T2-p,15718712,human,28.08,,______MtMDksELV,,3.0,1.0,
4,YWHAB,14-3-3 beta,P31946,20q13.12,S6-p,15718709,human,28.08,,__MtMDksELVQkAk,,8.0,,
6,YWHAB,14-3-3 beta,P31946,20q13.12,Y21-p,3426383,human,28.08,14-3-3,LAEQAERyDDMAAAM,,,4.0,
8,YWHAB,14-3-3 beta,P31946,20q13.12,T32-p,23077803,human,28.08,14-3-3,AAAMkAVtEQGHELs,,,1.0,
9,YWHAB,14-3-3 beta,P31946,20q13.12,S39-p,27442700,human,28.08,14-3-3,tEQGHELsNEERNLL,,4.0,,


Looking at the data in the Phosphorylation_site_dataset, the only column of interest we'd want to include in our final PhosphoSite table is the Human Chromosome Location. To incorporate Human Chromosome Location we first removed all of the rwos except for the Human Chromosome Location and a column that was shared between the two datasets i.e. ACC_ID which is equivalent to SUB_UNIPROT_ID in our cleaned Kinase_substrate_dataset.

In [11]:
#Dropping all columns in the Phosphorylation_site_dataset except ACC_ID and HU_CHR_LOC

df1 = df1.drop(columns=['GENE','PROTEIN', 'MOD_RSD', 'SITE_GRP_ID','ORGANISM', 'MW_kD', 'DOMAIN', 'SITE_+/-7_AA', 'LT_LIT',
                        'MS_LIT', 'MS_CST', 'CST_CAT#'])
df1.head()

Unnamed: 0,ACC_ID,HU_CHR_LOC
2,P31946,20q13.12
4,P31946,20q13.12
6,P31946,20q13.12
8,P31946,20q13.12
9,P31946,20q13.12


In [12]:
#We re-name the ACC_ID column to match our other dataset in order to actually carry out the next merge step.
df1 = df1.rename(index=str, columns = {'ACC_ID': 'SUB_UNIPROT_ID', 'HU_CHR_LOC': 'CHR_LOC'}) 
df1

Unnamed: 0,SUB_UNIPROT_ID,CHR_LOC
2,P31946,20q13.12
4,P31946,20q13.12
6,P31946,20q13.12
8,P31946,20q13.12
9,P31946,20q13.12
...,...,...
371197,Q8IYH5,1p31.1
371198,Q8IYH5,1p31.1
371200,Q8IYH5,1p31.1
371201,Q8IYH5,1p31.1


As there is a massive difference in the number of rows between our two dataframes (10981 vs 239530 rows), in order for the dataframe merge to be done correctly the difference between the two needs to be reduced significantly. As df1 has many duplicates to make up its number of rows we go about dropping the duplicates in order to reduce the row numbers.

In [13]:
#Dropping duplicates in Dataframe1.
df1 = df1.drop_duplicates(keep = 'first', inplace=False)
df1

Unnamed: 0,SUB_UNIPROT_ID,CHR_LOC
2,P31946,20q13.12
49,P62258,17p13.3
85,Q04917,22q12.3
115,P61981,7q11.23
161,P31947,1p36.11
...,...,...
371037,Q9C0D3,1p32.3
371042,Q7Z7L7,9q34.11
371050,Q15942,7q34
371107,O43149,17p13.2


In [14]:
#Now that the rows are reduced, we merge the dataframes together by the SUB_UNIPROT_ID so that Human Chromosome Location is now
#included to our cleaned table.
df = pd.merge(df,df1[['SUB_UNIPROT_ID','CHR_LOC']],on='SUB_UNIPROT_ID', how='inner')
df

Unnamed: 0,KINASE_GENE_NAME,KIN_UNIPROT_ID,SUBSTRATE_NAME,SUB_UNIPROT_ID,SUB_GENE_NAME,SUB_MOD_RSD,SITE_7_AA,CHR_LOC
0,HRI,Q9BQI3,eIF2-alpha,P05198,EIF2S1,S52,MILLSELsRRRIRSI,14q23.3
1,HRI,Q9BQI3,eIF2-alpha,P05198,EIF2S1,S49,IEGMILLsELSRRRI,14q23.3
2,PKR,P19525,eIF2-alpha,P05198,EIF2S1,S49,IEGMILLsELSRRRI,14q23.3
3,PKR,P19525,eIF2-alpha,P05198,EIF2S1,S52,MILLSELsRRRIRSI,14q23.3
4,P38A,Q16539,eIF2-alpha,P05198,EIF2S1,S52,MILLSELsRRRIRSI,14q23.3
...,...,...,...,...,...,...,...,...
10976,ZAK,Q9NYL2,ZAK,Q9NYL2,MAP3K20,T162,SRFHNHTtHMSLVGT,2q31.1
10977,ZAK,Q9NYL2,ZAK,Q9NYL2,MAP3K20,S165,HNHTTHMsLVGTFPW,2q31.1
10978,ZAK,Q9NYL2,ZAK,Q9NYL2,MAP3K20,T161,ASRFHNHtTHMSLVG,2q31.1
10979,MEKK4,Q9Y6R4,MEKK4,Q9Y6R4,MAP3K4,T1494,KLKNNAQtMPGEVNS,6q26


This table normally on its own would be more than enough for our database. However, as part of the data visualisation part of the web application, we would need to cross reference our phosphosite table of our database in order to carry out some visualisations. As a some of the substrates in example phosphoproteomic data file contain UniProt entry names in addition to gene names, we thought it would also be useful to include a column containing all of the corresponding UniProt entry names to each of the substrates. 

To do this we can run our existing UniProt IDs through the UniProt API where they can return our corresponding entry names. As our current df has almost 11k rows which would take the UniProt API a long time to process through. We can reduce the processing time by dropping duplicate UniProt IDs from our dataframe and re-merging the corresponding entries to our dataframe at a later point.

In [15]:
#We drop all of our columns except for SUB_UNIPROT_ID and SUB_GENE_NAME and place it into another dataframe object then we can 
#drop any duplicates.
df2 = df.drop(columns=['KINASE_GENE_NAME','KIN_UNIPROT_ID', 'SUBSTRATE_NAME', 'SUB_MOD_RSD','SITE_7_AA', 'CHR_LOC'])
df2 = df2.drop_duplicates(keep = 'first', inplace=False)

#Reduces the number of rows from 10981 rows and 2531 rows.
df2 

Unnamed: 0,SUB_UNIPROT_ID,SUB_GENE_NAME
0,P05198,EIF2S1
9,Q9UQL6,HDAC5
30,P18433-2,PTPRA
33,P61978,HNRNPK
48,Q9UQ13,SHOC2
...,...,...
10963,Q9UJV9,DDX41
10964,Q9NWZ3,IRAK4
10976,Q9NYL2,MAP3K20
10979,Q9Y6R4,MAP3K4


In [16]:
#Now we put all of our UniProt IDs into another object in order to feed through the UniProt API.
UniProt_IDs = df2['SUB_UNIPROT_ID'].values

When we tried to run our UniProt IDs through the UniProt API using a For loop, we wound up getting errors and after investigating the which UniProt IDs caused these errors fell into three categories:

1) UniProt IDs for isoforms of a protein that are labelled a '-' eg. P18433-2

2) UniProt IDs that contained a specific variation of a protein that are labelled with '_VAR_' e.g. O60504_VAR_055019

3) IDs that don't really fall into either of the two above categories and don't actually have a page on UniProt when searched manually e.g. AAC50053

So to fix the errors caused by 1 and 2 we used a regex to remove all of the characters from a '-' or '_' onwards. As we were unable to find a corresponding UniProt entries for IDs that fell into category 3 we simply assigned them a blank entry.

In [17]:
#Create an empty list to contain all of our UniProt Entries.
EntryList = []

for ID in UniProt_IDs: #for every ID in our object containing all of our UniProt IDs
    ID = str(ID) #convert the ID into a string
    ID = re.sub('([-,_].*)$', '', ID) #apply a regex to clean ID string in categories 1 and 2.
    try: #pass cleaned IDs through UniProt API to get a table containing ID and corresponding UniProt Entry name read by pandas                              
        data = pd.read_csv("https://www.uniprot.org/uniprot/?query="+ID+"&sort=score&columns=id,entry%20name&format=tab", 
                           sep="\t") 
        entry = data["Entry name"][0] #we then store the first value in the Entry name column of the table as this corresponds
                                      #to our specific ID.
        EntryList.append(entry) #We then append that entry name into our empty list
    except:
        EntryList.append('-') #In the case, of IDs in category 3 we would just append a '-' to represent a blank.

In [18]:
#Now we insert our list of UniProt entries to our second dataframe...
df2.insert(0, 'SUB_ENTRY_NAME', EntryList)
df2

Unnamed: 0,SUB_ENTRY_NAME,SUB_UNIPROT_ID,SUB_GENE_NAME
0,IF2A_HUMAN,P05198,EIF2S1
9,HDAC5_HUMAN,Q9UQL6,HDAC5
30,PTPRA_HUMAN,P18433-2,PTPRA
33,HNRPK_HUMAN,P61978,HNRNPK
48,SHOC2_HUMAN,Q9UQ13,SHOC2
...,...,...,...
10963,DDX41_HUMAN,Q9UJV9,DDX41
10964,IRAK4_HUMAN,Q9NWZ3,IRAK4
10976,M3K20_HUMAN,Q9NYL2,MAP3K20
10979,M3K4_HUMAN,Q9Y6R4,MAP3K4


In [19]:
#...and merge it with our original dataframe so that UniProt Entries are included.
df = pd.merge(df,df2[['SUB_UNIPROT_ID','SUB_ENTRY_NAME']],on='SUB_UNIPROT_ID', how='inner')
df

Unnamed: 0,KINASE_GENE_NAME,KIN_UNIPROT_ID,SUBSTRATE_NAME,SUB_UNIPROT_ID,SUB_GENE_NAME,SUB_MOD_RSD,SITE_7_AA,CHR_LOC,SUB_ENTRY_NAME
0,HRI,Q9BQI3,eIF2-alpha,P05198,EIF2S1,S52,MILLSELsRRRIRSI,14q23.3,IF2A_HUMAN
1,HRI,Q9BQI3,eIF2-alpha,P05198,EIF2S1,S49,IEGMILLsELSRRRI,14q23.3,IF2A_HUMAN
2,PKR,P19525,eIF2-alpha,P05198,EIF2S1,S49,IEGMILLsELSRRRI,14q23.3,IF2A_HUMAN
3,PKR,P19525,eIF2-alpha,P05198,EIF2S1,S52,MILLSELsRRRIRSI,14q23.3,IF2A_HUMAN
4,P38A,Q16539,eIF2-alpha,P05198,EIF2S1,S52,MILLSELsRRRIRSI,14q23.3,IF2A_HUMAN
...,...,...,...,...,...,...,...,...,...
10976,ZAK,Q9NYL2,ZAK,Q9NYL2,MAP3K20,T162,SRFHNHTtHMSLVGT,2q31.1,M3K20_HUMAN
10977,ZAK,Q9NYL2,ZAK,Q9NYL2,MAP3K20,S165,HNHTTHMsLVGTFPW,2q31.1,M3K20_HUMAN
10978,ZAK,Q9NYL2,ZAK,Q9NYL2,MAP3K20,T161,ASRFHNHtTHMSLVG,2q31.1,M3K20_HUMAN
10979,MEKK4,Q9Y6R4,MEKK4,Q9Y6R4,MAP3K4,T1494,KLKNNAQtMPGEVNS,6q26,M3K4_HUMAN


In [23]:
#Finally, we insert a a PHOSPHO_ID column into our dataframe 
df.insert(0, 'PHOSPHO_ID', range(1, 1 + len(df)))
df

Unnamed: 0,PHOSPHO_ID,KINASE_GENE_NAME,KIN_UNIPROT_ID,SUBSTRATE_NAME,SUB_UNIPROT_ID,SUB_GENE_NAME,SUB_MOD_RSD,SITE_7_AA,CHR_LOC,SUB_ENTRY_NAME
0,1,HRI,Q9BQI3,eIF2-alpha,P05198,EIF2S1,S52,MILLSELsRRRIRSI,14q23.3,IF2A_HUMAN
1,2,HRI,Q9BQI3,eIF2-alpha,P05198,EIF2S1,S49,IEGMILLsELSRRRI,14q23.3,IF2A_HUMAN
2,3,PKR,P19525,eIF2-alpha,P05198,EIF2S1,S49,IEGMILLsELSRRRI,14q23.3,IF2A_HUMAN
3,4,PKR,P19525,eIF2-alpha,P05198,EIF2S1,S52,MILLSELsRRRIRSI,14q23.3,IF2A_HUMAN
4,5,P38A,Q16539,eIF2-alpha,P05198,EIF2S1,S52,MILLSELsRRRIRSI,14q23.3,IF2A_HUMAN
...,...,...,...,...,...,...,...,...,...,...
10976,10977,ZAK,Q9NYL2,ZAK,Q9NYL2,MAP3K20,T162,SRFHNHTtHMSLVGT,2q31.1,M3K20_HUMAN
10977,10978,ZAK,Q9NYL2,ZAK,Q9NYL2,MAP3K20,S165,HNHTTHMsLVGTFPW,2q31.1,M3K20_HUMAN
10978,10979,ZAK,Q9NYL2,ZAK,Q9NYL2,MAP3K20,T161,ASRFHNHtTHMSLVG,2q31.1,M3K20_HUMAN
10979,10980,MEKK4,Q9Y6R4,MEKK4,Q9Y6R4,MAP3K4,T1494,KLKNNAQtMPGEVNS,6q26,M3K4_HUMAN


In [24]:
#Writing our final dataframe into a CSV file
df.to_csv('Phosphosite_Table.csv')

Lastly, we check through our final file for any errors that either were missed or not fixable through code and correct them manually. An examples of this include correcting the Substrate Gene Names such as 'SEPT1' and 'Oct1' back to their actual names as Excel read these genes and others with similar names as dates instead of gene names. Once things have been thoroughly checked our Phosphosite file is ready to be implemented to our database.    