# Phosphosite Data Processing, 2 of 2.
## Combining Phosphosite.org and Phospho.ELM Data.

Three phosphosite data tables have been downloaded as tab-delimited files from https://www.phosphosite.org/staticDownloads:

- Phosphorylation_site_dataset
    - The full list of phosphosites available on the website (371,203 rows long on 2020_01_11)
  
- Kinase_Substrate_Dataset
    - A list of known kinase-phosphosite relationships (18,455 rows long on 2020_01_11)
    
- Disease-associated_sites
    - A list of phosphosites with known links to disease (1,363 rows on on 2020_01_11)

We also have a partially-processed data file from http://phospho.elm.eu.org/dataset.html:

- phosphosites_2.csv
    - The full list of human phosphosites available on the website (37,145 rows long on 2020_01_17)
    
Before running this Jupyter notebook, "Phosphosite-data-processing-1-of-2-Phospho.ELM-downloaded-file.ipynb" needs to have been run to generate phosphosites_2.csv.

In this Jupyter notebook, the three phosphosite.org files will be imported as pandas dataframes, and will have the following changes made to them:

   - Remove any rows with information about non-human proteins (targets and kinases).
   - Remove any rows where "gene" is blank.
   - Remove any rows in the diseases table where "disease" is blank.
   - For tables with a "modified residue" column that includes a flag at the end to denote the type of post-translational modification (PTM), remove any that are not "-p" for "phosphorylation".
   - Generate "phosphosite ID" columns, using information in other columns, to match the format in user-submitted quantitative phosphoproteomics files, and to allow the tables to be linked in the SQL database.
   - Generate "source" column to direct the user to the phosphosite.org page for that substrate protein
   - Add "sequence" and "PMID" columns to capture interesting information from phospho.ELM
   - Add rows to the phosphosites table for any missing phosphosites that appear in the other tables, to allow them to be linked, and to include phosphosites only found in phospho.ELM.
   - Add rows to kin_sub_df for any missing phosphosites that appear in the phospho.ELM table
   - Ensure all genes listed are in uppercase.
   - Remove the unnecessary "-p" from the "modified residue" column in the phosphosite table, as our web app will only specialise in phosphosites.
   - Drop any other unnecessary columns.
   - Add a binary column to the phosphosites table to indicate whether the row is for an isoform (1) or not (0).
   - In the phosphosites table, remove duplicated rows.
   - Add a primary keys column to each table.

Finally, they will be exported as .csv files.

Import required packages

In [None]:
import pandas as pd
import re

Read in data files

In [None]:
phosphosite_df = pd.read_table( "Phosphorylation_site_dataset", error_bad_lines = False )

In [None]:
kin_sub_df = pd.read_table( "Kinase_Substrate_Dataset", error_bad_lines = False )

In [None]:
disease_df = pd.read_table( "Disease-associated_sites", error_bad_lines = False )

# A strange column "Unnamed: 19" is created upon import: drop this

disease_df = disease_df.drop( [ 'Unnamed: 19' ], axis = 1 )

In [None]:
# Read in partially-processed phospho.ELM data file

phosphosites_2_df = pd.read_csv( "phosphosites_2.csv" )

Remove any rows where ORGANISM is not "human"

In [None]:
phosphosite_df = phosphosite_df.drop( phosphosite_df[ phosphosite_df.ORGANISM != "human" ].index )

In [None]:
# kin_sub_df contains a substrate organism and a kinase organism

kin_sub_df = kin_sub_df.drop( kin_sub_df[ kin_sub_df.SUB_ORGANISM != "human" ].index )
kin_sub_df = kin_sub_df.drop( kin_sub_df[ kin_sub_df.KIN_ORGANISM != "human" ].index )

In [None]:
disease_df = disease_df.drop( disease_df[ disease_df.ORGANISM != "human" ].index )

Remove any rows where GENE is blank

In [None]:
phosphosite_df = phosphosite_df.dropna( subset = [ "GENE" ]  )
phosphosite_df = phosphosite_df.reset_index( drop = True )

In [None]:
# kin_sub_df has a kinase gene and a substrate gene

kin_sub_df = kin_sub_df.dropna( subset = [ "GENE" ] )
kin_sub_df = kin_sub_df.dropna( subset = [ "SUB_GENE" ] )
kin_sub_df = kin_sub_df.reset_index( drop = True )

In [None]:
disease_df = disease_df.dropna( subset = [ "GENE" ] )
disease_df = disease_df.reset_index( drop = True )

Remove any rows where DISEASE is blank

In [None]:
disease_df = disease_df.dropna( subset = [ "DISEASE" ] )
disease_df = disease_df.reset_index( drop = True )

Remove any rows where PTM is not "phosphorylation" (indicated as "-p" appended to MOD_RSD). N.B. kin_sub_df is not included here because its MOD_RSD column does not contain modification type, just AA and residue no.

In [None]:
# This takes around five minutes to run, and should not be necessary as
# the file just contains phosphosites, but is included in case of error

regex = r'\w{1}\d+-p'

indices = []

for n, i in enumerate( phosphosite_df.iterrows() ):
    if re.findall( regex, str( i[ 1 ]) ):
        continue
    else:
        indices.append( n )

phosphosite_df = phosphosite_df.drop( indices )
phosphosite_df = phosphosite_df.reset_index( drop = True )

In [None]:
regex = r'\w{1}\d+-p'

indices = []

for n, i in enumerate( disease_df.iterrows() ):
    if re.findall( regex, str( i[ 1 ]) ):
        continue
    else:
        indices.append( n )

disease_df = disease_df.drop( indices )
disease_df = disease_df.reset_index( drop = True )

Make a column of phosphosite IDs in a standardised format:
- to enable tables to be linked in the SQL database 
- to match the format in the user-submitted quantitative phosphoproteomics files
- The diseases table only needs to be linked to the phosphosites table and thus does not require multiple PHOS_ID columns

Different gene / protein aliases will be combined with the residue information in each column

In [None]:
phos_id = []
phos_id2 = []
phos_id3 = []
phos_id4 = []
phos_id5 = [] # This will use the UniProt ID and will be used as 
              # a foreign key

for n, i in enumerate( kin_sub_df.iterrows() ):
    phos_id.append( kin_sub_df.SUB_GENE[ n ].upper() + "_" + kin_sub_df.SUB_ORGANISM[ n ].upper() + "(" + kin_sub_df.SUB_MOD_RSD[ n ].upper() + ")" )
    phos_id2.append( kin_sub_df.SUBSTRATE[ n ].upper() + "(" + kin_sub_df.SUB_MOD_RSD[ n ].upper() + ")")
    phos_id3.append( kin_sub_df.SUBSTRATE[ n ].upper() + "_" + kin_sub_df.SUB_ORGANISM[ n ].upper() + "(" + kin_sub_df.SUB_MOD_RSD[ n ].upper() + ")" )
    phos_id4.append( kin_sub_df.SUB_GENE[ n ].upper() + "(" + kin_sub_df.SUB_MOD_RSD[ n ].upper() + ")" )
    phos_id5.append( kin_sub_df.SUB_ACC_ID[ n ].upper() + "(" + kin_sub_df.SUB_MOD_RSD[ n ].upper() + ")" )
    
phos_id = pd.Series( phos_id )
phos_id2 = pd.Series( phos_id2 )
phos_id3 = pd.Series( phos_id3 )
phos_id4 = pd.Series( phos_id4 )
phos_id5 = pd.Series( phos_id5 )

kin_sub_df = kin_sub_df.assign( PHOS_ID = phos_id )
kin_sub_df = kin_sub_df.assign( PHOS_ID2 = phos_id2 )
kin_sub_df = kin_sub_df.assign( PHOS_ID3 = phos_id3 )
kin_sub_df = kin_sub_df.assign( PHOS_ID4 = phos_id4 )
kin_sub_df = kin_sub_df.assign( PHOS_ID5 = phos_id5 )

In [None]:
# This takes around three minutes to run

phos_id = []
phos_id2 = []
phos_id3 = []
phos_id4 = []
phos_id5 = [] # This will use the UniProt ID and will be used as 
              # a foreign key

for n, i in enumerate( phosphosite_df.iterrows() ):
    phos_id.append( phosphosite_df.GENE[ n ].upper() + "_" + phosphosite_df.ORGANISM[ n ].upper() + "(" + phosphosite_df.MOD_RSD[ n ][ 0:-2 ].upper() + ")" )
    phos_id2.append( phosphosite_df.PROTEIN[ n ].upper() + "(" + phosphosite_df.MOD_RSD[ n ][ 0:-2 ].upper() + ")" )
    phos_id3.append( phosphosite_df.PROTEIN[ n ].upper() + "_" + phosphosite_df.ORGANISM[ n ].upper() + "(" + phosphosite_df.MOD_RSD[ n ][ 0:-2 ].upper() + ")" )
    phos_id4.append( phosphosite_df.GENE[ n ].upper() + "(" + phosphosite_df.MOD_RSD[ n ][ 0:-2 ].upper() + ")" )
    phos_id5.append( phosphosite_df.ACC_ID[ n ].upper() + "(" + phosphosite_df.MOD_RSD[ n ][ 0:-2 ].upper() + ")" )

phos_id = pd.Series( phos_id )
phos_id2 = pd.Series( phos_id2 )
phos_id3 = pd.Series( phos_id3 )
phos_id4 = pd.Series( phos_id4 )
phos_id5 = pd.Series( phos_id5 )

phosphosite_df = phosphosite_df.assign( PHOS_ID = phos_id )
phosphosite_df = phosphosite_df.assign( PHOS_ID2 = phos_id2 )
phosphosite_df = phosphosite_df.assign( PHOS_ID3 = phos_id3 )
phosphosite_df = phosphosite_df.assign( PHOS_ID4 = phos_id4 )
phosphosite_df = phosphosite_df.assign( PHOS_ID5 = phos_id5 )

In [None]:
phos_id = [] # This will use the UniProt ID and will be used as 
              # a foreign key

for n, i in enumerate( disease_df.iterrows() ):
    phos_id.append( disease_df.ACC_ID[ n ].upper() + "(" + disease_df.MOD_RSD[ n ][ 0:-2 ].upper() + ")" )
    
phos_id = pd.Series( phos_id )

disease_df = disease_df.assign( PHOS_ID = phos_id )

Add a "Source" column to direct the user to the phosphosite.org page for that substrate protein

In [None]:
source = []

for n, i in enumerate( phosphosite_df.iterrows() ):
    source.append( "http://www.phosphosite.org/uniprotAccAction?id=" + str( phosphosite_df.ACC_ID[ n ] ))

source = pd.Series( source )
phosphosite_df = phosphosite_df.assign( SOURCE = source )                  

In [None]:
source = []

for n, i in enumerate( kin_sub_df.iterrows() ):
    source.append( "http://www.phosphosite.org/uniprotAccAction?id=" + str( kin_sub_df.SUB_ACC_ID[ n ] ))

source = pd.Series( source )
kin_sub_df = kin_sub_df.assign( SOURCE = source )  

Add "sequence" and "PMID" columns to capture interesting information from phospho.ELM

In [None]:
column = [ None ] * len( phosphosite_df.PHOS_ID )

sequence = pd.Series( column )
PMID = pd.Series( column )

phosphosite_df = phosphosite_df.assign( SEQUENCE = sequence )
phosphosite_df = phosphosite_df.assign( PMID = PMID )

Add rows to phosphosite_df for any missing phosphosites that appear in the other tables, including the phospho.ELM table, to allow this column to be used as the foreign key in the database

In [None]:
# Use PHOS_ID5 to cross-check whether the phosphosites in kin_sub_df
# are already in phosphosite_df

# This takes around ten minutes to run

for n, i in enumerate( kin_sub_df[ 'PHOS_ID5' ] ):
    if i not in list( phosphosite_df[ 'PHOS_ID5' ] ):
        row = [{ 'GENE' : kin_sub_df.SUB_GENE_ID[ n ],
                'PROTEIN' : kin_sub_df.SUBSTRATE[ n ],
                'ACC_ID' : kin_sub_df.SUB_ACC_ID[ n ],
                'HU_CHR_LOC' : '',
                'MOD_RSD' : kin_sub_df.SUB_MOD_RSD[ n ] + "-p",
                'SITE_GRP_ID' : kin_sub_df.SITE_GRP_ID[ n ],
                'ORGANISM' : kin_sub_df.SUB_ORGANISM[ n ],
                'MW_kD' : 0.00,
                'DOMAIN' : kin_sub_df.DOMAIN[ n ],
                'SITE_+/-7_AA' : kin_sub_df[ 'SITE_+/-7_AA' ][ n ],
                'LT_LIT' : 0.0, 'MS_LIT' : 0.0, 'MS_CST' : 0.0,
                'CST_CAT#' : 0.0,
                'PHOS_ID' : kin_sub_df.PHOS_ID[ n ],
                'PHOS_ID2' : kin_sub_df.PHOS_ID2[ n ],
                'PHOS_ID3' : kin_sub_df.PHOS_ID3[ n ],
                'PHOS_ID4' : kin_sub_df.PHOS_ID4[ n ],
                'PHOS_ID5' : i,
                'SEQUENCE' : '', 'PMID' : '',
                'SOURCE':'http://www.phosphosite.org/uniprotAccAction?id=' + str( kin_sub_df.SUB_ACC_ID[ n ] )
               }]
        phosphosite_df = phosphosite_df.append( row, ignore_index = True )

In [None]:
# Cross-check whether the phosphosites in disease_df
# are already in phosphosite_df

# This takes around two minutes to run

for n,i in enumerate( disease_df[ 'PHOS_ID' ] ):
    if i not in list( phosphosite_df[ 'PHOS_ID5' ] ):
        row = [{ 'GENE' : disease_df.GENE [ n ],
                'PROTEIN' : disease_df.PROTEIN[ n ],
                'ACC_ID' : disease_df.ACC_ID[ n ],
                'HU_CHR_LOC' : disease_df.HU_CHR_LOC[ n ],
                'MOD_RSD' : disease_df.MOD_RSD[ n ],
                'SITE_GRP_ID' : disease_df.SITE_GRP_ID[ n ],
                'ORGANISM' : disease_df.ORGANISM[ n ],
                'MW_kD' : disease_df.MW_kD[ n ],
                'DOMAIN' : disease_df.DOMAIN[ n ],
                'SITE_+/-7_AA' : disease_df[ 'SITE_+/-7_AA' ][ n ],
                'LT_LIT' : disease_df.LT_LIT[ n ],
                'MS_LIT' : disease_df.MS_LIT[ n ],
                'MS_CST' : disease_df.MS_CST[ n ],
                'CST_CAT#' : disease_df[ 'CST_CAT#' ][ n ],
                'PHOS_ID' : '',
                'PHOS_ID2' : '',
                'PHOS_ID3' : '',
                'PHOS_ID4' : '',
                'PHOS_ID5' : i,
                'SEQUENCE' : '',
                'PMID' : disease_df.PMIDs[ n ],
                'SOURCE' : 'http://www.phosphosite.org/uniprotAccAction?id=' + str( disease_df.ACC_ID[ n ])
               }]
        phosphosite_df = phosphosite_df.append( row, ignore_index = True )

In [None]:
# Cross-check whether the phosphosites in phospho.ELM
# are already in phosphosite_df

# This takes over half an hour to run

for n, i in enumerate( phosphosites_2_df.PHOS_ID ):
    if i not in list( phosphosite_df[ 'PHOS_ID5' ] ):
        row = [{ 'GENE' : '',
                'PROTEIN' : '',
                'ACC_ID' : phosphosites_2_df.acc[ n ],
                'HU_CHR_LOC' : '',
                'MOD_RSD' : phosphosites_2_df.code[ n ] + str( phosphosites_2_df.position[ n ]) + '-p',
                'SITE_GRP_ID' : '',
                'ORGANISM':phosphosites_2_df.species[n],
                'MW_kD' : '',
                'DOMAIN' : '',
                'SITE_+/-7_AA' : '',
                'LT_LIT': '',
                'MS_LIT' : '',
                'MS_CST' : '',
                'CST_CAT#' : '',
                'PHOS_ID' : '',
                'PHOS_ID2' : '',
                'PHOS_ID3' : '',
                'PHOS_ID4' : '',
                'PHOS_ID5' : i,
                'SEQUENCE' : phosphosites_2_df.sequence[ n ],
                'PMID' : phosphosites_2_df.pmids[ n ],
                'SOURCE' : 'phospho.ELM' }]
        phosphosite_df = phosphosite_df.append( row, ignore_index = True )

Add rows to kin_sub_df for any missing phosphosites that appear in the phospho.ELM table

In [None]:
# As before, add "SEQUENCE" and "PMID" columns to capture
# additional information from phospho.ELM

column = [ None ] * len( kin_sub_df.PHOS_ID )

sequence = pd.Series( column )
PMID = pd.Series( column )

kin_sub_df = kin_sub_df.assign(SEQUENCE = sequence)
kin_sub_df = kin_sub_df.assign(PMID = PMID)

In [None]:
# Do not add any rows with empty kinase accession numbers as this
# will be a foreign key in the database

# This takes around fifteen minutes to run

for n, i in enumerate( phosphosites_2_df.PHOS_ID ):
    if str( phosphosites_2_df.ACC_ID[ n ]).upper()  != "NAN" and i not in list( kin_sub_df[ 'PHOS_ID5' ] ):
        row = [{ 'GENE' : phosphosites_2_df.kinases[ n ],
                'KINASE' : '',
                'KIN_ACC_ID' : phosphosites_2_df.ACC_ID[ n ],
                'KIN_ORGANISM' : phosphosites_2_df.species[ n ],
                'SUBSTRATE' : '',
                'SUB_GENE_ID' : '',
                'SUB_ACC_ID' : phosphosites_2_df.acc[ n ], 
                'SUB_GENE' : '', 
                'SUB_ORGANISM' : '',
                'SUB_MOD_RSD' : 0.0, 'SITE_GRP_ID' : '', 
                'SITE_+/-7_AA' : '', 'DOMAIN' : '',
                'IN_VIVO_RXN' : '', 'IN_VITRO_RXN' : '',
                'CST_CAT#' : 0.0, 'PHOS_ID' : '', 'PHOS_ID2': '', 
                'PHOS_ID3' : '','PHOS_ID4' : '', 
                'PHOS_ID5' : i,
                'SEQUENCE' : phosphosites_2_df.sequence[ n ], 
                'PMID' : phosphosites_2_df.pmids[ n ], 
                'SOURCE' : 'phospho.ELM'}]
        kin_sub_df = kin_sub_df.append( row, ignore_index = True )

Make GENE entries uppercase

In [None]:
# This takes a minute to run

uppercase_kinase = []

for n, i in enumerate( phosphosite_df.iterrows() ):
    uppercase_kinase.append( str(phosphosite_df.GENE[ n ]).upper() )

uppercase_kinase = pd.Series( uppercase_kinase )

phosphosite_df = phosphosite_df.assign( GENE = uppercase_kinase )

In [None]:
uppercase_kinase = []

for n, i in enumerate( kin_sub_df.iterrows() ):
    uppercase_kinase.append( kin_sub_df.GENE[ n ].upper() )

uppercase_kinase = pd.Series( uppercase_kinase )

kin_sub_df = kin_sub_df.assign( GENE = uppercase_kinase )

Remove -p from MOD_RSD in phosphosite_df

In [None]:
# This takes a minute to run

mod_rsd = []

for n, i in enumerate( phosphosite_df.iterrows() ):
    if phosphosite_df.MOD_RSD[ n ][ -2: ] == "-p":
        mod_rsd.append( phosphosite_df.MOD_RSD[ n ][ 0:-2 ] )
    else:
        mod_rsd.append( phosphosite_df.MOD_RSD[ n ] )

mod_rsd = pd.Series( mod_rsd )

phosphosite_df = phosphosite_df.assign( MOD_RSD = mod_rsd )

Drop any unnecessary columns

In [None]:
# ORGANISM is not required as the web app will be human-specific
# and we have already ensured that only human targets are included

phosphosite_df = phosphosite_df.drop( [ 'ORGANISM' ], axis = 1 )
phosphosite_df = phosphosite_df.reset_index( drop = True )

In [None]:
# KIN_ORGANISM and SUB_ORGANISM are not required as the web app will be
# human-specific and we have already ensured that only human proteins 
# are included
# KINASE,  SUBSTRATE,  SUB_GENE_ID, SUB_MOD_RSD and SUB_GENE 
# are not required as we have KIN_ACC_ID, SUB_ACC_ID and
# various PHOS_IDs
# SITE_GRP_ID, SITE_+/-7_AA and DOMAIN will be available in 
# phosphosite_df

kin_sub_df = kin_sub_df.drop( [ 'KIN_ORGANISM', 'SUB_ORGANISM',
                               'KINASE', 'SUBSTRATE', 
                               'SUB_GENE_ID', 'SUB_MOD_RSD',
                               'SUB_GENE', 'SITE_GRP_ID',
                               'SITE_+/-7_AA', 'DOMAIN' ],
                             axis = 1 )

kin_sub_df = kin_sub_df.reset_index( drop = True)

In [None]:
# PROTEIN and GENE_ID are unnecessary as we have ACC_ID
# ORGANISM is not required as the web app will be human-specific
# GENE, HU_CHR_LOC, MW_kD, SITE_GRP_ID, MOD_RSD, DOMAIN, and SITE_+/-7_AA
# will be available in phosphosite_df

disease_df = disease_df.drop( [ 'PROTEIN', 'GENE_ID',
                               'ORGANISM', 'GENE', 'HU_CHR_LOC',
                               'MW_kD', 'SITE_GRP_ID', 'MOD_RSD',
                               'DOMAIN', 'SITE_+/-7_AA' ],
                             axis = 1 )

disease_df = disease_df.reset_index( drop = True )

Add a column ISOFORM to the phosphosite_df table to indicate whether the row is for an isoform (1) or not (0)

In [None]:
yes_no_isoform = []

for n, i in enumerate( phosphosite_df.ACC_ID ):
    if "-" in i: # In a UniProt ID, "-" signifies an isoform
        
        yes_no_isoform.append( 1 )
        
    elif "iso" in phosphosite_df.PROTEIN[ n]: # For some records,
        # the presence of "iso" in this field is the only way 
        # the isoform is indicated
        
        yes_no_isoform.append( 1 )
        
    else:
        yes_no_isoform.append( 0 )

yes_no_isoform = pd.Series( yes_no_isoform )

phosphosite_df = phosphosite_df.assign( ISOFORM = yes_no_isoform )

Remove any duplicated rows

In [None]:
# This only removes one row upon testing

phosphosite_df = phosphosite_df.drop_duplicates( subset = 'PHOS_ID5', keep = 'first' )
phosphosite_df = phosphosite_df.reset_index( drop = True )

Add primary key columns

In [None]:
prim_key = []

count = 1

for i in phosphosite_df.PHOS_ID:
    key = "PH" + "{:07d}".format( count )
    prim_key.append( key )
    count += 1

prim_key = pd.Series( prim_key )

phosphosite_df = phosphosite_df.assign( ID_PH = prim_key )

In [None]:
prim_key = []

count = 1

for i in kin_sub_df.PHOS_ID:
    key = "KS" + "{:07d}".format( count )
    prim_key.append(key)
    count += 1

prim_key = pd.Series( prim_key )

kin_sub_df = kin_sub_df.assign( ID_KS = prim_key )

In [None]:
prim_key = []

count = 1

for i in disease_df.PHOS_ID:
    key = "DP" + "{:07d}".format( count )
    prim_key.append( key )
    count += 1

prim_key = pd.Series( prim_key )

disease_df = disease_df.assign( ID_DP = prim_key )

Make csv files

In [None]:
phosphosite_df.to_csv( "phosphosites.csv", index = False )

In [None]:
kin_sub_df.to_csv( "kinases_phosphosites.csv", index = False )

In [None]:
disease_df.to_csv( "phosphosites_diseases.csv", index = False )