# Phosphosite Data Processing, 2 of 2.
## Combining Phosphosite.org and Phospho.ELM Data.

Three phosphosite data tables have been downloaded as tab-delimited files from https://www.phosphosite.org/staticDownloads:

- Phosphorylation_site_dataset
    - The full list of phosphosites available on the website (371,203 rows long on 2020_01_11)
  
- Kinase_Substrate_Dataset
    - A list of known kinase-phosphosite relationships (18,455 rows long on 2020_01_11)
    
- Disease-associated_sites
    - A list of phosphosites with known links to disease (1,363 rows on on 2020_01_11)

We also have a partially-processed data file from http://phospho.elm.eu.org/dataset.html:

- phosphosites_2.csv
    - The full list of human phosphosites available on the website (37,145 rows long on 2020_01_17)
    
Before running this Jupyter notebook, "Phosphosite-data-processing-1-of-2-Phospho.ELM-downloaded-file.ipynb" needs to have been run to generate phosphosites_2.csv.

In this Jupyter notebook, the three phosphosite.org files will be imported as pandas dataframes, and will have the following changes made to them:

   - Remove any rows with information about non-human proteins (targets and kinases).
   - Remove any rows where "gene" is blank.
   - Remove any rows in the diseases table where "disease" is blank.
   - For tables with a "modified residue" column that includes a flag at the end to denote the type of post-translational modification (PTM), remove any that are not "-p" for "phosphorylation".
   - Generate "source" column to direct the user to the phosphosite.org page for that substrate protein
   - Add "sequence" and "PMID" columns to capture interesting information from phospho.ELM
   - Generate several different "phosphosite ID" and "kinase-phosphosite ID" columns, using gene/protein/accession ID/modified residue information in other columns, for several purposes:
       - to match the format in user-submitted quantitative phosphoproteomics files
       - to act as foreign keys to allow the tables to be linked in the SQL database
       - to act as unique identifiers for each phosphosite (or kinase-phosphosite pair) so that, while adding data from one table to another table, duplicates can be skipped
   - Add rows to kin_sub_df for any missing phosphosites that appear in the phospho.ELM table
   - Add rows to the phosphosites table for any missing phosphosites that appear in the other tables, to allow them to be linked, and to include phosphosites only found in phospho.ELM.
   - Ensure all genes listed are in uppercase.
   - Remove the unnecessary "-p" from the "modified residue" column in the phosphosite table, as our web app will only specialise in phosphosites.
   - Drop any other unnecessary columns.
   - Add a binary column to the phosphosites table to indicate whether the row is for an isoform (1) or not (0).
   - Make a second KIN_ACC_ID column in kin_sub_df, KIN_ACC_ID_2, with any isoform characters (e.g. "-2" suffix) removed, to allow linkage to kinase table, and enrichment analysis
   - Remove rows from kin_sub_df with KIN_ACC_ID_2 values that can't be found in the kinase table
   - In the phosphosites table, remove duplicated rows.
   - Add a primary keys column to each table.

Finally, they will be exported as .csv files.

Import required packages

In [1]:
import pandas as pd # v 0.25.1
import re # v 2.2.1

Read in data files

In [2]:
phosphosite_df = pd.read_table( "Phosphorylation_site_dataset", error_bad_lines = False )

In [3]:
kin_sub_df = pd.read_table( "Kinase_Substrate_Dataset", error_bad_lines = False )

In [4]:
disease_df = pd.read_table( "Disease-associated_sites", error_bad_lines = False )

# A strange column "Unnamed: 19" is created upon import: drop this

disease_df = disease_df.drop( [ 'Unnamed: 19' ], axis = 1 )

In [5]:
# Read in partially-processed phospho.ELM data file

phosphosites_2_df = pd.read_csv( "phosphosites_2.csv" )

Remove any rows where ORGANISM is not "human"

In [6]:
phosphosite_df = phosphosite_df.drop( phosphosite_df[ phosphosite_df.ORGANISM != "human" ].index )

In [7]:
# kin_sub_df contains a substrate organism and a kinase organism

kin_sub_df = kin_sub_df.drop( kin_sub_df[ kin_sub_df.SUB_ORGANISM != "human" ].index )
kin_sub_df = kin_sub_df.drop( kin_sub_df[ kin_sub_df.KIN_ORGANISM != "human" ].index )

In [8]:
disease_df = disease_df.drop( disease_df[ disease_df.ORGANISM != "human" ].index )

Remove any rows where GENE is blank

In [9]:
phosphosite_df = phosphosite_df.dropna( subset = [ "GENE" ]  )
phosphosite_df = phosphosite_df.reset_index( drop = True )

In [10]:
# kin_sub_df has a kinase gene and a substrate gene

kin_sub_df = kin_sub_df.dropna( subset = [ "GENE" ] )
kin_sub_df = kin_sub_df.dropna( subset = [ "SUB_GENE" ] )
kin_sub_df = kin_sub_df.reset_index( drop = True )

In [11]:
disease_df = disease_df.dropna( subset = [ "GENE" ] )
disease_df = disease_df.reset_index( drop = True )

Remove any rows where DISEASE is blank

In [12]:
disease_df = disease_df.dropna( subset = [ "DISEASE" ] )
disease_df = disease_df.reset_index( drop = True )

Remove any rows where PTM is not "phosphorylation" (indicated as "-p" appended to MOD_RSD). N.B. kin_sub_df is not included here because its MOD_RSD column does not contain modification type, just AA and residue no.

In [None]:
# This takes around five minutes to run, and should not be necessary as
# the file just contains phosphosites, but is included in case of error

regex = r'\w{1}\d+-p'

indices = []

for n, i in enumerate( phosphosite_df.iterrows() ):
    if re.findall( regex, str( i[ 1 ]) ):
        continue
    else:
        indices.append( n )

phosphosite_df = phosphosite_df.drop( indices )
phosphosite_df = phosphosite_df.reset_index( drop = True )

In [None]:
regex = r'\w{1}\d+-p'

indices = []

for n, i in enumerate( disease_df.iterrows() ):
    if re.findall( regex, str( i[ 1 ]) ):
        continue
    else:
        indices.append( n )

disease_df = disease_df.drop( indices )
disease_df = disease_df.reset_index( drop = True )

Add a "Source" column to direct the user to the phosphosite.org page for that substrate protein

In [None]:
source = []

for n, i in enumerate( phosphosite_df.iterrows() ):
    source.append( "http://www.phosphosite.org/uniprotAccAction?id=" + str( phosphosite_df.ACC_ID[ n ] ))

source = pd.Series( source )
phosphosite_df = phosphosite_df.assign( SOURCE = source )                  

In [None]:
source = []

for n, i in enumerate( kin_sub_df.iterrows() ):
    source.append( "http://www.phosphosite.org/uniprotAccAction?id=" + str( kin_sub_df.SUB_ACC_ID[ n ] ))

source = pd.Series( source )
kin_sub_df = kin_sub_df.assign( SOURCE = source )  

Add "sequence" and "PMID" columns to capture interesting information from phospho.ELM

In [None]:
column = [ None ] * len( phosphosite_df.GENE )

sequence = pd.Series( column )
PMID = pd.Series( column )

phosphosite_df = phosphosite_df.assign( SEQUENCE = sequence )
phosphosite_df = phosphosite_df.assign( PMID = PMID )

In [None]:
column = [ None ] * len( kin_sub_df.GENE )

sequence = pd.Series( column )
PMID = pd.Series( column )

kin_sub_df = kin_sub_df.assign(SEQUENCE = sequence)
kin_sub_df = kin_sub_df.assign(PMID = PMID)

Soon we will add rows to phosphosite_df and kin_sub_df from the phospho.ELM dataset. We will look to see whether the rows need to be added (i.e. are not duplicates) using phosphosite IDs (PHOS_ID5) for phosphosite_df and kinase-phosphosite IDs (KIN_PHOS_ID) for kin_sub_df. Here we add these IDs. Eventually kin_sub_df will also need PHOS_ID5 as a foreign key in the database so we will add it here also.

In [None]:
phos_id5 = [] # The substrate UniProt ID will be incorporated here
              # This will be used as a foreign key

for n, i in enumerate( kin_sub_df.iterrows() ):
    phos_id5.append( kin_sub_df.SUB_ACC_ID[ n ].upper() + "(" + kin_sub_df.SUB_MOD_RSD[ n ].upper() + ")" )

phos_id5 = pd.Series( phos_id5 )

kin_sub_df = kin_sub_df.assign( PHOS_ID5 = phos_id5 )

In [None]:
phos_id5 = [] # The substrate UniProt ID will be incorporated here 
              # This will be used as a foreign key

for n, i in enumerate( phosphosite_df.iterrows() ):
    phos_id5.append( phosphosite_df.ACC_ID[ n ].upper() + "(" + phosphosite_df.MOD_RSD[ n ][ 0:-2 ].upper() + ")" )

phos_id5 = pd.Series( phos_id5 )

phosphosite_df = phosphosite_df.assign( PHOS_ID5 = phos_id5 )

In [None]:
kin_phos_id = [] # The kinase and substrate UniProt IDs will be
                 # incorporated here

for n, i in enumerate( kin_sub_df.iterrows() ):
    k_p_id = (kin_sub_df.KIN_ACC_ID[ n ] + "_" + kin_sub_df.SUB_ACC_ID[ n ] + "(" + kin_sub_df.SUB_MOD_RSD[ n ] + ")" )
    kin_phos_id.append( k_p_id.upper() )

kin_phos_id = pd.Series( kin_phos_id )

kin_sub_df = kin_sub_df.assign( KIN_PHOS_ID = kin_phos_id )

In [None]:
kin_phos_id = [] # The kinase and substrate UniProt IDs will be
                 # incorporated here

for n, i in enumerate( phosphosites_2_df.iterrows() ):
    k_p_id = str( phosphosites_2_df.ACC_ID[ n ] ) + "_" + phosphosites_2_df.PHOS_ID[ n ]
    kin_phos_id.append( str( k_p_id ).upper() ) 

kin_phos_id = pd.Series( kin_phos_id )

phosphosites_2_df = phosphosites_2_df.assign( KIN_PHOS_ID = kin_phos_id )

Iterate through the kin_sub_df and check whether each phosphosite is already captured in phosphosite_df. If not, add rows.

In [None]:
# Use PHOS_ID5 to cross-check whether the phosphosites in kin_sub_df
# are already in phosphosite_df. Add them if not

# This takes around ten minutes to run

for n, i in enumerate( kin_sub_df[ 'PHOS_ID5' ] ):
    if i not in list( phosphosite_df[ 'PHOS_ID5' ] ):
        row = [{ 'GENE' : kin_sub_df.SUB_GENE[ n ],
                'PROTEIN' : kin_sub_df.SUBSTRATE[ n ],
                'ACC_ID' : kin_sub_df.SUB_ACC_ID[ n ],
                'HU_CHR_LOC' : '',
                'MOD_RSD' : kin_sub_df.SUB_MOD_RSD[ n ] + "-p",
                'SITE_GRP_ID' : kin_sub_df.SITE_GRP_ID[ n ],
                'ORGANISM' : kin_sub_df.SUB_ORGANISM[ n ],
                'MW_kD' : 0.00,
                'DOMAIN' : kin_sub_df.DOMAIN[ n ],
                'SITE_+/-7_AA' : kin_sub_df[ 'SITE_+/-7_AA' ][ n ],
                'LT_LIT' : 0.0, 'MS_LIT' : 0.0, 'MS_CST' : 0.0,
                'CST_CAT#' : 0.0,
                'PHOS_ID5' : i,
                'SEQUENCE' : '', 'PMID' : '',
                'SOURCE': 'http://www.phosphosite.org/uniprotAccAction?id=' + str( kin_sub_df.SUB_ACC_ID[ n ] )
               }]
        phosphosite_df = phosphosite_df.append( row, ignore_index = True )

Iterate through the disease_df and check whether each phosphosite is already captured in phosphosite_df. If not, add rows.
Before doing so, create PHOS_ID5 column in disease_df.

In [None]:
# Before adding rows from disease_df to phosphosite_df, we need to add a phos_id column for cross-checking

phos_id = [] # The substrate UniProt ID will be incorporated here
             # This will be used as a foreign key

for n, i in enumerate( disease_df.iterrows() ):
    phos_id.append( disease_df.ACC_ID[ n ].upper() + "(" + disease_df.MOD_RSD[ n ][ 0:-2 ].upper() + ")" )
    
phos_id = pd.Series( phos_id )

disease_df = disease_df.assign( PHOS_ID = phos_id )

In [None]:
# Cross-check whether the phosphosites in disease_df
# are already in phosphosite_df. Add them if not.

# This takes around two minutes to run

for n,i in enumerate( disease_df[ 'PHOS_ID' ] ):
    if i not in list( phosphosite_df[ 'PHOS_ID5' ] ):
        row = [{ 'GENE' : disease_df.GENE[ n ],
                'PROTEIN' : disease_df.PROTEIN[ n ],
                'ACC_ID' : disease_df.ACC_ID[ n ],
                'HU_CHR_LOC' : disease_df.HU_CHR_LOC[ n ],
                'MOD_RSD' : disease_df.MOD_RSD[ n ],
                'SITE_GRP_ID' : disease_df.SITE_GRP_ID[ n ],
                'ORGANISM' : disease_df.ORGANISM[ n ],
                'MW_kD' : disease_df.MW_kD[ n ],
                'DOMAIN' : disease_df.DOMAIN[ n ],
                'SITE_+/-7_AA' : disease_df[ 'SITE_+/-7_AA' ][ n ],
                'LT_LIT' : disease_df.LT_LIT[ n ],
                'MS_LIT' : disease_df.MS_LIT[ n ],
                'MS_CST' : disease_df.MS_CST[ n ],
                'CST_CAT#' : disease_df[ 'CST_CAT#' ][ n ],
                'PHOS_ID5' : i,
                'SEQUENCE' : '',
                'PMID' : disease_df.PMIDs[ n ],
                'SOURCE' : 'http://www.phosphosite.org/uniprotAccAction?id=' + str( disease_df.ACC_ID[ n ])
               }]
        phosphosite_df = phosphosite_df.append( row, ignore_index = True )

Iterate through phospho.ELM data set and check whether each kinase-phosphosite relationship is already captured in kin_sub_df. If not, add rows.

In [None]:
# Do not add any rows with empty kinase accession numbers as this
# will be a foreign key in the database

for n, i in enumerate( phosphosites_2_df.KIN_PHOS_ID ):

    if str( phosphosites_2_df.ACC_ID[ n ] ).upper()  != "NAN" and i not in list( kin_sub_df[ 'KIN_PHOS_ID' ] ):
        row = [{ 'GENE' : phosphosites_2_df.kinases[ n ],
                'KINASE' : '',
                'KIN_ACC_ID' : phosphosites_2_df.ACC_ID[ n ],
                'KIN_ORGANISM' : phosphosites_2_df.species[ n ],
                'SUBSTRATE' : phosphosites_2_df.SUB_PROTEIN[ n ],
                'SUB_GENE_ID' : '',
                'SUB_ACC_ID' : phosphosites_2_df.acc[ n ], 
                'SUB_GENE' : phosphosites_2_df.SUB_GENE[ n ], 
                'SUB_ORGANISM' : '',
                'SUB_MOD_RSD' : phosphosites_2_df.code[ n ] + str(phosphosites_2_df.position[ n ]), 
                'SITE_GRP_ID' : '', 
                'SITE_+/-7_AA' : '', 'DOMAIN' : '',
                'IN_VIVO_RXN' : '', 'IN_VITRO_RXN' : '',
                'CST_CAT#' : 0.0, 
                'SOURCE' : 'phospho.ELM',
                'SEQUENCE' : phosphosites_2_df.sequence[ n ], 
                'PMID' : phosphosites_2_df.pmids[ n ], 
                'PHOS_ID5' : phosphosites_2_df.PHOS_ID[ n ],
                'KIN_PHOS_ID' : i
               }]
        
        kin_sub_df = kin_sub_df.append( row, ignore_index = True )

Iterate through phospho.ELM data set and check whether each phosphosite is already captured in phosphosite_df. If not, add rows.

In [None]:
# Cross-check whether the phosphosites in phospho.ELM
# are already in phosphosite_df. Add them if not.

# This takes over half an hour to run

for n, i in enumerate( phosphosites_2_df.PHOS_ID ):
    if i not in list( phosphosite_df[ 'PHOS_ID5' ] ):
        row = [{ 'GENE' : phosphosites_2_df.SUB_GENE[ n ],
                'PROTEIN' : phosphosites_2_df.SUB_PROTEIN[ n ],
                'ACC_ID' : phosphosites_2_df.acc[ n ],
                'HU_CHR_LOC' : '',
                'MOD_RSD' : phosphosites_2_df.code[ n ] + str( phosphosites_2_df.position[ n ]) + '-p',
                'SITE_GRP_ID' : '',
                'ORGANISM':phosphosites_2_df.species[n],
                'MW_kD' : '',
                'DOMAIN' : '',
                'SITE_+/-7_AA' : '',
                'LT_LIT': '',
                'MS_LIT' : '',
                'MS_CST' : '',
                'CST_CAT#' : '',
                'PHOS_ID5' : i,
                'SEQUENCE' : phosphosites_2_df.sequence[ n ],
                'PMID' : phosphosites_2_df.pmids[ n ],
                'SOURCE' : 'phospho.ELM' }]
        phosphosite_df = phosphosite_df.append( row, ignore_index = True )

Make some more phosphosite ID columns to match the format in the user-submitted quantitative phosphoproteomics files. Different gene / protein aliases will be combined with the residue information in each column.

In [None]:
phos_id = []
phos_id2 = []
phos_id3 = []
phos_id4 = []

for n, i in enumerate( kin_sub_df.iterrows() ):
    phos_id.append( str( kin_sub_df.SUB_GENE[ n ] ).upper() + "_" + kin_sub_df.SUB_ORGANISM[ n ].upper() + "(" + kin_sub_df.SUB_MOD_RSD[ n ].upper() + ")" )
    phos_id2.append( str( kin_sub_df.SUBSTRATE[ n ] ).upper() + "(" + kin_sub_df.SUB_MOD_RSD[ n ].upper() + ")")
    phos_id3.append( str( kin_sub_df.SUBSTRATE[ n ] ).upper() + "_" + kin_sub_df.SUB_ORGANISM[ n ].upper() + "(" + kin_sub_df.SUB_MOD_RSD[ n ].upper() + ")" )
    phos_id4.append( str( kin_sub_df.SUB_GENE[ n ] ).upper() + "(" + kin_sub_df.SUB_MOD_RSD[ n ].upper() + ")" )
    
phos_id = pd.Series( phos_id )
phos_id2 = pd.Series( phos_id2 )
phos_id3 = pd.Series( phos_id3 )
phos_id4 = pd.Series( phos_id4 )

kin_sub_df = kin_sub_df.assign( PHOS_ID = phos_id )
kin_sub_df = kin_sub_df.assign( PHOS_ID2 = phos_id2 )
kin_sub_df = kin_sub_df.assign( PHOS_ID3 = phos_id3 )
kin_sub_df = kin_sub_df.assign( PHOS_ID4 = phos_id4 )

In [None]:
phos_id = []
phos_id2 = []
phos_id3 = []
phos_id4 = []

for n, i in enumerate( phosphosite_df.iterrows() ):
    phos_id.append( str( phosphosite_df.GENE[ n ] ).upper() + "_HUMAN(" + str( phosphosite_df.MOD_RSD[ n ][ 0:-2 ] ).upper() + ")" )
    phos_id2.append( str( phosphosite_df.PROTEIN[ n ] ).upper() + "(" + str( phosphosite_df.MOD_RSD[ n ][ 0:-2 ] ).upper() + ")" )
    phos_id3.append( str( phosphosite_df.PROTEIN[ n ] ).upper() + "_HUMAN(" + str( phosphosite_df.MOD_RSD[ n ][ 0:-2 ] ).upper() + ")" )
    phos_id4.append( str( phosphosite_df.GENE[ n ] ).upper() + "(" + str( phosphosite_df.MOD_RSD[ n ][ 0:-2 ] ).upper() + ")" )

phos_id = pd.Series( phos_id )
phos_id2 = pd.Series( phos_id2 )
phos_id3 = pd.Series( phos_id3 )
phos_id4 = pd.Series( phos_id4 )

phosphosite_df = phosphosite_df.assign( PHOS_ID = phos_id )
phosphosite_df = phosphosite_df.assign( PHOS_ID2 = phos_id2 )
phosphosite_df = phosphosite_df.assign( PHOS_ID3 = phos_id3 )
phosphosite_df = phosphosite_df.assign( PHOS_ID4 = phos_id4 )

Make GENE entries uppercase

In [None]:
# This takes a minute to run

uppercase_kinase = []

for n, i in enumerate( phosphosite_df.iterrows() ):
    uppercase_kinase.append( str(phosphosite_df.GENE[ n ]).upper() )

uppercase_kinase = pd.Series( uppercase_kinase )

phosphosite_df = phosphosite_df.assign( GENE = uppercase_kinase )

In [None]:
uppercase_kinase = []

for n, i in enumerate( kin_sub_df.iterrows() ):
    uppercase_kinase.append( kin_sub_df.GENE[ n ].upper() )

uppercase_kinase = pd.Series( uppercase_kinase )

kin_sub_df = kin_sub_df.assign( GENE = uppercase_kinase )

Remove -p from MOD_RSD in phosphosite_df

In [None]:
# This takes a minute to run

mod_rsd = []

for n, i in enumerate( phosphosite_df.iterrows() ):
    if phosphosite_df.MOD_RSD[ n ][ -2: ] == "-p":
        mod_rsd.append( phosphosite_df.MOD_RSD[ n ][ 0:-2 ] )
    else:
        mod_rsd.append( phosphosite_df.MOD_RSD[ n ] )

mod_rsd = pd.Series( mod_rsd )

phosphosite_df = phosphosite_df.assign( MOD_RSD = mod_rsd )

Drop any unnecessary columns

In [None]:
# ORGANISM is not required as the web app will be human-specific
# and we have already ensured that only human targets are included

phosphosite_df = phosphosite_df.drop( [ 'ORGANISM' ], axis = 1 )
phosphosite_df = phosphosite_df.reset_index( drop = True )

In [None]:
# KIN_ORGANISM and SUB_ORGANISM are not required as the web app will be
# human-specific and we have already ensured that only human proteins 
# are included
# KINASE, SUBSTRATE, SUB_GENE_ID, SUB_MOD_RSD and SUB_GENE 
# are not required as we have KIN_ACC_ID, SUB_ACC_ID and
# various PHOS_IDs
# SITE_GRP_ID, SITE_+/-7_AA and DOMAIN will be available in 
# phosphosite_df
# KIN_PHOS_ID was only required for duplicate checking while combining
# phosphosite.org and phospho.ELM data

kin_sub_df = kin_sub_df.drop( [ 'KIN_ORGANISM', 'SUB_ORGANISM',
                               'KINASE', 'SUBSTRATE', 
                               'SUB_GENE_ID', 'SUB_MOD_RSD',
                               'SUB_GENE', 'SITE_GRP_ID',
                               'SITE_+/-7_AA', 'DOMAIN', 
                               'KIN_PHOS_ID'],
                             axis = 1 )

kin_sub_df = kin_sub_df.reset_index( drop = True)

In [None]:
# PROTEIN and GENE_ID are unnecessary as we have ACC_ID
# ORGANISM is not required as the web app will be human-specific
# GENE, HU_CHR_LOC, MW_kD, SITE_GRP_ID, MOD_RSD, DOMAIN, and SITE_+/-7_AA
# will be available in phosphosite_df

disease_df = disease_df.drop( [ 'PROTEIN', 'GENE_ID',
                               'ORGANISM', 'GENE', 'HU_CHR_LOC',
                               'MW_kD', 'SITE_GRP_ID', 'MOD_RSD',
                               'DOMAIN', 'SITE_+/-7_AA' ],
                             axis = 1 )

disease_df = disease_df.reset_index( drop = True )

Add a column ISOFORM to the phosphosite_df table to indicate whether the row is for an isoform (1) or not (0)

In [None]:
yes_no_isoform = []

for n, i in enumerate( phosphosite_df.ACC_ID ):
    if "-" in i: # In a UniProt ID, "-" signifies an isoform
        
        yes_no_isoform.append( 1 )
        
    elif "iso" in str( phosphosite_df.PROTEIN[ n ] ): # For some records,
        # the presence of "iso" in this field is the only way 
        # the isoform is indicated
        
        yes_no_isoform.append( 1 )
        
    else:
        yes_no_isoform.append( 0 )

yes_no_isoform = pd.Series( yes_no_isoform )

phosphosite_df = phosphosite_df.assign( ISOFORM = yes_no_isoform )

Make a second KIN_ACC_ID column in kin_sub_df, KIN_ACC_ID_2, with any isoform characters (e.g. "-2" suffix) removed, to allow linkage to kinase table, and enrichment analysis

In [None]:
kin_acc_2 = []

for n, i in enumerate( kin_sub_df.iterrows() ):
    if "-" in kin_sub_df.KIN_ACC_ID[ n ]:
        idx = kin_sub_df.KIN_ACC_ID[ n ].index("-")
        kin_acc_2.append( kin_sub_df.KIN_ACC_ID[ n ][ :( idx ) ] )
    else:
        kin_acc_2.append( kin_sub_df.KIN_ACC_ID[ n ] )

kin_acc_2 = pd.Series( kin_acc_2 )

kin_sub_df = kin_sub_df.assign( KIN_ACC_ID_2 = kin_acc_2 )

Remove rows from kin_sub_df with KIN_ACC_ID_2 values that can't be found in the kinase table

In [None]:
# Read in kinase table

human_kinases = pd.read_csv( "human_kinase_dataframe.csv" )

# Make a list of unique UniProt_IDs from kinase table

parent_kin = list( human_kinases.UniProt_ID )

# If the kinase is not in the kinase table, remove the row

indices = []

for n, i in enumerate( kin_sub_df.KIN_ACC_ID_2 ):
    if i not in parent_kin:
        indices.append( n )

kin_sub_df = kin_sub_df.drop( indices )
kin_sub_df = kin_sub_df.reset_index( drop = True)

Remove any duplicated rows from phosphosite table

In [None]:
phosphosite_df = phosphosite_df.drop_duplicates( subset = 'PHOS_ID5', keep = 'first' )
phosphosite_df = phosphosite_df.reset_index( drop = True )

Add primary key columns

In [None]:
prim_key = []

count = 1

for i in phosphosite_df.PHOS_ID:
    key = "PH" + "{:07d}".format( count )
    prim_key.append( key )
    count += 1

prim_key = pd.Series( prim_key )

phosphosite_df = phosphosite_df.assign( ID_PH = prim_key )

In [None]:
prim_key = []

count = 1

for i in kin_sub_df.PHOS_ID:
    key = "KS" + "{:07d}".format( count )
    prim_key.append(key)
    count += 1

prim_key = pd.Series( prim_key )

kin_sub_df = kin_sub_df.assign( ID_KS = prim_key )

In [None]:
prim_key = []

count = 1

for i in disease_df.PHOS_ID:
    key = "DP" + "{:07d}".format( count )
    prim_key.append( key )
    count += 1

prim_key = pd.Series( prim_key )

disease_df = disease_df.assign( ID_DP = prim_key )

Make csv files

In [46]:
phosphosite_df.to_csv( "phosphosites.csv", index = False )

In [47]:
kin_sub_df.to_csv( "kinases_phosphosites.csv", index = False )

In [48]:
disease_df.to_csv( "phosphosites_diseases.csv", index = False )