# Editing Phosphosite Data Tables in Python

Three phosphosite data tables have been downloaded as tab-delimited files from https://www.phosphosite.org/staticDownloads:

- Phosphorylation_site_dataset
    - The full list of phosphosites available on the website (371,203 rows long on 2020_01_11)
  
- Kinase_Substrate_Dataset
    - A list of known kinase-phosphosite relationships (18,455 rows long on 2020_01_11)
    
- Disease-associated_sites
    - A list of phosphosites with known links to disease (1,363 rows on on 2020_01_11)
   
They will be imported as pandas dataframes, and will have the following changes made to them, in this Jupyter notebook:
   - Remove any rows with information about non-human proteins (targets and kinases).
   - Remove any rows where "gene" is blank (this field is essential for creating phosphosite and kinase IDs that will enable different tables to be linked).
   - Remove any rows in the diseases table where "disease" is blank.
   - For tables with a "modified residue" column that includes a flag at the end to denote the type of post-translational modification (PTM), remove any that are not "-p" for "phosphorylation".
   - Generate "phosphosite ID" columns, using information in other columns, to match the format in user-submitted quantitative phosphoproteomics files, and to allow the tables to be linked in the SQL database.
   - For the kinase-substrate table, generate a new column for the kinase name that will match the kinase table in the SQL database.
   - Ensure all genes listed are in uppercase.
   - Remove the unnecessary "-p" from the "modified residue" column in the phosphosite table, as our web app will only specialise in phosphosites.
   - Drop any other unnecessary columns.
   - Add a primary keys column.
   - Add a binary column to the phosphosites table to indicate whether the row is for an isoform (1) or not (0).

Finally, they will be exported as .csv files.

Import required packages

In [39]:
import pandas as pd
import re

Read in data files

In [40]:
phosphosite_df = pd.read_table("Phosphorylation_site_dataset", error_bad_lines = False)

In [41]:
kin_sub_df = pd.read_table("Kinase_Substrate_Dataset", error_bad_lines = False)

In [42]:
disease_df = pd.read_table("Disease-associated_sites", error_bad_lines = False)

# A strange column "Unnamed: 19" is created upon import: drop this
disease_df = disease_df.drop(['Unnamed: 19'], axis=1)

Remove any rows where ORGANISM is not human

In [43]:
phosphosite_df = phosphosite_df.drop(phosphosite_df[phosphosite_df.ORGANISM != "human"].index)

In [44]:
# kin_sub_df contains a substrate / target / phosphosite organism and a kinase organism

kin_sub_df = kin_sub_df.drop(kin_sub_df[kin_sub_df.SUB_ORGANISM != "human"].index)
kin_sub_df = kin_sub_df.drop(kin_sub_df[kin_sub_df.KIN_ORGANISM != "human"].index)

In [45]:
disease_df = disease_df.drop(disease_df[disease_df.ORGANISM != "human"].index)

Remove any rows where GENE is blank

In [46]:
phosphosite_df = phosphosite_df.dropna(subset=["GENE"])
phosphosite_df = phosphosite_df.reset_index(drop=True)

In [47]:
# kin_sub_df has a kinase gene and a substrate gene

kin_sub_df = kin_sub_df.dropna(subset=["GENE"])
kin_sub_df = kin_sub_df.dropna(subset=["SUB_GENE"])
kin_sub_df = kin_sub_df.reset_index(drop=True)

In [48]:
disease_df = disease_df.dropna(subset=["GENE"])
disease_df = disease_df.reset_index(drop=True)

Remove any rows where DISEASE is blank

In [49]:
disease_df = disease_df.dropna(subset=["DISEASE"])
disease_df = disease_df.reset_index(drop=True)

Remove any rows where PTM is not "phosphorylation" (indicated as "-p" appended to MOD_RSD). N.B. kin_sub_df is not included here because its MOD_RSD column does not contain modification type, just AA and residue no.

In [50]:
# This takes around five minutes to run, and should not be necessary as
# the file just contains phosphosites, but is included in case of error

regex = r'\w{1}\d+-p'

indices = []

for n,i in enumerate(phosphosite_df.iterrows()):
    if re.findall(regex,str(i[1])):
        continue
    else:
        indices.append(n)

phosphosite_df = phosphosite_df.drop(indices)
phosphosite_df = phosphosite_df.reset_index(drop=True)

In [51]:
regex = r'\w{1}\d+-p'

indices = []

for n,i in enumerate(disease_df.iterrows()):
    if re.findall(regex,str(i[1])):
        continue
    else:
        indices.append(n)

disease_df = disease_df.drop(indices)
disease_df = disease_df.reset_index(drop=True)

Make a column of phosphosite IDs in a standardised format:
- to enable tables to be linked in the SQL database
- to match the format in the user-submitted quantitative phosphoproteomics files

In [52]:
# This takes a minute to run

phos_id = []

for n,i in enumerate(phosphosite_df.iterrows()):
    phos_id.append(phosphosite_df.GENE[n].upper()+"_"+phosphosite_df.ORGANISM[n].upper()+"("+phosphosite_df.MOD_RSD[n][0:-2].upper()+")")

phos_id = pd.Series(phos_id)

phosphosite_df = phosphosite_df.assign(PHOS_ID = phos_id)

In [53]:
phos_id = []

for n,i in enumerate(kin_sub_df.iterrows()):
    phos_id.append(kin_sub_df.SUB_GENE[n].upper()+"_"+kin_sub_df.SUB_ORGANISM[n].upper()+"("+kin_sub_df.SUB_MOD_RSD[n].upper()+")")

phos_id = pd.Series(phos_id)

kin_sub_df = kin_sub_df.assign(PHOS_ID = phos_id)

In [54]:
phos_id = []

for n,i in enumerate(disease_df.iterrows()):
    phos_id.append(disease_df.GENE[n].upper()+"_"+disease_df.ORGANISM[n].upper()+"("+disease_df.MOD_RSD[n][0:-2].upper()+")")

phos_id = pd.Series(phos_id)

disease_df = disease_df.assign(PHOS_ID = phos_id)

Make a new kinase column in uppercase, with "_HUMAN" appended, to match the ID in the kinase table

In [55]:
uppercase_kinase = []

for n,i in enumerate(kin_sub_df.iterrows()):
    uppercase_kinase.append(kin_sub_df.GENE[n].upper()+"_"+kin_sub_df.SUB_ORGANISM[n].upper())

uppercase_kinase = pd.Series(uppercase_kinase)

kin_sub_df = kin_sub_df.assign(HUMAN_KINASE = uppercase_kinase)

Make GENE entries uppercase

In [56]:
# This takes a minute to run

uppercase_kinase = []

for n,i in enumerate(phosphosite_df.iterrows()):
    uppercase_kinase.append(phosphosite_df.GENE[n].upper())

uppercase_kinase = pd.Series(uppercase_kinase)

phosphosite_df = phosphosite_df.assign(GENE = uppercase_kinase)

In [57]:
uppercase_kinase = []

for n,i in enumerate(kin_sub_df.iterrows()):
    uppercase_kinase.append(kin_sub_df.GENE[n].upper())

uppercase_kinase = pd.Series(uppercase_kinase)

kin_sub_df = kin_sub_df.assign(GENE = uppercase_kinase)

Remove -p from MOD_RSD in phosphosite_df

In [58]:
# This takes a minute to run

mod_rsd = []

for n,i in enumerate(phosphosite_df.iterrows()):
    if phosphosite_df.MOD_RSD[n][-2:] == "-p":
        mod_rsd.append(phosphosite_df.MOD_RSD[n][0:-2])
    else:
        mod_rsd.append(phosphosite_df.MOD_RSD[n])

mod_rsd = pd.Series(mod_rsd)

phosphosite_df = phosphosite_df.assign(MOD_RSD = mod_rsd)

Drop any unnecessary columns

In [59]:
# PROTEIN is an unnecessary extra alias
# ORGANISM is not required as the web app will be human-specific
# and we have already ensured that only human targets are included

phosphosite_df = phosphosite_df.drop(['PROTEIN', 'ORGANISM'], axis=1)

In [60]:
# KIN_ACC_ID and SUB_ACC_ID will be available in kinase table
# KINASE, KIN_ORGANISM, SUBSTRATE, SUB_ORGANISM, SUB_GENE_ID, 
# SUB_MOD_RSD are not required
# SUB_GENE will be in phosphosite_df as GENE
# SITE_GRP_ID, SITE_+/-7_AA and DOMAIN will be available in phosphosite_df

kin_sub_df = kin_sub_df.drop(['KIN_ACC_ID', 'SUB_ACC_ID', 'KINASE',
                              'KIN_ORGANISM', 'SUBSTRATE', 'SUB_ORGANISM', 
                              'SUB_GENE_ID', 'SUB_MOD_RSD',
                              'SUB_GENE', 'SITE_GRP_ID', 'SITE_+/-7_AA',
                              'DOMAIN'], axis=1)

In [61]:
# PROTEIN, ACC_ID and GENE_ID will be available in the kinase table
# ORGANISM is not required
# GENE, HU_CHR_LOC, MW_kD, SITE_GRP_ID, MOD_RSD, DOMAIN, and SITE_+/-7_AA
# will be available in phosphosite_df

disease_df = disease_df.drop(['PROTEIN', 'ACC_ID', 'GENE_ID',  
                              'ORGANISM', 'GENE', 'HU_CHR_LOC', 'MW_kD',
                              'SITE_GRP_ID', 'MOD_RSD', 'DOMAIN', 
                               'SITE_+/-7_AA'], axis=1)

Add primary key columns

In [62]:
prim_key = []

count = 1

for i in phosphosite_df.PHOS_ID:
    key = "PH"+"{:07d}".format(count)
    prim_key.append(key)
    count += 1

prim_key = pd.Series(prim_key)

phosphosite_df = phosphosite_df.assign(PRIMARY_KEY = prim_key)

In [63]:
prim_key = []

count = 1

for i in kin_sub_df.PHOS_ID:
    key = "KS"+"{:07d}".format(count)
    prim_key.append(key)
    count += 1

prim_key = pd.Series(prim_key)

kin_sub_df = kin_sub_df.assign(PRIMARY_KEY = prim_key)

In [64]:
prim_key = []

count = 1

for i in disease_df.PHOS_ID:
    key = "DP"+"{:07d}".format(count)
    prim_key.append(key)
    count += 1

prim_key = pd.Series(prim_key)

disease_df = disease_df.assign(PRIMARY_KEY = prim_key)

Add a column ISOFORM to the phosphosite_df table to indicate whether the row is for an isoform (1) or not (0)

In [None]:
yes_no_isoform = []

for i in phosphosite_df.ACC_ID:
    if "-" in i:
        yes_no_isoform.append(1)
    else:
        yes_no_isoform.append(0)

yes_no_isoform = pd.Series(yes_no_isoform)

phosphosite_df = phosphosite_df.assign(ISOFORM = yes_no_isoform)

Make csv files

In [66]:
phosphosite_df.to_csv("phosphosites.csv", index = False)

In [67]:
kin_sub_df.to_csv("kinases_phosphosites.csv", index = False)

In [68]:
disease_df.to_csv("phosphosites_diseases.csv", index = False)