# Phosphosite Data Processing, 1 of 2.
## Phospho.ELM Data.

A phosphosite data set has been downloaded as a tab-delimited file from http://phospho.elm.eu.org/dataset.html:

- phosphoELM_vertebrate_2015-04.dump
    - The full list of vertebrate phosphosites available on the website (46,248 rows long on 2020_01_17)

In this Jupyter notebook, it will be imported as a pandas dataframe, and will have the following changes made to it:

- Remove any rows with information about non-human proteins.
- Translate important substrates using UniProt API
- Translate kinases using "human_kinase_dataframe.csv".
- Generate a phosphosite ID column.
- Remove unnecessary columns.

It will then be exported as "phosphosites_2.csv" for subsequent incorporation into data tables with data from phosphosite.org, in Jupyter notebook "Phosphosite-data-processing-2-of-2-Phosphosite.org-downloaded-files.ipynb".

Import required packages

In [None]:
import pandas as pd
import urllib.parse
import urllib.request

Read in phosphosite data from phospho.ELM (and table of human kinases, for translating kinase aliases)

In [None]:
# This table was downloaded from phospho.ELM

phosphosite_2_df = pd.read_table( "phosphoELM_vertebrate_2015-04.dump" )

# This table was generated by script "SDP_kinase_dataframe_script.ipynb"

kinases_df = pd.read_csv( "human_kinase_dataframe.csv" )

Remove any rows where species is not "Homo sapiens"

In [None]:
phosphosite_2_df = phosphosite_2_df.drop( phosphosite_2_df[ phosphosite_2_df.species != "Homo sapiens" ].index )

phosphosite_2_df = phosphosite_2_df.reset_index( drop = True )

Translate important substrate IDs to gene symbol and entry name (minus "_HUMAN", which will be added later). This will allow the phosphopeptides in the quantitative phosphoproteomics file to be queried in the database

In [None]:
# A subset of substrate IDs was chosen out of necessity. There are 5,374 
# unique substrate IDs in this table, but only 299 rows from this table will
# be added to the kinase-phosphosite table generated from phosphosite.org
# data, as most of the kinase-phosphosite relationships here are already there.
# Of the 299 rows, there are 296 unique substrate IDs to query. I attempted 
# to query all 5,374 on two occasions and the connection timed out before 2,000
# were returned. Hence I have selected the 296 important IDs.

substrates = ['O00141','O00161','O00257','O14745','O14757','O14964','O14994','O15273','O15357','O15492','O15530','O43521','O43521','O43586','O43914','O60381','O60506','O60674','O75582','O75676','O75914','O95997','O96017','P00519','P00533','P01100','P01236','P01350','P02686','P03372','P04049','P04150','P04629','P04637','P05106','P05198','P05412','P05787','P06213','P06239','P06401','P06733','P07101','P07203','P07550','P07947','P07948','P08047','P08069','P08100','P08631','P08670','P09038','P09619','P09769','P0C0S8','P10242','P10244','P10721','P10747','P10828','P10912','P11137','P11171','P11274','P11362','P11388','P11831','P11912','P12272','P12318','P12931','P13569','P13693','P14317','P14598','P14859','P15172','P15260','P15336','P15941','P16104','P16220','P16234','P16410','P16885','P17252','P17275','P17480','P17542','P17600','P17655','P17676','P18031','P18206','P18507','P19105','P19174','P19235','P19419','P19429','P19793','P20138','P20700','P20936','P20963','P22001','P22314','P22607','P22681','P23396','P23443','P23458','P23528','P23634','P24844','P24928','P25963','P27448','P27708','P27986','P28562','P28749','P29320','P29322','P29350','P29353','P29597','P30260','P30304','P30305','P30307','P30419','P30443','P30530','P31152','P31946','P33241','P33778','P33981','P33991','P35222','P35568','P35611','P35612','P35637','P36888','P36956','P38398','P38432','P38936','P39880','P40259','P41212','P41743','P42229','P42702','P42768','P43354','P46527','P46695','P48751','P49023','P49407','P49715','P49768','P49841','P49917','P50549','P50613','P51674','P51692','P51812','P51813','P52735','P52799','P52926','P53355','P53667','P53671','P53778','P54259','P55211','P56270','P56945','P58012','P60953','P61925','P61978','P62158','P62993','P68400','P68431','P78314','P78347','P78527','P78536','P84022','P84243','Q00169','Q00535','Q01892','Q02078','Q02241','Q02363','Q02763','Q03721','Q04206','Q04912','Q05513','Q05655','Q05682','Q06124','Q06413','Q07666','Q07820','Q08379','Q12772','Q12888','Q12972','Q13094','Q13131','Q13153','Q13158','Q13177','Q13224','Q13322','Q13415','Q13418','Q13424','Q13444','Q13480','Q13485','Q13507','Q13541','Q13651','Q13765','Q13813','Q14005','Q14118','Q14191','Q14247','Q14289','Q14847','Q14934','Q15139','Q15149','Q15418','Q15648','Q15653','Q15788','Q15796','Q15831','Q16760','Q16821','Q16828','Q7KZI7','Q86WB0','Q86WV1','Q8TDC3','Q8WXE1','Q92731','Q92918','Q92934','Q92974','Q99683','Q9BUB5','Q9BXS5','Q9BXW9','Q9H0H5','Q9H1A4','Q9H1D0','Q9H8V3','Q9HBA0','Q9NQ66','Q9NQS7','Q9NQU5','Q9NRQ2','Q9NRY4','Q9NXH3','Q9NYV6','Q9UBS0','Q9UEW8','Q9UJX2','Q9UJX3','Q9UJY1','Q9UM73','Q9UPZ9','Q9UQC2','Q9UQQ2','Q9Y281','Q9Y2K2','Q9Y618','Q9Y6I3','Q9Y6K9','Q9Y6W5']

In [None]:
### RETRIEVE PROTEIN/GENE NAMES FROM UNIPROT

# Use UniProt's recommended Python 3 script (available at https://www.uniprot.org/help/api_idmapping)

# Create partial URL

url = "https://www.uniprot.org/uploadlists/"

# Define parameters

params = {
    
"from" : "ID", # Assume protein names are in format "ID"
"to" : "GENENAME", # Retrieve IDs in "GENENAME" format
"format" : "tab", # Produce tab-delimited output
"query" : "", # The query protein ID will be defined during the loop
    
}

# Create an empty list to store the results in

results = []

for i in substrates: 
    params[ "query" ] = str( i ) # Enter the substrate ID
    data = urllib.parse.urlencode( params ) 
    data = data.encode( "utf-8" )
    req = urllib.request.Request( url, data ) # Run query in Uniprot
    with urllib.request.urlopen( req ) as f:
        response = f.read()
    line = response.decode( "utf-8" ) 
    results.append( line ) # Store results

In [None]:
# Split the results into a list of lists without "\n" characters

results2 = []

for n, i in enumerate( results ):
    results2.append( i.split() )

If the above UniProt-querying loop fails to run, un-comment and run the cell below for a snapshot of the data retrieved on 22 Jan 2020

In [None]:
# results2 = [['From', 'To', 'O00141', 'SGK1'],
#  ['From', 'To', 'O00161', 'SNAP23'],
#  ['From', 'To', 'O00257', 'CBX4'],
#  ['From', 'To', 'O14745', 'SLC9A3R1'],
#  ['From', 'To', 'O14757', 'CHEK1'],
#  ['From', 'To', 'O14964', 'HGS'],
#  ['From', 'To', 'O14994', 'SYN3'],
#  ['From', 'To', 'O15273', 'TCAP'],
#  ['From', 'To', 'O15357', 'INPPL1'],
#  ['From', 'To', 'O15492', 'RGS16'],
#  ['From', 'To', 'O15530', 'PDPK1'],
#  ['From', 'To', 'O43521', 'BCL2L11'],
#  ['From', 'To', 'O43521', 'BCL2L11'],
#  ['From', 'To', 'O43586', 'PSTPIP1'],
#  ['From', 'To', 'O43914', 'TYROBP'],
#  ['From', 'To', 'O60381', 'HBP1'],
#  ['From', 'To', 'O60506', 'SYNCRIP'],
#  ['From', 'To', 'O60674', 'JAK2'],
#  ['From', 'To', 'O75582', 'RPS6KA5'],
#  ['From', 'To', 'O75676', 'RPS6KA4'],
#  ['From', 'To', 'O75914', 'PAK3'],
#  ['From', 'To', 'O95997', 'PTTG1'],
#  ['From', 'To', 'O96017', 'CHEK2'],
#  ['From', 'To', 'P00519', 'ABL1'],
#  ['From', 'To', 'P00533', 'EGFR'],
#  ['From', 'To', 'P01100', 'FOS'],
#  ['From', 'To', 'P01236', 'PRL'],
#  ['From', 'To', 'P01350', 'GAST'],
#  ['From', 'To', 'P02686', 'MBP'],
#  ['From', 'To', 'P03372', 'ESR1'],
#  ['From', 'To', 'P04049', 'RAF1'],
#  ['From', 'To', 'P04150', 'NR3C1'],
#  ['From', 'To', 'P04629', 'NTRK1'],
#  ['From', 'To', 'P04637', 'TP53'],
#  ['From', 'To', 'P05106', 'ITGB3'],
#  ['From', 'To', 'P05198', 'EIF2S1'],
#  ['From', 'To', 'P05412', 'JUN'],
#  ['From', 'To', 'P05787', 'KRT8'],
#  ['From', 'To', 'P06213', 'INSR'],
#  ['From', 'To', 'P06239', 'LCK'],
#  ['From', 'To', 'P06401', 'PGR'],
#  ['From', 'To', 'P06733', 'ENO1'],
#  ['From', 'To', 'P07101', 'TH'],
#  ['From', 'To', 'P07203', 'GPX1'],
#  ['From', 'To', 'P07550', 'ADRB2'],
#  ['From', 'To', 'P07947', 'YES1'],
#  ['From', 'To', 'P07948', 'LYN'],
#  ['From', 'To', 'P08047', 'SP1'],
#  ['From', 'To', 'P08069', 'IGF1R'],
#  ['From', 'To', 'P08100', 'RHO'],
#  ['From', 'To', 'P08631', 'HCK'],
#  ['From', 'To', 'P08670', 'VIM'],
#  ['From', 'To', 'P09038', 'FGF2'],
#  ['From', 'To', 'P09619', 'PDGFRB'],
#  ['From', 'To', 'P09769', 'FGR'],
#  ['From',
#   'To',
#   'P0C0S8',
#   'H2AC11',
#   'P0C0S8',
#   'H2AC13',
#   'P0C0S8',
#   'H2AC15',
#   'P0C0S8',
#   'H2AC16',
#   'P0C0S8',
#   'H2AC17'],
#  ['From', 'To', 'P10242', 'MYB'],
#  ['From', 'To', 'P10244', 'MYBL2'],
#  ['From', 'To', 'P10721', 'KIT'],
#  ['From', 'To', 'P10747', 'CD28'],
#  ['From', 'To', 'P10828', 'THRB'],
#  ['From', 'To', 'P10912', 'GHR'],
#  ['From', 'To', 'P11137', 'MAP2'],
#  ['From', 'To', 'P11171', 'EPB41'],
#  ['From', 'To', 'P11274', 'BCR'],
#  ['From', 'To', 'P11362', 'FGFR1'],
#  ['From', 'To', 'P11388', 'TOP2A'],
#  ['From', 'To', 'P11831', 'SRF'],
#  ['From', 'To', 'P11912', 'CD79A'],
#  ['From', 'To', 'P12272', 'PTHLH'],
#  ['From', 'To', 'P12318', 'FCGR2A'],
#  ['From', 'To', 'P12931', 'SRC'],
#  ['From', 'To', 'P13569', 'CFTR'],
#  ['From', 'To', 'P13693', 'TPT1'],
#  ['From', 'To', 'P14317', 'HCLS1'],
#  ['From', 'To', 'P14598', 'NCF1'],
#  ['From', 'To', 'P14859', 'POU2F1'],
#  ['From', 'To', 'P15172', 'MYOD1'],
#  ['From', 'To', 'P15260', 'IFNGR1'],
#  ['From', 'To', 'P15336', 'ATF2'],
#  ['From', 'To', 'P15941', 'MUC1'],
#  ['From', 'To', 'P16104', 'H2AFX'],
#  ['From', 'To', 'P16220', 'CREB1'],
#  ['From', 'To', 'P16234', 'PDGFRA'],
#  ['From', 'To', 'P16410', 'CTLA4'],
#  ['From', 'To', 'P16885', 'PLCG2'],
#  ['From', 'To', 'P17252', 'PRKCA'],
#  ['From', 'To', 'P17275', 'JUNB'],
#  ['From', 'To', 'P17480', 'UBTF'],
#  ['From', 'To', 'P17542', 'TAL1'],
#  ['From', 'To', 'P17600', 'SYN1'],
#  ['From', 'To', 'P17655', 'CAPN2'],
#  ['From', 'To', 'P17676', 'CEBPB'],
#  ['From', 'To', 'P18031', 'PTPN1'],
#  ['From', 'To', 'P18206', 'VCL'],
#  ['From', 'To', 'P18507', 'GABRG2'],
#  ['From', 'To', 'P19105', 'MYL12A'],
#  ['From', 'To', 'P19174', 'PLCG1'],
#  ['From', 'To', 'P19235', 'EPOR'],
#  ['From', 'To', 'P19419', 'ELK1'],
#  ['From', 'To', 'P19429', 'TNNI3'],
#  ['From', 'To', 'P19793', 'RXRA'],
#  ['From', 'To', 'P20138', 'CD33'],
#  ['From', 'To', 'P20700', 'LMNB1'],
#  ['From', 'To', 'P20936', 'RASA1'],
#  ['From', 'To', 'P20963', 'CD247'],
#  ['From', 'To', 'P22001', 'KCNA3'],
#  ['From', 'To', 'P22314', 'UBA1'],
#  ['From', 'To', 'P22607', 'FGFR3'],
#  ['From', 'To', 'P22681', 'CBL'],
#  ['From', 'To', 'P23396', 'RPS3'],
#  ['From', 'To', 'P23443', 'RPS6KB1'],
#  ['From', 'To', 'P23458', 'JAK1'],
#  ['From', 'To', 'P23528', 'CFL1'],
#  ['From', 'To', 'P23634', 'ATP2B4'],
#  ['From', 'To', 'P24844', 'MYL9'],
#  ['From', 'To', 'P24928', 'POLR2A'],
#  ['From', 'To', 'P25963', 'NFKBIA'],
#  ['From', 'To', 'P27448', 'MARK3'],
#  ['From', 'To', 'P27708', 'CAD'],
#  ['From', 'To', 'P27986', 'PIK3R1'],
#  ['From', 'To', 'P28562', 'DUSP1'],
#  ['From', 'To', 'P28749', 'RBL1'],
#  ['From', 'To', 'P29320', 'EPHA3'],
#  ['From', 'To', 'P29322', 'EPHA8'],
#  ['From', 'To', 'P29350', 'PTPN6'],
#  ['From', 'To', 'P29353', 'SHC1'],
#  ['From', 'To', 'P29597', 'TYK2'],
#  ['From', 'To', 'P30260', 'CDC27'],
#  ['From', 'To', 'P30304', 'CDC25A'],
#  ['From', 'To', 'P30305', 'CDC25B'],
#  ['From', 'To', 'P30307', 'CDC25C'],
#  ['From', 'To', 'P30419', 'NMT1'],
#  ['From', 'To', 'P30443', 'HLA-A'],
#  ['From', 'To', 'P30530', 'AXL'],
#  ['From', 'To', 'P31152', 'MAPK4'],
#  ['From', 'To', 'P31946', 'YWHAB'],
#  ['From', 'To', 'P33241', 'LSP1'],
#  ['From', 'To', 'P33778', 'HIST1H2BB'],
#  ['From', 'To', 'P33981', 'TTK'],
#  ['From', 'To', 'P33991', 'MCM4'],
#  ['From', 'To', 'P35222', 'CTNNB1'],
#  ['From', 'To', 'P35568', 'IRS1'],
#  ['From', 'To', 'P35611', 'ADD1'],
#  ['From', 'To', 'P35612', 'ADD2'],
#  ['From', 'To', 'P35637', 'FUS'],
#  ['From', 'To', 'P36888', 'FLT3'],
#  ['From', 'To', 'P36956', 'SREBF1'],
#  ['From', 'To', 'P38398', 'BRCA1'],
#  ['From', 'To', 'P38432', 'COIL'],
#  ['From', 'To', 'P38936', 'CDKN1A'],
#  ['From', 'To', 'P39880', 'CUX1'],
#  ['From', 'To', 'P40259', 'CD79B'],
#  ['From', 'To', 'P41212', 'ETV6'],
#  ['From', 'To', 'P41743', 'PRKCI'],
#  ['From', 'To', 'P42229', 'STAT5A'],
#  ['From', 'To', 'P42702', 'LIFR'],
#  ['From', 'To', 'P42768', 'WAS'],
#  ['From', 'To', 'P43354', 'NR4A2'],
#  ['From', 'To', 'P46527', 'CDKN1B'],
#  ['From', 'To', 'P46695', 'IER3'],
#  ['From', 'To', 'P48751', 'SLC4A3'],
#  ['From', 'To', 'P49023', 'PXN'],
#  ['From', 'To', 'P49407', 'ARRB1'],
#  ['From', 'To', 'P49715', 'CEBPA'],
#  ['From', 'To', 'P49768', 'PSEN1'],
#  ['From', 'To', 'P49841', 'GSK3B'],
#  ['From', 'To', 'P49917', 'LIG4'],
#  ['From', 'To', 'P50549', 'ETV1'],
#  ['From', 'To', 'P50613', 'CDK7'],
#  ['From', 'To', 'P51674', 'GPM6A'],
#  ['From', 'To', 'P51692', 'STAT5B'],
#  ['From', 'To', 'P51812', 'RPS6KA3'],
#  ['From', 'To', 'P51813', 'BMX'],
#  ['From', 'To', 'P52735', 'VAV2'],
#  ['From', 'To', 'P52799', 'EFNB2'],
#  ['From', 'To', 'P52926', 'HMGA2'],
#  ['From', 'To', 'P53355', 'DAPK1'],
#  ['From', 'To', 'P53667', 'LIMK1'],
#  ['From', 'To', 'P53671', 'LIMK2'],
#  ['From', 'To', 'P53778', 'MAPK12'],
#  ['From', 'To', 'P54259', 'ATN1'],
#  ['From', 'To', 'P55211', 'CASP9'],
#  ['From', 'To', 'P56270', 'MAZ'],
#  ['From', 'To', 'P56945', 'BCAR1'],
#  ['From', 'To', 'P58012', 'FOXL2'],
#  ['From', 'To', 'P60953', 'CDC42'],
#  ['From', 'To', 'P61925', 'PKIA'],
#  ['From', 'To', 'P61978', 'HNRNPK'],
#  ['From', 'To', 'P62158', 'CALM1', 'P62158', 'CALM2', 'P62158', 'CALM3'],
#  ['From', 'To', 'P62993', 'GRB2'],
#  ['From', 'To', 'P68400', 'CSNK2A1'],
#  ['From',
#   'To',
#   'P68431',
#   'H3C1',
#   'P68431',
#   'H3C10',
#   'P68431',
#   'H3C11',
#   'P68431',
#   'H3C12',
#   'P68431',
#   'H3C2',
#   'P68431',
#   'H3C3',
#   'P68431',
#   'H3C4',
#   'P68431',
#   'H3C6',
#   'P68431',
#   'H3C7',
#   'P68431',
#   'H3C8'],
#  ['From', 'To', 'P78314', 'SH3BP2'],
#  ['From', 'To', 'P78347', 'GTF2I'],
#  ['From', 'To', 'P78527', 'PRKDC'],
#  ['From', 'To', 'P78536', 'ADAM17'],
#  ['From', 'To', 'P84022', 'SMAD3'],
#  ['From', 'To', 'P84243', 'H3-3A', 'P84243', 'H3-3B'],
#  ['From', 'To', 'Q00169', 'PITPNA'],
#  ['From', 'To', 'Q00535', 'CDK5'],
#  ['From', 'To', 'Q01892', 'SPIB'],
#  ['From', 'To', 'Q02078', 'MEF2A'],
#  ['From', 'To', 'Q02241', 'KIF23'],
#  ['From', 'To', 'Q02363', 'ID2'],
#  ['From', 'To', 'Q02763', 'TEK'],
#  ['From', 'To', 'Q03721', 'KCNC4'],
#  ['From', 'To', 'Q04206', 'RELA'],
#  ['From', 'To', 'Q04912', 'MST1R'],
#  ['From', 'To', 'Q05513', 'PRKCZ'],
#  ['From', 'To', 'Q05655', 'PRKCD'],
#  ['From', 'To', 'Q05682', 'CALD1'],
#  ['From', 'To', 'Q06124', 'PTPN11'],
#  ['From', 'To', 'Q06413', 'MEF2C'],
#  ['From', 'To', 'Q07666', 'KHDRBS1'],
#  ['From', 'To', 'Q07820', 'MCL1'],
#  ['From', 'To', 'Q08379', 'GOLGA2'],
#  ['From', 'To', 'Q12772', 'SREBF2'],
#  ['From', 'To', 'Q12888', 'TP53BP1'],
#  ['From', 'To', 'Q12972', 'PPP1R8'],
#  ['From', 'To', 'Q13094', 'LCP2'],
#  ['From', 'To', 'Q13131', 'PRKAA1'],
#  ['From', 'To', 'Q13153', 'PAK1'],
#  ['From', 'To', 'Q13158', 'FADD'],
#  ['From', 'To', 'Q13177', 'PAK2'],
#  ['From', 'To', 'Q13224', 'GRIN2B'],
#  ['From', 'To', 'Q13322', 'GRB10'],
#  ['From', 'To', 'Q13415', 'ORC1'],
#  ['From', 'To', 'Q13418', 'ILK'],
#  ['From', 'To', 'Q13424', 'SNTA1'],
#  ['From', 'To', 'Q13444', 'ADAM15'],
#  ['From', 'To', 'Q13480', 'GAB1'],
#  ['From', 'To', 'Q13485', 'SMAD4'],
#  ['From', 'To', 'Q13507', 'TRPC3'],
#  ['From', 'To', 'Q13541', 'EIF4EBP1'],
#  ['From', 'To', 'Q13651', 'IL10RA'],
#  ['From', 'To', 'Q13765', 'NACA'],
#  ['From', 'To', 'Q13813', 'SPTAN1'],
#  ['From', 'To', 'Q14005', 'IL16'],
#  ['From', 'To', 'Q14118', 'DAG1'],
#  ['From', 'To', 'Q14191', 'WRN'],
#  ['From', 'To', 'Q14247', 'CTTN'],
#  ['From', 'To', 'Q14289', 'PTK2B'],
#  ['From', 'To', 'Q14847', 'LASP1'],
#  ['From', 'To', 'Q14934', 'NFATC4'],
#  ['From', 'To', 'Q15139', 'PRKD1'],
#  ['From', 'To', 'Q15149', 'PLEC'],
#  ['From', 'To', 'Q15418', 'RPS6KA1'],
#  ['From', 'To', 'Q15648', 'MED1'],
#  ['From', 'To', 'Q15653', 'NFKBIB'],
#  ['From', 'To', 'Q15788', 'NCOA1'],
#  ['From', 'To', 'Q15796', 'SMAD2'],
#  ['From', 'To', 'Q15831', 'STK11'],
#  ['From', 'To', 'Q16760', 'DGKD'],
#  ['From', 'To', 'Q16821', 'PPP1R3A'],
#  ['From', 'To', 'Q16828', 'DUSP6'],
#  ['From', 'To', 'Q7KZI7', 'MARK2'],
#  ['From', 'To', 'Q86WB0', 'ZC3HC1'],
#  ['From', 'To', 'Q86WV1', 'SKAP1'],
#  ['From', 'To', 'Q8TDC3', 'BRSK1'],
#  ['From', 'To', 'Q8WXE1', 'ATRIP'],
#  ['From', 'To', 'Q92731', 'ESR2'],
#  ['From', 'To', 'Q92918', 'MAP4K1'],
#  ['From', 'To', 'Q92934', 'BAD'],
#  ['From', 'To', 'Q92974', 'ARHGEF2'],
#  ['From', 'To', 'Q99683', 'MAP3K5'],
#  ['From', 'To', 'Q9BUB5', 'MKNK1'],
#  ['From', 'To', 'Q9BXS5', 'AP1M1'],
#  ['From', 'To', 'Q9BXW9', 'FANCD2'],
#  ['From', 'To', 'Q9H0H5', 'RACGAP1'],
#  ['From', 'To', 'Q9H1A4', 'ANAPC1'],
#  ['From', 'To', 'Q9H1D0', 'TRPV6'],
#  ['From', 'To', 'Q9H8V3', 'ECT2'],
#  ['From', 'To', 'Q9HBA0', 'TRPV4'],
#  ['From', 'To', 'Q9NQ66', 'PLCB1'],
#  ['From', 'To', 'Q9NQS7', 'INCENP'],
#  ['From', 'To', 'Q9NQU5', 'PAK6'],
#  ['From', 'To', 'Q9NRQ2', 'PLSCR4'],
#  ['From', 'To', 'Q9NRY4', 'ARHGAP35'],
#  ['From', 'To', 'Q9NXH3', 'PPP1R14D'],
#  ['From', 'To', 'Q9NYV6', 'RRN3'],
#  ['From', 'To', 'Q9UBS0', 'RPS6KB2'],
#  ['From', 'To', 'Q9UEW8', 'STK39'],
#  ['From', 'To', 'Q9UJX2', 'CDC23'],
#  ['From', 'To', 'Q9UJX3', 'ANAPC7'],
#  ['From', 'To', 'Q9UJY1', 'HSPB8'],
#  ['From', 'To', 'Q9UM73', 'ALK'],
#  ['From', 'To', 'Q9UPZ9', 'ICK'],
#  ['From', 'To', 'Q9UQC2', 'GAB2'],
#  ['From', 'To', 'Q9UQQ2', 'SH2B3'],
#  ['From', 'To', 'Q9Y281', 'CFL2'],
#  ['From', 'To', 'Q9Y2K2', 'SIK3'],
#  ['From', 'To', 'Q9Y618', 'NCOR2'],
#  ['From', 'To', 'Q9Y6I3', 'EPN1'],
#  ['From', 'To', 'Q9Y6K9', 'IKBKG'],
#  ['From', 'To', 'Q9Y6W5', 'WASF2']]

In [None]:
# Only attempt to translate unambiguous accession IDs
# Each sub-list should have length 4, e.g. "['From', 'To', 'Q9Y6W5', 'WASF2']"

proteindict = {}

for i in results2: # i = one substrate and all of its possible translations
    if( len( i ) ) == 4 : # Lists of this length have only one, unambiguous translation option
        proteindict[ str( i[2] ) ] = str( i[3] ) # Add to dictionary

In [None]:
# Create protein/gene ID column for database

protein_ids = []

for n, i in enumerate( phosphosite_2_df.acc ): # Check substrate acc IDs
    if str( i ) in proteindict.keys(): # If they are in the dictionary
        protein_ids.append( proteindict.get( str(i) ) ) # Add the corresponding
        # protein/gene ID to the column
    else: # Otherwise append nothing
        protein_ids.append( "" )

protein_ids = pd.Series( protein_ids )
phosphosite_2_df = phosphosite_2_df.assign( SUB_PROTEIN = protein_ids )

In [None]:
### RETRIEVE GENE NAMES IN "_HUMAN" FORMAT FROM UNIPROT

# Use UniProt's recommended Python 3 script (available at https://www.uniprot.org/help/api_idmapping)

# Create partial URL

url = "https://www.uniprot.org/uploadlists/"

# Define parameters

params = {
    
"from" : "ID", # Assume protein names are in format "ID"
"to" : "ID", # Retrieve IDs in <gene>_HUMAN format
"format" : "tab", # Produce tab-delimited output
"query" : "", # The query protein ID will be defined during the loop
    
}

# Create an empty list to store the results in

results3 = []

for i in substrates: 
    params[ "query" ] = str( i ) # Enter the substrate ID
    data = urllib.parse.urlencode( params ) 
    data = data.encode( "utf-8" )
    req = urllib.request.Request( url, data ) # Run query in Uniprot
    with urllib.request.urlopen( req ) as f:
        response = f.read()
    line = response.decode( "utf-8" ) 
    results3.append( line ) # Store results

In [None]:
# Split the results into a list of lists without "\n" characters

results4 = []

for n, i in enumerate( results3 ):
    results4.append( i.split() )

If the above UniProt-querying loop fails to run, un-comment and run the cell below for a snapshot of the data retrieved on 22 Jan 2020

In [None]:
# results4 = [['From', 'To', 'O00141', 'SGK1_HUMAN'],
#  ['From', 'To', 'O00161', 'SNP23_HUMAN'],
#  ['From', 'To', 'O00257', 'CBX4_HUMAN'],
#  ['From', 'To', 'O14745', 'NHRF1_HUMAN'],
#  ['From', 'To', 'O14757', 'CHK1_HUMAN'],
#  ['From', 'To', 'O14964', 'HGS_HUMAN'],
#  ['From', 'To', 'O14994', 'SYN3_HUMAN'],
#  ['From', 'To', 'O15273', 'TELT_HUMAN'],
#  ['From', 'To', 'O15357', 'SHIP2_HUMAN'],
#  ['From', 'To', 'O15492', 'RGS16_HUMAN'],
#  ['From', 'To', 'O15530', 'PDPK1_HUMAN'],
#  ['From', 'To', 'O43521', 'B2L11_HUMAN'],
#  ['From', 'To', 'O43521', 'B2L11_HUMAN'],
#  ['From', 'To', 'O43586', 'PPIP1_HUMAN'],
#  ['From', 'To', 'O43914', 'TYOBP_HUMAN'],
#  ['From', 'To', 'O60381', 'HBP1_HUMAN'],
#  ['From', 'To', 'O60506', 'HNRPQ_HUMAN'],
#  ['From', 'To', 'O60674', 'JAK2_HUMAN'],
#  ['From', 'To', 'O75582', 'KS6A5_HUMAN'],
#  ['From', 'To', 'O75676', 'KS6A4_HUMAN'],
#  ['From', 'To', 'O75914', 'PAK3_HUMAN'],
#  ['From', 'To', 'O95997', 'PTTG1_HUMAN'],
#  ['From', 'To', 'O96017', 'CHK2_HUMAN'],
#  ['From', 'To', 'P00519', 'ABL1_HUMAN'],
#  ['From', 'To', 'P00533', 'EGFR_HUMAN'],
#  ['From', 'To', 'P01100', 'FOS_HUMAN'],
#  ['From', 'To', 'P01236', 'PRL_HUMAN'],
#  ['From', 'To', 'P01350', 'GAST_HUMAN'],
#  ['From', 'To', 'P02686', 'MBP_HUMAN'],
#  ['From', 'To', 'P03372', 'ESR1_HUMAN'],
#  ['From', 'To', 'P04049', 'RAF1_HUMAN'],
#  ['From', 'To', 'P04150', 'GCR_HUMAN'],
#  ['From', 'To', 'P04629', 'NTRK1_HUMAN'],
#  ['From', 'To', 'P04637', 'P53_HUMAN'],
#  ['From', 'To', 'P05106', 'ITB3_HUMAN'],
#  ['From', 'To', 'P05198', 'IF2A_HUMAN'],
#  ['From', 'To', 'P05412', 'JUN_HUMAN'],
#  ['From', 'To', 'P05787', 'K2C8_HUMAN'],
#  ['From', 'To', 'P06213', 'INSR_HUMAN'],
#  ['From', 'To', 'P06239', 'LCK_HUMAN'],
#  ['From', 'To', 'P06401', 'PRGR_HUMAN'],
#  ['From', 'To', 'P06733', 'ENOA_HUMAN'],
#  ['From', 'To', 'P07101', 'TY3H_HUMAN'],
#  ['From', 'To', 'P07203', 'GPX1_HUMAN'],
#  ['From', 'To', 'P07550', 'ADRB2_HUMAN'],
#  ['From', 'To', 'P07947', 'YES_HUMAN'],
#  ['From', 'To', 'P07948', 'LYN_HUMAN'],
#  ['From', 'To', 'P08047', 'SP1_HUMAN'],
#  ['From', 'To', 'P08069', 'IGF1R_HUMAN'],
#  ['From', 'To', 'P08100', 'OPSD_HUMAN'],
#  ['From', 'To', 'P08631', 'HCK_HUMAN'],
#  ['From', 'To', 'P08670', 'VIME_HUMAN'],
#  ['From', 'To', 'P09038', 'FGF2_HUMAN'],
#  ['From', 'To', 'P09619', 'PGFRB_HUMAN'],
#  ['From', 'To', 'P09769', 'FGR_HUMAN'],
#  ['From', 'To', 'P0C0S8', 'H2A1_HUMAN'],
#  ['From', 'To', 'P10242', 'MYB_HUMAN'],
#  ['From', 'To', 'P10244', 'MYBB_HUMAN'],
#  ['From', 'To', 'P10721', 'KIT_HUMAN'],
#  ['From', 'To', 'P10747', 'CD28_HUMAN'],
#  ['From', 'To', 'P10828', 'THB_HUMAN'],
#  ['From', 'To', 'P10912', 'GHR_HUMAN'],
#  ['From', 'To', 'P11137', 'MTAP2_HUMAN'],
#  ['From', 'To', 'P11171', '41_HUMAN'],
#  ['From', 'To', 'P11274', 'BCR_HUMAN'],
#  ['From', 'To', 'P11362', 'FGFR1_HUMAN'],
#  ['From', 'To', 'P11388', 'TOP2A_HUMAN'],
#  ['From', 'To', 'P11831', 'SRF_HUMAN'],
#  ['From', 'To', 'P11912', 'CD79A_HUMAN'],
#  ['From', 'To', 'P12272', 'PTHR_HUMAN'],
#  ['From', 'To', 'P12318', 'FCG2A_HUMAN'],
#  ['From', 'To', 'P12931', 'SRC_HUMAN'],
#  ['From', 'To', 'P13569', 'CFTR_HUMAN'],
#  ['From', 'To', 'P13693', 'TCTP_HUMAN'],
#  ['From', 'To', 'P14317', 'HCLS1_HUMAN'],
#  ['From', 'To', 'P14598', 'NCF1_HUMAN'],
#  ['From', 'To', 'P14859', 'PO2F1_HUMAN'],
#  ['From', 'To', 'P15172', 'MYOD1_HUMAN'],
#  ['From', 'To', 'P15260', 'INGR1_HUMAN'],
#  ['From', 'To', 'P15336', 'ATF2_HUMAN'],
#  ['From', 'To', 'P15941', 'MUC1_HUMAN'],
#  ['From', 'To', 'P16104', 'H2AX_HUMAN'],
#  ['From', 'To', 'P16220', 'CREB1_HUMAN'],
#  ['From', 'To', 'P16234', 'PGFRA_HUMAN'],
#  ['From', 'To', 'P16410', 'CTLA4_HUMAN'],
#  ['From', 'To', 'P16885', 'PLCG2_HUMAN'],
#  ['From', 'To', 'P17252', 'KPCA_HUMAN'],
#  ['From', 'To', 'P17275', 'JUNB_HUMAN'],
#  ['From', 'To', 'P17480', 'UBF1_HUMAN'],
#  ['From', 'To', 'P17542', 'TAL1_HUMAN'],
#  ['From', 'To', 'P17600', 'SYN1_HUMAN'],
#  ['From', 'To', 'P17655', 'CAN2_HUMAN'],
#  ['From', 'To', 'P17676', 'CEBPB_HUMAN'],
#  ['From', 'To', 'P18031', 'PTN1_HUMAN'],
#  ['From', 'To', 'P18206', 'VINC_HUMAN'],
#  ['From', 'To', 'P18507', 'GBRG2_HUMAN'],
#  ['From', 'To', 'P19105', 'ML12A_HUMAN'],
#  ['From', 'To', 'P19174', 'PLCG1_HUMAN'],
#  ['From', 'To', 'P19235', 'EPOR_HUMAN'],
#  ['From', 'To', 'P19419', 'ELK1_HUMAN'],
#  ['From', 'To', 'P19429', 'TNNI3_HUMAN'],
#  ['From', 'To', 'P19793', 'RXRA_HUMAN'],
#  ['From', 'To', 'P20138', 'CD33_HUMAN'],
#  ['From', 'To', 'P20700', 'LMNB1_HUMAN'],
#  ['From', 'To', 'P20936', 'RASA1_HUMAN'],
#  ['From', 'To', 'P20963', 'CD3Z_HUMAN'],
#  ['From', 'To', 'P22001', 'KCNA3_HUMAN'],
#  ['From', 'To', 'P22314', 'UBA1_HUMAN'],
#  ['From', 'To', 'P22607', 'FGFR3_HUMAN'],
#  ['From', 'To', 'P22681', 'CBL_HUMAN'],
#  ['From', 'To', 'P23396', 'RS3_HUMAN'],
#  ['From', 'To', 'P23443', 'KS6B1_HUMAN'],
#  ['From', 'To', 'P23458', 'JAK1_HUMAN'],
#  ['From', 'To', 'P23528', 'COF1_HUMAN'],
#  ['From', 'To', 'P23634', 'AT2B4_HUMAN'],
#  ['From', 'To', 'P24844', 'MYL9_HUMAN'],
#  ['From', 'To', 'P24928', 'RPB1_HUMAN'],
#  ['From', 'To', 'P25963', 'IKBA_HUMAN'],
#  ['From', 'To', 'P27448', 'MARK3_HUMAN'],
#  ['From', 'To', 'P27708', 'PYR1_HUMAN'],
#  ['From', 'To', 'P27986', 'P85A_HUMAN'],
#  ['From', 'To', 'P28562', 'DUS1_HUMAN'],
#  ['From', 'To', 'P28749', 'RBL1_HUMAN'],
#  ['From', 'To', 'P29320', 'EPHA3_HUMAN'],
#  ['From', 'To', 'P29322', 'EPHA8_HUMAN'],
#  ['From', 'To', 'P29350', 'PTN6_HUMAN'],
#  ['From', 'To', 'P29353', 'SHC1_HUMAN'],
#  ['From', 'To', 'P29597', 'TYK2_HUMAN'],
#  ['From', 'To', 'P30260', 'CDC27_HUMAN'],
#  ['From', 'To', 'P30304', 'MPIP1_HUMAN'],
#  ['From', 'To', 'P30305', 'MPIP2_HUMAN'],
#  ['From', 'To', 'P30307', 'MPIP3_HUMAN'],
#  ['From', 'To', 'P30419', 'NMT1_HUMAN'],
#  ['From', 'To', 'P30443', 'HLAA_HUMAN'],
#  ['From', 'To', 'P30530', 'UFO_HUMAN'],
#  ['From', 'To', 'P31152', 'MK04_HUMAN'],
#  ['From', 'To', 'P31946', '1433B_HUMAN'],
#  ['From', 'To', 'P33241', 'LSP1_HUMAN'],
#  ['From', 'To', 'P33778', 'H2B1B_HUMAN'],
#  ['From', 'To', 'P33981', 'TTK_HUMAN'],
#  ['From', 'To', 'P33991', 'MCM4_HUMAN'],
#  ['From', 'To', 'P35222', 'CTNB1_HUMAN'],
#  ['From', 'To', 'P35568', 'IRS1_HUMAN'],
#  ['From', 'To', 'P35611', 'ADDA_HUMAN'],
#  ['From', 'To', 'P35612', 'ADDB_HUMAN'],
#  ['From', 'To', 'P35637', 'FUS_HUMAN'],
#  ['From', 'To', 'P36888', 'FLT3_HUMAN'],
#  ['From', 'To', 'P36956', 'SRBP1_HUMAN'],
#  ['From', 'To', 'P38398', 'BRCA1_HUMAN'],
#  ['From', 'To', 'P38432', 'COIL_HUMAN'],
#  ['From', 'To', 'P38936', 'CDN1A_HUMAN'],
#  ['From', 'To', 'P39880', 'CUX1_HUMAN'],
#  ['From', 'To', 'P40259', 'CD79B_HUMAN'],
#  ['From', 'To', 'P41212', 'ETV6_HUMAN'],
#  ['From', 'To', 'P41743', 'KPCI_HUMAN'],
#  ['From', 'To', 'P42229', 'STA5A_HUMAN'],
#  ['From', 'To', 'P42702', 'LIFR_HUMAN'],
#  ['From', 'To', 'P42768', 'WASP_HUMAN'],
#  ['From', 'To', 'P43354', 'NR4A2_HUMAN'],
#  ['From', 'To', 'P46527', 'CDN1B_HUMAN'],
#  ['From', 'To', 'P46695', 'IEX1_HUMAN'],
#  ['From', 'To', 'P48751', 'B3A3_HUMAN'],
#  ['From', 'To', 'P49023', 'PAXI_HUMAN'],
#  ['From', 'To', 'P49407', 'ARRB1_HUMAN'],
#  ['From', 'To', 'P49715', 'CEBPA_HUMAN'],
#  ['From', 'To', 'P49768', 'PSN1_HUMAN'],
#  ['From', 'To', 'P49841', 'GSK3B_HUMAN'],
#  ['From', 'To', 'P49917', 'DNLI4_HUMAN'],
#  ['From', 'To', 'P50549', 'ETV1_HUMAN'],
#  ['From', 'To', 'P50613', 'CDK7_HUMAN'],
#  ['From', 'To', 'P51674', 'GPM6A_HUMAN'],
#  ['From', 'To', 'P51692', 'STA5B_HUMAN'],
#  ['From', 'To', 'P51812', 'KS6A3_HUMAN'],
#  ['From', 'To', 'P51813', 'BMX_HUMAN'],
#  ['From', 'To', 'P52735', 'VAV2_HUMAN'],
#  ['From', 'To', 'P52799', 'EFNB2_HUMAN'],
#  ['From', 'To', 'P52926', 'HMGA2_HUMAN'],
#  ['From', 'To', 'P53355', 'DAPK1_HUMAN'],
#  ['From', 'To', 'P53667', 'LIMK1_HUMAN'],
#  ['From', 'To', 'P53671', 'LIMK2_HUMAN'],
#  ['From', 'To', 'P53778', 'MK12_HUMAN'],
#  ['From', 'To', 'P54259', 'ATN1_HUMAN'],
#  ['From', 'To', 'P55211', 'CASP9_HUMAN'],
#  ['From', 'To', 'P56270', 'MAZ_HUMAN'],
#  ['From', 'To', 'P56945', 'BCAR1_HUMAN'],
#  ['From', 'To', 'P58012', 'FOXL2_HUMAN'],
#  ['From', 'To', 'P60953', 'CDC42_HUMAN'],
#  ['From', 'To', 'P61925', 'IPKA_HUMAN'],
#  ['From', 'To', 'P61978', 'HNRPK_HUMAN'],
#  ['From',
#   'To',
#   'P62158',
#   'CALM3_HUMAN',
#   'P62158',
#   'CALM1_HUMAN',
#   'P62158',
#   'CALM2_HUMAN'],
#  ['From', 'To', 'P62993', 'GRB2_HUMAN'],
#  ['From', 'To', 'P68400', 'CSK21_HUMAN'],
#  ['From', 'To', 'P68431', 'H31_HUMAN'],
#  ['From', 'To', 'P78314', '3BP2_HUMAN'],
#  ['From', 'To', 'P78347', 'GTF2I_HUMAN'],
#  ['From', 'To', 'P78527', 'PRKDC_HUMAN'],
#  ['From', 'To', 'P78536', 'ADA17_HUMAN'],
#  ['From', 'To', 'P84022', 'SMAD3_HUMAN'],
#  ['From', 'To', 'P84243', 'H33_HUMAN'],
#  ['From', 'To', 'Q00169', 'PIPNA_HUMAN'],
#  ['From', 'To', 'Q00535', 'CDK5_HUMAN'],
#  ['From', 'To', 'Q01892', 'SPIB_HUMAN'],
#  ['From', 'To', 'Q02078', 'MEF2A_HUMAN'],
#  ['From', 'To', 'Q02241', 'KIF23_HUMAN'],
#  ['From', 'To', 'Q02363', 'ID2_HUMAN'],
#  ['From', 'To', 'Q02763', 'TIE2_HUMAN'],
#  ['From', 'To', 'Q03721', 'KCNC4_HUMAN'],
#  ['From', 'To', 'Q04206', 'TF65_HUMAN'],
#  ['From', 'To', 'Q04912', 'RON_HUMAN'],
#  ['From', 'To', 'Q05513', 'KPCZ_HUMAN'],
#  ['From', 'To', 'Q05655', 'KPCD_HUMAN'],
#  ['From', 'To', 'Q05682', 'CALD1_HUMAN'],
#  ['From', 'To', 'Q06124', 'PTN11_HUMAN'],
#  ['From', 'To', 'Q06413', 'MEF2C_HUMAN'],
#  ['From', 'To', 'Q07666', 'KHDR1_HUMAN'],
#  ['From', 'To', 'Q07820', 'MCL1_HUMAN'],
#  ['From', 'To', 'Q08379', 'GOGA2_HUMAN'],
#  ['From', 'To', 'Q12772', 'SRBP2_HUMAN'],
#  ['From', 'To', 'Q12888', 'TP53B_HUMAN'],
#  ['From', 'To', 'Q12972', 'PP1R8_HUMAN'],
#  ['From', 'To', 'Q13094', 'LCP2_HUMAN'],
#  ['From', 'To', 'Q13131', 'AAPK1_HUMAN'],
#  ['From', 'To', 'Q13153', 'PAK1_HUMAN'],
#  ['From', 'To', 'Q13158', 'FADD_HUMAN'],
#  ['From', 'To', 'Q13177', 'PAK2_HUMAN'],
#  ['From', 'To', 'Q13224', 'NMDE2_HUMAN'],
#  ['From', 'To', 'Q13322', 'GRB10_HUMAN'],
#  ['From', 'To', 'Q13415', 'ORC1_HUMAN'],
#  ['From', 'To', 'Q13418', 'ILK_HUMAN'],
#  ['From', 'To', 'Q13424', 'SNTA1_HUMAN'],
#  ['From', 'To', 'Q13444', 'ADA15_HUMAN'],
#  ['From', 'To', 'Q13480', 'GAB1_HUMAN'],
#  ['From', 'To', 'Q13485', 'SMAD4_HUMAN'],
#  ['From', 'To', 'Q13507', 'TRPC3_HUMAN'],
#  ['From', 'To', 'Q13541', '4EBP1_HUMAN'],
#  ['From', 'To', 'Q13651', 'I10R1_HUMAN'],
#  ['From', 'To', 'Q13765', 'NACA_HUMAN'],
#  ['From', 'To', 'Q13813', 'SPTN1_HUMAN'],
#  ['From', 'To', 'Q14005', 'IL16_HUMAN'],
#  ['From', 'To', 'Q14118', 'DAG1_HUMAN'],
#  ['From', 'To', 'Q14191', 'WRN_HUMAN'],
#  ['From', 'To', 'Q14247', 'SRC8_HUMAN'],
#  ['From', 'To', 'Q14289', 'FAK2_HUMAN'],
#  ['From', 'To', 'Q14847', 'LASP1_HUMAN'],
#  ['From', 'To', 'Q14934', 'NFAC4_HUMAN'],
#  ['From', 'To', 'Q15139', 'KPCD1_HUMAN'],
#  ['From', 'To', 'Q15149', 'PLEC_HUMAN'],
#  ['From', 'To', 'Q15418', 'KS6A1_HUMAN'],
#  ['From', 'To', 'Q15648', 'MED1_HUMAN'],
#  ['From', 'To', 'Q15653', 'IKBB_HUMAN'],
#  ['From', 'To', 'Q15788', 'NCOA1_HUMAN'],
#  ['From', 'To', 'Q15796', 'SMAD2_HUMAN'],
#  ['From', 'To', 'Q15831', 'STK11_HUMAN'],
#  ['From', 'To', 'Q16760', 'DGKD_HUMAN'],
#  ['From', 'To', 'Q16821', 'PPR3A_HUMAN'],
#  ['From', 'To', 'Q16828', 'DUS6_HUMAN'],
#  ['From', 'To', 'Q7KZI7', 'MARK2_HUMAN'],
#  ['From', 'To', 'Q86WB0', 'NIPA_HUMAN'],
#  ['From', 'To', 'Q86WV1', 'SKAP1_HUMAN'],
#  ['From', 'To', 'Q8TDC3', 'BRSK1_HUMAN'],
#  ['From', 'To', 'Q8WXE1', 'ATRIP_HUMAN'],
#  ['From', 'To', 'Q92731', 'ESR2_HUMAN'],
#  ['From', 'To', 'Q92918', 'M4K1_HUMAN'],
#  ['From', 'To', 'Q92934', 'BAD_HUMAN'],
#  ['From', 'To', 'Q92974', 'ARHG2_HUMAN'],
#  ['From', 'To', 'Q99683', 'M3K5_HUMAN'],
#  ['From', 'To', 'Q9BUB5', 'MKNK1_HUMAN'],
#  ['From', 'To', 'Q9BXS5', 'AP1M1_HUMAN'],
#  ['From', 'To', 'Q9BXW9', 'FACD2_HUMAN'],
#  ['From', 'To', 'Q9H0H5', 'RGAP1_HUMAN'],
#  ['From', 'To', 'Q9H1A4', 'APC1_HUMAN'],
#  ['From', 'To', 'Q9H1D0', 'TRPV6_HUMAN'],
#  ['From', 'To', 'Q9H8V3', 'ECT2_HUMAN'],
#  ['From', 'To', 'Q9HBA0', 'TRPV4_HUMAN'],
#  ['From', 'To', 'Q9NQ66', 'PLCB1_HUMAN'],
#  ['From', 'To', 'Q9NQS7', 'INCE_HUMAN'],
#  ['From', 'To', 'Q9NQU5', 'PAK6_HUMAN'],
#  ['From', 'To', 'Q9NRQ2', 'PLS4_HUMAN'],
#  ['From', 'To', 'Q9NRY4', 'RHG35_HUMAN'],
#  ['From', 'To', 'Q9NXH3', 'PP14D_HUMAN'],
#  ['From', 'To', 'Q9NYV6', 'RRN3_HUMAN'],
#  ['From', 'To', 'Q9UBS0', 'KS6B2_HUMAN'],
#  ['From', 'To', 'Q9UEW8', 'STK39_HUMAN'],
#  ['From', 'To', 'Q9UJX2', 'CDC23_HUMAN'],
#  ['From', 'To', 'Q9UJX3', 'APC7_HUMAN'],
#  ['From', 'To', 'Q9UJY1', 'HSPB8_HUMAN'],
#  ['From', 'To', 'Q9UM73', 'ALK_HUMAN'],
#  ['From', 'To', 'Q9UPZ9', 'ICK_HUMAN'],
#  ['From', 'To', 'Q9UQC2', 'GAB2_HUMAN'],
#  ['From', 'To', 'Q9UQQ2', 'SH2B3_HUMAN'],
#  ['From', 'To', 'Q9Y281', 'COF2_HUMAN'],
#  ['From', 'To', 'Q9Y2K2', 'SIK3_HUMAN'],
#  ['From', 'To', 'Q9Y618', 'NCOR2_HUMAN'],
#  ['From', 'To', 'Q9Y6I3', 'EPN1_HUMAN'],
#  ['From', 'To', 'Q9Y6K9', 'NEMO_HUMAN'],
#  ['From', 'To', 'Q9Y6W5', 'WASF2_HUMAN']]

In [None]:
# Only attempt to translate unambiguous accession IDs
# Each sub-list should have length 4, e.g. "['From', 'To', 'Q9Y6W5', 'WASF2_HUMAN']"

genedict = {}

for i in results4:
    if( len( i ) ) == 4 : # Lists of this length have only one, unambiguous translation option
        genedict[ str( i[2] ) ] = str( i[3] )

In [None]:
# Create gene ID column for database

gene_ids = []

for n, i in enumerate( phosphosite_2_df.acc ): # Check substrate acc IDs
    if str( i ) in genedict.keys(): # If they are in the dictionary
        gene_ids.append( genedict.get( str(i) ) ) # Add the corresponding
        # gene ID to the column
    else: # Otherwise append nothing
        gene_ids.append( "" )

gene_ids = pd.Series( gene_ids )
phosphosite_2_df = phosphosite_2_df.assign( SUB_GENE = gene_ids )

Translate the kinase IDs to UniProt IDs, where possible

In [None]:
# Make a list of kinases from phosphosite_2_df
# Remove duplicates
# Convert them to uppercase

kinases = phosphosite_2_df.kinases

kinases = list( kinases.drop_duplicates() )

for n, i in enumerate( kinases ):
    kinases[ n ] = str( i ).upper()

In [None]:
# Make an empty dictionary for storing kinase aliases
# The keys will be kinase names from phosphosite_2_df
# The values will be the UniProt accession IDs from
# kinases_df

kinase_dict = {}

In [None]:
# Check five columns in kinases_df for each kinase ID in phosphosite_2_df
# If kinase is found, store the corresponding UniProt ID in the dictionary
# and remove it from the list "kinases"

# For some reason the following five loops need to be executed three times
# in order to work correctly

# Run 1/3 of five loops:

# Check whether kinase plus "_HUMAN" matches the "Entry name"

for a in kinases:
    for o, j in enumerate( kinases_df[ 'Entry_name' ] ):
        if str( a ) + "_HUMAN" == str( j ):
            kinase_dict[ a ] = kinases_df.UniProt_ID[ o ]
            kinases.remove( a )

# Check whether the kinase can be found in "Primary Protein Name"           
            
for b in kinases:
    if b != "ABL": # Matches with the word "probable"
        for p, k in enumerate( kinases_df[ 'Primary_Protein_Name' ] ):
                if str( b ) in str( k ).upper():
                    if b in kinases:
                        kinase_dict[ b ] = kinases_df.UniProt_ID[ p ]
                        kinases.remove( b )

# Check whether the kinase can be found in "Alternative Protein Name(s)"
                        
for c in kinases:
    if c != "LOK": # Matches with the work "Telokin"
        for q, l in enumerate( kinases_df[ 'Alternative_Protein_Name(s)' ] ):
            if str( c ) in str( l ).upper():
                if c in kinases:
                    kinase_dict[ c ] = kinases_df.UniProt_ID[ q ]
                    kinases.remove( c )

# Check whether the kinase exactly matches the "Gene Symbol"
                    
for d in kinases:
    for r, m in enumerate( kinases_df[ 'Gene_Symbol' ] ):
        if str( d ) == str( m ).upper():
            if d in kinases:
                kinase_dict[ d ] = kinases_df.UniProt_ID[ r ]
                kinases.remove( d )

# Check whether the kinase can be found in "Alternative Gene Name(s)"
                
for e in kinases:
    for s, z in enumerate( kinases_df[ 'Alternative_Gene_Name(s)' ] ):
        if str( e ) in str( z ).upper():
            if e in kinases:
                kinase_dict[ e ] = kinases_df.UniProt_ID[ s ]
                kinases.remove( e )
                
# Re-running the five loops resolves a further 21 kinase IDs. 
# For example, the previous loops failed to match "ABL2" + "_HUMAN"
# with "ABL2_HUMAN" 

# Run 2/3 of five loops

# Check whether kinase plus "_HUMAN" matches the "Entry name"

for a in kinases:
    for o, j in enumerate( kinases_df[ 'Entry_name' ] ):
        if str( a ) + "_HUMAN" == str( j ):
            kinase_dict[ a ] = kinases_df.UniProt_ID[ o ]
            kinases.remove( a )

# Check whether the kinase can be found in "Primary Protein Name"           
            
for b in kinases:
    if b != "ABL": # Matches with the word "probable"
        for p, k in enumerate( kinases_df[ 'Primary_Protein_Name' ] ):
                if str( b ) in str( k ).upper():
                    if b in kinases:
                        kinase_dict[ b ] = kinases_df.UniProt_ID[ p ]
                        kinases.remove( b )

# Check whether the kinase can be found in "Alternative Protein Name(s)"
                        
for c in kinases:
    if c != "LOK": # Matches with the work "Telokin"
        for q, l in enumerate( kinases_df[ 'Alternative_Protein_Name(s)' ] ):
            if str( c ) in str( l ).upper():
                if c in kinases:
                    kinase_dict[ c ] = kinases_df.UniProt_ID[ q ]
                    kinases.remove( c )

# Check whether the kinase exactly matches the "Gene Symbol"
                    
for d in kinases:
    for r, m in enumerate( kinases_df[ 'Gene_Symbol' ] ):
        if str( d ) == str( m ).upper():
            if d in kinases:
                kinase_dict[ d ] = kinases_df.UniProt_ID[ r ]
                kinases.remove( d )

# Check whether the kinase can be found in "Alternative Gene Name(s)"
                
for e in kinases:
    for s, z in enumerate( kinases_df[ 'Alternative_Gene_Name(s)' ] ):
        if str( e ) in str( z ).upper():
            if e in kinases:
                kinase_dict[ e ] = kinases_df.UniProt_ID[ s ]
                kinases.remove( e )
                
# The second set of loops fails to pair "PDHK4" or "MAP2K7" with a UniProt ID. 
# Re-running the loops will add these kinases and their corresponding UniProt
# IDs to the dictionary    

# Run 3/3 of five loops

# Check whether kinase plus "_HUMAN" matches the "Entry name"

for a in kinases:
    for o, j in enumerate( kinases_df[ 'Entry_name' ] ):
        if str( a ) + "_HUMAN" == str( j ):
            kinase_dict[ a ] = kinases_df.UniProt_ID[ o ]
            kinases.remove( a )

# Check whether the kinase can be found in "Primary Protein Name"           
            
for b in kinases:
    if b != "ABL": # Matches with the word "probable"
        for p, k in enumerate( kinases_df[ 'Primary_Protein_Name' ] ):
                if str( b ) in str( k ).upper():
                    if b in kinases:
                        kinase_dict[ b ] = kinases_df.UniProt_ID[ p ]
                        kinases.remove( b )

# Check whether the kinase can be found in "Alternative Protein Name(s)"
                        
for c in kinases:
    if c != "LOK": # Matches with the work "Telokin"
        for q, l in enumerate( kinases_df[ 'Alternative_Protein_Name(s)' ] ):
            if str( c ) in str( l ).upper():
                if c in kinases:
                    kinase_dict[ c ] = kinases_df.UniProt_ID[ q ]
                    kinases.remove( c )

# Check whether the kinase exactly matches the "Gene Symbol"
                    
for d in kinases:
    for r, m in enumerate( kinases_df[ 'Gene_Symbol' ] ):
        if str( d ) == str( m ).upper():
            if d in kinases:
                kinase_dict[ d ] = kinases_df.UniProt_ID[ r ]
                kinases.remove( d )

# Check whether the kinase can be found in "Alternative Gene Name(s)"
                
for e in kinases:
    for s, z in enumerate( kinases_df[ 'Alternative_Gene_Name(s)' ] ):
        if str( e ) in str( z ).upper():
            if e in kinases:
                kinase_dict[ e ] = kinases_df.UniProt_ID[ s ]
                kinases.remove( e )

Several kinases are still not in the dictionary. Translate as many untranslated kinases as possible

In [None]:
# Make an empty list. We will split each untranslated kinase in two and 
# store the resulting list in this list, along with the original kinase
# name, which will be used in a dictionary later

kinases_2 = []

In [None]:
# Check for four non-alphanumeric characters in the untranslated kinases
# Split them by these characters if found, store in "kinases_2", and 
# remove from "kinases"
# Attach the original kinase name for use in a dictionary later

# For some reason the following four loops need to be executed four times
# in order to work correctly

# Run 1/4 of four loops:

for i in kinases:
    if " " in i:
        if i in kinases:
            items = i.split( " " )
            items.append(items[ 0 ] + " " + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( i )
        
for j in kinases:
    if "/" in j:
        if j in kinases:
            items = j.split( "/" )
            items.append(items[ 0 ] + "/" + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( j )
    
for k in kinases:
    if "_" in k:
        if k in kinases:
            items = k.split( "_" )
            items.append(items[ 0 ] + "_" + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( k )

for l in kinases:
    if "-" in l:
        if l in kinases:
            items = l.split( "-" )
            items.append(items[ 0 ] + "-" + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( l )

# Several kinases with underscores still remain in the original list

# Run 2/4 of four loops:

for i in kinases:
    if " " in i:
        if i in kinases:
            items = i.split( " " )
            items.append(items[ 0 ] + " " + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( i )
        
for j in kinases:
    if "/" in j:
        if j in kinases:
            items = j.split( "/" )
            items.append(items[ 0 ] + "/" + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( j )
    
for k in kinases:
    if "_" in k:
        if k in kinases:
            items = k.split( "_" )
            items.append(items[ 0 ] + "_" + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( k )

for l in kinases:
    if "-" in l:
        if l in kinases:
            items = l.split( "-" )
            items.append(items[ 0 ] + "-" + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( l )

# Nine splittable kinases remain in the original list

# Run 3/4 of four loops:

for i in kinases:
    if " " in i:
        if i in kinases:
            items = i.split( " " )
            items.append(items[ 0 ] + " " + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( i )
        
for j in kinases:
    if "/" in j:
        if j in kinases:
            items = j.split( "/" )
            items.append(items[ 0 ] + "/" + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( j )
    
for k in kinases:
    if "_" in k:
        if k in kinases:
            items = k.split( "_" )
            items.append(items[ 0 ] + "_" + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( k )

for l in kinases:
    if "-" in l:
        if l in kinases:
            items = l.split( "-" )
            items.append(items[ 0 ] + "-" + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( l )
    
# Three splittable kinases remain in the original list

# Run 4/4 of four loops:

for i in kinases:
    if " " in i:
        if i in kinases:
            items = i.split( " " )
            items.append(items[ 0 ] + " " + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( i )
        
for j in kinases:
    if "/" in j:
        if j in kinases:
            items = j.split( "/" )
            items.append(items[ 0 ] + "/" + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( j )
    
for k in kinases:
    if "_" in k:
        if k in kinases:
            items = k.split( "_" )
            items.append(items[ 0 ] + "_" + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( k )

for l in kinases:
    if "-" in l:
        if l in kinases:
            items = l.split( "-" )
            items.append(items[ 0 ] + "-" + items[ 1 ])
            kinases_2.append( items )
            kinases.remove( l )

Check the kinases table for the strings in kinases_2

In [None]:
# Check whether the two substrings from the kinase name
# can be found in the kinase data frame under "Primary Protein Name"
# Add to dictionary if so

# For some reason the following loop needs to be executed twice
# in order to work correctly

for i in kinases_2:
    for n, j in enumerate( kinases_df[ 'Primary_Protein_Name' ] ):
        if i[ 0 ] in j.upper() and i[ 1 ] in j.upper():
            if i in kinases_2:
                kinase_dict[ str( i[ 2 ] )] = kinases_df.UniProt_ID[ n ]
                kinases_2.remove( i )

for i in kinases_2:
    for n, j in enumerate( kinases_df[ 'Primary_Protein_Name' ] ):
        if i[ 0 ] in j.upper() and i[ 1 ] in j.upper():
            if i in kinases_2:
                kinase_dict[ str( i[ 2 ] )] = kinases_df.UniProt_ID[ n ]
                kinases_2.remove( i )

In [None]:
# Check whether the two substrings from the kinase name
# can be found in the kinase data frame under "Alternative Protein Name(s)"
# Add to dictionary if so

# Ensure only unambiguous translations are added by counting
# the number of matches

# For some reason the following loop needs to be executed twice
# in order to work correctly

for i in kinases_2:
    
    matches = 0
    match = []
    
    for n, j in enumerate( kinases_df[ 'Alternative_Protein_Name(s)' ] ):
        if i[ 0 ] in str( j ).upper() and i[ 1 ] in str( j ).upper():
            matches += 1
            match = kinases_df[ 'UniProt_ID' ][ n ]
    if matches == 1 and i in kinases_2:
        kinase_dict[ str( i[ 2 ] )] = match
        kinases_2.remove( i )     

for i in kinases_2:
    
    matches = 0
    match = []
    
    for n, j in enumerate( kinases_df[ 'Alternative_Protein_Name(s)' ] ):
        if i[ 0 ] in str( j ).upper() and i[ 1 ] in str( j ).upper():
            matches += 1
            match = kinases_df[ 'UniProt_ID' ][ n ]
    if matches == 1 and i in kinases_2:
        kinase_dict[ str( i[ 2 ] )] = match
        kinases_2.remove( i )          

In [None]:
# Check whether the first substring of the kinase name exactly matches
# "Entry name" minus "_HUMAN" in the kinase data frame

for i in kinases_2:
    for n, j in enumerate( kinases_df[ 'Entry_name' ] ):
        if i[ 0 ] == j[ : -6 ].upper(): # Crop "_HUMAN" from entry name
            # This will be added later
            kinase_dict[ str( i[ 2 ] )] = kinases_df[ 'UniProt_ID' ][ n ]
            kinases_2.remove( i ) 

In the phospohosite table, convert the current "kinases" ID to uppercase, in order to allow translation

In [None]:
uppercase_kinase = []

for i in phosphosite_2_df.kinases:
    uppercase_kinase.append( str( i ).upper() )

uppercase_kinase = pd.Series( uppercase_kinase )

phosphosite_2_df = phosphosite_2_df.assign( kinases = uppercase_kinase )

Remove "NAN" from dictionary

In [None]:
kinase_dict.pop( "NAN" )

Add kinase UniProt ID column to data frame

In [None]:
uniprot_id = []

# If the kinase can be translated, add the UniProt ID to the column
# Otherwise add an empty string

for n, i in enumerate( phosphosite_2_df.kinases ):
    if i in kinase_dict.keys():
        kinase = kinase_dict.get( str( i ) )
        uniprot_id.append( kinase )
    else:
        uniprot_id.append( "" )

uniprot_id = pd.Series(uniprot_id)

phosphosite_2_df = phosphosite_2_df.assign( ACC_ID = uniprot_id )

Using the UniProt ID and the phosphosite amino acid residue information, add a phosphosite ID column to act as a Foreign Key for several different phosphosite-related tables in the database

In [None]:
phos_id = []

for n, i in phosphosite_2_df.iterrows():
    phos_id.append( phosphosite_2_df.acc[ n ].upper() + "(" + phosphosite_2_df.code[ n ].upper() + str( phosphosite_2_df.position[ n ]) + ")" )

phos_id = pd.Series( phos_id )

phosphosite_2_df = phosphosite_2_df.assign( PHOS_ID = phos_id )

Remove unnecessary columns

In [None]:
# Remove columns "source" and "entry_date"

phosphosite_2_df = phosphosite_2_df.drop( [ 'source', 'entry_date' ],
                                         axis = 1 )

phosphosite_2_df = phosphosite_2_df.reset_index( drop = True )

Write to CSV

In [None]:
phosphosite_2_df.to_csv( "phosphosites_2.csv", index = False)