# Introduction

### Get PhosphPICK Result Files 
1.  The current version of human proteome (reviewed) sequences were downloaded from [Uniprit.org](https://www.uniprot.org) 
    - use CreateHumanProteomeDF.ipynb
    - Or, go [HERE](https://www.uniprot.org/uniprot/?query=*&fil=organism%3A%22Homo+sapiens+%28Human%29+%5B9606%5D%22+AND+reviewed%3Ayes), click on 'Download'-> FASTA(canonical)
2.  split the downloaded fasta file into smaller files with 1000 sequence in each. The last file has 367 sequences. (code in CreateHumanProteomeDF.ipynb)
3.  submit the sequence files on [PhosphoPICK](http://bioinf.scmb.uq.edu.au/phosphopick/submit) with the following settings:
    - **selection type:** Multiple Kninases
    - **species:** human
    - **family:** all 
    - **kinases:** (use the "shift" key to select all Kinases in the list)
    - **calculate p-value:** checked
    - **p-value significance threshold:** no threshold
    - **email address:**(enter the email you want to receive the result)
4.  download the results from the links in the notification emails, and save them in **'../Data/Raw/PhosphoPICK/'** dir

### About the data
- Version: Nov., 2019
- PhosphoPICK will take about 5 days to complete the prediction and send an email for each of the submitted files with the result downloading link. 
- the raw predictions are available upon request
- PhosphoPICK result files contain the following columns:
    - **identifier:** identifiers from the fasta files submitted for PhosphoPICK prediction	
    - **Uniprot-Acc:** uniprotID of the protein that has been found by the BLAST search (please note if multiple proteins return as the most significant match PhosphoPICK will return all matching Uniprot IDs).
    - **Uniprot-Acc:** uniprotID of the protein that has been found by the BLAST search (please note if multiple proteins return as the most significant match PhosphoPICK will return all matching Uniprot IDs).
    - **blastp-identity:** blastp-identity of the predicted substrate to the submitted seq
    - **kinase:** the kinase being scored for this substrate/site.
    - **context-score:** probability according to the PhosphoPICK Bayesian network model that your chosen kinase is phosphorylating the protein.
    - **context-p-value**	
    - **site:** the location of a potential phosphorylation site on the protein.
    - **site-score:** probability according to the naive Bayes model that this site is being phosphorylated by the kinase.	
    - **site-p-value**	
    - **combined-score:** Represents the combined probability according to the context model and the sequence model that this site is phosphorylated by the kinase. Calculated as the average of the context score and the site score.
    - **combined-p-value**


        
### Preprocessing PhosphoPICK Raw
In order to convert this to standard format, here we:
1. Filtering Date: 
    - remove isoforms
    - remove predicted substrate with blastp-identity != 100.0 
        - only keep predictions that the predicted substrate is the input protein submitted for PhosphoPICK prediction
2. Mapping accessions:
    - **Mapping substrate accessions:**  
        - extrated the uniprotID from 'identifier' column (this is the uniprotID of the protein submitted for PhosphoPICK prediction)
        - using the 'Uniprot-Acc' (the uniprotID of the predicted substrate by PhosphoPICK) column to retrieve the current uniprotID for the predicted substrate
        - remove data that the uniprotID of the protein submitted for PhosphoPICK prediction is different from the retrieved current uniprotID for the predicted substrate (this will remove any data with outdated uniprotID, and perdiction with substrate has seq with 100% blastp-identity but is not the input portein)
    - **Mapping kinase accessions:** 
        - get uniprotID for the kinases using 'kinase' column 
3. **Mapping sites:**
    - get the +/- 7 AA peptide seq from the reference human proteome
    - catch outdated records causing by: deleted uniprotID, updated UniprotID, sequence of the given uniprotID changed causing out of range error...
    

### Creating Resource Files
1.  **'globalKinaseMap.csv':** After step 3 in preprocessing, add unique kinases from PhosphoPICK to the globel kinase resource. 
    

### Standard Formatted PhosphoPICK
**'FormattedPhosphoPICK.csv':** Standardize the preprocessed file with following columns:
- **substrate_id** - unique IDs for the substrate phosphorylation site (substrate_acc + position)
- **substrate** - gene name for the substrates
- **substrate_acc** - mapped UniprotIDs for the substrates
- **site** - phosphorylation  site
- **pep** - +/- 7 AA peptide sequence around the site
- **kinase** - Kinase name

# Initializing

In [1]:
# IMPORTS
import pandas as pd
import os
import re
import glob

import humanProteomesReference, phosphoPick_convert, getUniprotID, checkSite

#only need when testing the code
import time

In [2]:
# DEFINE FILE NAMES/DIRs
##################
# Version (Date) #
##################
version = '2019-12-11'

##################
# File Location  #
##################
# local (../../)
base = '../../'

####################################################
# For Prepare Fasta Files to Submit in PhosphoPICK #
####################################################

# Human Proteome fasta file
HP_fasta = base + 'Data/Raw/HumanProteome/humanProteome_' + version + '.fasta'
# Dir for splited Human Proteome fasta files
HP_dir = base + 'Data/Raw/HumanProteome/'


# human proteome referece file 
HP_csv = base + 'Data/Map/humanProteome_' + version + '.csv'

####################################################
# For Preprocessing PhosphoPICK Prediction Results #
# (for the Nov 2019 results)                       #
#--------------------------------------------------#
# . The files submitted for PhosphoPICK predictor  #
#   is NOT the up-to-date human proteom sequences  #
# . There has been an update in human proteomes    #
#   from the time the perdiction results were got  #
#   to running the preporcessing steps.            #
####################################################

# PhosphoPICK results dir
PP_dir = base + 'Data/Raw/PhosphoPICK/'
PP_update_dir = base + 'Data/Raw/PhosphoPICK/updated/'

# PhosphoPICK temp dir
PP_temp_dir_acc = base + 'Data/Temp/PhosphoPICK/mappedAcc/'
PP_temp_dir_site = base + 'Data/Temp/PhosphoPICK/mappedSite/'
PP_temp_dir_acc_update = base + 'Data/Temp/PhosphoPICK/mappedAcc/updated/'
PP_temp_dir_site_update = base + 'Data/Temp/PhosphoPICK/mappedSite/updated/'

# Resource Files
HK_org = base + 'Data/Raw/HumanKinase/globalKinaseMap.txt'                  # orginal manually created kinase file
KinaseMap = base + 'Data/Map/globalKinaseMap.csv'                           # add all unique kinase in HPRD to the global file

# Standard formatted output file
PP_formatted = base + 'Data/Formatted/PhosphoPICK/PhosphoPICK_formatted_' + version + '.csv'       # preprocessed file with cloumns: substrate_id/substrate/substrate_acc/kinase/site/pep/score


# Preprocessing PhosphoPICK Raw

### Mapping Accessions (UniprotID) and Site
1. Filtering Date: 
    - remove isoforms
    - remove predicted substrate with blastp-identity != 100.0 
        - only keep predictions that the predicted substrate is the input protein submitted for PhosphoPICK prediction
2. Mapping accessions:
    - **Mapping substrate accessions:**  
        - extrated the uniprotID from 'identifier' column (this is the uniprotID of the protein submitted for PhosphoPICK prediction)
        - using the 'Uniprot-Acc' (the uniprotID of the predicted substrate by PhosphoPICK) column to retrieve the current uniprotID for the predicted substrate
        - remove data that the uniprotID of the protein submitted for PhosphoPICK prediction is different from the retrieved current uniprotID for the predicted substrate (this will remove any data with outdated uniprotID, and perdiction with substrate has seq with 100% blastp-identity but is not the input portein)
    - **Mapping kinase accessions:** 
        - get uniprotID for the kinases using 'kinase' column 
3. **Mapping sites:**
    - get the +/- 7 AA peptide seq from the reference human proteome
    - catch outdated records causing by: deleted uniprotID, updated UniprotID, sequence of the given uniprotID changed causing out of range error...

**Output Files Dataframe:**
- **Uniprot-Acc:** the uniprotID of the predicted substrate by PhosphoPICK
- **substrate_id:** unique IDs for the substrate phosphorylation site (substrate_acc + position)
- **substrate_acc:** substrate uniprotID
- **site:** aa + position in protein sequence
- **position:** position in protein sequence
- **pep:** +/- 7 AA
- **kinase:** kinase name used in PhosphoPICK
- **kinase_acc:** kinase uniprotID
- **combined-p-value:** PhosphoPICK score

**Mapping Accessions**

In [17]:
convert_type = 'acc'
phosphoPick_convert.pick_convert_directory(PP_dir, 'na', PP_temp_dir_acc, convert_type)

reading  13001-14000.txt
getting unique sub
getting sub_acc
merge
done 458.9465310573578
getting unique kin
getting kin_acc
RPSK6A5 no hit in human
MAP3KB no hit in human
merge
done 45.92072534561157
saving
Done 25.648595094680786
reading  3001-4000.txt
getting unique sub
getting sub_acc
merge
done 442.121808052063
getting unique kin
getting kin_acc
RPSK6A5 no hit in human
MAP3KB no hit in human
merge
done 47.999385833740234
saving
Done 22.979498863220215
reading  1-1000.txt
getting unique sub
getting sub_acc
Q5JT78 no hit in human
merge
done 399.5749170780182
getting unique kin
getting kin_acc
RPSK6A5 no hit in human
MAP3KB no hit in human
merge
done 44.98341202735901
saving
Done 21.85321807861328
reading  15001-16000.txt
getting unique sub
getting sub_acc
P0CZ20 no hit in human
B3EWG4 no hit in human
B3EWG4 no hit in human
merge
done 388.16595697402954
getting unique kin
getting kin_acc
RPSK6A5 no hit in human
MAP3KB no hit in human
merge
done 41.36503720283508
saving
Done 21.8129420

**Mapping Site**

In [19]:
convert_type = 'site'
phosphoPick_convert.pick_convert_directory(PP_temp_dir_acc, HP_csv, PP_temp_dir_site, convert_type)

Set input file dir...
done
read the Human Proteome df...
done
processing  13001-14000 .txt
Saving  13001-14000
13001-14000 155.31652522087097
processing  3001-4000 .txt
Saving  3001-4000
3001-4000 135.85251212120056
processing  1-1000 .txt
Saving  1-1000
1-1000 128.94466519355774
processing  15001-16000 .txt
Saving  15001-16000
15001-16000 135.2762269973755
processing  10001-11000 .txt
Saving  10001-11000
10001-11000 133.8019778728485
processing  16001-17000 .txt
Saving  16001-17000
16001-17000 137.52903985977173
processing  8001-9000 .txt
Saving  8001-9000
8001-9000 136.03232216835022
processing  2001-3000 .txt
Saving  2001-3000
2001-3000 131.88873171806335
processing  5001-6000 .txt
Saving  5001-6000
5001-6000 149.10758900642395
processing  4001-5000 .txt
Saving  4001-5000
4001-5000 152.20608711242676
processing  6001-7000 .txt
Saving  6001-7000
6001-7000 149.08443593978882
processing  1001-2000 .txt
Saving  1001-2000
1001-2000 149.21809601783752
processing  7001-8000 .txt
Saving  70

**Remove unmapped/outdated results**
- Check if there is any unmapped substrate/site due to outdated uniprot sequence records
    - get a list of outdated uniprotID

In [None]:
all_results = glob.glob(PP_temp_dir_site + '*.csv')
    
updatedSub_li = []
for filename in all_results:
    # uncomment the next line if df_subMap was not defined
    df_unmapped = pd.read_csv(filename, usecols = ['substrate_id','substrate_acc', 'site'])
    df_unmapped = df_unmapped[~(df_unmapped['site'].str.contains('S|T|Y', na=False)) | (df_unmapped['substrate_id'] == 'outdated')]
    df_unmapped = df_unmapped.substrate_acc.drop_duplicates()
    unmapped_li = df_unmapped.values.tolist()
    updatedSub_li.extend(unmapped_li)

updatedSub_li = list(dict.fromkeys(updatedSub_li))
updatedSub_li 

In [21]:
all_results = glob.glob(PP_temp_dir_site + '*.csv')
    
updatedSub_li = []
for filename in all_results:
    # uncomment the next line if df_subMap was not defined
    df_unmapped = pd.read_csv(filename, usecols = ['substrate_id','substrate_acc', 'site'])
    df_unmapped = df_unmapped[~(df_unmapped['site'].str.contains('S|T|Y', na=False)) | (df_unmapped['substrate_id'] == 'outdated')]
    df_unmapped = df_unmapped.substrate_acc.drop_duplicates()
    unmapped_li = df_unmapped.values.tolist()
    updatedSub_li.extend(unmapped_li)

updatedSub_li = list(dict.fromkeys(updatedSub_li))
updatedSub_li 

['Q6QHF9',
 'Q9H4K1',
 'O43151',
 'Q9C0I4',
 'A8MYV0',
 'Q9HBT7',
 'Q06124',
 'P08913',
 'Q8IXR5']

- Remove any record with unmapped substrate_acc/site in '*_mappedSite.csv'

In [25]:
all_results = glob.glob(PP_temp_dir_site + '*.csv')
for filename in all_results:
    start = time.time()
    df_mapSite = pd.read_csv(filename)
    # remove the outdated records from df_subMap
    df_update = df_mapSite[~df_mapSite['substrate_acc'].isin(updatedSub_li)]
    df_update.to_csv(filename, chunksize=100000, index=False)
    end = time.time()
    print (f"chunk time\t{(end-start):.3f}")

chunk time	35.869
chunk time	37.469
chunk time	36.316
chunk time	35.933
chunk time	34.634
chunk time	34.238
chunk time	36.334
chunk time	34.589
chunk time	36.695
chunk time	37.314
chunk time	40.518
chunk time	35.472
chunk time	37.720
chunk time	33.727
chunk time	34.665
chunk time	33.893
chunk time	33.305
chunk time	44.794
chunk time	32.537
chunk time	34.160


### Update PhosphoPICK results
- The next 6 cells is only for PhosphoPICK results from outdated Human Proteomes sequences (input sequences for PhosphoPICK prediction an earlier version than the referece human proteome sequence)

1. download the sequence fasta file of the above UniprotID (updatedSub_li) from Uniprot.org, save it as '../Data/Raw/HumanProteome/phosphoPICK_updateSub.fasta'. Submit the 'phosphoPICK_updateSub.fasta' on PhosphoPICK again.
2. run mapAcc and mapSite function for the result file from 'phosphoPICK_updateSub.fasta'

In [7]:
pp_update_hp = base + 'Data/Raw/HumanProteome/phosphoPICK_updateSub.fasta'
update_hp_csv = base + 'Data/Map/phosphoPICK_updateSub.csv'

In [22]:
convert_type = 'acc'
phosphoPick_convert.pick_convert_directory(PP_update_dir, 'na', PP_temp_dir_acc_update, convert_type)

reading  PICK_updated-9.txt
getting unique sub
getting sub_acc
merge
done 2.027851104736328
getting unique kin
getting kin_acc
RPSK6A5 no hit in human
MAP3KB no hit in human
merge
done 34.19140601158142
saving
Done 0.16403412818908691


In [12]:
humanProteomesReference.fastaToCSV(pp_update_hp, update_hp_csv)

convert_type = 'site'
phosphoPick_convert.pick_convert_directory(PP_temp_dir_acc_update, update_hp_csv, PP_temp_dir_site_update, convert_type)

Formatting  PICK_updated-9 ...
Reading input file...
Get unique substrate sites...
Done. Time	21.207


3. save 'PICK_updated-9_mappedSite.csv' under the same dir as other *_mappedSite.csv files
    - remove any unmapped sites, if any

In [14]:
df_mapUpdateSite = pd.read_csv(PP_temp_dir_site_update + 'PICK_updated-9_mappedSite.csv')  
df_mapUpdateSite = df_mapUpdateSite[(df_mapUpdateSite['site'].str.contains('S|T|Y', na=False)) & (df_mapUpdateSite['substrate_id'] != 'outdated')]
df_mapUpdateSite.to_csv(PP_temp_dir_site + 'PICK_updated-9_mappedSite.csv')

df_mapUpdateSite

Unnamed: 0,Uniprot-Acc,blastp-identity,kinase,position,combined-p-value,substrate_acc,kinase_acc,site,pep,substrate_id
0,O43151,100.0,CDK1,396,0.001721,O43151,P06493,S396,SPPAPFRSPQSYLRA,O43151_396
1,O43151,100.0,CDK1,1549,0.003608,O43151,P06493,S1549,FNSALKGSPGFQDKL,O43151_1549
2,O43151,100.0,CDK1,753,0.006664,O43151,P06493,S753,ESPFATRSPKQIKIE,O43151_753
3,O43151,100.0,CDK1,1000,0.007765,O43151,P06493,T1000,CKYARSKTPRKFRLA,O43151_1000
4,O43151,100.0,CDK1,428,0.008986,O43151,P06493,T428,SSAFPPATPRTEFPE,O43151_428
...,...,...,...,...,...,...,...,...,...,...
42365,P08913,100.0,PRKDC,242,0.798014,P08913,P78527,T242,YQIAKRRTRVPPSRR,P08913_242
42366,P08913,100.0,PRKDC,247,0.808342,P08913,P78527,S247,RRTRVPPSRRGPDAV,P08913_247
42367,P08913,100.0,PRKDC,135,0.817593,P08913,P78527,S135,DVLFCTSSIVHLCAI,P08913_135
42368,P08913,100.0,PRKDC,143,0.826284,P08913,P78527,S143,IVHLCAISLDRYWSI,P08913_143


### Get the Gene Name of the Substrates from the Reference Human Proteome
- get and add the Gene Name that would use across all perdictors for the substrates to the result files

In [30]:
# get the new list of unique kinase with common kinase name that would use across all referece and the uniprotID for these kinase
df_unique_sub =  pd.read_csv(HP_csv, usecols = ['UniprotID','Gene Name'], sep = '\t')
df_unique_sub

Unnamed: 0,Gene Name,UniprotID
0,ACTN1,P12814
1,STAT3,P40763
2,ADD1,P35611
3,ADD2,P35612
4,ADRA2A,P08913
...,...,...
20358,HSFX4,A0A1B0GTS1
20359,TRBJ2-6,A0A0A0MT70
20360,TMEM225B,P0DP42
20361,SMIM29,Q86T20


In [31]:
start = time.time()
all_results = glob.glob(PP_temp_dir_site + '*.csv')

for filename in all_results:
    df = pd.read_csv(filename)
    # merge df_subsMap with df_unique_sub to add the common substrate gene name to the df
    df = df.merge(df_unique_sub, left_on=['substrate_acc'], right_on=['UniprotID'], how = 'left')

    df = df.drop(columns = ['UniprotID'])
    df.to_csv(filename, index=False)
df

Unnamed: 0,Uniprot-Acc,blastp-identity,kinase,position,combined-p-value,substrate_acc,kinase_acc,site,pep,substrate_id,Gene Name
0,P18077,100.0,CDK1,59,0.017008,P18077,P06493,T59,KAKNNTVTPGGKPNK,P18077_59,RPL35A
1,P18077,100.0,CDK1,75,0.303455,P18077,P06493,T75,RVIWGKVTRAHGNSG,P18077_75,RPL35A
2,P18077,100.0,CDK1,67,0.313442,P18077,P06493,T67,PGGKPNKTRVIWGKV,P18077_67,RPL35A
3,P18077,100.0,CDK1,57,0.591686,P18077,P06493,T57,VYKAKNNTVTPGGKP,P18077_57,RPL35A
4,P18077,100.0,CDK1,108,0.621989,P18077,P06493,S108,IRVMLYPSRI_____,P18077_108,RPL35A
...,...,...,...,...,...,...,...,...,...,...,...
6625186,Q9H2W6,100.0,PRKDC,214,0.819006,Q9H2W6,P78527,T214,NAPCGHYTFKFPQAM,Q9H2W6_214,MRPL46
6625187,Q9H2W6,100.0,PRKDC,196,0.876844,Q9H2W6,P78527,S196,ERTLATLSENNMEAK,Q9H2W6_196,MRPL46
6625188,Q9H2W6,100.0,PRKDC,147,0.899907,Q9H2W6,P78527,T147,ADEKNDRTSLNRKLD,Q9H2W6_147,MRPL46
6625189,Q9H2W6,100.0,PRKDC,148,0.918730,Q9H2W6,P78527,S148,DEKNDRTSLNRKLDR,Q9H2W6_148,MRPL46


### Creating Resource Files
**globalKinaseMap**
- creat a new or add unique kinases from NetworKIN to the globel kinase resource file.
- get and add the Kinase Name that would use across all perdictors for the kinases to the result files

In [32]:
# get the kinases that failed to retrieve uniprotID
all_results = glob.glob(PP_temp_dir_site + '*.csv')

df_unique_kin = pd.DataFrame()

for filename in all_results:
    df = pd.read_csv(filename, usecols = ['kinase_acc', 'kinase'])
    df = df.drop_duplicates()
    df_unique_kin = df_unique_kin.append(df, ignore_index=True)
    
df_unique_kin = df_unique_kin.drop_duplicates()
print (df_unique_kin[df_unique_kin['kinase_acc'] == '(no hit in human)'])
df_unique_kin

     kinase         kinase_acc
67  RPSK6A5  (no hit in human)
94   MAP3KB  (no hit in human)


Unnamed: 0,kinase,kinase_acc
0,CDK1,P06493
1,CDK2,P24941
2,CDK3,Q00526
3,CDK4,P11802
4,CDK5,Q00535
...,...,...
102,VRK1,Q99986
103,ATM,Q13315
104,ATR,Q13535
105,MTOR,P42345


Manually check/search for the ones that fails to retrieve the UniprotID (column 'kinase_acc') programmatically:
- there are 2 such substrates (Jan. 2020):
    - RPSK6A5: may refer to RPS6KA5?
    - MAP3KB : didn't find any record in human 
- Create a dictionary for the ones that need to manually enter the kinase_acc
    - key = 'kinase' (the provided uniprotID)
    - value = 'kinase_acc (the mapped uniprotID)

|kinase|Mapped ID|
|---|---|
|RPSK6A5|O75582|
|MAP3KB| |

In [None]:
id_dict = {'RPSK6A5':'O75582'}

- add the uniprotID of the above protein kinases in the '*_mappedSite.csv' files
- get a list of unique kinases in PhosphoPICK

In [33]:
all_results = glob.glob(PP_temp_dir_site + '*.csv')

df_unique_kin = pd.DataFrame()

for filename in all_results:
    df = pd.read_csv(filename)
    for key in id_dict:
        df.loc[df.kinase == key, ["kinase_acc"]] = id_dict[key]
    # remove MAP3KB : didn't find any record in human
    df = df[df['kinase_acc'] != '(no hit in human)']
    df.to_csv(filename, index = False)
    df = df[['kinase_acc', 'kinase']]
    df = df.drop_duplicates()
    df_unique_kin = df_unique_kin.append(df, ignore_index=True)
    
df_unique_kin = df_unique_kin.drop_duplicates()

df_unique_kin

Unnamed: 0,kinase_acc,kinase
0,P06493,CDK1
1,P24941,CDK2
2,Q00526,CDK3
3,P11802,CDK4
4,Q00535,CDK5
...,...,...
101,Q99986,VRK1
102,Q13315,ATM
103,Q13535,ATR
104,P42345,MTOR


- creat a new or add unique kinases from PhosphoPICK to the globel kinase resource file

In [35]:
start = time.time()
unmapped_list = pd.DataFrame()

# create df for the globalkinaseMap.csv if exist
if os.path.isfile(KinaseMap): 
    df_humanKinase = pd.read_csv(KinaseMap)
# if globalkinaseMap.csv file does not exist, create an new df using orginal human kinase map
else:
    df_humanKinase = pd.read_csv(HK_org, usecols = ['Kinase Name', 'Preferred Name', 'UniprotID', 'Type', 'description'], sep = '\t')
    df_humanKinase['description'].replace(regex=True,inplace=True,to_replace=r'\[Source.+\]',value=r'')

for index, row in df_unique_kin.iterrows():
    kinase = df_unique_kin.at[index, 'kinase_acc']
    # if the kinase/other enzyme already in the globalKinaseMap.csv file
    if any(df_humanKinase.UniprotID == kinase):
        # get the index of the substrate in the globalKinaseMap.csv file 
        idx = df_humanKinase.index[df_humanKinase.UniprotID == kinase].values[0] 

        df_humanKinase.at[idx, 'PhosphoPICK_kinase_name'] = df_unique_kin.at[index, 'kinase']
        
    # if the kinase is not in the globalSubstrateMap.csv file, we need a list to check annotations manullay
    else:
        unmapped_list = unmapped_list.append(row,sort=False).reset_index(drop=True)
        
print (unmapped_list)
df_humanKinase.to_csv(KinaseMap, index = False)
df_humanKinase 

Empty DataFrame
Columns: []
Index: []


Unnamed: 0,Kinase Name,UniprotID,description,HPRD_kinase_name,HPRD_kinase_uniprot_id,HPRD_kinase_refseq_id,PhosphoSite_kinase_uniprot_id,PhosphoSite_kinase_name,PhosphoSite_kinase_gene_name,GPS5_kinase_name,NetworKIN_kinase_name,PhosphoPICK_kinase_name
0,SGK1,O00141,serum/glucocorticoid regulated kinase 1,SGK1,O00141,NP_001137148.1,O00141,SGK1,SGK1,SGK1,SGK1,SGK1
1,BMPR1B,O00238,bone morphogenetic protein receptor type 1B,BMPR1B,O00238,NP_001194.1,O00238,BMPR1B,BMPR1B,BMPR1B,,
2,CDC7,O00311,cell division cycle 7,,,,O00311,CDC7,CDC7,CDC7,,
3,PLK4,O00444,polo like kinase 4,,,,O00444,PLK4,PLK4,PLK4,,
4,STK25,O00506,serine/threonine kinase 25,STK25,O00506,NP_006365.2,O00506,YSK1,STK25,STK25,STK25,
...,...,...,...,...,...,...,...,...,...,...,...,...
484,TP53RK,Q96S44,EKC/KEOPS complex subunit TP53RK,,,,Q96S44,PRPK,TP53RK,TP53RK,,
485,TRPM6,Q9BX84,Transient receptor potential cation channel su...,,,,Q9BX84,ChaK2,TRPM6,TRPM6,,
486,BCR/ABL,A9UF07,BCR/ABL fusion protein isoform Y5,,,,A9UF07,BCR-ABL1,BCR/ABL,,,
487,BCKDK,O14874,[3-methyl-2-oxobutanoate dehydrogenase [lipoam...,,,,,,,BCKDK,BCKDK,


In [36]:
# get the new list kinase with common kinase name that would use across all referece and the uniprotID for these kinase
df_unique_kin = df_humanKinase[['Kinase Name','UniprotID']]
df_unique_kin

Unnamed: 0,Kinase Name,UniprotID
0,SGK1,O00141
1,BMPR1B,O00238
2,CDC7,O00311
3,PLK4,O00444
4,STK25,O00506
...,...,...
484,TP53RK,Q96S44
485,TRPM6,Q9BX84
486,BCR/ABL,A9UF07
487,BCKDK,O14874


- add the Kinase Name that would use across all perdictors for the kinase to the result files

In [37]:
start = time.time()
all_results = glob.glob(PP_temp_dir_site + '*.csv')

for filename in all_results:
    df = pd.read_csv(filename)
    # merge with df_unique_kinase to add the common kinase name to the df
    df = df.merge(df_unique_kin, left_on='kinase_acc', right_on='UniprotID', how = 'left')
    # drop the duplicated uniprot id for kinases
    df = df.drop(columns = 'UniprotID')

    df.to_csv(filename,index=False)  

df

Unnamed: 0,Uniprot-Acc,blastp-identity,kinase,position,combined-p-value,substrate_acc,kinase_acc,site,pep,substrate_id,Gene Name,Kinase Name
0,P18077,100.0,CDK1,59,0.017008,P18077,P06493,T59,KAKNNTVTPGGKPNK,P18077_59,RPL35A,CDK1
1,P18077,100.0,CDK1,75,0.303455,P18077,P06493,T75,RVIWGKVTRAHGNSG,P18077_75,RPL35A,CDK1
2,P18077,100.0,CDK1,67,0.313442,P18077,P06493,T67,PGGKPNKTRVIWGKV,P18077_67,RPL35A,CDK1
3,P18077,100.0,CDK1,57,0.591686,P18077,P06493,T57,VYKAKNNTVTPGGKP,P18077_57,RPL35A,CDK1
4,P18077,100.0,CDK1,108,0.621989,P18077,P06493,S108,IRVMLYPSRI_____,P18077_108,RPL35A,CDK1
...,...,...,...,...,...,...,...,...,...,...,...,...
6553695,Q9H2W6,100.0,PRKDC,214,0.819006,Q9H2W6,P78527,T214,NAPCGHYTFKFPQAM,Q9H2W6_214,MRPL46,PRKDC
6553696,Q9H2W6,100.0,PRKDC,196,0.876844,Q9H2W6,P78527,S196,ERTLATLSENNMEAK,Q9H2W6_196,MRPL46,PRKDC
6553697,Q9H2W6,100.0,PRKDC,147,0.899907,Q9H2W6,P78527,T147,ADEKNDRTSLNRKLD,Q9H2W6_147,MRPL46,PRKDC
6553698,Q9H2W6,100.0,PRKDC,148,0.918730,Q9H2W6,P78527,S148,DEKNDRTSLNRKLDR,Q9H2W6_148,MRPL46,PRKDC


# Standard Formatted PhosphoPICK
### 'PICK_formatted.csv'
Standardize the preprocessed file with following columns:
- **substrate_id** - unique IDs for the substrate phosphorylation site (substrate_acc + position)
- **substrate** - gene name for the substrates
- **substrate_acc** - mapped UniprotIDs for the substrates
- **site** - phosphorylation  site
- **pep** - +/- 7 AA peptide sequence around the site
- **score** - perdiction score
- **kinase** - Kinase name

In [None]:
all_results = glob.glob(PP_temp_dir_site + '*.csv')
PP = []
for filename in all_results:
    df = pd.read_csv(filename, usecols = ['substrate_id', 'Gene Name','Uniprot-Acc','substrate_acc','site','pep', 'combined-p-value','Kinase Name'])
    df = df.rename(columns={'Gene Name': 'substrate_name', 'combined-p-value': 'score'})
    PP.append(df)
df_final = pd.concat(PP)
df_final = df_final.drop_duplicates()
df_final = df_final.reset_index(drop=True)

df_final

- check for substrate/kinase prediction with multiple scores

In [None]:
duplicateRowsDF = df_final[df_final.duplicated(['substrate_id', 'Kinase Name'], keep = False)].sort_values(by=['substrate_id', 'Kinase Name'])
duplicateRowsDF

- get the index of the ones that the "Uniprot-Acc" (PhosphoPICK predicted substrate uniprotID) is a secondary UniprotID
- remove them from the data

In [None]:
sub_2nd_acc = duplicateRowsDF.index[duplicateRowsDF['Uniprot-Acc'] != duplicateRowsDF['substrate_acc']].tolist()
sub_2nd_acc

In [None]:
df_final = df_final.drop(df_final.index[sub_2nd_acc])
df_final = df_final.reset_index(drop=True)
df_final.to_csv(PP_formatted, chunksize = 1000000, index = False)