# Introduction

### Get NetworKIN Result Files (Ben)

### About the data (Ben)
- Version:
- NetworKIN result files contain the following columns:
    - #Name
    - Position
    - Tree
    - NetPhorest Group
    - Kinase/Phosphatase/Phospho-binding domain
    - NetworKIN score
    - NetPhorest probability
    - STRING score
    - Target STRING ID
    - Kinase/Phosphatase/Phospho-binding domain
    - STRING ID
    - Target description
    - Kinase/Phosphatase/Phospho-binding domain description
    - Target Name
    - Kinase/Phosphatase/Phospho-binding domain Name
    - Peptide sequence window
    - Intermediate nodes
        
### Preprocessing NetworKIN Raw
1. Filtering Date: only keep the results of kinase prediction (TREE == 'KIN')
2. Mapping accessions:
    - **Mapping substrate accessions:**  Get the UniprotID from the sequence identifiers of the original fasta files that submitted in NetworKIN for prediction. 
    - **Mapping kinase accessions:** get uniprotID for the kinases using 'Kinase/Phosphatase/Phospho-binding domain description' column 
        - correcting duplicated names found above
3. **Mapping sites:**
    - formatting pep seq (e.g. ----MsGSKSV --> ____MSGSKSV)
    - update 'site' if the orignal sequence submitted for perdiction is different from the reference sequence due to changes causing shift in position

**Output Files Dataframe:**
- **substrate_id:** unique IDs for the substrate phosphorylation site (substrate_acc + position)
- **substrate_name:** substrate gene name used in NetworKIN
- **substrate_acc:** substrate uniprotID
- **site:** aa + position in protein sequence
- **Position:** position in protein sequence
- **pep:** +/- 5 AA
- **kinase_name:** kinase name used in NetworKIN
- **kinase_acc:** kinase uniprotID
- **score:** NetworKIN score

    

### Creating Resource Files
1.  **'globalKinaseMap.csv':** 
    - creat a new or add unique kinases from NetworKIN to the globel substrate resource file.
    - get and add the Kinase Name that would use across all perdictors for the kinases to the result files

### Standard Formatted NetworKIN
**'NetworKIN_formatted.csv':** Standardize the preprocessed file with following columns:
- **substrate_id** - unique IDs for the substrate phosphorylation site (substrate_acc + position)
- **substrate_name** - gene name for the substrates
- **substrate_acc** - mapped UniprotIDs for the substrates
- **site** - phosphorylation  site
- **pep** - +/- 5 AA peptide sequence around the site
- **score**
- **Kinase Name** 


In [1]:
# IMPORTS
import pandas as pd
import os
import re
import glob

import humanProteomesReference, networKin_convert, getUniprotID, checkSite

#only need when testing the code
import time

In [5]:
# DEFINE FILE NAMES/DIRs
##################
# Version (Date) #
##################
version = '2019-12-11'

##################
# File Location  #
##################
# local (../../)
base = '../../'

##################################################
# For Prepare Fasta Files to Submit in NetworKIN #
##################################################

# Human Proteome fasta file
HP_fasta = base + 'Data/Raw/HumanProteome/humanProteome_' + version + '.fasta'
# Dir for splited Human Proteome fasta files
HP_dir = base + 'Data/Raw/HumanProteome/'

# human proteome referece file 
HP_csv = base + 'Data/Map/humanProteome_' + version + '.csv'

####################################################
# For Preprocessing NetworKIN Prediction Results   #
#--------------------------------------------------#
# . The files submitted for NetworKIN predictor is #
#   NOT the up-to-date human proteom sequences     #
# . There has been an update in human proteomes    #
#   from the time the perdiction results were got  #
#   to running the preporcessing steps.            #
####################################################

# NetworKIN results dir
NW_dir = base + 'Data/Raw/NetworKIN/'
NW_update_dir = base + 'Data/Raw/NetworKIN/updated/'
# NetworKIN temp dir
NW_temp_dir_acc = base + 'Data/Temp/NetworKIN/mappedAcc/'
NW_temp_dir_site = base + 'Data/Temp/NetworKIN/mappedSite/'
NW_temp_dir_acc_update = base + 'Data/Temp/NetworKIN/mappedAcc/updated/'
NW_temp_dir_site_update = base + 'Data/Temp/NetworKIN/mappedSite/updated/'

# Resource Files
HK_org = base + 'Data/Raw/HumanKinase/globalKinaseMap.txt'                  # orginal manually created kinase file
KinaseMap = base + 'Data/Map/globalKinaseMap.csv'                           # add all unique kinase in HPRD to the global file

# Standard formatted output file
NW_formatted = base + 'Data/Formatted/NetworKIN/NetworKIN_formatted_' + version + '.csv'       # preprocessed file with cloumns: substrate_id/substrate/substrate_acc/kinase/site/pep/score


# Preprocessing NetworKIN Raw

In the NetworKIN raw file, most of the names in the 'Kinase/Phosphatase/Phospho-binding domain' column can not use to retrieve the uniprotID for the kinase, but all the names in the 'Kinase/Phosphatase/Phospho-binding domain description' column can. However, there are different 'Kinase/Phosphatase/Phospho-binding domain' with the same 'Kinase/Phosphatase/Phospho-binding domain description'.  We need to identify those and correct them if need. 

### Get unique kinases in NetworKIN

In [32]:
all_results = glob.glob(NW_dir + '*.tsv')
#create empty df to store unique kinases
df_unique_kin = pd.DataFrame()
for filename in all_results:
    df = pd.read_csv(filename, usecols = ['Tree','Kinase/Phosphatase/Phospho-binding domain', 'Kinase/Phosphatase/Phospho-binding domain description'], sep = '\t')
    # the only type of perdiction we are intreseted in is 'KIN'
    df = df[df.Tree == 'KIN']
    df = df[['Kinase/Phosphatase/Phospho-binding domain', 'Kinase/Phosphatase/Phospho-binding domain description']].drop_duplicates()
    # append unique kinases found in each result files
    df_unique_kin = df_unique_kin.append(df)
    
# drop any duplicated kinases
df_unique_kin = df_unique_kin.drop_duplicates()
# get the  kinase(s) with the same domain description
duplicateRowsDF = df_unique_kin[df_unique_kin.duplicated(['Kinase/Phosphatase/Phospho-binding domain description'],keep=False)]
duplicateRowsDF.sort_values(['Kinase/Phosphatase/Phospho-binding domain description','Kinase/Phosphatase/Phospho-binding domain'])

Unnamed: 0,Kinase/Phosphatase/Phospho-binding domain,Kinase/Phosphatase/Phospho-binding domain description
1459,MST2,STK3
1484,MST4,STK3


After manually check for the above kinases, MST4 should associate with STK26 not STK3. Create a dictionary to correct that in the result files.

In [4]:
correct_kinase = {'MST4' : 'STK26'}

### Mapping Accessions (UniprotID) and Site
1. Filtering Date: only keep the results of kinase prediction (TREE == 'KIN')
2. Mapping accessions:
    - **Mapping substrate accessions:**  Get the UniprotID from the sequence identifiers of the original fasta files that submitted in NetworKIN for prediction. 
    - **Mapping kinase accessions:** get uniprotID for the kinases using 'Kinase/Phosphatase/Phospho-binding domain description' column 
        - correcting duplicated names found above
3. **Mapping sites:**
    - formatting pep seq (e.g. ----MsGSKSV --> ____MSGSKSV)
    - update 'site' if the orignal sequence submitted for perdiction is different from the reference sequence due to changes causing shift in position

**Output Files Dataframe:**
- **substrate_id:** unique IDs for the substrate phosphorylation site (substrate_acc + position)
- **substrate_name:** substrate gene name used in NetworKIN
- **substrate_acc:** substrate uniprotID
- **site:** aa + position in protein sequence
- **Position:** position in protein sequence
- **pep:** +/- 5 AA
- **kinase_name:** kinase name used in NetworKIN
- **kinase_acc:** kinase uniprotID
- **score:** NetworKIN score

**Mapping Accessions**

In [7]:
# convert substrate_acc and kinase_acc
convert_type = 'acc'
networKin_convert.kin_convert_directory(NW_dir, 'na', NW_temp_dir_acc, convert_type)

reading  19.tsv
getting unique sub
getting sub_acc
merge
done 2.626702308654785
getting unique kin
getting kin_acc
merge
done 240.26136994361877
saving
Done 30.140185832977295
reading  18.tsv
getting unique sub
getting sub_acc
merge
done 3.2055108547210693
getting unique kin
getting kin_acc
merge
done 251.04621005058289
saving
Done 38.56568384170532
reading  20.tsv
getting unique sub
getting sub_acc
merge
done 3.681636095046997
getting unique kin
getting kin_acc
merge
done 326.4341878890991
saving
Done 40.38838195800781
reading  21.tsv
getting unique sub
getting sub_acc
merge
done 1.4215641021728516
getting unique kin
getting kin_acc
merge
done 316.3874821662903
saving
Done 17.19068193435669
reading  8.tsv
getting unique sub
getting sub_acc
merge
done 2.9008588790893555
getting unique kin
getting kin_acc
merge
done 276.8457570075989
saving
Done 27.926394939422607
reading  9.tsv
getting unique sub
getting sub_acc
merge
done 2.908118724822998
getting unique kin
getting kin_acc
merge
done

  if (await self.run_code(code, result,  async_=asy)):


getting unique sub
getting sub_acc
merge
done 2.7888870239257812
getting unique kin
getting kin_acc
merge
done 221.8265450000763
saving
Done 30.584384202957153
reading  7.tsv
getting unique sub
getting sub_acc
merge
done 2.604299783706665
getting unique kin
getting kin_acc
merge
done 260.0229811668396
saving
Done 30.57912302017212
reading  6.tsv
getting unique sub
getting sub_acc
merge
done 2.8214480876922607
getting unique kin
getting kin_acc
merge
done 269.59356689453125
saving
Done 41.610522985458374
reading  2.tsv
getting unique sub
getting sub_acc
merge
done 3.613402843475342
getting unique kin
getting kin_acc
merge
done 251.59176588058472
saving
Done 28.798696756362915
reading  3.tsv
getting unique sub
getting sub_acc
merge
done 2.4144513607025146
getting unique kin
getting kin_acc
merge
done 231.33128333091736
saving
Done 27.983081817626953
reading  1.tsv
getting unique sub
getting sub_acc
merge
done 3.1557531356811523
getting unique kin
getting kin_acc
merge
done 241.9041361808

**Mapping Site**

In [8]:
# map the site to the updated (new) human proteome reference
convert_type = 'site'
networKin_convert.kin_convert_directory(NW_temp_dir_acc, HP_csv, NW_temp_dir_site, convert_type)

Set input file dir...
done
read the Human Proteome df...
done
processing  9 .tsv
Get unique substrate sites in  9 .tsv
P30988 :   corrected:  (seqNotFound)S4  site:  S4  seq len:  474 pep:  MQFSGEKIS
P30988 :   corrected:  (seqNotFound)S9  site:  S9  seq len:  474 pep:  SGEKISGQRDL
P30988 :   corrected:  (seqNotFound)S17  site:  S17  seq len:  474 pep:  RDLQKSKMRFT
P30988 :   corrected:  (seqNotFound)T22  site:  T22  seq len:  474 pep:  SKMRFTFTSRC
P30988 :   corrected:  (seqNotFound)T195  site:  T195  seq len:  474 pep:  FFRKLTTIFPL
P30988 :   corrected:  (seqNotFound)T196  site:  T196  seq len:  474 pep:  FRKLTTIFPLN
P30988 :   corrected:  (seqNotFound)Y204  site:  Y204  seq len:  474 pep:  PLNWKYRKALS
P30988 :   corrected:  (seqNotFound)S209  site:  S209  seq len:  474 pep:  YRKALSLGCQR
Q9BSG1 :   corrected:  (seqNotFound)S161  site:  S161  seq len:  425 pep:  LRRRRSALSRE
Saving  9 .tsv
9 .tsv Done  256.96863174438477
processing  8 .tsv
Get unique substrate sites in  8 .tsv
Q86VQ6 :

Saving  12 .tsv
12 .tsv Done  227.0921070575714
processing  10 .tsv
Get unique substrate sites in  10 .tsv
Q6P4E1 :   corrected:  (seqNotFound)T424  site:  T424  seq len:  436 pep:  GLHAITMKPTS
Q6P4E1 :   corrected:  (seqNotFound)T428  site:  T428  seq len:  436 pep:  ITMKPTSKFFG
Q6P4E1 :   corrected:  (seqNotFound)S429  site:  S429  seq len:  436 pep:  TMKPTSKFFG
P05534 :   corrected:  (idNotFound)S129  site:  S129 pep:  GCDVGSDGRFL
P05534 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P05534 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTLVLLL
P05534 :   corrected:  (idNotFound)Y137  site:  Y137 pep:  RFLRGYHQYAY
P05534 :   corrected:  (idNotFound)Y140  site:  Y140 pep:  RGYHQYAYDGK
P05534 :   corrected:  (idNotFound)S14  site:  S14 pep:  LVLLLSGALAL
P05534 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQPTV
P05534 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  RKGGSYSQAAS
P05534 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALKE


P30457 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P30457 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTLVLLL
P30457 :   corrected:  (idNotFound)Y137  site:  Y137 pep:  RFLRGYQQDAY
P30457 :   corrected:  (idNotFound)S14  site:  S14 pep:  LVLLLSGALAL
P30457 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQPTI
P30457 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  RKGGSYSQAAS
P30457 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P30457 :   corrected:  (idNotFound)T20  site:  T20 pep:  GALALTQTWAG
P30457 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTQTWAGSH
P30457 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GQEQRYTCHVQ
P30457 :   corrected:  (idNotFound)S26  site:  S26 pep:  QTWAGSHSMRY
P30457 :   corrected:  (idNotFound)S28  site:  S28 pep:  WAGSHSMRYFY
P30457 :   corrected:  (idNotFound)T282  site:  T282 pep:  QEQRYTCHVQH
P30457 :   corrected:  (idNotFound)T158  site:  T158 pep:  DLRSWTAADMA
P30457 :   corrected

Q8N1F1 :   corrected:  (idNotFound)T64  site:  T64 pep:  FCGRATSARAW
Q8N1F1 :   corrected:  (idNotFound)S65  site:  S65 pep:  CGRATSARAWS
Q8N1F1 :   corrected:  (idNotFound)S98  site:  S98 pep:  AYPLQSAEDGV
Q8N1F1 :   corrected:  (idNotFound)Y36  site:  Y36 pep:  GAAARYWTAWQ
Q8N1F1 :   corrected:  (idNotFound)S5  site:  S5 pep:  MFPGSLSRGR
Q8N1F1 :   corrected:  (idNotFound)T38  site:  T38 pep:  AARYWTAWQGS
Q8N1F1 :   corrected:  (idNotFound)S7  site:  S7 pep:  FPGSLSRGRRA
Q8N1F1 :   corrected:  (idNotFound)T105  site:  T105 pep:  EDGVATRLQIR
Q8N1F1 :   corrected:  (idNotFound)S43  site:  S43 pep:  TAWQGSAGPNP
Q8N1F1 :   corrected:  (idNotFound)S78  site:  S78 pep:  RPGPGSPAHSG
Q8N1F1 :   corrected:  (idNotFound)S113  site:  S113 pep:  QIREESASCLA
Q8N1F1 :   corrected:  (idNotFound)S82  site:  S82 pep:  GSPAHSGGVQT
Q8N1F1 :   corrected:  (idNotFound)S115  site:  S115 pep:  REESASCLAAE
Q8N1F1 :   corrected:  (idNotFound)Y121  site:  Y121 pep:  CLAAEYWSQEP
Q8N1F1 :   corrected:  (idNotFo

Saving  10 .tsv
10 .tsv Done  240.43896985054016
processing  11 .tsv
Get unique substrate sites in  11 .tsv
P30475 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P30475 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTVPIVG
P30475 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTVLLLL
P30475 :   corrected:  (idNotFound)S14  site:  S14 pep:  VLLLLSAALAL
P30475 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQSTV
P30475 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAS
P30475 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P30475 :   corrected:  (idNotFound)T20  site:  T20 pep:  AALALTETWAG
P30475 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTETWAGSH
P30475 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P30475 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
P30475 :   corrected:  (idNotFound)S155  site:  S155 pep:  LNEDLSSWTAA
P30475 :   corrected:  (idNotFound)S28  site:  S28 p

P10314 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P10314 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTLLLLL
P10314 :   corrected:  (idNotFound)Y137  site:  Y137 pep:  RLLRGYQQDAY
P10314 :   corrected:  (idNotFound)Y142  site:  Y142 pep:  YQQDAYDGKDY
P10314 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  RKGGSYSQAAS
P10314 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P10314 :   corrected:  (idNotFound)T20  site:  T20 pep:  GALALTQTWAG
P10314 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTQTWAGSH
P10314 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GQEQRYTCHVQ
P10314 :   corrected:  (idNotFound)S26  site:  S26 pep:  QTWAGSHSMRY
P10314 :   corrected:  (idNotFound)S28  site:  S28 pep:  WAGSHSMRYFF
P10314 :   corrected:  (idNotFound)T282  site:  T282 pep:  QEQRYTCHVQH
P10314 :   corrected:  (idNotFound)T158  site:  T158 pep:  DLRSWTAADMA
P10314 :   corrected:  (idNotFound)Y31  site:  Y31 pep:  SHSMRYFFTSV
P10314 :   corrected

P30512 :   corrected:  (idNotFound)S129  site:  S129 pep:  GCHVGSDGRFL
P30512 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P30512 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTLLLLL
P30512 :   corrected:  (idNotFound)Y137  site:  Y137 pep:  RFLRGYRQDAY
P30512 :   corrected:  (idNotFound)Y142  site:  Y142 pep:  YRQDAYDGKDY
P30512 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQPTI
P30512 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  RKGGSYSQAAS
P30512 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P30512 :   corrected:  (idNotFound)T20  site:  T20 pep:  GALALTQTWAG
P30512 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTQTWAGSH
P30512 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GQEQRYTCHVQ
P30512 :   corrected:  (idNotFound)S26  site:  S26 pep:  QTWAGSHSMRY
P30512 :   corrected:  (idNotFound)S28  site:  S28 pep:  WAGSHSMRYFT
P30512 :   corrected:  (idNotFound)T282  site:  T282 pep:  QEQRYTCHVQH
P30512 :   correct

Q52LJ0 :   corrected:  (seqNotFound)S305  site:  S305  seq len:  433 pep:  VGVSFSTVENE
Q52LJ0 :   corrected:  (seqNotFound)S326  site:  S326  seq len:  433 pep:  ILVYFSFMSW
Q52LJ0 :   corrected:  (seqNotFound)S303  site:  S303  seq len:  433 pep:  NKVGVSFSTVE
Q52LJ0 :   corrected:  (seqNotFound)T306  site:  T306  seq len:  433 pep:  GVSFSTVENEL
Q52LJ0 :   corrected:  (seqNotFound)S314  site:  S314  seq len:  433 pep:  NELMISYLMFL
Q52LJ0 :   corrected:  (seqNotFound)Y315  site:  Y315  seq len:  433 pep:  ELMISYLMFLQ
Q52LJ0 :   corrected:  (seqNotFound)Y324  site:  Y324  seq len:  433 pep:  LQILVYFSFMS
Q52LJ0 :   corrected:  (seqNotFound)S329  site:  S329  seq len:  433 pep:  YFSFMSW
Saving  15 .tsv
15 .tsv Done  242.00071382522583
processing  14 .tsv
Get unique substrate sites in  14 .tsv
P59796 :   corrected:  (seqNotFound)Y72  site:  Y72  seq len:  221 pep:  VNVAAYGLAAQ
P07203 :   corrected:  (seqNotFound)T51  site:  T51  seq len:  203 pep:  ASLGTTVRDYT
P07203 :   corrected:  (seqNotF

Q06124 :   corrected:  (seqNotFound)S404  site:  S404  seq len:  593 pep:  RELKLSKVGQA
Q06124 :   corrected:  (seqNotFound)T415  site:  T415  seq len:  593 pep:  LLQGNTERTVW
Q9UJ41 :   corrected:  (seqNotFound)T5  site:  T5  seq len:  491 pep:  MVVVTGREPD
Q9UJ41 :   corrected:  (seqNotFound)S11  site:  S11  seq len:  491 pep:  GREPDSRRQDG
Q9UJ41 :   corrected:  (seqNotFound)S19  site:  S19  seq len:  491 pep:  QDGAMSSSDAE
Q9UJ41 :   corrected:  (seqNotFound)S20  site:  S20  seq len:  491 pep:  DGAMSSSDAED
Q9UJ41 :   corrected:  (seqNotFound)S21  site:  S21  seq len:  491 pep:  GAMSSSDAEDD
Q9UJ41 :   corrected:  (seqNotFound)T32  site:  T32  seq len:  491 pep:  FLEPATPTATQ
Q9UJ41 :   corrected:  (seqNotFound)T34  site:  T34  seq len:  491 pep:  EPATPTATQAG
Q9UJ41 :   corrected:  (seqNotFound)T36  site:  T36  seq len:  491 pep:  ATPTATQAGHA
Q9UJ41 :   corrected:  (seqNotFound)T61  site:  T61  seq len:  491 pep:  LRGPPTQGACS
Q9UJ41 :   corrected:  (seqNotFound)S66  site:  S66  seq len:  4

P10316 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GQEQRYTCHVQ
P10316 :   corrected:  (idNotFound)S26  site:  S26 pep:  QTWAGSHSMRY
P10316 :   corrected:  (idNotFound)S28  site:  S28 pep:  WAGSHSMRYFY
P10316 :   corrected:  (idNotFound)T282  site:  T282 pep:  QEQRYTCHVQH
P10316 :   corrected:  (idNotFound)T158  site:  T158 pep:  DLRSWTAADMA
P10316 :   corrected:  (idNotFound)Y31  site:  Y31 pep:  SHSMRYFYTSV
P10316 :   corrected:  (idNotFound)Y33  site:  Y33 pep:  SMRYFYTSVSR
P10316 :   corrected:  (idNotFound)T34  site:  T34 pep:  MRYFYTSVSRP
P10316 :   corrected:  (idNotFound)S35  site:  S35 pep:  RYFYTSVSRPG
P10316 :   corrected:  (idNotFound)S37  site:  S37 pep:  FYTSVSRPGRG
P10316 :   corrected:  (idNotFound)T166  site:  T166 pep:  DMAAQTTKHKW
P10316 :   corrected:  (idNotFound)T167  site:  T167 pep:  MAAQTTKHKWE
P10316 :   corrected:  (idNotFound)S156  site:  S156 pep:  KEDLRSWTAAD
P10316 :   corrected:  (idNotFound)T240  site:  T240 pep:  AEITLTWQRDG
P10316 :   corrected

P10319 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P10319 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTIPIVG
P10319 :   corrected:  (idNotFound)T4  site:  T4 pep:  MRVTAPRTV
P10319 :   corrected:  (idNotFound)T8  site:  T8 pep:  VTAPRTVLLLL
P10319 :   corrected:  (idNotFound)S140  site:  S140 pep:  RGHDQSAYDGK
P10319 :   corrected:  (idNotFound)Y142  site:  Y142 pep:  HDQSAYDGKDY
P10319 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQSTI
P10319 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAS
P10319 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P10319 :   corrected:  (idNotFound)T20  site:  T20 pep:  GAVALTETWAG
P10319 :   corrected:  (idNotFound)T22  site:  T22 pep:  VALTETWAGSH
P10319 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P10319 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
P10319 :   corrected:  (idNotFound)S155  site:  S155 pep:  LNEDLSSWTAA
P10319 :   corrected: 

Q31612 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
Q31612 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTIPIVG
Q31612 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTVLLLL
Q31612 :   corrected:  (idNotFound)Y137  site:  Y137 pep:  RLLRGYNQFAY
Q31612 :   corrected:  (idNotFound)S14  site:  S14 pep:  VLLLLSAALAL
Q31612 :   corrected:  (idNotFound)S344  site:  S344 pep:  GGKGGSYSQAA
Q31612 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
Q31612 :   corrected:  (idNotFound)T20  site:  T20 pep:  AALALTETWAG
Q31612 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTETWAGSH
Q31612 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GQEQRYTCHVQ
Q31612 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
Q31612 :   corrected:  (idNotFound)S28  site:  S28 pep:  WAGSHSMRYFH
Q31612 :   corrected:  (idNotFound)T282  site:  T282 pep:  QEQRYTCHVQH
Q31612 :   corrected:  (idNotFound)T158  site:  T158 pep:  DLRSWTAADTA
Q31612 :   corrected

Q9TNN7 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
Q9TNN7 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTLILLL
Q9TNN7 :   corrected:  (idNotFound)Y137  site:  Y137 pep:  RLLRGYNQFAY
Q9TNN7 :   corrected:  (idNotFound)S14  site:  S14 pep:  LILLLSGALAL
Q9TNN7 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWGPSSQPTI
Q9TNN7 :   corrected:  (idNotFound)S344  site:  S344 pep:  GGKGGSCSQAA
Q9TNN7 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
Q9TNN7 :   corrected:  (idNotFound)T20  site:  T20 pep:  GALALTETWAC
Q9TNN7 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTETWACSH
Q9TNN7 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
Q9TNN7 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWACSHSMRY
Q9TNN7 :   corrected:  (idNotFound)S28  site:  S28 pep:  WACSHSMRYFY
Q9TNN7 :   corrected:  (idNotFound)T282  site:  T282 pep:  EEQRYTCHVQH
Q9TNN7 :   corrected:  (idNotFound)T158  site:  T158 pep:  DLRSWTAADKA
Q9TNN7 :   corrected

Q95IE3 :   corrected:  (idNotFound)T129  site:  T129 pep:  VHPKVTVYPSK
Q95IE3 :   corrected:  (idNotFound)Y131  site:  Y131 pep:  PKVTVYPSKTQ
Q95IE3 :   corrected:  (idNotFound)S133  site:  S133 pep:  VTVYPSKTQPL
Q95IE3 :   corrected:  (idNotFound)T135  site:  T135 pep:  VYPSKTQPLQH
Q95IE3 :   corrected:  (idNotFound)S10  site:  S10 pep:  RLPGGSCMAVL
Q95IE3 :   corrected:  (idNotFound)S173  site:  S173 pep:  KTGVVSTGLIH
Q95IE3 :   corrected:  (idNotFound)T16  site:  T16 pep:  CMAVLTVTLMV
Q95IE3 :   corrected:  (idNotFound)Y152  site:  Y152 pep:  SVSGFYPGSIE
Q95IE3 :   corrected:  (idNotFound)T18  site:  T18 pep:  AVLTVTLMVLS
Q95IE3 :   corrected:  (idNotFound)S147  site:  S147 pep:  NLLVCSVSGFY
Q95IE3 :   corrected:  (idNotFound)S149  site:  S149 pep:  LVCSVSGFYPG
Q95IE3 :   corrected:  (idNotFound)S23  site:  S23 pep:  TLMVLSSPLAL
Q95IE3 :   corrected:  (idNotFound)S24  site:  S24 pep:  LMVLSSPLALA
Q95IE3 :   corrected:  (idNotFound)S155  site:  S155 pep:  GFYPGSIEVRW
Q95IE3 :   corre

P30466 :   corrected:  (idNotFound)T162  site:  T162 pep:  WTAADTAAQIT
P30466 :   corrected:  (idNotFound)S349  site:  S349 pep:  YSQAASSDSAQ
P30466 :   corrected:  (idNotFound)S336  site:  S336 pep:  MCRRKSSGGKG
P30466 :   corrected:  (idNotFound)S337  site:  S337 pep:  CRRKSSGGKGG
P30466 :   corrected:  (idNotFound)Y83  site:  Y83 pep:  QEGPEYWDRNT
P30466 :   corrected:  (idNotFound)T214  site:  T214 pep:  PKTHVTHHPIS
P30466 :   corrected:  (idNotFound)S343  site:  S343 pep:  GGKGGSYSQAA
P30466 :   corrected:  (idNotFound)T88  site:  T88 pep:  YWDRNTQISKT
P30466 :   corrected:  (idNotFound)S345  site:  S345 pep:  KGGSYSQAASS
P30466 :   corrected:  (idNotFound)S91  site:  S91 pep:  RNTQISKTNTQ
P30466 :   corrected:  (idNotFound)T93  site:  T93 pep:  TQISKTNTQTY
P30466 :   corrected:  (idNotFound)S350  site:  S350 pep:  SQAASSDSAQG
P30466 :   corrected:  (idNotFound)T95  site:  T95 pep:  ISKTNTQTYRE
P30466 :   corrected:  (idNotFound)T224  site:  T224 pep:  SDHEATLRCWA
P30466 :   corre

P30460 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P30460 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTVPIVG
P30460 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTVLLLL
P30460 :   corrected:  (idNotFound)S343  site:  S343 pep:  GGKGGSYSQAA
P30460 :   corrected:  (idNotFound)Y140  site:  Y140 pep:  RGHNQYAYDGK
P30460 :   corrected:  (idNotFound)S14  site:  S14 pep:  VLLLLSAALAL
P30460 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAC
P30460 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P30460 :   corrected:  (idNotFound)T20  site:  T20 pep:  AALALTETWAG
P30460 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTETWAGSH
P30460 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P30460 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
P30460 :   corrected:  (idNotFound)S28  site:  S28 pep:  WAGSHSMRYFD
P30460 :   corrected:  (idNotFound)T282  site:  T282 pep:  EEQRYTCHVQH
P30460 :   corrected

Q04826 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
Q04826 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTVPIVG
Q04826 :   corrected:  (idNotFound)T4  site:  T4 pep:  MRVTAPRTL
Q04826 :   corrected:  (idNotFound)T8  site:  T8 pep:  VTAPRTLLLLL
Q04826 :   corrected:  (idNotFound)S343  site:  S343 pep:  GGKGGSYSQAA
Q04826 :   corrected:  (idNotFound)Y140  site:  Y140 pep:  RGHNQYAYDGK
Q04826 :   corrected:  (idNotFound)Y142  site:  Y142 pep:  HNQYAYDGKDY
Q04826 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAC
Q04826 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
Q04826 :   corrected:  (idNotFound)T20  site:  T20 pep:  GAVALTETWAG
Q04826 :   corrected:  (idNotFound)T22  site:  T22 pep:  VALTETWAGSH
Q04826 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
Q04826 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
Q04826 :   corrected:  (idNotFound)S28  site:  S28 pep:  WAGSHSMRYFH
Q04826 :   corrected:  (

P13761 :   corrected:  (idNotFound)T129  site:  T129 pep:  VHPEVTVYPAK
P13761 :   corrected:  (idNotFound)Y131  site:  Y131 pep:  PEVTVYPAKTQ
P13761 :   corrected:  (idNotFound)T262  site:  T262 pep:  SGLQPTGFLS
P13761 :   corrected:  (idNotFound)T135  site:  T135 pep:  VYPAKTQPLQH
P13761 :   corrected:  (idNotFound)S10  site:  S10 pep:  KLPGGSCMAAL
P13761 :   corrected:  (idNotFound)T16  site:  T16 pep:  CMAALTVTLMV
P13761 :   corrected:  (idNotFound)Y152  site:  Y152 pep:  SVSGFYPGSIE
P13761 :   corrected:  (idNotFound)T18  site:  T18 pep:  AALTVTLMVLS
P13761 :   corrected:  (idNotFound)S147  site:  S147 pep:  NLLVCSVSGFY
P13761 :   corrected:  (idNotFound)S149  site:  S149 pep:  LVCSVSGFYPG
P13761 :   corrected:  (idNotFound)S23  site:  S23 pep:  TLMVLSSPLAL
P13761 :   corrected:  (idNotFound)S24  site:  S24 pep:  LMVLSSPLALA
P13761 :   corrected:  (idNotFound)S155  site:  S155 pep:  GFYPGSIEVRW
P13761 :   corrected:  (idNotFound)T32  site:  T32 pep:  ALAGDTQPRFL
P13761 :   correcte

P30461 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P30461 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTVPIVG
P30461 :   corrected:  (idNotFound)T4  site:  T4 pep:  MRVTAPRTL
P30461 :   corrected:  (idNotFound)T8  site:  T8 pep:  VTAPRTLLLLL
P30461 :   corrected:  (idNotFound)S343  site:  S343 pep:  GGKGGSYSQAA
P30461 :   corrected:  (idNotFound)Y142  site:  Y142 pep:  HNQLAYDGKDY
P30461 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQSTV
P30461 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAC
P30461 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P30461 :   corrected:  (idNotFound)T20  site:  T20 pep:  GAVALTETWAG
P30461 :   corrected:  (idNotFound)T22  site:  T22 pep:  VALTETWAGSH
P30461 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P30461 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
P30461 :   corrected:  (idNotFound)S155  site:  S155 pep:  LNEDLSSWTAA
P30461 :   corrected: 

P01892 :   corrected:  (idNotFound)S129  site:  S129 pep:  GCDVGSDWRFL
P01892 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P01892 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTLVLLL
P01892 :   corrected:  (idNotFound)Y137  site:  Y137 pep:  RFLRGYHQYAY
P01892 :   corrected:  (idNotFound)Y140  site:  Y140 pep:  RGYHQYAYDGK
P01892 :   corrected:  (idNotFound)S14  site:  S14 pep:  LVLLLSGALAL
P01892 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  RKGGSYSQAAS
P01892 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALKE
P01892 :   corrected:  (idNotFound)T20  site:  T20 pep:  GALALTQTWAG
P01892 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTQTWAGSH
P01892 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GQEQRYTCHVQ
P01892 :   corrected:  (idNotFound)S26  site:  S26 pep:  QTWAGSHSMRY
P01892 :   corrected:  (idNotFound)S28  site:  S28 pep:  WAGSHSMRYFF
P01892 :   corrected:  (idNotFound)T282  site:  T282 pep:  QEQRYTCHVQH
P01892 :   corrected

P30480 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P30480 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTVPIVG
P30480 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTVLLLL
P30480 :   corrected:  (idNotFound)S343  site:  S343 pep:  GGKGGSYSQAA
P30480 :   corrected:  (idNotFound)Y140  site:  Y140 pep:  RGHNQYAYDGK
P30480 :   corrected:  (idNotFound)S14  site:  S14 pep:  VLLLLSAALAL
P30480 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQSTV
P30480 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAC
P30480 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P30480 :   corrected:  (idNotFound)T20  site:  T20 pep:  AALALTETWAG
P30480 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTETWAGSH
P30480 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P30480 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
P30480 :   corrected:  (idNotFound)S28  site:  S28 pep:  WAGSHSMRYFY
P30480 :   corrected

P30495 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P30495 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTIPIVG
P30495 :   corrected:  (idNotFound)T4  site:  T4 pep:  MRVTAPRTL
P30495 :   corrected:  (idNotFound)T8  site:  T8 pep:  VTAPRTLLLLL
P30495 :   corrected:  (idNotFound)S343  site:  S343 pep:  GGKGGSYSQAA
P30495 :   corrected:  (idNotFound)Y142  site:  Y142 pep:  HNQLAYDGKDY
P30495 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQSTI
P30495 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAS
P30495 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P30495 :   corrected:  (idNotFound)T20  site:  T20 pep:  GALALTETWAG
P30495 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTETWAGSH
P30495 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P30495 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
P30495 :   corrected:  (idNotFound)S155  site:  S155 pep:  LNEDLSSWTAA
P30495 :   corrected: 

Q29974 :   corrected:  (idNotFound)T129  site:  T129 pep:  VQPKVTVYPSK
Q29974 :   corrected:  (idNotFound)Y131  site:  Y131 pep:  PKVTVYPSKTQ
Q29974 :   corrected:  (idNotFound)S133  site:  S133 pep:  VTVYPSKTQPL
Q29974 :   corrected:  (idNotFound)T262  site:  T262 pep:  SGLQPTGFLS
Q29974 :   corrected:  (idNotFound)T135  site:  T135 pep:  VYPSKTQPLQH
Q29974 :   corrected:  (idNotFound)S10  site:  S10 pep:  KLPGGSCMTAL
Q29974 :   corrected:  (idNotFound)T13  site:  T13 pep:  GGSCMTALTVT
Q29974 :   corrected:  (idNotFound)T16  site:  T16 pep:  CMTALTVTLMV
Q29974 :   corrected:  (idNotFound)Y152  site:  Y152 pep:  SVSGFYPGSIE
Q29974 :   corrected:  (idNotFound)T18  site:  T18 pep:  TALTVTLMVLS
Q29974 :   corrected:  (idNotFound)S147  site:  S147 pep:  NLLVCSVSGFY
Q29974 :   corrected:  (idNotFound)S149  site:  S149 pep:  LVCSVSGFYPG
Q29974 :   corrected:  (idNotFound)S23  site:  S23 pep:  TLMVLSSPLAL
Q29974 :   corrected:  (idNotFound)S24  site:  S24 pep:  LMVLSSPLALA
Q29974 :   correcte

P18465 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P18465 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTVPIVG
P18465 :   corrected:  (idNotFound)T4  site:  T4 pep:  MRVTAPRTV
P18465 :   corrected:  (idNotFound)T8  site:  T8 pep:  VTAPRTVLLLL
P18465 :   corrected:  (idNotFound)S140  site:  S140 pep:  RGHDQSAYDGK
P18465 :   corrected:  (idNotFound)Y142  site:  Y142 pep:  HDQSAYDGKDY
P18465 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQSTV
P18465 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAC
P18465 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P18465 :   corrected:  (idNotFound)T20  site:  T20 pep:  GAVALTETWAG
P18465 :   corrected:  (idNotFound)T22  site:  T22 pep:  VALTETWAGSH
P18465 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P18465 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
P18465 :   corrected:  (idNotFound)S155  site:  S155 pep:  LNEDLSSWTAA
P18465 :   corrected: 

P30459 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P30459 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTLLLLL
P30459 :   corrected:  (idNotFound)Y137  site:  Y137 pep:  RLLRGYQQDAY
P30459 :   corrected:  (idNotFound)Y142  site:  Y142 pep:  YQQDAYDGKDY
P30459 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  RKGGSYSQAAS
P30459 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P30459 :   corrected:  (idNotFound)T20  site:  T20 pep:  GALALTQTRAG
P30459 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTQTRAGSH
P30459 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GQEQRYTCHVQ
P30459 :   corrected:  (idNotFound)S26  site:  S26 pep:  QTRAGSHSMRY
P30459 :   corrected:  (idNotFound)S28  site:  S28 pep:  RAGSHSMRYFF
P30459 :   corrected:  (idNotFound)T282  site:  T282 pep:  QEQRYTCHVQH
P30459 :   corrected:  (idNotFound)T158  site:  T158 pep:  DLRSWTAADMA
P30459 :   corrected:  (idNotFound)Y31  site:  Y31 pep:  SHSMRYFFTSV
P30459 :   corrected

P30484 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P30484 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTIPIVG
P30484 :   corrected:  (idNotFound)T4  site:  T4 pep:  MRVTAPRTV
P30484 :   corrected:  (idNotFound)T8  site:  T8 pep:  VTAPRTVLLLL
P30484 :   corrected:  (idNotFound)S140  site:  S140 pep:  RGHDQSAYDGK
P30484 :   corrected:  (idNotFound)S14  site:  S14 pep:  VLLLLSGALAL
P30484 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQSTI
P30484 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAS
P30484 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P30484 :   corrected:  (idNotFound)T20  site:  T20 pep:  GALALTETWAG
P30484 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTETWAGSH
P30484 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P30484 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
P30484 :   corrected:  (idNotFound)S155  site:  S155 pep:  LNEDLSSWTAA
P30484 :   corrected:  (

Q31610 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
Q31610 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTVPIVG
Q31610 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTVLLLL
Q31610 :   corrected:  (idNotFound)S343  site:  S343 pep:  GGKGGSYSQAA
Q31610 :   corrected:  (idNotFound)Y140  site:  Y140 pep:  RGHNQYAYDGK
Q31610 :   corrected:  (idNotFound)T269  site:  T269 pep:  TFQKWTAVVVP
Q31610 :   corrected:  (idNotFound)Y142  site:  Y142 pep:  HNQYAYDGKDY
Q31610 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQSTV
Q31610 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAC
Q31610 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
Q31610 :   corrected:  (idNotFound)T20  site:  T20 pep:  GAVALTETWAG
Q31610 :   corrected:  (idNotFound)T22  site:  T22 pep:  VALTETWAGSH
Q31610 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
Q31610 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
Q31610 :   corre

Q30134 :   corrected:  (idNotFound)T129  site:  T129 pep:  VHPKVTVYPSK
Q30134 :   corrected:  (idNotFound)Y131  site:  Y131 pep:  PKVTVYPSKTQ
Q30134 :   corrected:  (idNotFound)S133  site:  S133 pep:  VTVYPSKTQPL
Q30134 :   corrected:  (idNotFound)T262  site:  T262 pep:  SGLQPTGFLS
Q30134 :   corrected:  (idNotFound)T135  site:  T135 pep:  VYPSKTQPLQH
Q30134 :   corrected:  (idNotFound)S10  site:  S10 pep:  RLPGGSCMAVL
Q30134 :   corrected:  (idNotFound)S173  site:  S173 pep:  KTGVVSTGLIH
Q30134 :   corrected:  (idNotFound)T16  site:  T16 pep:  CMAVLTVTLMV
Q30134 :   corrected:  (idNotFound)Y152  site:  Y152 pep:  SVSGFYPGSIE
Q30134 :   corrected:  (idNotFound)T18  site:  T18 pep:  AVLTVTLMVLS
Q30134 :   corrected:  (idNotFound)S147  site:  S147 pep:  NLLVCSVSGFY
Q30134 :   corrected:  (idNotFound)S149  site:  S149 pep:  LVCSVSGFYPG
Q30134 :   corrected:  (idNotFound)S23  site:  S23 pep:  TLMVLSSPLAL
Q30134 :   corrected:  (idNotFound)S24  site:  S24 pep:  LMVLSSPLALA
Q30134 :   correc

P30488 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P30488 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTIPIVG
P30488 :   corrected:  (idNotFound)T4  site:  T4 pep:  MRVTAPRTV
P30488 :   corrected:  (idNotFound)T8  site:  T8 pep:  VTAPRTVLLLL
P30488 :   corrected:  (idNotFound)Y137  site:  Y137 pep:  RLLRGYNQLAY
P30488 :   corrected:  (idNotFound)S14  site:  S14 pep:  VLLLLSAALAL
P30488 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAS
P30488 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P30488 :   corrected:  (idNotFound)T20  site:  T20 pep:  AALALTETWAG
P30488 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTETWAGSH
P30488 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P30488 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
P30488 :   corrected:  (idNotFound)S155  site:  S155 pep:  LNEDLSSWTAA
P30488 :   corrected:  (idNotFound)S28  site:  S28 pep:  WAGSHSMRYFH
P30488 :   corrected:  (id

P01160 :   corrected:  (seqNotFound)Y151  site:  Y151  seq len:  151 pep:  CNSFRYRR
P01160 :   corrected:  (seqNotFound)S148  site:  S148  seq len:  151 pep:  GLGCNSFRYRR
P16188 :   corrected:  (idNotFound)S129  site:  S129 pep:  GCDVGSDGRFL
P16188 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P16188 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTLLLLL
P16188 :   corrected:  (idNotFound)Y137  site:  Y137 pep:  RFLRGYEQHAY
P16188 :   corrected:  (idNotFound)S14  site:  S14 pep:  LLLLLSGALAL
P16188 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWELSSQPTI
P16188 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  RKGGSYTQAAS
P16188 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P16188 :   corrected:  (idNotFound)T20  site:  T20 pep:  GALALTHTWAG
P16188 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTHTWAGSH
P16188 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P16188 :   corrected:  (idNotFound)S26  site:  S26 pep:  H

P30685 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P30685 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTIPIVG
P30685 :   corrected:  (idNotFound)T4  site:  T4 pep:  MRVTAPRTV
P30685 :   corrected:  (idNotFound)T8  site:  T8 pep:  VTAPRTVLLLL
P30685 :   corrected:  (idNotFound)S140  site:  S140 pep:  RGHDQSAYDGK
P30685 :   corrected:  (idNotFound)Y142  site:  Y142 pep:  HDQSAYDGKDY
P30685 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQSTI
P30685 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAS
P30685 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P30685 :   corrected:  (idNotFound)T20  site:  T20 pep:  GAVALTETWAG
P30685 :   corrected:  (idNotFound)T22  site:  T22 pep:  VALTETWAGSH
P30685 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P30685 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
P30685 :   corrected:  (idNotFound)S155  site:  S155 pep:  LNEDLSSWTAA
P30685 :   corrected: 

Q30167 :   corrected:  (idNotFound)T129  site:  T129 pep:  VQPKVTVYPSK
Q30167 :   corrected:  (idNotFound)Y131  site:  Y131 pep:  PKVTVYPSKTQ
Q30167 :   corrected:  (idNotFound)S133  site:  S133 pep:  VTVYPSKTQPL
Q30167 :   corrected:  (idNotFound)T262  site:  T262 pep:  SGLPPTGFLS
Q30167 :   corrected:  (idNotFound)T135  site:  T135 pep:  VYPSKTQPLQH
Q30167 :   corrected:  (idNotFound)S10  site:  S10 pep:  RLPGGSCMAVL
Q30167 :   corrected:  (idNotFound)T16  site:  T16 pep:  CMAVLTVTLMV
Q30167 :   corrected:  (idNotFound)Y152  site:  Y152 pep:  SVNGFYPGSIE
Q30167 :   corrected:  (idNotFound)T18  site:  T18 pep:  AVLTVTLMVLS
Q30167 :   corrected:  (idNotFound)S147  site:  S147 pep:  NLLVCSVNGFY
Q30167 :   corrected:  (idNotFound)S23  site:  S23 pep:  TLMVLSSPLAL
Q30167 :   corrected:  (idNotFound)S24  site:  S24 pep:  LMVLSSPLALA
Q30167 :   corrected:  (idNotFound)S155  site:  S155 pep:  GFYPGSIEVRW
Q30167 :   corrected:  (idNotFound)T32  site:  T32 pep:  ALAGDTRPRFL
Q30167 :   correcte

P18462 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P18462 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTLVLLL
P18462 :   corrected:  (idNotFound)Y137  site:  Y137 pep:  RFLRGYQQDAY
P18462 :   corrected:  (idNotFound)S14  site:  S14 pep:  LVLLLSGALAL
P18462 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQPTI
P18462 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  RKGGSYSQAAS
P18462 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P18462 :   corrected:  (idNotFound)T20  site:  T20 pep:  GALALTQTWAG
P18462 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTQTWAGSH
P18462 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GQEQRYTCHVQ
P18462 :   corrected:  (idNotFound)S26  site:  S26 pep:  QTWAGSHSMRY
P18462 :   corrected:  (idNotFound)S28  site:  S28 pep:  WAGSHSMRYFY
P18462 :   corrected:  (idNotFound)T282  site:  T282 pep:  QEQRYTCHVQH
P18462 :   corrected:  (idNotFound)T158  site:  T158 pep:  DLRSWTAADMA
P18462 :   corrected

P30493 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P30493 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTIPIVG
P30493 :   corrected:  (idNotFound)T4  site:  T4 pep:  MRVTAPRTL
P30493 :   corrected:  (idNotFound)T8  site:  T8 pep:  VTAPRTLLLLL
P30493 :   corrected:  (idNotFound)S343  site:  S343 pep:  GGKGGSYSQAA
P30493 :   corrected:  (idNotFound)Y142  site:  Y142 pep:  HNQLAYDGKDY
P30493 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQSTI
P30493 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAS
P30493 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P30493 :   corrected:  (idNotFound)T20  site:  T20 pep:  GALALTETWAG
P30493 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTETWAGSH
P30493 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P30493 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
P30493 :   corrected:  (idNotFound)S155  site:  S155 pep:  LNEDLSSWTAA
P30493 :   corrected: 

Q95365 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
Q95365 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTVPIVG
Q95365 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTVLLLL
Q95365 :   corrected:  (idNotFound)S14  site:  S14 pep:  VLLLLSAALAL
Q95365 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQSTV
Q95365 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAS
Q95365 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
Q95365 :   corrected:  (idNotFound)T20  site:  T20 pep:  AALALTETWAG
Q95365 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTETWAGSH
Q95365 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
Q95365 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
Q95365 :   corrected:  (idNotFound)S155  site:  S155 pep:  LNEDLSSWTAA
Q95365 :   corrected:  (idNotFound)S28  site:  S28 pep:  WAGSHSMRYFY
Q95365 :   corrected:  (idNotFound)T282  site:  T282 pep:  EEQRYTCHVQH
Q95365 :   corrected

P30479 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P30479 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTVPIVG
P30479 :   corrected:  (idNotFound)T4  site:  T4 pep:  MRVTAPRTV
P30479 :   corrected:  (idNotFound)T8  site:  T8 pep:  VTAPRTVLLLL
P30479 :   corrected:  (idNotFound)Y140  site:  Y140 pep:  RGHNQYAYDGK
P30479 :   corrected:  (idNotFound)S14  site:  S14 pep:  VLLLLSAALAL
P30479 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAC
P30479 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P30479 :   corrected:  (idNotFound)T20  site:  T20 pep:  AALALTETWAG
P30479 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTETWAGSH
P30479 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P30479 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
P30479 :   corrected:  (idNotFound)S28  site:  S28 pep:  WAGSHSMRYFH
P30479 :   corrected:  (idNotFound)T282  site:  T282 pep:  EEQRYTCHVQH
P30479 :   corrected:  (id

P30508 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P30508 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQPTIPIVG
P30508 :   corrected:  (idNotFound)T8  site:  T8 pep:  VMAPRTLILLL
P30508 :   corrected:  (idNotFound)Y137  site:  Y137 pep:  RLLRGYDQSAY
P30508 :   corrected:  (idNotFound)S140  site:  S140 pep:  RGYDQSAYDGK
P30508 :   corrected:  (idNotFound)S14  site:  S14 pep:  LILLLSGALAL
P30508 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQPTI
P30508 :   corrected:  (idNotFound)S344  site:  S344 pep:  GGKGGSCSQAA
P30508 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P30508 :   corrected:  (idNotFound)T20  site:  T20 pep:  GALALTETWAC
P30508 :   corrected:  (idNotFound)T22  site:  T22 pep:  LALTETWACSH
P30508 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P30508 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWACSHSMRY
P30508 :   corrected:  (idNotFound)S28  site:  S28 pep:  WACSHSMRYFY
P30508 :   corrected

P01912 :   corrected:  (idNotFound)T129  site:  T129 pep:  VHPKVTVYPSK
P01912 :   corrected:  (idNotFound)Y131  site:  Y131 pep:  PKVTVYPSKTQ
P01912 :   corrected:  (idNotFound)S133  site:  S133 pep:  VTVYPSKTQPL
P01912 :   corrected:  (idNotFound)T135  site:  T135 pep:  VYPSKTQPLQH
P01912 :   corrected:  (idNotFound)S10  site:  S10 pep:  RLPGGSCMAVL
P01912 :   corrected:  (idNotFound)T16  site:  T16 pep:  CMAVLTVTLMV
P01912 :   corrected:  (idNotFound)Y152  site:  Y152 pep:  SVSGFYPGSIE
P01912 :   corrected:  (idNotFound)T18  site:  T18 pep:  AVLTVTLMVLS
P01912 :   corrected:  (idNotFound)S147  site:  S147 pep:  NLLVCSVSGFY
P01912 :   corrected:  (idNotFound)S149  site:  S149 pep:  LVCSVSGFYPG
P01912 :   corrected:  (idNotFound)S23  site:  S23 pep:  TLMVLSSPLAL
P01912 :   corrected:  (idNotFound)S24  site:  S24 pep:  LMVLSSPLALA
P01912 :   corrected:  (idNotFound)S155  site:  S155 pep:  GFYPGSIEVRW
P01912 :   corrected:  (idNotFound)T32  site:  T32 pep:  ALAGDTRPRFL
P01912 :   correct

P30485 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P30485 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTVPIVG
P30485 :   corrected:  (idNotFound)T4  site:  T4 pep:  MRVTAPRTL
P30485 :   corrected:  (idNotFound)T8  site:  T8 pep:  VTAPRTLLLLL
P30485 :   corrected:  (idNotFound)Y137  site:  Y137 pep:  RLLRGYHQDAY
P30485 :   corrected:  (idNotFound)Y142  site:  Y142 pep:  YHQDAYDGKDY
P30485 :   corrected:  (idNotFound)S301  site:  S301 pep:  LRWEPSSQSTV
P30485 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAC
P30485 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P30485 :   corrected:  (idNotFound)T20  site:  T20 pep:  GAVALTETWAG
P30485 :   corrected:  (idNotFound)T22  site:  T22 pep:  VALTETWAGSH
P30485 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P30485 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
P30485 :   corrected:  (idNotFound)S155  site:  S155 pep:  LNEDLSSWTAA
P30485 :   corrected: 

P03989 :   corrected:  (idNotFound)T257  site:  T257 pep:  TELVETRPAGD
P03989 :   corrected:  (idNotFound)T305  site:  T305 pep:  PSSQSTVPIVG
P03989 :   corrected:  (idNotFound)T4  site:  T4 pep:  MRVTAPRTL
P03989 :   corrected:  (idNotFound)T8  site:  T8 pep:  VTAPRTLLLLL
P03989 :   corrected:  (idNotFound)Y137  site:  Y137 pep:  RLLRGYHQDAY
P03989 :   corrected:  (idNotFound)Y142  site:  Y142 pep:  YHQDAYDGKDY
P03989 :   corrected:  (idNotFound)Y344  site:  Y344 pep:  GKGGSYSQAAC
P03989 :   corrected:  (idNotFound)Y147  site:  Y147 pep:  YDGKDYIALNE
P03989 :   corrected:  (idNotFound)T20  site:  T20 pep:  GAVALTETWAG
P03989 :   corrected:  (idNotFound)T22  site:  T22 pep:  VALTETWAGSH
P03989 :   corrected:  (idNotFound)Y281  site:  Y281 pep:  GEEQRYTCHVQ
P03989 :   corrected:  (idNotFound)S26  site:  S26 pep:  ETWAGSHSMRY
P03989 :   corrected:  (idNotFound)S155  site:  S155 pep:  LNEDLSSWTAA
P03989 :   corrected:  (idNotFound)S28  site:  S28 pep:  WAGSHSMRYFH
P03989 :   corrected:  (

Q9UMY4 :   corrected:  (seqNotFound)S160  site:  S160  seq len:  162 pep:  YVPGKSLAVSC
Q9UMY4 :   corrected:  (seqNotFound)Y155  site:  Y155  seq len:  162 pep:  AIDRNYVPGKS
Q9UMY4 :   corrected:  (seqNotFound)S164  site:  S164  seq len:  162 pep:  KSLAVSCPGWS
Q9UMY4 :   corrected:  (seqNotFound)S169  site:  S169  seq len:  162 pep:  SCPGWSAVA
Saving  4 .tsv
4 .tsv Done  270.4089620113373
processing  6 .tsv
Get unique substrate sites in  6 .tsv
P63302 :   corrected:  (seqNotFound)Y9  site:  Y9  seq len:  87 pep:  AVRVVYCGAGY
P63302 :   corrected:  (seqNotFound)Y14  site:  Y14  seq len:  87 pep:  YCGAGYKSKYL
P63302 :   corrected:  (seqNotFound)S16  site:  S16  seq len:  87 pep:  GAGYKSKYLQL
Q9H2Q1 :   corrected:  (idNotFound)Y128  site:  Y128 pep:  GKMQVYE
Q9H2Q1 :   corrected:  (idNotFound)T2  site:  T2 pep:  MTSHHCV
Q9H2Q1 :   corrected:  (idNotFound)S3  site:  S3 pep:  MTSHHCVG
Q9H2Q1 :   corrected:  (idNotFound)S14  site:  S14 pep:  PGNHISWSGHE
Q9H2Q1 :   corrected:  (idNotFound)S16

**Remove unmapped/outdated results**
- Check if there is any unmapped substrate/site due to outdated uniprot sequence records
    - get a list of outdated uniprotID

In [9]:
all_results = glob.glob(NW_temp_dir_site + '*.csv')
    
updatedSub_li = []
for filename in all_results:
    start = time.time() 
    df_unmapped = pd.read_csv(filename, usecols = ['substrate_id','substrate_acc', 'site'])
    df_unmapped = df_unmapped[~(df_unmapped['site'].str.contains('S|T|Y', na=False)) | (df_unmapped['substrate_id'] == 'outdated')]
    df_unmapped = df_unmapped.substrate_acc.drop_duplicates()
    unmapped_li = df_unmapped.values.tolist()
    updatedSub_li.extend(unmapped_li)
    end = time.time()
    print (f"Time\t{(end-start):.3f}")
updatedSub_li 

Time	8.223
Time	9.399
Time	9.811
Time	10.685
Time	9.016
Time	8.493
Time	9.990
Time	5.003
Time	12.716
Time	9.222
Time	8.385
Time	9.441
Time	8.693
Time	8.373
Time	8.131
Time	8.660
Time	11.187
Time	10.031
Time	8.585
Time	8.566
Time	9.941


['Q86VQ6',
 'O15015',
 'Q96S15',
 'O43493',
 'Q9C0I4',
 'Q9NNW7',
 'Q9BQ50',
 'P63302',
 'Q9H2Q1',
 'Q99611',
 'P04229',
 'P30483',
 'P10316',
 'P30504',
 'P10319',
 'Q07000',
 'P20039',
 'Q31612',
 'P01891',
 'Q9TNN7',
 'Q29718',
 'Q95IE3',
 'P16190',
 'Q29836',
 'P30466',
 'Q29963',
 'P30498',
 'P30460',
 'P30455',
 'Q04826',
 'P30443',
 'P13761',
 'P30487',
 'Q9TQE0',
 'P30461',
 'P30490',
 'P01892',
 'Q09160',
 'P30480',
 'P30499',
 'P30495',
 'P30505',
 'Q29974',
 'P30481',
 'P18465',
 'P30450',
 'P30459',
 'Q29960',
 'P30484',
 'P13746',
 'Q31610',
 'P30462',
 'Q30134',
 'P16189',
 'P30488',
 'P30492',
 'P01160',
 'P16188',
 'Q29940',
 'P30685',
 'P18464',
 'Q30167',
 'Q29865',
 'Q5Y7A7',
 'P18462',
 'P30501',
 'P30493',
 'P30456',
 'Q95365',
 'P13760',
 'P30479',
 'P18463',
 'P30508',
 'P30510',
 'P01912',
 'P30464',
 'P30485',
 'P30486',
 'P03989',
 'P30988',
 'Q9BSG1',
 'E7EML9',
 'Q06124',
 'Q9UJ41',
 'Q8N2Q7',
 'Q9UJX0',
 'Q6ZR98',
 'Q9UMY4',
 'P18283',
 'P36969',
 'Q86Y91',

- Remove any record with unmapped substrate_acc/site in '*_mappedSite.csv'

In [12]:
all_results = glob.glob(NW_temp_dir_site + '*.csv')

for filename in all_results:
    start = time.time()
    df_mapSite = pd.read_csv(filename)
    # remove the outdated records from df_subMap
    df_update = df_mapSite[~df_mapSite['substrate_acc'].isin(updatedSub_li)]
    df_update.to_csv(filename, chunksize=100000, index=False)
    end = time.time()
    print (f"chunk time\t{(end-start):.3f}")


chunk time	32.552
chunk time	34.708
chunk time	37.731
chunk time	42.370
chunk time	35.697
chunk time	32.222
chunk time	37.086
chunk time	19.179
chunk time	45.215
chunk time	36.513
chunk time	33.216
chunk time	37.718
chunk time	33.076
chunk time	33.881
chunk time	32.461
chunk time	33.712
chunk time	39.761
chunk time	37.249
chunk time	32.760
chunk time	32.690
chunk time	37.350


### Update NetworKIN results
- The next 6 cells is only for NetworKIN results from outdated Human Proteomes sequences (input sequences for NetworKIN prediction an earlier version than the referece human proteome sequence)

1. download the sequence fasta file of the above UniprotID (updatedSub_li) from Uniprot.org, save it as '../Data/Raw/HumanProteome/NetworKIN_updateSub.fasta'. Submit the 'NetworKIN_updateSub.fasta' in NetworKIN again
2. run mapAcc and mapSite function for the result file from 'NetworKIN_updateSub.fasta'

In [10]:
# convert substrate_acc and kinase_acc
convert_type = 'acc'
networKin_convert.kin_convert_directory(NW_update_dir, 'na', NW_temp_dir_acc_update, convert_type)

reading  2020.tsv
getting unique sub
getting sub_acc
merge
done 0.2106938362121582
getting unique kin
getting kin_acc
merge
done 124.0804557800293
saving
Done 2.124906063079834


In [11]:
# map the site to the updated (new) human proteome reference
convert_type = 'site'
networKin_convert.kin_convert_directory(NW_temp_dir_acc_update, HP_csv, NW_temp_dir_site_update, convert_type)


Set input file dir...
done
read the Human Proteome df...
done
processing  2020 .tsv
Get unique substrate sites in  2020 .tsv
P55073 :   corrected:  (seqNotFound)S167  site:  S167  seq len:  304 pep:  VLNFGSCTPPF
P55073 :   corrected:  (seqNotFound)T169  site:  T169  seq len:  304 pep:  NFGSCTPPFMA
P59797 :   corrected:  (seqNotFound)Y269  site:  Y269  seq len:  346 pep:  LIRVTYCGLSY
P59797 :   corrected:  (seqNotFound)S275  site:  S275  seq len:  346 pep:  CGLSYSLRYIL
P59797 :   corrected:  (seqNotFound)S273  site:  S273  seq len:  346 pep:  TYCGLSYSLRY
P59797 :   corrected:  (seqNotFound)T268  site:  T268  seq len:  346 pep:  VLIRVTYCGLS
P59797 :   corrected:  (seqNotFound)Y274  site:  Y274  seq len:  346 pep:  YCGLSYSLRYI
Q9NZV6 :   corrected:  (seqNotFound)S92  site:  S92  seq len:  116 pep:  PKPGQSRFIFS
Q9NZV6 :   corrected:  (seqNotFound)S97  site:  S97  seq len:  116 pep:  SRFIFSSSLKF
Q9NZV6 :   corrected:  (seqNotFound)S98  site:  S98  seq len:  116 pep:  RFIFSSSLKFV
Q9NZV6 :   

3. save '2020_mappedSite.csv' under the same dir as other *_mappedSite.csv files
    - NetworKIN ignorgs the 'U'(Selenocysteine) in some substrate sequences, this causes frame shift of the downstream sequences
    - those unmapped sites will not be included

In [13]:
start = time.time()
df_mapUpdateSite = pd.read_csv(NW_temp_dir_site_update + '2020_mappedSite.csv')
df_mapUpdateSite = df_mapUpdateSite[(df_mapUpdateSite['site'].str.contains('S|T|Y', na=False)) & (df_mapUpdateSite['substrate_id'] != 'outdated')]
df_mapUpdateSite.to_csv(NW_temp_dir_site + '2020_mappedSite.csv')
end = time.time()
print (f"Time\t{(end-start):.3f}")
df_mapUpdateSite


Time	3.122


Unnamed: 0,Position,score,substrate_name,kinase_name,pep,substrate_acc,kinase_acc,site,substrate_id
0,Y256,0.3483,DIO3,SRC,SAYGAYFERLY,P55073,P12931,Y257,P55073_256
1,Y256,0.2997,DIO3,IGF1R,SAYGAYFERLY,P55073,P08069,Y257,P55073_256
2,Y256,0.2369,DIO3,FLT1,SAYGAYFERLY,P55073,P17948,Y257,P55073_256
3,Y256,0.2054,DIO3,ABL1,SAYGAYFERLY,P55073,P00519,Y257,P55073_256
4,Y256,0.1853,DIO3,MST1R,SAYGAYFERLY,P55073,Q04912,Y257,P55073_256
...,...,...,...,...,...,...,...,...,...
677118,Y2038,0.0001,TRIO,HCK,AEYDAYFEEVK,O60229,P08631,Y2038,O60229_2038
677119,Y2038,0.0001,TRIO,BTK,AEYDAYFEEVK,O60229,Q06187,Y2038,O60229_2038
677120,Y2038,0.0001,TRIO,FYN,AEYDAYFEEVK,O60229,P06241,Y2038,O60229_2038
677121,Y2038,0.0001,TRIO,KDR,AEYDAYFEEVK,O60229,P35968,Y2038,O60229_2038


### Get the Gene Name of the Substrates from the Reference Human Proteome
- get and add the Gene Name that would use across all perdictors for the substrates to the result files

In [16]:
# get the protein gene names and accessions from the Reference Human Proteome
df_unique_sub =  pd.read_csv(HP_csv, usecols = ['UniprotID','Gene Name'], sep = '\t')
df_unique_sub

Unnamed: 0,Gene Name,UniprotID
0,ACTN1,P12814
1,STAT3,P40763
2,ADD1,P35611
3,ADD2,P35612
4,ADRA2A,P08913
...,...,...
20358,HSFX4,A0A1B0GTS1
20359,TRBJ2-6,A0A0A0MT70
20360,TMEM225B,P0DP42
20361,SMIM29,Q86T20


In [17]:
# add the Gene Name that would use across all perdictors for the substrates to the result files
start = time.time()
all_results = glob.glob(NW_temp_dir_site + '*.csv')

for filename in all_results:
    df = pd.read_csv(filename, usecols = ['substrate_id', 'substrate_acc', 'substrate_name', 'site', 'Position', 'pep', 'score', 'kinase_name', 'kinase_acc'])
    # merge df_subsMap with df_unique_sub to add the common substrate gene name to the df
    df = df.merge(df_unique_sub, left_on=['substrate_acc'], right_on=['UniprotID'], how = 'left')

    df = df.drop(columns = ['UniprotID'])
    df.to_csv(filename, index=False)
df

Unnamed: 0,Position,score,substrate_name,kinase_name,pep,substrate_acc,kinase_acc,site,substrate_id,Gene Name
0,S257,1.0740,NDUFAF1,PAK1,VKIPFSKFFFS,Q9Y375,Q13153,S257,Q9Y375_257,NDUFAF1
1,S257,0.9377,NDUFAF1,PRKCB,VKIPFSKFFFS,Q9Y375,P05771,S257,Q9Y375_257,NDUFAF1
2,S257,0.5971,NDUFAF1,PRKCA,VKIPFSKFFFS,Q9Y375,P17252,S257,Q9Y375_257,NDUFAF1
3,S257,0.3979,NDUFAF1,PRKCZ,VKIPFSKFFFS,Q9Y375,Q05513,S257,Q9Y375_257,NDUFAF1
4,S257,0.3288,NDUFAF1,PRKCD,VKIPFSKFFFS,Q9Y375,Q05655,S257,Q9Y375_257,NDUFAF1
...,...,...,...,...,...,...,...,...,...,...
10144852,Y597,0.0001,WDR65,NTRK2,AFDVTYTAIVI,Q96MR6,Q16620,Y597,Q96MR6_597,CFAP57
10144853,Y597,0.0001,WDR65,KDR,AFDVTYTAIVI,Q96MR6,P35968,Y597,Q96MR6_597,CFAP57
10144854,Y597,0.0000,WDR65,PDGFRB,AFDVTYTAIVI,Q96MR6,P09619,Y597,Q96MR6_597,CFAP57
10144855,Y597,0.0000,WDR65,TYK2,AFDVTYTAIVI,Q96MR6,P29597,Y597,Q96MR6_597,CFAP57


### Creating Resource Files
**globalKinaseMap**
- creat a new or add unique kinases from NetworKIN to the globel kinase resource file.
- get and add the Kinase Name that would use across all perdictors for the kinases to the result files

In [18]:
# get unique kinases in the NetworKIN result files

all_results = glob.glob(NW_temp_dir_site + '*.csv')

df_unique_kin = pd.DataFrame()

for filename in all_results:
    df = pd.read_csv(filename, usecols = ['kinase_acc', 'kinase_name'])
    df = df.drop_duplicates()
    df_unique_kin = df_unique_kin.append(df, ignore_index=True)
    
df_unique_kin = df_unique_kin.drop_duplicates()
df_unique_kin

Unnamed: 0,kinase_name,kinase_acc
0,PRKCB,P05771
1,PRKCA,P17252
2,PRKCZ,Q05513
3,PRKCD,Q05655
4,PRKCG,P05129
...,...,...
189,FYN,P06241
190,MAPK13,O15264
191,MAPK12,P53778
192,MAPK11,Q15759


In [22]:
# start = time.time()
unmapped_list = pd.DataFrame()

if os.path.isfile(KinaseMap): 
    df_humanKinase = pd.read_csv(KinaseMap)
# if globalkinaseMap.csv file does not exist, create an new df using orginal human kinase map
else:
    df_humanKinase = pd.read_csv(HK_org, usecols = ['Kinase Name', 'Preferred Name', 'UniprotID', 'Type', 'description'], sep = '\t')
    df_humanKinase['description'].replace(regex=True,inplace=True,to_replace=r'\[Source.+\]',value=r'')

# add unique kinases from NetworKIN to the globel kinases resource file
for index, row in df_unique_kin.iterrows():
    kinase = df_unique_kin.at[index, 'kinase_acc']
    # if the kinase/other enzyme already in the globalKinaseMap.csv file
    if any(df_humanKinase.UniprotID == kinase):
        # get the index of the substrate in the globalKinaseMap.csv file 
        idx = df_humanKinase.index[df_humanKinase.UniprotID == kinase].values[0] 

        df_humanKinase.at[idx, 'NetworKIN_kinase_name'] = df_unique_kin.at[index, 'kinase_name']
        
    # if the kinase is not in the globalKinaseMap.csv file, we need a list to check annotations manullay
    else:
        unmapped_list = unmapped_list.append(row,sort=False).reset_index(drop=True)
        
print (unmapped_list)

  kinase_acc kinase_name
0     Q16654        PDK4
1     Q15120        PDK3
2     Q15119        PDK2
3     Q86VQ0        LCA5


- Manually check the above unmapped kinase. Create a dictionay for the one(s) that are kinases to add in the globalKinaseMap.csv.  Create a list of the one(s) that are not kinases to drop records from the NetworKIN result files.

In [26]:
new_kinase = {'EPHB6':'Ephrin type-B receptor 6'}

not_kinase = ['PDK2','PDK3','PDK4','LCA5']

- add the 1 protein kinases to the globalKinaseMap.csv using above dictionary 

In [58]:
# get the length (last index) of the current df_humanKinase
len = df_humanKinase.UniprotID.count()

# add the 1 new kinase in the globalKinaseMap.csv
for key in new_kinase:
    # add the kinase name and description for the new kinase 
    df_humanKinase = df_humanKinase.append({'Kinase Name': key}, ignore_index=True)
    df_humanKinase.at[len,'Preferred Name'] = key
    df_humanKinase.at[len,'description'] = new_kinase[key]
    # get the index of where the new kinase is in the df_unique_kinase
    i = df_unique_kin.index[df_unique_kin['kinase_name'] == key].values[0] 
    # add the uniprotID for the new kinase 
    df_humanKinase.at[len,'UniprotID'] = df_unique_kin.at[i, 'kinase_acc']
    # add the accs used in NetworKIN for the new kinase
    df_humanKinase.at[len,'NetworKIN_kinase_name'] = df_unique_kin.at[i, 'kinase_name']

    len += 1

df_humanKinase.to_csv(KinaseMap,index = False) 
df_humanKinase

Unnamed: 0,Kinase Name,UniprotID,description,HPRD_kinase_name,HPRD_kinase_uniprot_id,HPRD_kinase_refseq_id,PhosphoSite_kinase_uniprot_id,PhosphoSite_kinase_name,PhosphoSite_kinase_gene_name,GPS5_kinase_name,NetworKIN_kinase_name
0,SGK1,O00141,serum/glucocorticoid regulated kinase 1,SGK1,O00141,NP_001137148.1,O00141,SGK1,SGK1,SGK1,SGK1
1,BMPR1B,O00238,bone morphogenetic protein receptor type 1B,BMPR1B,O00238,NP_001194.1,O00238,BMPR1B,BMPR1B,BMPR1B,
2,CDC7,O00311,cell division cycle 7,,,,O00311,CDC7,CDC7,CDC7,
3,PLK4,O00444,polo like kinase 4,,,,O00444,PLK4,PLK4,PLK4,
4,STK25,O00506,serine/threonine kinase 25,STK25,O00506,NP_006365.2,O00506,YSK1,STK25,STK25,STK25
...,...,...,...,...,...,...,...,...,...,...,...
484,TP53RK,Q96S44,EKC/KEOPS complex subunit TP53RK,,,,Q96S44,PRPK,TP53RK,TP53RK,
485,TRPM6,Q9BX84,Transient receptor potential cation channel su...,,,,Q9BX84,ChaK2,TRPM6,TRPM6,
486,BCR/ABL,A9UF07,BCR/ABL fusion protein isoform Y5,,,,A9UF07,BCR-ABL1,BCR/ABL,,
487,BCKDK,O14874,[3-methyl-2-oxobutanoate dehydrogenase [lipoam...,,,,,,,BCKDK,BCKDK


In [24]:
# get the new list kinase with common kinase name that would use across all referece and the uniprotID for these kinase
df_unique_kin = df_humanKinase[['Kinase Name','UniprotID']]
df_unique_kin

Unnamed: 0,Kinase Name,UniprotID
0,SGK1,O00141
1,BMPR1B,O00238
2,CDC7,O00311
3,PLK4,O00444
4,STK25,O00506
...,...,...
484,TP53RK,Q96S44
485,TRPM6,Q9BX84
486,BCR/ABL,A9UF07
487,BCKDK,O14874


- remove the 4 enzyme records that are not protein kinase from the NetworKIN result files
- add the Kinase Name that would use across all perdictors for the kinase to the result files

In [27]:
all_results = glob.glob(NW_temp_dir_site + '*.csv')

for filename in all_results:
    df = pd.read_csv(filename)
    # remove the 4 enzyme records that are not protein kinase from the df_kinaseMap
    df = df[~df['kinase_name'].isin(not_kinase)]

    # merge with df_unique_kinase to add the common kinase name to the df
    df = df.merge(df_unique_kin, left_on='kinase_acc', right_on='UniprotID', how = 'left')
    # drop the duplicated uniprot id for kinases
    df = df.drop(columns = 'UniprotID')

    df.to_csv(filename,index=False)  

df

Unnamed: 0,Position,score,substrate_name,kinase_name,pep,substrate_acc,kinase_acc,site,substrate_id,Gene Name,Kinase Name
0,S257,1.0740,NDUFAF1,PAK1,VKIPFSKFFFS,Q9Y375,Q13153,S257,Q9Y375_257,NDUFAF1,PAK1
1,S257,0.9377,NDUFAF1,PRKCB,VKIPFSKFFFS,Q9Y375,P05771,S257,Q9Y375_257,NDUFAF1,PRKCB
2,S257,0.5971,NDUFAF1,PRKCA,VKIPFSKFFFS,Q9Y375,P17252,S257,Q9Y375_257,NDUFAF1,PRKCA
3,S257,0.3979,NDUFAF1,PRKCZ,VKIPFSKFFFS,Q9Y375,Q05513,S257,Q9Y375_257,NDUFAF1,PRKCZ
4,S257,0.3288,NDUFAF1,PRKCD,VKIPFSKFFFS,Q9Y375,Q05655,S257,Q9Y375_257,NDUFAF1,PRKCD
...,...,...,...,...,...,...,...,...,...,...,...
9912858,Y597,0.0001,WDR65,NTRK2,AFDVTYTAIVI,Q96MR6,Q16620,Y597,Q96MR6_597,CFAP57,NTRK2
9912859,Y597,0.0001,WDR65,KDR,AFDVTYTAIVI,Q96MR6,P35968,Y597,Q96MR6_597,CFAP57,KDR
9912860,Y597,0.0000,WDR65,PDGFRB,AFDVTYTAIVI,Q96MR6,P09619,Y597,Q96MR6_597,CFAP57,PDGFRB
9912861,Y597,0.0000,WDR65,TYK2,AFDVTYTAIVI,Q96MR6,P29597,Y597,Q96MR6_597,CFAP57,TYK2


# Standard Formatted NetWorKIN
### 'NetworKIN_formatted.csv'
Standardize the preprocessed file with following columns:

- substrate_id - unique IDs for the substrate phosphorylation site (substrate_acc + position)
- substrate_name - gene name for the substrates
- substrate_acc - mapped UniprotIDs for the substrates
- site - phosphorylation site
- pep - +/- 5 AA peptide sequence around the site
- score - perdiction score
- Kinase Name - Kinase name

In [8]:
all_results = glob.glob(NW_temp_dir_site + '*.csv')
NW = []
for filename in all_results:
    df = pd.read_csv(filename, usecols = ['substrate_id','substrate_acc','Gene Name', 'site','pep', 'score', 'Kinase Name'])
    df = df.rename(columns={'Gene Name': 'substrate_name'})
    NW.append(df)
        
df_final = pd.concat(NW)
df_final = df_final.drop_duplicates()
df_final.to_csv(NW_formatted, chunksize = 1000000, index = False)
df_final

Unnamed: 0,substrate_id,substrate_acc,substrate_name,site,pep,score,Kinase Name
0,O60294_5,O60294,LCMT2,S5,_MGPRSRERRA,2.4948,PRKCB
1,O60294_5,O60294,LCMT2,S5,_MGPRSRERRA,1.3048,PRKCA
2,O60294_5,O60294,LCMT2,S5,_MGPRSRERRA,1.2440,PRKCZ
3,O60294_5,O60294,LCMT2,S5,_MGPRSRERRA,0.5110,PRKCD
4,O60294_5,O60294,LCMT2,S5,_MGPRSRERRA,0.4732,PRKCG
...,...,...,...,...,...,...,...
192836106,Q96MR6_597,Q96MR6,CFAP57,Y597,AFDVTYTAIVI,0.0001,NTRK2
192836107,Q96MR6_597,Q96MR6,CFAP57,Y597,AFDVTYTAIVI,0.0001,KDR
192836108,Q96MR6_597,Q96MR6,CFAP57,Y597,AFDVTYTAIVI,0.0000,PDGFRB
192836109,Q96MR6_597,Q96MR6,CFAP57,Y597,AFDVTYTAIVI,0.0000,TYK2
