## Download and Process Dataset of Interest
Prior to predicting kinase activities, datasets need to be mapped to KinPred to obtain the Uniprot ID, phosphosite, and the +/-7 peptide sequence that will be used by KSTAR to identify which kinases are associated with each phosphosite. In order to map kinase activities, the dataframe containing phosphoproteomic data should contain each peptides Uniprot accession, as well as either the site number or peptide sequence. If the peptide sequence is used, it should be formatted with only the phosphorylated peptides being lowercased. An example of the processed dataset can be seen below, which is a trimmed and processed version of the dataset published publically (Chylek, 2014):

In [1]:
#import KSTAR and other necesary packages
import pandas as pd
import numpy as np
import pickle
import os

from kstar import config, helpers, kstar_runner
from kstar.mapper import experiment_mapper

In [2]:
#load data
df = pd.read_csv('example.tsv', index_col = 0, sep = '\t')
df

Unnamed: 0_level_0,query_accession,mod_sites,peptide,data:time:0,data:time:5,data:time:15,data:time:30,data:time:60
MS_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7605136,Q9P2D3-1,Y1104,EAAEVCEyAMSLAK,0.0,-0.01,-0.28,-0.03,-0.27
7605137,A0FGR8-6,Y845,NLIAFSEDGSDPyVR,0.0,0.26,0.27,0.04,0.05
7605138,Q5T4S7-2,Y5156,HNDMPIyEAADK,0.0,0.31,-0.15,0.01,-0.23
7605139,Q16181-1,Y30,NLEGyVGFANLPNQVYR,0.0,-0.14,-0.19,0.07,0.15
7605140,Q16181-1,Y41,NLEGYVGFANLPNQVyR,0.0,-0.14,-0.09,0.04,-0.06
...,...,...,...,...,...,...,...,...
7605855,O95801,Y129,AAAQYyLGNFR,0.0,0.15,0.39,-0.05,0.08
7605859,O60711,Y22,STLQDSDEySNPAPLPLDQHSR,0.0,-0.73,0.76,4.48,5.88
7605860,O60711,Y203,SGLAYCPNDyHQLFSPR,0.0,0.55,-1.77,3.58,5.15
7605861,P47736,Y374,LINAEyACYK,0.0,-0.12,0.19,0.02,-0.16


Notice that all data columns in the dataset have 'data:' in front of them. This is how KSTAR will identify which columns to use when making evidence decisions. This can be done manually prior to mapping, or will be done by KSTAR automatically once you indicate which columns you would like to use as evidence.

## Map Dataset

In [3]:
#define the directory where mapped dataset and run information will be saved
odir = './example'
#Define the log name. The log will store information regarding any errors that occur during the mapping process.
logName = 'example_run'

In [4]:
###Construct a dictionary which identifies which columns contain the accession + peptide and/or site for each peptide. 
#Format of this dictionary should be: {'peptide': 'Col_with_PeptideInfo','accession_id': 'Col_with_UniprotID'}
mapDict = {'peptide':'peptide', 'accession_id':'query_accession'}

#get mapping log
if not os.path.exists(f"{odir}/MAPPED_DATA"): 
    os.mkdir(f"{odir}/MAPPED_DATA")   
mapping_log = helpers.get_logger(f"mapping_{logName}", f"{odir}/MAPPED_DATA/mapping_{logName}.log")
#map dataset and record process in the logger
exp_mapper = experiment_mapper.ExperimentMapper(experiment = df,
                                                columns = mapDict, 
                                                logger = mapping_log)
#save mapped dataset
exp_mapper.experiment.to_csv(f"{odir}/MAPPED_DATA/{logName}_mapped.tsv", sep = '\t', index = False)

In [5]:
exp_mapper.experiment

Unnamed: 0,query_accession,mod_sites,peptide,data:time:0,data:time:5,data:time:15,data:time:30,data:time:60,KSTAR_ACCESSION,KSTAR_PEPTIDE,KSTAR_SITE,KSTAR_NUM_COMPENDIA,KSTAR_NUM_COMPENDIA_CLASS
0,Q9P2D3-1,Y1104,EAAEVCEyAMSLAK,0.0,-0.01,-0.28,-0.03,-0.27,Q9P2D3,EAAEVCEyAMSLAKN,Y1104,0,0
1,A0FGR8-6,Y845,NLIAFSEDGSDPyVR,0.0,0.26,0.27,0.04,0.05,A0FGR8,SEDGSDPyVRMYLLP,Y824,2,1
2,Q5T4S7-2,Y5156,HNDMPIyEAADK,0.0,0.31,-0.15,0.01,-0.23,Q5T4S7,RHNDMPIyEAADKAL,Y5135,1,1
3,Q16181-1,Y30,NLEGyVGFANLPNQVYR,0.0,-0.14,-0.19,0.07,0.15,Q16181,QQKNLEGyVGFANLP,Y30,5,2
4,Q16181-1,Y41,NLEGYVGFANLPNQVyR,0.0,-0.14,-0.09,0.04,-0.06,Q16181,ANLPNQVyRKSVKRG,Y41,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
660,O95801,Y129,AAAQYyLGNFR,0.0,0.15,0.39,-0.05,0.08,O95801,NRAAAQYyLGNFRSA,Y129,1,1
661,O60711,Y22,STLQDSDEySNPAPLPLDQHSR,0.0,-0.73,0.76,4.48,5.88,O60711,TLQDSDEySNPAPLP,Y22,5,2
662,O60711,Y203,SGLAYCPNDyHQLFSPR,0.0,0.55,-1.77,3.58,5.15,O60711,LAYCPNDyHQLFSPR,Y203,2,1
663,P47736,Y374,LINAEyACYK,0.0,-0.12,0.19,0.02,-0.16,P47736,TKLINAEyACYKAEK,Y374,0,0


## Explore the Mapped Dataset

One way to check how well the mapping worked is to check the number of peptides in the original dataframe vs. the newly mapped dataframe. Changes to the number of peptides could be a result of several things:
1. Failure to find the corresponding peptide 
2. A peptide with multiple phosphorylation sites was seperated into seperate peptides
3. Multiple peptides peptides are identical and provide the same evidence

For specifics on mapping failures, go to the log file for details.

In [6]:
#Look at how many of the peptides in the dataset were actually mapped. This indicates how well the mapping worked
print("Original number of lines in df: %d\nNew number of lines in mapped:%d"%(len(df), len(exp_mapper.experiment)))

Original number of lines in df: 665
New number of lines in mapped:665
