Kinase activity calculation requires selection of the following choices

1. Tyrosine ['Y'] or Serine/Threonine ['ST'], or both ['Y, 'ST'] (default). 
2. How to handle duplicates (or greater) of the same peptide (i.e. aggregation). 'count' counts the total number of times a non-NaN value of that peptide occurred in an experiment. 'mean' averages the non-NaN values found for multiple peptides.
3. The threshold to use for the 'mean' or 'count' aggregates. Can select values to be kept that are greater than or equal to provided threshold (greater=TRUE default). 

In [17]:
#Import preamble of kstar and other necessary functions
import pandas as pd
import os
import pickle

from kstar import config, helpers
from kstar.activity import kstar_activity

#define log name and out directory. Ideally use the same values used for mapping
odir = './example'
logName = 'example_run'

## Determine Thresholds
It is useful to determine the best threshold to use (which sites to use as evidence for each sample). One easy benchmark way is to identify the number of sites that will be used as evidence at different thresholds. For tyrosine kinase activities, it is recommended that samples have at least 50 tyrosine sites used as evidence, while for serine/threonine kinase activities, it is recommended that samples have at least 1000 sites used as evidence. While not a necessity, it is also beneficial to have comparable site numbers across samples.

In [18]:
#load mapped data, if necessary
experiment = pd.read_csv(f'{odir}/MAPPED_DATA/{logName}_mapped.tsv', sep = '\t', index_col = 0)

In [19]:
#indicate phosphomod of interest: either 'Y' for tyrosine or 'ST' for serine/threonine
phospho_type = ['Y']
logName_new = logName + '_Y'
#create activity log
if not os.path.exists(f"{odir}/RESULTS"): 
    os.mkdir(f"{odir}/RESULTS")
activity_log = helpers.get_logger(f"activity_{logName_new}", f"{odir}/RESULTS/activity_{logName_new}.log")

In [20]:
#If your data columns already have data: in front of their name, then data_columns can be set to None. 
#Otherwise, indicate which columns contain the data of interest here:
data_columns = None

agg = 'mean' # if a non-NaN value appears at all, use it
threshold = 0.2


#create the activity object and then 
kinact = kstar_activity.KinaseActivity(experiment, activity_log, phospho_type=phospho_type)
kinact.set_data_columns()
evidence_binary = kinact.create_binary_evidence(data_columns = None, agg = agg, threshold = threshold,  greater = True)

#inspect the number of sites used for each sample
data_cols = [col for col in evidence_binary.columns if 'data:' in col]
evidence_binary[data_cols].sum()

data:time:0       0.0
data:time:5     128.0
data:time:15    135.0
data:time:30    191.0
data:time:60    200.0
dtype: float64

All samples above (except for the 0 time point, which will be removed prior to activity calculation) have greater than 50 sites being used as evidence, and the evidence sizes are generally comparable. This is a good threshold to use.

## Calculate Statistical Enrichment (Hypergeometric p-values)

The first step to obtaining activity predictions is calculating the statistical enrichment of kinase substrates in the dataset using the hypergeometric test. From this test, the median p-value from the pruned networks can be used to indicate the level of activity. To obtain these predictions, use the following steps:
1. Load the pruned kinase-substrate networks
2. Generate an activity log, which will store information about each run, including any errors that arise.
3. Define the data columns containing each experiment and the threshold to use
4. Perform enrichment calculations

In [21]:
phospho_types = ['Y'] #running on this type of kinase/substrate network

#Setup the network dictionary. Here, using the default pickles from config
# only have to load one of these if running analysis on only one substrate type
networks = {}
networks['Y'] = pickle.load(open(config.NETWORK_Y_PICKLE, "rb" ) )

In [6]:
#Create activity log: if already did this, ignore.
if not os.path.exists(f"{odir}/RESULTS"): 
    os.mkdir(f"{odir}/RESULTS")
activity_log = helpers.get_logger(f"activity_{logName}", f"{odir}/RESULTS/activity_{logName}.log")

In [7]:
data_columns = None #by passing None, all columns prefixed by data: will be used to calculate activity
agg = 'mean'
threshold = 0.2
greater = True
kinact_dict = kstar_activity.run_kstar_analysis(experiment, activity_log, networks, phospho_types = phospho_types, 
                                                data_columns = data_columns, agg =agg, threshold = threshold,  
                                                greater = greater)

In [8]:
#Kinase activities, as the median pvalues seen across all networks, and summarized
# have now been calculated and you can see them here as activities
# let's sort by the most active at 30 seconds
kinact_dict['Y'].activities.sort_values('data:time:30').head()

Unnamed: 0_level_0,data:time:5,data:time:15,data:time:30,data:time:60
KSTAR_KINASE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LCK,7e-06,8.355183e-08,4.000094e-11,1.05971e-13
FYN,0.000587,0.0001244823,6.028671e-06,5.408169e-08
ITK,0.019084,0.004479018,7.399574e-05,2.249004e-05
HCK,0.036456,0.03438913,0.0002025043,3.41367e-06
BTK,0.023044,0.03009696,0.0002667463,3.553248e-06


## Generate random datasets, run kinase activity on random datasets, normalize original analysis

The p-value enrichment obtained from the hypergeometric test above often suffer from high false positive rates for specific kinases that are well studied in phosphosite compendia such as LCK and FYN. In order to account for these false positive rates, we want to generate random datasets and rerun enrichment calculations on each of the random datasets. These random results can then be used to obtain new activity predictions that have been adjusted for the enrichment that might be found by random chance. 

In [9]:
#indicate the number of random experiments generate. 150 experiments provides a good balance between statistical power and
#    computational complexity
num_random_experiments=150
#indicate the desired false positive rate
target_alpha=0.05
#run the normalize analysis. This will generate the random experiments then normalize p-values to adjust for fpr.
kstar_activity.normalize_analysis(kinact_dict, activity_log, num_random_experiments, target_alpha)

In [10]:
#This object holds the random datasets that were used, where NaN means a site was not selected and 1 means it was
kinact_dict['Y'].random_experiments.head()

Unnamed: 0,KSTAR_ACCESSION,KSTAR_SITE,data:time:5:0,data:time:5:1,data:time:5:2,data:time:5:3,data:time:5:4,data:time:5:5,data:time:5:6,data:time:5:7,...,data:time:60:140,data:time:60:141,data:time:60:142,data:time:60:143,data:time:60:144,data:time:60:145,data:time:60:146,data:time:60:147,data:time:60:148,data:time:60:149
0,A0AUZ9,Y792,,,,,,,,,...,,,,,,,,,,
1,A0AV02,Y107,,,,,,,,,...,,,,,,,,,,
2,A0AVF1,Y167,,,,,,,,,...,,,,,,,,,,
3,A0AVF1,Y174,,,,,,,,,...,,,,,,,,,,1.0
4,A0AVI2,Y1801,,,,,,,,,...,,,,,,,,,,


In [11]:
# If you sum the random_experiments it will tell you how many total sites were selected, which should match
# the parent experiment 
kinact_dict['Y'].random_experiments.sum()

KSTAR_ACCESSION     A0AUZ9A0AV02A0AVF1A0AVF1A0AVI2A0AVK6A0AVK6A0AV...
KSTAR_SITE          Y792Y107Y167Y174Y1801Y202Y316Y1046Y44Y558Y979Y...
data:time:5:0                                                     128
data:time:5:1                                                     128
data:time:5:2                                                     128
                                          ...                        
data:time:60:145                                                  200
data:time:60:146                                                  200
data:time:60:147                                                  200
data:time:60:148                                                  200
data:time:60:149                                                  200
Length: 602, dtype: object

## Calculate Mann Whitney Significance

Finally, the preferred activity estimation involves implementing the Mann Whitney test to compare the distribution of real p-values (activity predictions on the real dataset) to the random p-values (activity predictions on the random dataset). Further, a false positive rate is calculated by pulling out one of the random datasets and comparing to all other random datasets using the same Mann Whitney test. The parameter 'number_sig_trials' indicates the number of times to repeat this calculation to obtain the false positive rate.

In [12]:
#run MW analysis on tyrosine. These will provide the final distribution based p-values used for final kinase activity scores
kstar_activity.Mann_Whitney_analysis(kinact_dict, activity_log, number_sig_trials = 100)

## Save KSTAR Results

There are several options that can be used to save results obtained from KSTAR. If all information is desired, use save_kstar object, which will save all attributes found in the KinaseActivity object. 

In [13]:
kstar_activity.save_kstar(kinact_dict, logName, odir)

However, this produces a very large file that takes up considerable memory. If you do not need the random experiments that were generated for normalization and mann whitney calculations, then it is recommended that kstar_slim is used. This will still save all activity and fpr predictions.

In [14]:
kstar_activity.save_kstar_slim(kinact_dict, logName, odir)

Lastly, it is possible to just save each KinaseActivity attribute individually using pandas:

In [16]:
kinact_dict['Y'].activities_mann_whitney.to_csv(f'{logName}_Y_mann_whitney_activities.tsv', sep = '\t')
kinact_dict['Y'].fpr_mann_whitney.to_csv(f'{logName}_Y_mann_whitney_fpr.tsv', sep = '\t')