# Analysis of PTMs and Splicing

This notebook contains all analysis used to generate data for the analysis of PTMs and Splicing (detailed in our manuscript here). To run this, notebook, you must have the resulting files from a complete mapping run of the [ExonPTMapper](https://github.com/NaegleLab/ExonPTMapper/tree/main) python package.

## Table of Contents

1. [Load Data](#load-data)
2. [Exclusion of PTMs (Figure 2)](#ptm-exclusion)
    1. [Constitutive PTM Rates (Figure 2A/B)](#constitutive-ptm-rate)
        1. [All alternative isoforms (Figure 2B)](#all-alternative-isoforms)
        2. [Filtering out potentially non-functional transcripts (Supplementary Figure 8)](#filtering-out-potentially-non-functional-transcripts)
    2. [Biological Process and Function Enrichment of Splicing-Controlled PTMs (Supplementary Figure 9)](#biological-process-and-function-enrichment-of-splicing-controlled-ptms)
    3. [Splice Events Responsible for Exclusion (Figure 2C)](#splice-events-responsible-for-ptm-exclusion) 
    4. [Density of PTMs in Tissue Specific Exons (Figure 2D)](#density-of-ptms-in-tissue-specific-exons)
3. [Altered Flanking Sequences of PTMs (Figure 3)](#altered-flanking-sequences)
    1. [Modification specific flanking sequence alteration rates (Figure 3C)](#mod-specific-rates-of-alteration)
    2. [Events Causing Altered Flanking Sequences (Figure 3D)](#events-causing-altered-flanking-sequences)
    3. [Sequence Similarity Between Canonical and Alternative Flanking Sequences (Figure 3E)](#sequence-similarity-between-flanking-sequences-in-canonical-and-alternative)
    4. [Position of Altered Residues in Flanking Sequences (Supplementary Figure 11)](#position-of-altered-residues-in-flanking-sequence)
    5. [Kinase Library Analysis (Figure 3F-G)](#kinase-library-analysis)
4. [ESRP1-mediated Splicing in Prostate Cancer (Figure 4)](#esrp1-mediated-splicing-in-prostate-cancer)
    1. [Project PTMs onto the SpliceSeq splicegraph](#projecting-ptms-onto-spliceseq-exons)
    2. [Identifying differentially included PTMs in ESRP1-High Prostate Cancer Patients](#identify-differentially-included-ptm-containing-exons)
    3. [Gene set enrichment of ESRP1-correlated genes (Supplementary Figure 12)](#gene-set-enrichment-of-esrp1-correlated-genes)
    4. [Site-specific set enrichment of ESRP1-correlated PTMs (Supplementary Figure 13)](#site-enrichment-of-esrp1-correlated-ptms)
    5. [Kinase Enrichment of ESRP1-correlated Substrates (Supplementary Figure 16)](#kinase-enrichment-of-esrp1-correlated-substrates)
5. [Analysis of PTMs in Canonical and Alternative UniProt Isoforms (Supplementary Figure 2)](#analysis-of-ptms-in-canonical-and-alternative-uniprot-isoforms)

## Load Data

In [1]:
import pandas as pd
import numpy as np
import os
from collections import defaultdict
from Bio import pairwise2
from ExonPTMapper import mapping, config

from tqdm import tqdm

#import custom statistic functions
import stat_utils

#location of figshare data
figshare_dir = 'C:/Users/Sam/OneDrive/Documents/GradSchool/Research/Splicing/Paper_Prep/PTM_Splicing_FigShare_Update/'
#where to find projected PTMs
#ptm_data_dir = figshare_dir + 'PTM_Projection_Data/'
ptm_data_dir = 'C:/Users/Sam/OneDrive/Documents/GradSchool/Research/Splicing/Data/April18_2024/'
#where to find information from different databases
database_dir = figshare_dir + 'External_Data/'
#where analysis data will be saved
#analysis_dir = figshare_dir + 'Analysis_For_Paper/'
analysis_dir = figshare_dir + '/Analysis_For_Paper/'



#set global figure parameters
min_mods = 150



#load and process mapping data
mapper = mapping.PTM_mapper()
mapper.ptm_info.index = mapper.ptm_info['Protein'] + '_' + mapper.ptm_info['Residue']+mapper.ptm_info['PTM Location (AA)'].astype(str)
if os.path.exists(config.processed_data_dir + 'isoform_ptms.csv'):
    mapper.isoform_ptms = pd.read_csv(config.processed_data_dir + 'isoform_ptms.csv', header = 0)


#get modification class names and subtypes (converting between classes and subtypes)
mod_groups = pd.read_csv('C:/Users/Sam/OneDrive/Documents/GradSchool/Research/Splicing/Paper_Prep/PTM_Splicing_FigShare_Update/External_Data/modification_conversion.csv', header = 0)


#separate ptm_info into unique exon/PTM pairs
exploded_ptms = mapper.explode_PTMinfo(explode_cols=['Transcripts', 'Gene Location (NC)', 'Transcript Location (NC)', 'Exon Location (NC)', 'Exon stable ID', 'Exon rank in transcript', 'Distance to C-terminal Splice Boundary (NC)', 'Distance to N-terminal Splice Boundary (NC)'])
exploded_ptms = exploded_ptms.dropna(subset = ['Exon stable ID'])


#separate based on modifications (unique PTM/modification class pairs)
exploded_mods = mapper.ptm_info.copy()
exploded_mods["Modification Class"] = exploded_mods['Modification Class'].apply(lambda x: x.split(';'))
exploded_mods = exploded_mods.explode('Modification Class').reset_index()
exploded_mods = exploded_mods.rename({'index':'PTM'}, axis = 1)
exploded_mods = exploded_mods.drop_duplicates()



Downloading Canonical UniProt isoforms
Downloading ID translator file
Loading mapper object from .tsv files
Loading exon-specific data
Loading transcript-specific data
Loading gene-specific info
Loading unique protein isoforms
Loading protein-specific info
Loading information on PTMs on canonical proteins
Loading genomic coordinates of PTMs associated with canonical proteins


  self.ptm_coordinates = pd.read_csv(config.processed_data_dir + 'ptm_coordinates.csv',index_col = 0,


Loading information on PTMs on alternative proteins


  mapper.isoform_ptms = pd.read_csv(config.processed_data_dir + 'isoform_ptms.csv', header = 0)


In [8]:
tmp = mapper.ptm_info[mapper.ptm_info['Isoform Type'] == 'Canonical']
tmp[tmp["PTM Conservation Score"] == 1].shape[0]/tmp.shape[0]

0.6880696014811886

## Get isoform-specific PTM information

Code used to collapse ptm information into data for only unique protein isoforms (unique amino acid sequence). Then the constitutive rates and rate of altered flanking sequences can be calculated. These functions are present in the standard pipeline code, but are not automatically done when running the pipeline. 


[Return to Table of Contents](#table-of-contents)

In [7]:
#collapse alternative ptms dataframe to only include unique protein isoforms
mapper.getIsoformSpecificPTMs()
#calculate fraction of isoforms containing the a given ptm
mapper.calculate_PTMconservation()

#get flanking sequence changes
for size in range(1,11):
    mapper.isoform_ptms[f'Conserved Flank (Size = {size})'] = mapper.compareAllFlankSeqs(flank_size=size)

#get tryptic fragments
mapper.isoform_ptms['Conserved Fragment'] = mapper.compareAllTrypticFragments()

100%|██████████| 1339434/1339434 [00:30<00:00, 44110.40it/s]


In [8]:
canonical_isoforms = config.translator.loc[config.translator['UniProt Isoform Type'] == 'Canonical', 'Transcript stable ID'].unique()
alternative_isoform_ptms = mapper.isoform_ptms.copy()
#alternative_isoform_ptms['Isoform ID'] = alternative_isoform_ptms['Source of PTM'].apply(lambda x: x.split('_')[0])
alternative_isoform_ptms = alternative_isoform_ptms[~alternative_isoform_ptms['Isoform ID'].isin(canonical_isoforms)]
alternative_isoform_ptms.groupby('Mapping Result').size()/alternative_isoform_ptms.shape[0]

Mapping Result
Not Found           0.175347
Ragged Insertion    0.000372
Residue Mismatch    0.000037
Success             0.824244
dtype: float64

In [9]:
mapper.isoform_ptms.to_csv(ptm_data_dir + 'processed_data_dir/isoform_ptms.csv', index = False)
mapper.ptm_info.to_csv(ptm_data_dir + 'processed_data_dir/ptm_info.csv')


## PTM Exclusion

The following code was used to generate data for Figure 2 and supplementary figures 4-10, all focused on the exclusion/inclusion of PTMs across isoforms.

[Return to Table of Contents](#table-of-contents)

### Constitutive PTM Rate

The first analysis we performed after projecting PTMs onto alternative isoforms was to determine the fraction of PTMs that could be defined as constitutive (found in all isoforms of a given gene). 

In [7]:
overall_rate, num_isoforms = mapper.calculate_PTMconservation(return_score = True)
print('Overall PTM Constitutive Rate: ' + str(overall_rate))

Overall PTM Constitutive Rate: 0.6822873222451133


#### All Alternative Isoforms

Constitutive PTM rates were calculated for all protein-coding transcripts in Ensembl, grouped by broad modificaiton classes and modification subtypes.

[Return to Table of Contents](#table-of-contents)

In [9]:
groupings = ['Modification Class', 'Modification']
fname = ['ByModificationClass', 'ByModificationSubtype']

for group, f in zip(groupings, fname):
    #get non-duplicate data for the group type
    if group == 'Modification Class':
        mod_data = exploded_mods[exploded_mods['Isoform Type'] == 'Canonical'].copy()
    elif group == 'Modification':
        exploded_subtype = mapper.ptm_info[mapper.ptm_info['Isoform Type'] == 'Canonical'].copy()
        exploded_subtype["Modification"] = exploded_subtype['Modification'].apply(lambda x: x.split(';'))
        exploded_subtype = exploded_subtype.explode('Modification').reset_index()
        exploded_subtype = exploded_subtype.rename({'index':'PTM'}, axis = 1)
        exploded_subtype = exploded_subtype.drop_duplicates(subset = ['Modification', 'PTM'])
        mod_data = exploded_subtype.copy()

    #get number of modifications
    sizes = mod_data.groupby(group).size()
    sizes = sizes.sort_values(ascending = False)
    #get the number of constitutive ptms, then add in any mod types that don't have any constitutive ptms
    constitutive_ptms = mod_data[mod_data['PTM Conservation Score'] == 1]
    grouped_conserved = constitutive_ptms.groupby(group).size()
    #add in any modification types that don't have any constitutive ptms
    for mod in sizes.index:
        if mod not in grouped_conserved.index.values:
            grouped_conserved[mod] = 0
    rate_data = grouped_conserved[sizes.index]/sizes
    rate_data = pd.concat([sizes, rate_data], axis = 1)
    rate_data.columns = ['Number of Instances in Proteome', 'Rate']
    
    #save data
    if not os.path.exists(analysis_dir + '/Constitutive_Rates'):
        os.makedirs(analysis_dir + '/Constitutive_Rates')
    rate_data.to_csv(analysis_dir + f'/Constitutive_Rates/{f}.csv')

#### Filtering out potentially non-functional transcripts

To account for the possibility that some transcripts may not ultimately code for functional transcripts, we repeated the analysis of constitutive PTM rates after filtering using different criteria for functional transcripts.

[Return to Table of Contents](#table-of-contents)

##### By Database

In [14]:
scores = {}
num_isoforms = {}
label = 'Ensembl (All)'
scores[label], num_isoforms[label] = mapper.calculate_PTMconservation(return_score = True)

label = 'UniProt'
uniprot_isoforms = config.translator.loc[config.translator['UniProt Isoform Type'] == 'Alternative', 'Transcript stable ID'].values
scores[label], num_isoforms[label] = mapper.calculate_PTMconservation(transcript_subset = uniprot_isoforms, save_col = 'UniProt Isoform Score', return_score = True)

appris_isoforms = mapper.transcripts.dropna(subset = 'APPRIS annotation').index.values
label = 'APPRIS'
scores[label], num_isoforms[label] = mapper.calculate_PTMconservation(transcript_subset = appris_isoforms, save_col = 'APPRIS Score', return_score = True)

label = 'CCDS'
ccds_isoforms = config.translator.dropna(subset = 'CCDS ID')['Transcript stable ID'].values
scores[label], num_isoforms[label] = mapper.calculate_PTMconservation(transcript_subset = ccds_isoforms, save_col = 'CCDS Isoform Score', return_score = True)

label = 'PDB'
pdb_isoforms = config.translator.dropna(subset = 'PDB ID')['Transcript stable ID'].values
scores[label], num_isoforms[label] = mapper.calculate_PTMconservation(transcript_subset = pdb_isoforms, save_col = 'PDB Isoform Score', return_score = True)

label = 'RefSeq'
refseq_isoforms = config.translator.dropna(subset = 'RefSeq mRNA ID')['Transcript stable ID'].values
scores[label], num_isoforms[label] = mapper.calculate_PTMconservation(transcript_subset = pdb_isoforms, save_col = 'Refseq Isoform Score', return_score = True)

#combine data and save
scores = pd.Series(scores, name = 'Constitutive Rate')
num_isoforms = pd.Series(num_isoforms, name = 'Number of Isoforms')
scores = pd.concat([scores, num_isoforms], axis = 1)  

if not os.path.exists(analysis_dir + '/Constitutive_Rates/Filtered'):
    os.makedirs(analysis_dir + '/Constitutive_Rates/Filtered')
scores.to_csv(analysis_dir + '/Constitutive_Rates/Filtered/ByDatabase.csv')


#save mapper ptm info with added rate columns
mapper.ptm_info.to_csv(ptm_data_dir + 'processed_data_dir/ptm_info.csv', index = False)

100%|██████████| 51133/51133 [01:58<00:00, 430.23it/s]
100%|██████████| 51133/51133 [00:52<00:00, 974.91it/s] 
100%|██████████| 51133/51133 [05:59<00:00, 142.22it/s]
100%|██████████| 51133/51133 [04:47<00:00, 177.88it/s]
100%|██████████| 51133/51133 [10:56<00:00, 77.90it/s]   


##### By Transcript Support Level

In [15]:
scores = {}
num_isoforms = {}

#all isoforms
label = 'All'
scores[label], num_isoforms[label] = mapper.calculate_PTMconservation(return_score = True)

#transcript support level
tsl_transcripts = mapper.transcripts.dropna(subset = 'Transcript support level (TSL)').copy()
tsl1_isoforms = tsl_transcripts.loc[tsl_transcripts['Transcript support level (TSL)'].str.contains('tsl1')].index.values
tsl1_2_isoforms = tsl_transcripts.loc[tsl_transcripts['Transcript support level (TSL)'].str.contains('tsl1') | tsl_transcripts['Transcript support level (TSL)'].str.contains('tsl2')].index.values
tsl1_2_3_isoforms = tsl_transcripts.loc[tsl_transcripts['Transcript support level (TSL)'].str.contains('tsl1') | tsl_transcripts['Transcript support level (TSL)'].str.contains('tsl2') | tsl_transcripts['Transcript support level (TSL)'].str.contains('tsl3')].index.values
tsl1_2_3_4_isoforms = tsl_transcripts.loc[tsl_transcripts['Transcript support level (TSL)'].str.contains('tsl1') | tsl_transcripts['Transcript support level (TSL)'].str.contains('tsl2') | tsl_transcripts['Transcript support level (TSL)'].str.contains('tsl3')| tsl_transcripts['Transcript support level (TSL)'].str.contains('tsl4')].index.values
tsl1_2_3_4_5_isoforms = tsl_transcripts.loc[tsl_transcripts['Transcript support level (TSL)'].str.contains('tsl1') | tsl_transcripts['Transcript support level (TSL)'].str.contains('tsl2') | tsl_transcripts['Transcript support level (TSL)'].str.contains('tsl3')| tsl_transcripts['Transcript support level (TSL)'].str.contains('tsl4')| tsl_transcripts['Transcript support level (TSL)'].str.contains('tsl5')].index.values
label = 'TSL1/2/3/4/5'
scores[label], num_isoforms[label] = mapper.calculate_PTMconservation(transcript_subset = tsl1_2_3_4_5_isoforms, save_col = 'TSL1/2/3/4/5 Score', return_score = True)
label = 'TSL1/2/3/4'
scores[label], num_isoforms[label] = mapper.calculate_PTMconservation(transcript_subset = tsl1_2_3_4_isoforms, save_col = 'TSL1/2/3/4/5 Score', return_score = True)
label = 'TSL1/2/3'
scores[label], num_isoforms[label] = mapper.calculate_PTMconservation(transcript_subset = tsl1_2_3_isoforms, save_col = 'TSL1/2/3 Score', return_score = True)
label = 'TSL1/2'
scores[label], num_isoforms[label] = mapper.calculate_PTMconservation(transcript_subset = tsl1_2_isoforms, save_col = 'TSL1/2 Score', return_score = True)
label = 'TSL1'
scores[label], num_isoforms[label] = mapper.calculate_PTMconservation(transcript_subset = tsl1_isoforms, save_col = 'TSL1 Score', return_score = True)


#combine data and save
scores = pd.Series(scores, name = 'Constitutive Rate')
num_isoforms = pd.Series(num_isoforms, name = 'Number of Isoforms')
scores = pd.concat([scores, num_isoforms], axis = 1)  

if not os.path.exists(analysis_dir + '/Constitutive_Rates/Filtered'):
    os.makedirs(analysis_dir + '/Constitutive_Rates/Filtered')
scores.to_csv(analysis_dir + '/Constitutive_Rates/Filtered/ByTranscriptSupportLevel.csv')

100%|██████████| 51133/51133 [01:29<00:00, 569.27it/s]
100%|██████████| 51133/51133 [01:07<00:00, 756.58it/s]
100%|██████████| 51133/51133 [01:00<00:00, 846.71it/s]
100%|██████████| 51133/51133 [00:51<00:00, 992.27it/s] 
100%|██████████| 51133/51133 [00:35<00:00, 1441.98it/s]


##### By TRIFID Score

In [16]:
scores ={}
num_isoforms = {}
for cutoff in tqdm(np.arange(0, 1, 0.05)):
    alt_transcripts = mapper.transcripts[mapper.transcripts['TRIFID Score'] > cutoff].index.values
    scores[cutoff], num_isoforms[cutoff] = mapper.calculate_PTMconservation(transcript_subset = alt_transcripts, save_col = 'Tmp TRIFID Col', return_score = True)

#combine data and save
scores = pd.Series(scores, name = 'Constitutive Rate')
num_isoforms = pd.Series(num_isoforms, name = 'Number of Isoforms')
scores = pd.concat([scores, num_isoforms], axis = 1)  

if not os.path.exists(analysis_dir + '/Constitutive_Rates/Filtered'):
    os.makedirs(analysis_dir + '/Constitutive_Rates/Filtered')
scores.to_csv(analysis_dir + '/Constitutive_Rates/Filtered/ByTRIFID.csv')

100%|██████████| 51133/51133 [01:34<00:00, 539.01it/s]
100%|██████████| 51133/51133 [01:14<00:00, 682.26it/s]
100%|██████████| 51133/51133 [00:57<00:00, 886.95it/s]
100%|██████████| 51133/51133 [00:56<00:00, 897.21it/s] 
100%|██████████| 51133/51133 [00:47<00:00, 1067.76it/s]
100%|██████████| 51133/51133 [00:48<00:00, 1063.01it/s]
100%|██████████| 51133/51133 [00:46<00:00, 1105.58it/s]
100%|██████████| 51133/51133 [00:44<00:00, 1149.40it/s]
100%|██████████| 51133/51133 [00:47<00:00, 1082.72it/s]
100%|██████████| 51133/51133 [00:47<00:00, 1071.27it/s]
100%|██████████| 51133/51133 [00:47<00:00, 1074.49it/s]
100%|██████████| 51133/51133 [00:45<00:00, 1125.22it/s]
100%|██████████| 51133/51133 [00:43<00:00, 1167.73it/s]
100%|██████████| 51133/51133 [00:45<00:00, 1125.83it/s]
100%|██████████| 51133/51133 [00:44<00:00, 1136.81it/s]
100%|██████████| 51133/51133 [00:43<00:00, 1176.25it/s]
100%|██████████| 51133/51133 [00:42<00:00, 1193.58it/s]
100%|██████████| 51133/51133 [00:41<00:00, 1229.65i

### Biological Process and Function Enrichment of Splicing-Controlled PTMs

Identify the biological processes and functions enriched in genes with splicing-controlled PTMs, based on annotations from PhosphoSitePlus. Used for supplementary figure 9.

[Return to Table of Contents](#table-of-contents)

In [17]:
#annotations from phosphositeplus
annotations = pd.read_csv(database_dir + '/PhosphoSitePlus/Regulatory_sites.gz', sep = '\t',compression = 'gzip', on_bad_lines='skip', header = 2)
annotations['Substrate'] = annotations['ACC_ID'] + '_' + annotations['MOD_RSD'].apply(lambda x: x.split('-')[0])

#add ptm column to match phosphositeplus information (isoform id for only alternative isoforms)
ptm_info = mapper.ptm_info[mapper.ptm_info['Isoform Type'] == 'Canonical'].copy()
ptm_info['PTM'] = ptm_info.apply(lambda x: x['Protein'].split('-')[0]+'_'+x['Residue']+str(x['PTM Location (AA)']) if x['Isoform Type'] == 'Canonical' else x['Protein']+'_'+x['Residue']+str(x['PTM Location (AA)']), axis = 1) 

print('Getting enrichment of function for all ptms')
#get list of constitutive and non-constitutive ptms
constitutive_ptms = ptm_info[ptm_info['PTM Conservation Score'] == 1]
non_constitutive_ptms = ptm_info[ptm_info['PTM Conservation Score'] != 1]


function_table = stat_utils.constructPivotTable(ptm_info, annotations, reference_col='ON_FUNCTION', collapse_on_similar = True, include_unknown=True)
process_table = stat_utils.constructPivotTable(ptm_info,annotations, reference_col='ON_PROCESS', collapse_on_similar = True, include_unknown=True)

function_enrichment = stat_utils.generate_site_enrichment(constitutive_ptms['PTM'].values, function_table, subset_name = 'Constitutive', type = 'Function', fishers = True)
process_enrichment = stat_utils.generate_site_enrichment(constitutive_ptms['PTM'].values, process_table, subset_name = 'Constitutive', type = 'Process', fishers = True)

function_enrichment.to_csv(analysis_dir + '/Constitutive_Rates/Enrichment/Function_Enrichment.csv')
process_enrichment.to_csv(analysis_dir + '/Constitutive_Rates/Enrichment/Process_Enrichment.csv')

Getting enrichment of function for all ptms


In [2]:

def constructPivotTable(annotated_ptms, reference_col, database = 'PhosphoSitePlus', collapse_on_similar = False, include_unknown = False):
    """
    Given a ptm dataframe and regulatory data from phosphositeplus, create a table with PTMs in the rows and annotations in the columns, with 1 indicating that the PTM has that annotation

    Parameters
    ----------
    ptms : pandas dataframe
        dataframe containing PTM data
    regulatory : pandas dataframe
        dataframe containing regulatory data from phosphositeplus
    reference_col : str, optional
        column in regulatory dataframe to use as annotations. The default is 'ON_FUNCTION'.
    collapse_on_similar : bool, optional
        whether to collapse similar annotations into one category. The default is False.

    Returns
    -------
    annotation : pandas dataframe
        dataframe with PTMs in the rows and annotations in the columns, with 1 indicating that the PTM has that annotation
    """
    #create matrix indicating function of each ptm: ptm in the rows, function in columns, and 1 indicating that the ptm has that function## create molecular function table, with
    annotation = annotated_ptms.copy()
    if include_unknown:
        annotation.loc[annotation[reference_col].isna(), reference_col] = 'unknown'
    annotation = annotation.dropna(subset = reference_col)
    annotation[reference_col] = annotation[reference_col].apply(lambda x: x.split(';') if x == x else x)
    annotation = annotation.explode(reference_col).reset_index()
    if collapse_on_similar:
        annotation[reference_col] = annotation[reference_col].apply(lambda x: x.split(',')[0].strip(' ') if x == x else x)
    else:
        annotation[reference_col] = annotation[reference_col].apply(lambda x: x.strip(' ') if x == x else x)
    annotation['value'] = 1
    annotation = annotation[['PTM',reference_col, 'value']].drop_duplicates()
    annotation = annotation.pivot(index = 'PTM', columns = reference_col, values = 'value')
    #remove any sites with no functions
    annotation = annotation.dropna(how = 'all')
    return annotation

In [3]:
ks_dataset = pd.read_csv(database_dir + '/PhosphoSitePlus/Kinase_Substrate_Dataset.tsv', sep = '\t', on_bad_lines='skip')
ks_dataset  = ks_dataset[(ks_dataset['SUB_ORGANISM'] == 'human') & (ks_dataset['KIN_ORGANISM'] == 'human')]
ks_dataset['Substrate'] = ks_dataset['SUB_ACC_ID'] + '_' + ks_dataset['SUB_MOD_RSD']


#add ptm column to match phosphositeplus information (isoform id for only alternative isoforms)
annotated_ptms = mapper.ptm_info[mapper.ptm_info['Isoform Type'] == 'Canonical'].copy()
annotated_ptms['PTM'] = annotated_ptms.apply(lambda x: x['Protein'].split('-')[0]+'_'+x['Residue']+str(x['PTM Location (AA)']) if x['Isoform Type'] == 'Canonical' else x['Protein']+'_'+x['Residue']+str(x['PTM Location (AA)']), axis = 1) 
annotated_ptms = annotated_ptms.merge(ks_dataset[['GENE', 'Substrate']], right_on = 'Substrate', left_on = 'PTM', how = 'left')


regphos = pd.read_csv('http://140.138.144.141/~RegPhos/download/RegPhos_Phos_human.txt', sep = '\t')


regphos = regphos.dropna(subset = 'catalytic kinase')
#regphos['Residue'] = regphos['code'] + regphos['position'].astype(str)
regphos = regphos.rename(columns = {'code': 'Residue', 'position':'PTM Position in Canonical Isoform', 'AC': 'UniProtKB Accession', 'catalytic kinase': 'RegPhos:Kinase'})
regphos['PTM'] = regphos['UniProtKB Accession'] + '_' + regphos['Residue'] + regphos['PTM Position in Canonical Isoform'].astype(str)
regphos = regphos[['PTM', 'RegPhos:Kinase']].dropna()
regphos = regphos.groupby(['PTM']).agg(';'.join).reset_index()

#add to splice data
annotated_ptms = annotated_ptms.merge(regphos, how = 'left', on = ['PTM'])

  regphos = pd.read_csv('http://140.138.144.141/~RegPhos/download/RegPhos_Phos_human.txt', sep = '\t')


In [4]:
regphos_conversion = {'CK2A1':'CSNK2A1', 'PKACA':'PRKACA', 'ABL1(ABL)':'ABL1'}
def combine_kinases(row):
    psp = row['GENE'].split(';') if row['GENE'] == row['GENE'] else []
    regphos = row['RegPhos:Kinase'].split(';') if row['RegPhos:Kinase'] == row['RegPhos:Kinase'] else []
    for i, rp in enumerate(regphos):
        if rp.upper() in regphos_conversion:
            regphos[i] = regphos_conversion[rp.upper()]
        else:
            regphos[i] = rp.upper()
    combined = np.unique(psp+regphos)
    if len(combined) > 0:
        return ';'.join(combined)
    else:
        return np.nan

annotated_ptms['Combined:Kinase'] = annotated_ptms.apply(combine_kinases, axis = 1)

In [6]:
#add ptm column to match phosphositeplus information (isoform id for only alternative isoforms)
ptm_info = mapper.ptm_info[mapper.ptm_info['Isoform Type'] == 'Canonical'].copy()
ptm_info['PTM'] = ptm_info.apply(lambda x: x['Protein'].split('-')[0]+'_'+x['Residue']+str(x['PTM Location (AA)']) if x['Isoform Type'] == 'Canonical' else x['Protein']+'_'+x['Residue']+str(x['PTM Location (AA)']), axis = 1) 

print('Getting enrichment of function for all ptms')
#get list of constitutive and non-constitutive ptms
constitutive_ptms = ptm_info[ptm_info['PTM Conservation Score'] == 1]
non_constitutive_ptms = ptm_info[ptm_info['PTM Conservation Score'] != 1]


kinase_table = constructPivotTable(annotated_ptms, reference_col='Combined:Kinase', collapse_on_similar = False, include_unknown=True)

kinase_enrichment = stat_utils.generate_site_enrichment(constitutive_ptms['PTM'].values, kinase_table, subset_name = 'Constitutive', type = 'Kinase', fishers = True)
kinase_enrichment.to_csv(analysis_dir + '/Constitutive_Rates/Enrichment/Kinase_Enrichment.csv')

Getting enrichment of function for all ptms


### Splice events responsible for PTM exclusion

Given splice event data obtained from ExonPTMapper package, calculate the fraction of PTM exclusion events that are caused by skipped exon events, alternative splice sites, and mutually exclusive exons.

[Return to Table of Contents](#table-of-contents)

In [10]:
#grab ptms in alternative isoforms
alternative_ptms = mapper.alternative_ptms.copy()

#extract events causing loss of PTMs
alternative_ptms = alternative_ptms[alternative_ptms['Mapping Result'] == 'Not Found']
#remove canonical isoforms and transcripts not associated with gene name in translator object, which might mean entry is old and makes splice event mapping more difficult
canonical_transcripts = mapper.transcripts[mapper.transcripts['UniProt Isoform Type'] == 'Canonical'].index
missing_gene_name=config.translator.loc[config.translator['Gene name'].isna(), 'Transcript stable ID'].unique()
alternative_ptms = alternative_ptms[(~alternative_ptms['Event Type'].isna()) & (~alternative_ptms['Alternative Transcript'].isin(canonical_transcripts)) & (~alternative_ptms['Alternative Transcript'].isin(missing_gene_name))]

alternative_ptms['Event Type'] = alternative_ptms['Event Type'].apply(lambda x: x.split(';'))
alternative_ptms = alternative_ptms.explode('Event Type').dropna(subset = 'Event Type')

#separate based on modifications (unique PTM/modification class pairs)
exploded_res = alternative_ptms.copy()
exploded_res['Modification Class'] = exploded_res['Modification Class'].apply(lambda x: x.split(';'))
exploded_res = exploded_res.explode('Modification Class')
#exploded_res = exploded_res.merge(mod_groups[['Modification', 'Mod Class Code']], left_on = 'Modification', right_on = 'Modification')

#extract only true splice events (ignore alternative promoters, etc.)
events_to_keep = ["3' ASS", "3' and 5' ASS", "5' ASS", 'Skipped', 'Mutually Exclusive']
exploded_res = exploded_res[exploded_res['Event Type'].isin(events_to_keep)]
	
#group by modification type and event type
grouped_res = exploded_res.groupby(['Modification Class','Event Type'])
grouped_res = grouped_res.size().reset_index()

#count number of events for each modification
sizes = grouped_res.groupby('Modification Class')[0].sum().sort_values(ascending = False)
mods_to_keep = sizes.index

total_events = {}
for mod in grouped_res['Modification Class'].unique():
	mod_grouped = grouped_res[grouped_res['Modification Class'] == mod]
	total_events[mod] = mod_grouped[0].sum()
	
possible_events = grouped_res['Event Type'].unique()
plt_data = []
mods = mods_to_keep
for event in possible_events:
	event_data = []
	event_grouped = grouped_res[grouped_res['Event Type'] == event]
	for mod in mods:
		if mod in event_grouped['Modification Class'].values:
			event_data.append(int(event_grouped.loc[event_grouped['Modification Class'] == mod, 0].values[0])/total_events[mod])
		else:
			event_data.append(0)
	plt_data.append(event_data)
	
event_data = pd.DataFrame(plt_data, index = possible_events, columns = mods_to_keep)

In [11]:
event_data.to_csv(analysis_dir + '/Constitutive_Rates/Splice_Event_Fractions.csv')

### Density of PTMs in tissue specific exons

Tissue-specific exons were extracted from three publications, and our projected PTM data was used to determine the density of PTMs in these exons.

[Return to Table of Contents](#table-of-contents)

In [31]:
import importlib
importlib.reload(tissue_specificity)

<module 'tissue_specificity' from 'c:\\Users\\Sam\\OneDrive\\Documents\\GradSchool\\Research\\Splicing\\Paper_Prep\\PTM_Splicing_Analysis\\Analysis\\tissue_specificity.py'>

In [5]:
import tissue_specificity
all_data, densities = tissue_specificity.getTSData(mapper, exploded_ptms, exploded_mods, mod_groups, figshare_dir = figshare_dir)

Getting PTM Density across the entire proteome
Getting PTMs in tissue specific exons from Buljan et al. (2012)
Getting PTMs in tissue specific exons from Rodriguez et al. (2020)
Getting PTMs in tissue specific exons from Gonzalez-Porta et al. (2013)
Calculating PTM density in tissue-specific exons


In [6]:
if not os.path.exists(analysis_dir + '/Tissue_Specificity'):
    os.mkdir(analysis_dir + '/Tissue_Specificity')
densities.to_csv(analysis_dir + '/Tissue_Specificity/PTM_Densities.csv')
all_data.to_csv(analysis_dir + '/Tissue_Specificity/PTMs_in_TissueSpecificExons.csv')

## Altered Flanking Sequences (Figure 3)

### Mod specific rates of alteration

Here, we sought to look at the rate of alteration considering different window sizes (number of residues on either side of the PTM), broken down by modification type. This data was used for Figure4C.

[Return to Table of Contents](#table-of-contents)

In [23]:
def getModSpecificChanges(mapper, mod_groups, flank_size = 5, return_fraction_only = True):
    """
    Calculate the fraction of ptms with altered flanking sequence of the given flank size for each modification group

    Parameters
    ----------
    mapper: PTMmapper object
        PTMmapper object containing alternative and canonical flanking sequence data
    mod_groups: pandas dataframe
        dataframe for conversion from modification subtypes ('Mod Name') to modification classes ('Mod Class)
    flank_size: int
        size of the flanking sequence to compare. IMPORTANT, this should not be larger than the available flanking sequence in the ptm_info dataframe
    return_fraction_only: bool
        If True, return only the fraction of ptms with altered flanking sequence. If False, return the fraction of ptms with altered flanking sequence and the total number of ptms altered + total number of ptms
    unique_isoforms: bool
        If True, compare ptms found in unique isoforms only. If False, compare ptms found in all transcripts, regardless of if there is redundant protein sequences
    
    Returns
    -------
    If return_fraction_only is True:
        fraction_altered: pandas Series
            Series containing the fraction of ptms with altered flanking sequence for each modification group
    If return_fraction_only is False:
        results: pandas dataframe
            dataframe containing the fraction of ptms with altered flanking sequence and the total number of ptms altered + total number of ptms for each modification group

    """
    conserved = mapper.isoform_ptms[mapper.isoform_ptms['Mapping Result'] == 'Success']
    conserved = conserved[conserved["Isoform Type"] == 'Alternative']
    conserved = conserved.drop_duplicates()

    #separate into mod specific rows
    conserved['Modification Class'] = conserved['Modification Class'].apply(lambda x: x.split(';'))
    conserved = conserved.explode('Modification Class')
    #conserved = conserved.merge(mod_groups[['Mod Name', 'Mod Class']], left_on = 'Modification', right_on = 'Mod Name', how = 'left')
    #conserved = conserved.drop(['Mod Name', 'Modification'], axis = 1)
    #conserved = conserved.drop_duplicates()
    #calculate number of each modification type that are conserved
    num_single_mods_conserved = conserved.groupby('Modification Class').size()
    #calculate number of each modification type that are conserved AND have conserved flank
    num_single_mods_conserved_flank = conserved[conserved[f'Conserved Flank (Size = {flank_size})'] == 1].groupby('Modification Class').size()
    #calculate number of each modification type that are conserved AND have altered flank
    num_single_mods_altered_flank = conserved[conserved[f'Conserved Flank (Size = {flank_size})'] == 0].groupby('Modification Class').size()

    #fill in any mods without a conserved or altered flank
    for mod in num_single_mods_conserved.index:
        if mod not in num_single_mods_altered_flank.index.values:
            num_single_mods_altered_flank[mod] = 0
        if mod not in num_single_mods_conserved_flank.index.values:
            num_single_mods_conserved_flank[mod] = 0
    fraction_of_mod_with_altered_flank = num_single_mods_altered_flank/num_single_mods_conserved

    if return_fraction_only:
        fraction_of_mod_with_altered_flank.name = flank_size
        return fraction_of_mod_with_altered_flank
    else:
        results = pd.concat([num_single_mods_conserved, num_single_mods_conserved_flank, num_single_mods_altered_flank, fraction_of_mod_with_altered_flank], axis = 1)
        results.columns = ['Number of Conserved PTMs', 'Number of PTMs with Matching Flanking Sequence', 'Number of PTMs with Altered Flanking Sequence', 'Fraction of PTMs with Altered Flanking Sequence']
        return results
    
def alteredByWindowSize(mapper, mod_groups, flank_size = list(range(1,6)), unique_isoforms = True):
    """
    Calculate the fraction of ptms with altered flanking sequence for each modification group for each flank size

    Parameters
    ----------
    mapper: PTMmapper object
        PTMmapper object containing alternative and canonical flanking sequence data
    mod_groups: pandas dataframe
        dataframe for conversion from modification subtypes ('Mod Name') to modification classes ('Mod Class)
    flank_size: list
        list of flank sizes to compare. IMPORTANT, maximum value should not be larger than the available flanking sequence in the ptm_info dataframe
    unique_isoforms: bool
        If True, compare ptms found in unique isoforms only. If False, compare ptms found in all transcripts, regardless of if there is redundant protein sequences
    
    Returns
    -------
    fractions: pandas dataframe
        dataframe containing the fraction of ptms with altered flanking sequence for each modification group for each flank size
    """
    fractions = None
    for flank in flank_size:
        if flank == 5:
            return_fraction_only = False
        else:
            return_fraction_only = True

        if fractions is None:
            fractions = getModSpecificChanges(mapper, mod_groups, flank, return_fraction_only = True)
        elif flank == 5:
            results = getModSpecificChanges(mapper, mod_groups, flank, return_fraction_only=False)
            tmp_fraction = results['Fraction of PTMs with Altered Flanking Sequence']
            tmp_fraction.name = 5
            fractions = pd.concat([fractions, tmp_fraction], axis = 1)

        else:
            fractions = pd.concat([fractions, getModSpecificChanges(mapper, mod_groups, flank, return_fraction_only=True)], axis = 1)
    return fractions, results

In [21]:
fractions, window5_flanks = alteredByWindowSize(mapper, mod_groups)
fractions.to_csv(analysis_dir + '/FlankingSequences/Mod_Specific_Alteration_Rates.csv')
window5_flanks.to_csv(analysis_dir + '/FlankingSequences/Window5_FlankingSequences.csv')

### Events Causing Altered Flanking Sequences

Analysis of the splice events that are impacting flanking sequences and whether they are caused by the PTM-containing exon or an adjacent exon. Data used in Figure 4D.

[Return to Table of Contents](#table-of-contents)

In [21]:
def upstream_test(mapper, sevents, canonical_ptm_info, canonical_exon_info, canonical_transcript, alt_transcript):
    """
    Test for event impacting flanking sequence upstream of PTM
    """
    #look at what is happening in upstream exon
    rank = int(canonical_ptm_info['Exon rank in transcript'])
    if rank != 1:
        upstream_rank = rank - 1
        upstream_exon = mapper.exons[(mapper.exons['Exon rank in transcript'] == upstream_rank) 
                                     & (mapper.exons['Transcript stable ID'] == canonical_transcript)].squeeze()
        #check for inserted exon
        if mapper.genes.loc[canonical_exon_info['Gene stable ID'], 'Strand'] == 1:
            upstream_end = upstream_exon['Exon End (Gene)']
            canonical_start = canonical_exon_info['Exon Start (Gene)']
            alt_exons = mapper.exons[mapper.exons['Transcript stable ID'] == alt_transcript]
            alt_exons = alt_exons[(alt_exons['Exon Start (Gene)'] > upstream_end) &
                                    (alt_exons['Exon End (Gene)'] < canonical_start)]
            if alt_exons.shape[0] > 0:
                return 'Inserted Exon (Upstream Exon)'

        else:
            upstream_start = upstream_exon['Exon Start (Gene)']
            canonical_end = canonical_exon_info['Exon End (Gene)']
            alt_exons = mapper.exons[mapper.exons['Transcript stable ID'] == alt_transcript]
            alt_exons = alt_exons[(alt_exons['Exon Start (Gene)'] > canonical_end) &
                                    (alt_exons['Exon End (Gene)'] < upstream_start)]
            if alt_exons.shape[0] > 0:
                return 'Inserted Exon (Upstream Exon)'


        sevent_of_interest = sevents[(sevents['Exon ID (Canonical)'] == upstream_exon['Exon stable ID'])
                                            & (sevents['Canonical Transcript'] == canonical_transcript)]
        if sevent_of_interest.shape[0] ==0:
            return 'missing sevent'
        else:
            sevent_of_interest = sevent_of_interest.iloc[0]

        if sevent_of_interest['Event Type'] == 'Skipped':
            return 'Skipped (Upstream Exon)'
        elif sevent_of_interest['Event Type'] == "3' ASS" or sevent_of_interest['Event Type'] == "3' and 5' ASS":
            return 'ASS (Upstream Exon)'
        elif sevent_of_interest['Event Type'] == 'Mutually Exclusive':
            return 'MXE (Upstream Exon)'
        elif sevent_of_interest['Event Type'] == 'Conserved':
            return None
        else:
            return 'unclear'
    else:
        return None
    
def downstream_test(mapper, sevents, canonical_ptm_info, canonical_exon_info, canonical_transcript, alt_transcript):
    """
    Test for event impacting flanking sequence downstream of PTM
    """
    #look at what is happening in upstream exon
    rank = int(canonical_ptm_info['Exon rank in transcript'])
    canonical_transcript = canonical_ptm_info['Transcripts']
    downstream_rank = rank + 1
    downstream_exon = mapper.exons[(mapper.exons['Exon rank in transcript'] == downstream_rank) 
                                 & (mapper.exons['Transcript stable ID'] == canonical_transcript)].squeeze()

    #check for inserted exon
    if mapper.genes.loc[canonical_exon_info['Gene stable ID'], 'Strand'] == -1:
        downstream_end = downstream_exon['Exon End (Gene)']
        canonical_start = canonical_exon_info['Exon Start (Gene)']
        alt_exons = mapper.exons[mapper.exons['Transcript stable ID'] == alt_transcript]
        alt_exons = alt_exons[(alt_exons['Exon Start (Gene)'] > downstream_end) &
                                (alt_exons['Exon End (Gene)'] < canonical_start)]
        if alt_exons.shape[0] > 0:
            return 'Inserted Exon (Downstream Exon)'

    else:
        downstream_start = downstream_exon['Exon Start (Gene)']
        canonical_end = canonical_exon_info['Exon End (Gene)']
        alt_exons = mapper.exons[mapper.exons['Transcript stable ID'] == alt_transcript]
        alt_exons = alt_exons[(alt_exons['Exon Start (Gene)'] > canonical_end) &
                                (alt_exons['Exon End (Gene)'] < downstream_start)]
        if alt_exons.shape[0] > 0:
            return 'Inserted Exon (Downstream Exon)' 

    sevent_of_interest = sevents[(sevents['Exon ID (Canonical)'] == downstream_exon['Exon stable ID'])
                                        & (sevents['Canonical Transcript'] == canonical_transcript)]

    if sevent_of_interest.shape[0] == 0:
        return 'missing sevent'
    else:
        sevent_of_interest = sevent_of_interest.iloc[0]
        
    if sevent_of_interest['Event Type'] == 'Skipped':
        return 'Skipped (Downstream Exon)'
    elif sevent_of_interest['Event Type'] == "5' ASS" or sevent_of_interest['Event Type'] == "3' and 5' ASS":
        return 'ASS (Downstream Exon)'
    elif sevent_of_interest['Event Type'] == 'Mutually Exclusive':
        return 'MXE (Downstream Exon)'
    elif sevent_of_interest['Event Type'] == 'Conserved':
        return None
    else:
        return 'unclear'
    
def complete_test_forConserved(mapper, sevents, canonical_ptm_info, canonical_exon_info, canonical_transcript, alt_transcript, flank_size = 5):
    """
    If exon containing PTM is conserved, check to see what events upstream/downstream of the PTM-containing exon are impacting the flanking sequence
    """
    if int(canonical_ptm_info['Distance to C-terminal Splice Boundary (NC)']) <= flank_size*3:
        downstream_result = downstream_test(mapper, sevents, canonical_ptm_info, canonical_exon_info, canonical_transcript, alt_transcript)
    else:
        downstream_result = None
        
    if int(canonical_ptm_info['Distance to N-terminal Splice Boundary (NC)']) <= flank_size*3:
        upstream_result = upstream_test(mapper, sevents, canonical_ptm_info, canonical_exon_info, canonical_transcript, alt_transcript)
    else:
        upstream_result = None
        
    if upstream_result is not None and downstream_result is not None:
        return ','.join([upstream_result,downstream_result])
    elif upstream_result is not None:
        return upstream_result
    elif downstream_result is not None:
        return downstream_result
    else:
        return 'unclear'
    
    

def complete_test_forASS(mapper, sevents, event_type, canonical_ptm_info, canonical_exon_info, canonical_transcript, alt_transcript,alt_exon_id, flank_size = 5):
    """
    If the PTM-containing exon undergoes an alternative splice site, check to see if this is causing the altered flanking sequence or if it is the result of a change to an adjacent exon
    """
    alt_exon = mapper.exons[(mapper.exons['Transcript stable ID'] == alt_transcript) &
             (mapper.exons['Exon stable ID'] == alt_exon_id)].squeeze()
    if event_type == "3' ASS":
        distance_to_altered_boundary = alt_exon['Exon End (Gene)'] -int(canonical_ptm_info['Gene Location (NC)'])
    elif event_type == "5' ASS":
        distance_to_altered_boundary = alt_exon['Exon End (Gene)'] -int(canonical_ptm_info['Gene Location (NC)'])
    else:
        distance_to_altered_boundary = np.min([alt_exon['Exon End (Gene)'] -int(canonical_ptm_info['Gene Location (NC)']), alt_exon['Exon End (Gene)'] -int(canonical_ptm_info['Gene Location (NC)'])])
    
    if distance_to_altered_boundary <= flank_size*3:
        return 'ASS (Exon with PTM)'
    else:
        relevant_events = sevents[sevents['Alternative Transcript'] == alt_transcript]
        return complete_test_forConserved(mapper, relevant_events, canonical_ptm_info, canonical_exon_info, canonical_transcript, alt_transcript)

In [22]:
mapper.ptm_info.index = mapper.ptm_info['Protein'] + '_' + mapper.ptm_info['Residue']+mapper.ptm_info['PTM Location (AA)'].astype(str)

In [23]:
flank_size = 5
mapper.alternative_ptms[f'Conserved Flank (Size = {flank_size})'] = mapper.compareAllFlankSeqs(flank_size=flank_size, unique_isoforms=False)

In [36]:
#load splice event information
sevents = pd.read_csv(config.processed_data_dir + 'splice_events.csv').drop_duplicates()
flank_size = 5
#grab flanking sequences that are different from canonical, separate by event and canonical exon id
altered_flanks = mapper.alternative_ptms[mapper.alternative_ptms[f'Conserved Flank (Size = 5)'] == 0].copy()
#remove canonical isoforms and transcripts not associated with gene name in translator object, which might mean entry is old and makes splice event mapping more difficult
canonical_transcripts = mapper.transcripts[mapper.transcripts['UniProt Isoform Type'] == 'Canonical'].index
missing_gene_name=config.translator.loc[config.translator['Gene name'].isna(), 'Transcript stable ID'].unique()
altered_flanks = altered_flanks[(~altered_flanks['Event Type'].isna()) & (~altered_flanks['Alternative Transcript'].isin(canonical_transcripts)) & (~altered_flanks['Alternative Transcript'].isin(missing_gene_name))]

#extract events
altered_flanks['Event Type'] = altered_flanks['Event Type'].apply(lambda x: x.split(';'))
altered_flanks['Exon ID (Canonical)'] = altered_flanks['Exon ID (Canonical)'].apply(lambda x: x.split(';'))
altered_flanks = altered_flanks.explode(['Event Type'])

test = altered_flanks.copy()
cause = []
for i, row in tqdm(test.iterrows(), total = test.shape[0]):
       #get distance to boundary
    ptm = row['Source of PTM'].split(';')[0]
    canonical_exon_id = row['Exon ID (Canonical)']
    relevant_events = sevents[sevents['Alternative Transcript'] == row['Alternative Transcript']]
    canonical_ptm_info = exploded_ptms[(exploded_ptms['PTM'] == ptm) & (exploded_ptms['Exon stable ID'].isin(canonical_exon_id))]
    #grab first relevant entry (if only one, will just convert to series)
    if canonical_ptm_info.shape[0] == 0:
        cause.append('PTM not found')
    elif ';' in row['Exon ID (Alternative)']:
        cause.append('Alternative Exon ID Discrepancy')
    else:
        #grab canonical exon and transcript information
        canonical_ptm_info = canonical_ptm_info.iloc[0]
        canonical_transcript = canonical_ptm_info['Transcripts']
        canonical_exon_info = mapper.exons[(mapper.exons['Exon stable ID'].isin(canonical_exon_id))
                                         & (mapper.exons['Transcript stable ID'] == canonical_transcript)].squeeze()
        #for events with a conserve exon, check for changes adjacent to exon
        if row['Event Type'] == 'Conserved' or row['Event Type'] == 'No Difference':
            cause.append(complete_test_forConserved(mapper, relevant_events, canonical_ptm_info, canonical_exon_info, canonical_transcript, row['Alternative Transcript'], flank_size = flank_size))
        elif 'ASS' in row['Event Type']: #if ptm is an exon involved in ass, check if this is causing altered flank
            cause.append(complete_test_forASS(mapper, sevents, row['Event Type'], canonical_ptm_info, canonical_exon_info, canonical_transcript, row['Alternative Transcript'], row['Exon ID (Alternative)'], flank_size = flank_size))
        elif row['Event Type'] == 'Mutually Exclusive': #if ptm is in a mxe, annotate this as the cause
            cause.append('MXE (Exon with PTM)')
        else:
            cause.append(row['Event Type'])

altered_flanks['Cause'] = cause

cause_breakdown = altered_flanks.groupby('Cause').size()
cause_breakdown.name = 'Number of Instances'
cause_breakdown.to_csv(analysis_dir + '/FlankingSequences/cause_of_alteration.csv')

100%|██████████| 16190/16190 [1:20:19<00:00,  3.36it/s]


### Sequence Similarity Between Flanking Sequences in Canonical and Alternative Isoforms

Compare sequence similarity between the flanking sequence in the canonical isoform to the altered flanking sequence in the alternative isoform. Data used for Figure 4E.

[Return to Table of Contents](#table-of-contents)

In [2]:
def getSequenceSimilarity(can_flank, alt_flank):
    """
    Given two flanking sequences, calculate the sequence similarity between them using Biopython and criteria definded by Pillman et al. BMC Bioinformatics 2011

    Parameters
    ----------
    can_flank: str
        flanking sequence for PTM in canonical protein isoform
    alt_flank: str
        flanking sequence for PTM in alternative protein isoform

    Returns
    -------
    normalized_score: float
        normalized score of sequence similarity between flanking sequences (calculated similarity/max possible similarity)
    """
    #align canonical and alternative flanks, return only the score
    actual_similarity = pairwise2.align.globalxs(can_flank, alt_flank, -10, -2, score_only = True)
    #aling the canonical flank to itself, return only the score
    control_similarity = pairwise2.align.globalxs(can_flank, can_flank, -10, -2, score_only = True)
    #normalize score
    normalized_score = actual_similarity/control_similarity
    return normalized_score

In [3]:
#option where we consider regulatory region for each flanking sequence size
isoform_ptms = mapper.isoform_ptms.copy()
similarity = defaultdict(list)
flank_sizes = [3, 5, 7, 10]
for size in flank_sizes:
    altered_flanks = isoform_ptms[(isoform_ptms[f'Conserved Flank (Size = {size})'] == 0) & (isoform_ptms['Mapping Result'] == 'Success')].copy()
    for i,row in altered_flanks.iterrows():
        ptm = row['Source of PTM']
        if ';' in ptm:
            ptm = ptm.split(';')[0]
        alt_flank = row['Flanking Sequence'][10-size:10+size+1]
        can_flank = mapper.ptm_info.loc[ptm, 'Flanking Sequence'][10-size:10+size+1]
        similarity[size].append(getSequenceSimilarity(can_flank, alt_flank))

#construct dataframe from dictionary, adding column indicating window size
similarity_data = []
for size in [3,5,7,10]:
    subset = pd.DataFrame(similarity[size], columns = ['Similarity'])
    subset['Window Size'] = size
    similarity_data.append(subset)
similarity_data = pd.concat(similarity_data)

#multiply similarity by 100 to get percentage
similarity_data['Similarity'] = similarity_data['Similarity']*100

In [4]:
similarity_data.to_csv(analysis_dir + '/FlankingSequences/SequenceSimilarity.csv', index = False)

### Position of Altered Residues in Flanking Sequence

In [5]:
def findAlteredPositions(seq1, seq2, desired_seq_size = 21):
    """
    Given two sequences, identify the location of positions that have changed

    Parameters
    ----------
    seq1, seq2: str
        sequences to compare (order does not matter)
    desired_seq_size: int
        size of the flanking sequence to compare. IMPORTANT, this should not be larger than the available flanking sequence in the ptm_info dataframe
    
    Returns
    -------
    altered_positions: list
        list of positions that have changed
    residue_change: list
        list of residues that have changed associated with that position
    flank_side: str
        indicates which side of the flanking sequence the change has occurred (N-term, C-term, or Both)
    """
    altered_positions = []
    residue_change = []
    flank_side = []
    seq_size = len(seq1)
    flank_size = (seq_size -1)/2
    if seq_size == len(seq2) and seq_size == desired_seq_size:
        for i in range(seq_size):
            if seq1[i] != seq2[i]:
                altered_positions.append(i-(flank_size))
                residue_change.append(f'{seq1[i]}->{seq2[i]}')
        #check to see which side flanking sequence
        altered_positions = np.array(altered_positions)
        n_term = any(altered_positions < 0)
        c_term = any(altered_positions > 0)
        if n_term and c_term:
            flank_side = 'Both'
        elif n_term:
            flank_side = 'N-term only'
        elif c_term:
            flank_side = 'C-term only'
        else:
            flank_side = 'Unclear'
        return altered_positions, residue_change, flank_side
    else:
        return np.nan, np.nan, np.nan

In [6]:
altered_flanks = mapper.isoform_ptms[(mapper.isoform_ptms['Conserved Flank (Size = 10)'] == 0) & (mapper.isoform_ptms['Mapping Result'] == 'Success')].copy()
altered_flanks = altered_flanks.dropna(subset = 'Alternative Residue')

altered_positions = []
residue_changes = []
flank_side = []
for i, row in altered_flanks.iterrows():
    alt_flank = row['Flanking Sequence']
    ptm = row['Source of PTM']
    if ';' in ptm:
        ptm = ptm.split(';')[0]
    can_flank = mapper.ptm_info.loc[ptm, 'Flanking Sequence']
    results = findAlteredPositions(can_flank, alt_flank)
    altered_positions.append(results[0])
    residue_changes.append(results[1])
    flank_side.append(results[2])
altered_flanks['Altered_Positions'] = altered_positions
altered_flanks['Residue Changes'] = residue_changes
altered_flanks["Location of Altered Flank"] = flank_side

altered_flanks = altered_flanks[['Isoform ID', 'Source of PTM', 'Flanking Sequence', 'Altered_Positions', 'Residue Changes', 'Location of Altered Flank']]
altered_flanks.to_csv(analysis_dir + '/FlankingSequences/PositionOfAlteredFlanks.csv', index = False)

In [7]:
exploded_positions = altered_flanks.explode('Altered_Positions')
exploded_positions = exploded_positions.groupby('Altered_Positions').size()
exploded_positions.name = 'Number of PTMs'
exploded_positions.to_csv(analysis_dir + '/FlankingSequences/PositionBreakdown.csv')

### Kinase Library Analysis

In order to identify how changes to flanking sequences might change protein interactions, we turned to kinase-substrate interactions as our example. Using the Kinase Library tool, which returns a score indicating the likelihood that a kinase interacts with a given substrate based on that kinases motif, we can score the flanking sequences found in both the canonical and alternative isoform of our ptms of interest. This data was used for Figure 4F/G.

[Return to Table of Contents](#table-of-contents)

In [5]:
#dictionary for converting from kinase library kinase names to kinase names in phosphositeplus
kinase_conversion = {'AURKA':'AURA','AURKB':'AURA','AURKC':'AURC','CHEK2':'CHK2','CHEK1':'CHK1','CSNK1A1':'CK1A','CSNK2A1':'CK2A1', 'CSNK2A2':'CK2A2', 'PRKACA':'PKACA',
                     'PRKACB':'PKACB','PRKCA':'PKCA','PRKCB':'PKCB','PRKCE':'PKCE','PRKCI': 'PKCI','PRKG2':'PKG2', 'TRPM7':'CHAK1', 'MAPK14':'P38A',
                     'MAPK1':'ERK2','MAPK3':'ERK1','RPS6KA1':'P90RSK','PRKCQ':'PKCT','PDPK1':'PDK1', 'PRKAA1':'AMPKA1','PRKAA2':'AMPKA2', 'EIF2AK2':'PKR'}

#### Processing for Kinase Library Analysis

Prior to using the kinase library tool, we needed to extract and format the flanking sequences of interest to be analyzed with kinase library. The following code was used to generate the flanking sequences that would ultimately be plugged into kinase library.

In [6]:
def editSequence(seq):
    """
    Convert flanking sequence to version accepted by kinase library (modified residue denoted by asterick)
    """
    seq = seq.replace('t','t*')
    seq = seq.replace('s','s*')
    seq = seq.replace('y','y*')
    return seq

def identify_flanks_for_KinaseLibrary(mapper):
    """
    Using the mapper object data, identify PTMs with altered flanking sequences, which by default will restrict to ptms with at least one known kinase interaction and a conserved tryptic fragment (would be missed by MS)

    Parameters
    ----------
    mapper : PTMmapper.Mapper
        Mapper object with data already processed
    ks_dataset : pandas.DataFrame
        Dataframe with kinase-substrate interactions, downloaded from phosphositeplus
    """
    if 'Conserved Flank' not in mapper.isoform_ptms.columns:
        mapper.isoform_ptms['Conserved Flank'] = mapper.compareAllFlankSeqs(flank_size = 5)
    if 'Conserved Fragment' not in mapper.isoform_ptms.columns:
        mapper.isoform_ptms['Conserved Fragment'] = mapper.compareAllTrypticFragments()

    #extract ptms with altered flank sequences
    conserved = mapper.isoform_ptms[mapper.isoform_ptms['Mapping Result'] == 'Success'].copy()
    conserved = conserved[conserved["Isoform Type"] == 'Alternative']

    conserved = conserved[(conserved['Conserved Fragment'] == 1) & (conserved['Conserved Flank (Size = 5)'] == 0)]
    #conserved = conserved[(conserved['Conserved Flank (Size = 5)'] == 0)]


    #restrict to phosphorylation sites
    conserved = conserved[(conserved['Modification Class'].str.contains('Phosphorylation'))]
    conserved = conserved.drop_duplicates(subset = ['Source of PTM', 'Flanking Sequence'])

    #separate ptm sources
    conserved['Source of PTM'] = conserved['Source of PTM'].str.split(';')
    conserved = conserved.explode('Source of PTM')
    conserved['Source of PTM'] = conserved['Source of PTM'].apply(lambda x: x.split('-')[0]+'_'+ x.split('_')[1])
    conserved = conserved[['Isoform ID', 'Source of PTM', 'Flanking Sequence']].drop_duplicates()


    #if requested, restrict to known kinase-substrate interactions

    #restrict to human interactions
    #ks_dataset = ks_dataset[(ks_dataset['KIN_ORGANISM'] == 'human') & (ks_dataset['SUB_ORGANISM'] == 'human')].copy()
    #inner merge kinase-substrate info with conserved flank info
    #ks_dataset['Source of PTM'] = ks_dataset['SUB_ACC_ID']+'_'+ks_dataset['SUB_MOD_RSD']
    #conserved = conserved.merge(ks_dataset, on = 'Source of PTM')

    #extract only the columns with relevant info
    conserved = conserved[['Isoform ID', 'Source of PTM', 'Flanking Sequence']].drop_duplicates()
    conserved = conserved.rename({'Flanking Sequence':'Alternative Sequence'}, axis = 1)
    #add canonical flanking sequence to data
    canonical_ptms = mapper.ptm_info[mapper.ptm_info['Isoform Type'] == 'Canonical'].copy()
    canonical_sequence = mapper.ptm_info['Flanking Sequence'].reset_index().rename({'index':'PTM', 'Flanking Sequence':'Canonical Sequence'}, axis = 1)
    canonical_sequence['PTM'] = canonical_sequence['PTM'].apply(lambda x: x.split('-')[0]+'_'+ x.split('_')[1])
    conserved = conserved.merge(canonical_sequence, right_on = 'PTM', left_on = 'Source of PTM', how = 'left')
    
    return conserved

In [7]:
#get flanking sequences to use for kinase library analysis 
flanking_sequences = identify_flanks_for_KinaseLibrary(mapper)

#save data
if not os.path.exists(analysis_dir + '/FlankingSequences/Kinase_Library'):
    os.mkdir(analysis_dir + '/FlankingSequences/Kinase_Library')
flanking_sequences.to_csv(analysis_dir + '/FlankingSequences/Kinase_Library/sequences_to_analyze.csv', index = False)

#generate files to input into Kinase Library
canonical_sequences = flanking_sequences[['PTM', 'Canonical Sequence']].drop_duplicates()
canonical_sequences['Canonical Sequence'] = canonical_sequences['Canonical Sequence'].apply(editSequence)
#write sequences to text file
with open(analysis_dir + '/FlankingSequences/Kinase_Library/Canonical_Flanking_Sequences.txt', 'w') as f:
    for index, row in canonical_sequences.iterrows():
        f.write(row['Canonical Sequence']+'\n')

alternative_sequences = flanking_sequences[['Isoform ID', 'PTM', 'Alternative Sequence']].drop_duplicates()
alternative_sequences['Alternative Sequence'] = alternative_sequences['Alternative Sequence'].apply(editSequence)
alternative_sequences = alternative_sequences['Alternative Sequence'].drop_duplicates()
#write sequences to text file
with open(analysis_dir + '/FlankingSequences/Kinase_Library/Alternative_Flanking_Sequences.txt', 'w') as f:
    for seq in alternative_sequences.values:
        f.write(seq+'\n')

#### Processing Kinase Library Results

Kinase library analysis outputs a single file for each flanking sequence, containing scores for all kinases for the given flanking sequence. We processed these files in order to merge the information into a single file. 

[Return to Table of Contents](#table-of-contents)

In [1]:
import pandas as pd

In [4]:
flanking_sequences = pd.read_csv(analysis_dir + '/FlankingSequences/Kinase_Library/sequences_to_analyze.csv')

In [5]:
flanking_sequences

Unnamed: 0,Isoform ID,Source of PTM,Alternative Sequence,PTM,Canonical Sequence
0,Q14847-3,Q14847_Y86,REEALLQRVRyKEEFEKNKGK,Q14847_Y86,QQSELQSQVRyKEEFEKNKGK
1,Q9P2N7-2,Q9P2N7_S46,MKLsLGGSEMGLSS,Q9P2N7_S46,VEEEDQHMKLsLGGSEMGLSS
2,ENS-CFLAR-2,O15519_Y171,RIDLKTKIQKyKQSGGWNGTW,O15519_Y171,RIDLKTKIQKyKQSVQGAGTS
3,ENS-POLDIP2-1,Q9Y2S7_Y103,GKYETGQARLyDRDVASAAPE,Q9Y2S7_Y103,VVLFPWQARLyDRDVASAAPE
4,Q9H6T3-2,Q9H6T3_S391,PGNKQAVTELsKIKKKPLKKV,Q9H6T3_S391,PGNKQAVTELsKIKKELIEKG
...,...,...,...,...,...
1901,Q13085-2,Q13085_S80,RYYMLQRSSMsGLHLVKQGRD,Q13085_S80,GLALHIRSSMsGLHLVKQGRD
1902,ENS-TBCE-2,Q15813_S495,PVSDLLLSYEsPKVSCPAKYK,Q15813_S495,PVSDLLLSYEsPKKPGREIEL
1903,Q5JSZ5-5,Q5JSZ5_S1470,DTLAMDMRVRsPDEALPGGLS,Q5JSZ5_S1470,QNGTPLKVKRsPDEALPGGLS
1904,Q5JSZ5-5,Q5JSZ5_S776,DTLAMDMRVRsPDEALPGGLS,Q5JSZ5_S776,DTLAMDMRVRsPDEALPGGLS


In [None]:
#load sequence data
flanking_sequences = pd.read_csv(analysis_dir + '/FlankingSequences/Kinase_Library/sequences_to_analyze.csv')
#grab canonical sequences and match kinase library formatting
canonical_sequences = flanking_sequences[['PTM', 'Canonical Sequence']].drop_duplicates()
canonical_sequences['Canonical Sequence'] = canonical_sequences['Canonical Sequence'].apply(lambda x: x[10-5:10+6+1].upper().replace(' ', '_'))
#grab alternative sequences and match kinase library formatting
alternative_sequences = flanking_sequences[['Isoform ID', 'PTM', 'Alternative Sequence']].drop_duplicates()
alternative_sequences['Alternative Sequence'] = alternative_sequences['Alternative Sequence'].apply(lambda x: x[10-5:10+6+1].upper().replace(' ','_'))
alternative_sequences['Label'] = alternative_sequences['Isoform ID'] + ';' + alternative_sequences['PTM']
alternative_sequences = alternative_sequences[['Label', 'Alternative Sequence']].drop_duplicates()


#add kinase library scorescores to sequence info
canonical_scores = pd.read_csv(analysis_dir + '/FlankingSequences/Kinase_Library/Results/canonical_sequence_scores.tsv', sep = '\t')
canonical_sequences = canonical_sequences.merge(canonical_scores, left_on = 'Canonical Sequence', right_on = 'sequence', how = 'left')

alternative_scores = pd.read_csv(analysis_dir + '/FlankingSequences/Kinase_Library/Results/alternative_sequence_scores.tsv', sep = '\t')
alternative_sequences = alternative_sequences.merge(alternative_scores, left_on = 'Alternative Sequence', right_on = 'sequence', how = 'left')


#pivot and extract scores
canonical_sequences_y = canonical_sequences[canonical_sequences['PTM'].str.contains('_Y')]
canonical_percentiles_y = canonical_sequences_y.pivot_table(index = 'PTM', columns = 'kinase', values = 'site_percentile')
canonical_sequences_st = canonical_sequences[(canonical_sequences['PTM'].str.contains('_S')) | (canonical_sequences['PTM'].str.contains('_T'))]
canonical_percentiles_st = canonical_sequences_st.pivot_table(index = 'PTM', columns = 'kinase', values = 'site_percentile')

alternative_sequences_y = alternative_sequences[alternative_sequences['Label'].str.contains('_Y')]
alternative_percentiles_y = alternative_sequences_y.pivot_table(index = 'Label', columns = 'kinase', values = 'site_percentile')
alternative_sequences_st = alternative_sequences[(alternative_sequences['Label'].str.contains('_S')) | (alternative_sequences['Label'].str.contains('_T'))]
alternative_percentiles_st = alternative_sequences_st.pivot_table(index = 'Label', columns = 'kinase', values = 'site_percentile')

#calculate the difference in percentiles

percentiles_diff_y = alternative_percentiles_y.copy()
percentiles_diff_y = percentiles_diff_y[canonical_percentiles_y.columns]
for i, row in percentiles_diff_y.iterrows():
    percentiles_diff_y.loc[i] = row - canonical_percentiles_y.loc[i.split(';')[1]]

percentiles_diff_st = alternative_percentiles_st.copy()
percentiles_diff_st = percentiles_diff_st[canonical_percentiles_st.columns]
for i, row in percentiles_diff_st.iterrows():
    percentiles_diff_st.loc[i] = row - canonical_percentiles_st.loc[i.split(';')[1]]

#save results
percentiles_diff_y.to_csv(analysis_dir + '/FlankingSequences/Kinase_Library/Results/Percentile_Differences_Y.csv')
percentiles_diff_st.to_csv(analysis_dir + '/FlankingSequences/Kinase_Library/Results/Percentile_Differences_ST.csv')

## ESRP1-mediated Splicing in Prostate Cancer

The following code was used in the analysis of how ESRP1-mediated splicing is altered in prostate cancer and leads to changes the presence/absence of different PTMs. This analysis was used to generate Figure 4.

[Return to Table of Contents](#table-of-contents)

### Projecting PTMs onto SpliceSeq Exons

Here, we used data from the genomic coordinates of different ptms obtained from the projection pipeline to determine which PTMs are present in SpliceSeq splicegraph exons.

In [73]:
#load splicegraph from SpliceSeq
splicegraph = pd.read_csv(database_dir + '/TCGA/TCGASpliceSeq_splicegraph.txt', delim_whitespace=True)


splicegraph_ptms = None
#iterate through all ptms and project them onto the splicegraph (breaking up the data by chromosome and strand)
for chromosome in mapper.ptm_coordinates['Chromosome/scaffold name'].unique():
    for strand in mapper.ptm_coordinates['Strand'].unique():
        if strand == -1:
            sg_strand = '-'
        else:
            sg_strand = '+'
        #grab splicegraph exons and ptms associated with the same chromosome/strand as the ptm
        trim_sg = splicegraph[(splicegraph['Chromosome'] == chromosome) & (splicegraph['Strand'] == sg_strand)]
        trim_ptms = mapper.ptm_coordinates[(mapper.ptm_coordinates['Chromosome/scaffold name'] == chromosome) & (mapper.ptm_coordinates['Strand'] == strand)]
        #iterate through all ptms on the chromosome/strand and project the ptms onto the splicegraph exons
        for i,row in trim_ptms.iterrows():
            sg_exons = trim_sg[(trim_sg['Chr_Start'] <= row['HG19 Location']) & (trim_sg['Chr_Stop'] >= row['HG19 Location'])].copy()
            sg_exons['PTM'] = row['Source of PTM']
            if splicegraph_ptms is None:
                splicegraph_ptms = sg_exons.copy()
            else:
                splicegraph_ptms = pd.concat([splicegraph_ptms, sg_exons])


#save data
splicegraph_ptms.to_csv(analysis_dir + '/TCGA/splicegraph_PTMs.csv')

### Identify differentially included PTM-containing exons

Using percent spliced in (PSI) data downloaded from TCGASpliceSeq, we identified exons that are differentially included in ESRP1-high or ESRP1-low prostate cancer samples. We then determined which of these exons contain PTMs.  

[Return to Table of Contents](#table-of-contents)

In [76]:
tissue = 'PRAD'
#load ESRP1 expression data
ESRP1 = pd.read_csv(database_dir + f"/TCGA/{tissue}/{tissue}_ESRP1_expression.txt", sep = '\t') # Z-score of ESPR1 
ESRP1 = ESRP1.dropna(subset = 'ESRP1')

#load percent spliced in data from TCGA SpliceSeq, restrict to exon skipping and alternative splice site events with PSI variation of at least 0.25 across all patients
PRAD = pd.read_csv(database_dir + f"/TCGA/{tissue}/PSI_download_{tissue}.txt", sep = '\t') # PSI data
PRAD = PRAD[PRAD['splice_type'].isin(['ES','RI','AD','AA'])].copy()
PRAD = PRAD[PRAD['psi_range'] > 0.25].copy()

#remove patients with no measured ESRP1 mRNA from splice data, or vice versa
ESRP1['Edited Patient ID'] = ESRP1['SAMPLE_ID'].apply(lambda x: '_'.join(x.split('-')[0:3])).to_list()
patients_to_drop = [col for col in PRAD.columns if col not in ESRP1['Edited Patient ID'].values and 'TCGA' in col]
PRAD = PRAD.drop(patients_to_drop, axis = 1)

#remove patients with no measured splice data from ESRP1 data
patients_to_drop = [col for col in ESRP1['Edited Patient ID'] if col not in PRAD.columns]
ESRP1 = ESRP1[~ESRP1['Edited Patient ID'].isin(patients_to_drop)].copy()

#grab PTMs mapped onto SpliceSeq splicegraph
mapped_ptms = pd.read_csv(analysis_dir + '/TCGA/splicegraph_PTMs.csv')
#Percent Spliced In data
spliceseq = pd.read_csv(database_dir + "/TCGA/TCGASpliceSeq_splicegraph.txt", delim_whitespace=True) 


## Identify patients with high or low ESRP1 expression
ESPR_low = ESRP1[ESRP1["ESRP1"] < -1]
ESPR_low_id = ESPR_low["SAMPLE_ID"].str.split("-").apply(lambda x: x[0:3]).apply(lambda x: '_'.join(x)).to_list()
ESPR_high = ESRP1[ESRP1["ESRP1"] > 1]
ESPR_high_id = ESPR_high["SAMPLE_ID"].str.split("-").apply(lambda x: x[0:3]).apply(lambda x: '_'.join(x)).to_list()
print('Number of ESRP1 low patients: ', len(ESPR_low_id))
print('Number of ESRP1 high patients: ', len(ESPR_high_id))

  mapped_ptms = pd.read_csv(analysis_dir + '/TCGA/splicegraph_PTMs.csv')


Number of ESRP1 low patients:  61
Number of ESRP1 high patients:  69


In [None]:
# Hold indexes of PSI data where values are statistically significant + if mean is higher than other group place it in that group. 
direction = [] 
p_list = []
effect_list =[]
for index, row in PRAD.iterrows():
    #get list of ESRP1-high and ESRP1-low samples
    high_sample = PRAD.loc[index, ESPR_high_id].values
    high_sample = list(high_sample[~pd.isnull(high_sample)])
    low_sample = PRAD.loc[index, ESPR_low_id].values
    low_sample = list(low_sample[~pd.isnull(low_sample)])
    
    #calculate Mann Whitney p-value and effect size comparing ESRP1-high and low PSI values
    if len(low_sample) > 1 and len(high_sample) > 1:
        p_value, effect_size = stat_utils.calculateMW_EffectSize(high_sample, low_sample)
    else:
        p_value = np.nan
    p_list.append(p_value)
    effect_list.append(effect_size)
    
    #Determine whether ESRP1-high or low samples have higher PSI values (based on mean)
    if p_value != p_value:
        direction.append(np.nan)
    elif np.mean(high_sample) > np.mean(low_sample):
        direction.append('High')
    else: 
        direction.append('Low')
        
#save results to dataframe
PRAD['ESRP1'] = direction
PRAD['p'] = p_list
PRAD['Effect Size'] = effect_list
PRAD = PRAD.dropna(subset = 'p')

#adjust p-values using Benjamini-Hochberg method
PRAD = PRAD.sort_values(by = 'p', ascending = True)
PRAD['Adj p'] = stat_utils.adjustP(PRAD['p'].values)

#add PTMs to differentially included exons
PRAD_ptms = PRAD.copy()
#remove mutually exclusive exon events from analysis
PRAD_ptms = PRAD_ptms[~PRAD_ptms['exons'].apply(lambda x: '|' in x)]
#separate exons in events with multiple exons
PRAD_ptms['exons'] = PRAD_ptms["exons"].apply(lambda x: x.split(':'))
PRAD_ptms = PRAD_ptms.explode('exons').drop_duplicates(subset = ['symbol','exons','ESRP1','p'])
PRAD_ptms['exons'] = PRAD_ptms['exons'].astype(float)
#merge with mapped PTMs
PRAD_ptms = PRAD_ptms.merge(mapped_ptms, left_on = ['symbol','exons'], right_on = ['Symbol', 'Exon'])
PRAD_ptms = PRAD_ptms.sort_values(by = 'Adj p')
PRAD_ptms = PRAD_ptms.drop_duplicates(subset = ['PTM', 'ESRP1'])

#save data
PRAD_ptms.to_csv(analysis_dir + f'/TCGA/{tissue}_ESRP1_PTMs.csv')

### Gene set enrichment of ESRP1-correlated genes

To assess the general impact of ESRP1 on gene splicing/biological pathways, we used the gseapy package and the enrichr api to perform gene set enrichment analysis (esrp1-related genes vs. all genes in the prostate dataset). This data was used for Supplementary Figure 12.

[Return to Table of Contents](#table-of-contents)

In [114]:
from gseapy import enrichr

#load prostate data
PRAD_ptms = pd.read_csv(analysis_dir + f'/TCGA/PRAD_ESRP1_PTMs.csv')
sig_ptms = PRAD_ptms[(PRAD_ptms['Adj p'] <= 0.05) & (PRAD_ptms['Effect Size'] >= 0.25)].copy()
sig_ptms = sig_ptms.drop_duplicates(subset = ['PTM', 'ESRP1'])
sig_ptms = sig_ptms.drop_duplicates(subset = 'PTM', keep = False) #remove PTMs that are significant in both directions

#run enrichr analysis
gene_set_names = {'GO_Cellular_Component_2023':'GO Cellular Component', 'GO_Molecular_Function_2023':'GO Molecular Function', 'GO_Biological_Process_2023':'GO Biological Process', 'KEGG_2021_Human':'KEGG Pathways'}
enrichr_results = enrichr(list(sig_ptms['symbol'].unique()), background = list(PRAD_ptms['symbol'].unique()), gene_sets = list(gene_set_names.keys()), organism='human').results
enrichr_results['Gene Set Title'] = enrichr_results['Gene_set'].apply(lambda x: gene_set_names[x])

#save data
if not os.path.exists(analysis_dir + '/TCGA/Enrichment'):
    os.mkdir(analysis_dir + '/TCGA/Enrichment')
enrichr_results.to_csv(analysis_dir + '/TCGA/Enrichment/Gene_Set_Enrichr.csv', index = False)

### Site enrichment of ESRP1-correlated PTMs

To determine whether ESRP1-correlated PTMs are enriched in PTM sites with specific functions or are involved in specific processes, we obtained PhosphoSitePlus annotations of PTM function and performed enrichment analysis using a Fishers Exact test (esrp1-related PTMs vs. all PTMs in the prostate dataset). This data was used for Supplementary Figure 13.

[Return to Table of Contents](#table-of-contents)

In [121]:
#load annotation data
annotations = pd.read_csv(database_dir + '/PhosphoSitePlus/Regulatory_sites.gz', sep = '\t',compression = 'gzip', on_bad_lines='skip', header = 2)
annotations['Substrate'] = annotations['ACC_ID'] + '_' + annotations['MOD_RSD'].apply(lambda x: x.split('-')[0])


#load prostate data
PRAD_ptms = pd.read_csv(analysis_dir + f'/TCGA/PRAD_ESRP1_PTMs.csv')
sig_ptms = PRAD_ptms[(PRAD_ptms['Adj p'] <= 0.05) & (PRAD_ptms['Effect Size'] >= 0.25)].copy()
sig_ptms = sig_ptms.drop_duplicates(subset = ['PTM', 'ESRP1'])
sig_ptms = sig_ptms.drop_duplicates(subset = 'PTM', keep = False) #remove PTMs that are significant in both directions

#get function enrichment
functions = stat_utils.constructPivotTable(PRAD_ptms, annotations, reference_col = 'ON_FUNCTION', collapse_on_similar=True)
function_enrichment = stat_utils.generate_site_enrichment(sig_ptms['PTM'].unique(), functions, subset_name = 'ESRP1-Regulated', type = 'Function')
processes = stat_utils.constructPivotTable(PRAD_ptms, annotations, reference_col = 'ON_PROCESS', collapse_on_similar=True)
process_enrichment = stat_utils.generate_site_enrichment(sig_ptms['PTM'].unique(), processes, subset_name = 'ESRP1-Regulated', type = 'Process')


#save data
if not os.path.exists(analysis_dir + '/TCGA/Enrichment'):
    os.mkdir(analysis_dir + '/TCGA/Enrichment')

function_enrichment.to_csv(analysis_dir + '/TCGA/Enrichment/Site_Function_Enrichment.csv')
process_enrichment.to_csv(analysis_dir + '/TCGA/Enrichment/Site_Process_Enrichment.csv')

### Kinase Enrichment of ESRP1-correlated Substrates

To determine if cell signaling might be altered by splicing-controlled phosphorylation sites in ESRP1-high prostate cancer, we tested for enrichment of different kinase's substrates, based on either known substrates annotated in PhosphoSitePlus or predicted substrates from the ensemble of kinase-substrates in KSTAR, a kinase activity inference tool. This data was used for Supplementary Figure 16.

[Return to Table of Contents](#table-of-contents)

In [124]:
#run hypergeometric test
from scipy.stats import hypergeom

#load splice seq data
prostate = pd.read_csv(analysis_dir + '/TCGA/Prad_ESRP1_PTMs.csv')

#add modification information
prostate = prostate.merge(mapper.ptm_info['Modification'].reset_index(), on='PTM')

#get significant prostate
sig_prostate = prostate[(prostate['Adj p'] < 0.05) & (prostate['Effect Size'] >= 0.25)]

#separate unique modifications into unique rows so that data can be restricted to phosphotyrosine or phosphoserine/threonine
exploded_prostate = prostate.copy()
exploded_prostate['Modification'] = exploded_prostate['Modification'].str.split(';')
exploded_prostate = exploded_prostate.explode('Modification')

#get sig
exploded_sig_prostate = exploded_prostate[(exploded_prostate['Adj p'] < 0.05) & (exploded_prostate['Effect Size'] >= 0.25)]

#restrict to phosphotyrosine data
sig_prostate_y = exploded_sig_prostate[exploded_sig_prostate['Modification'] == 'Phosphotyrosine']
prostate_y = exploded_prostate[exploded_prostate['Modification'] == 'Phosphotyrosine']

#restrict to phosphoserine/threonine data
sig_prostate_st = exploded_sig_prostate[exploded_sig_prostate['Modification'].isin(['Phosphoserine', 'Phosphothreonine'])]
prostate_st = exploded_prostate[exploded_prostate['Modification'].isin(['Phosphoserine', 'Phosphothreonine'])]

#### Known Substrates from PhosphoSitePlus

In [128]:
# load known kinases from PhosphoSitePlus
ks_dataset = pd.read_csv(database_dir + '/PhosphoSitePlus/Kinase_Substrate_Dataset.tsv', sep ='\t')
ks_dataset = ks_dataset[(ks_dataset['KIN_ORGANISM'] == 'human') & (ks_dataset['KIN_ORGANISM'] == 'human')]
ks_dataset['PTM'] = ks_dataset['SUB_ACC_ID']+'_'+ks_dataset['SUB_MOD_RSD']

##### Phosphotyrosines

In [129]:
#add annotations info to esrp1-regulated phosphotyrosines
prostate_known = prostate_y.merge(ks_dataset[['GENE','PTM']], on = 'PTM')
sig_prostate_known = sig_prostate_y.merge(ks_dataset[['GENE','PTM']], on = 'PTM')

#iterate through each kinase and perform enrichment for substrates using hypergeometric test
results = pd.DataFrame(np.nan, index = sig_prostate_known['GENE'].unique(), columns = ['k','n','M','N','p'])
for kinase in sig_prostate_known['GENE'].unique():
    #get numbers for a hypergeometric test to look for enrichment of kinase substrates
    k = sig_prostate_known[sig_prostate_known['GENE'] == kinase].shape[0]
    n = prostate_known[prostate_known['GENE'] == kinase].shape[0]
    M = prostate_y.shape[0]
    N = sig_prostate_y.shape[0]


    results.loc[kinase,'p'] = hypergeom.sf(k-1, M, n, N)
    results.loc[kinase, 'M'] = M
    results.loc[kinase, 'N'] = N
    results.loc[kinase, 'k'] = k
    results.loc[kinase, 'n'] = n

results.to_csv(analysis_dir + '/TCGA/Enrichment/tyrosine_kinase_enrichment_known.csv')

##### Phosphoserines/threonines

In [130]:
#restrict to phosphotyrosine data
prostate_known = prostate_st.merge(ks_dataset[['GENE','PTM']], on = 'PTM')
sig_prostate_known = sig_prostate_st.merge(ks_dataset[['GENE','PTM']], on = 'PTM')

results = pd.DataFrame(np.nan, index = sig_prostate_known['GENE'].unique(), columns = ['k','n','M','N','p'])
for kinase in sig_prostate_known['GENE'].unique():
    #get numbers for a hypergeometric test to look for enrichment of kinase substrates
    k = sig_prostate_known[sig_prostate_known['GENE'] == kinase].shape[0]
    n = prostate_known[prostate_known['GENE'] == kinase].shape[0]
    M = prostate_st.shape[0]
    N = sig_prostate_st.shape[0]


    results.loc[kinase,'p'] = hypergeom.sf(k-1, M, n, N)
    results.loc[kinase, 'M'] = M
    results.loc[kinase, 'N'] = N
    results.loc[kinase, 'k'] = k
    results.loc[kinase, 'n'] = n

results = results.sort_values(by = 'p', ascending = False)
results.to_csv(analysis_dir + '/TCGA/Enrichment/st_kinase_enrichment_known.csv')

#### KSTAR Predicted Kinase Substrate Interactions

In [1]:
import pandas as pd
import numpy as np
import pickle
from scipy.stats import hypergeom

from kstar import config

#location of figshare data
figshare_dir = 'C:/Users/Sam/OneDrive/Documents/GradSchool/Research/Splicing/Paper_Prep/PTM_Splicing_FigShare/'
#where to find projected PTMs
ptm_data_dir = figshare_dir + 'PTM_Projection_Data/'
#where to find information from different databases
database_dir = figshare_dir + 'External_Data/'
#where analysis data will be saved
analysis_dir = figshare_dir + 'Analysis_For_Paper/'

#load ptm info
ptm_info = pd.read_csv(ptm_data_dir + '/processed_data_dir/ptm_info.csv', index_col = 0)

#load splice seq data
prostate = pd.read_csv(analysis_dir + '/TCGA/Prad_ESRP1_PTMs.csv')

#add modification information
prostate = prostate.merge(ptm_info['Modification'].reset_index(), on='PTM')

#get significant prostate
sig_prostate = prostate[(prostate['Adj p'] < 0.05) & (prostate['Effect Size'] >= 0.25)]

#separate unique modifications into unique rows so that data can be restricted to phosphotyrosine or phosphoserine/threonine
exploded_prostate = prostate.copy()
exploded_prostate['Modification'] = exploded_prostate['Modification'].str.split(';')
exploded_prostate = exploded_prostate.explode('Modification')

#get sig
exploded_sig_prostate = exploded_prostate[(exploded_prostate['Adj p'] < 0.05) & (exploded_prostate['Effect Size'] >= 0.25)]

#restrict to phosphotyrosine data
sig_prostate_y = exploded_sig_prostate[exploded_sig_prostate['Modification'] == 'Phosphotyrosine']
prostate_y = exploded_prostate[exploded_prostate['Modification'] == 'Phosphotyrosine']

#restrict to phosphoserine/threonine data
sig_prostate_st = exploded_sig_prostate[exploded_sig_prostate['Modification'].isin(['Phosphoserine', 'Phosphothreonine'])]
prostate_st = exploded_prostate[exploded_prostate['Modification'].isin(['Phosphoserine', 'Phosphothreonine'])]

#grab networks 
networks = {}
networks['Y'] =pickle.load(open(config.NETWORK_Y_PICKLE, "rb" ) )
networks['ST'] = pickle.load(open(config.NETWORK_ST_PICKLE, "rb" ) )


##### Functions for calculating enrichment

In [2]:
def get_enrichment_single_network(network, prostate, sig_prostate, type = 'All'):
    """
    Given prostate data and a single kstar network, get enrichment for each kinase in the network in the prostate data. Assumes the prostate data has already been reduced to the modification of interest (phosphotyrosine or phoshoserine/threonine)

    Parameters
    ----------
    network : pandas dataframe
        kstar network
    prostate : pandas dataframe
        all PTMs identified in tCGA prostate data, regardless of significance (reduced to only include mods of interest)
    sig_prostate : pandas dataframe
        significant PTMs identified in tCGA prostate data, p < 0.05 and effect size > 0.25 (reduced to only include mods of interest)
    """
    network['PTM'] = network['KSTAR_ACCESSION'] + '_' + network['KSTAR_SITE']
    #if focusing high or low groups restrict to those, otherwise add network informatin to sig prostate without changing
    if type == 'All':
        sig_prostate_kstar = sig_prostate.merge(network[['KSTAR_KINASE','PTM']], on = 'PTM')
    elif type == 'High':
        sig_prostate = sig_prostate[sig_prostate['ESRP1'] == 'High']
        sig_prostate_kstar = sig_prostate.merge(network[['KSTAR_KINASE','PTM']], on = 'PTM')
    elif type == 'Low':
        sig_prostate = sig_prostate[sig_prostate['ESRP1'] == 'Low']
        sig_prostate_kstar = sig_prostate.merge(network[['KSTAR_KINASE','PTM']], on = 'PTM')

    #add network information to all prostate data
    prostate_kstar = prostate.merge(network[['KSTAR_KINASE','PTM']], on = 'PTM')

    results = pd.DataFrame(np.nan, index = sig_prostate_kstar['KSTAR_KINASE'].unique(), columns = ['k','n','M','N','p'])
    for kinase in sig_prostate_kstar['KSTAR_KINASE'].unique():
        #get numbers for a hypergeometric test to look for enrichment of kinase substrates
        k = sig_prostate_kstar[sig_prostate_kstar['KSTAR_KINASE'] == kinase].shape[0]
        n = prostate_kstar[prostate_kstar['KSTAR_KINASE'] == kinase].shape[0]
        M = prostate.shape[0]
        N = sig_prostate.shape[0]

        #run hypergeometric test
        results.loc[kinase,'p'] = hypergeom.sf(k-1, M, n, N)
        results.loc[kinase, 'M'] = M
        results.loc[kinase, 'N'] = N
        results.loc[kinase, 'k'] = k
        results.loc[kinase, 'n'] = n

    return results

def get_enrichment_all_networks(networks, prostate, sig_prostate, type = 'All'):
    """
    Given prostate data and a dictionary of kstar networks, get enrichment for each kinase in each network in the prostate data. Assumes the prostate data has already been reduced to the modification of interest (phosphotyrosine or phoshoserine/threonine)

    Parameters
    ----------
    networks : dict
        dictionary of kstar networks
    prostate : pandas dataframe
        all PTMs identified in tCGA prostate data, regardless of significance (reduced to only include mods of interest)
    sig_prostate : pandas dataframe
        significant PTMs identified in tCGA prostate data, p < 0.05 and effect size > 0.25 (reduced to only include mods of interest)
    """
    results = {}
    for network in networks:
        results[network] = get_enrichment_single_network(networks[network], prostate, sig_prostate, type = type)
    return results

def extract_enrichment(results):
    """
    Given a dictionary of results from get_enrichment_all_networks, extract the p-values for each network and kinase, and then calculate the median p-value across all networks for each kinase

    Parameters
    ----------
    results : dict
        dictionary of results from get_enrichment_all_networks
    """
    enrichment = pd.DataFrame(index = results['nkin0'].index, columns = results.keys())
    for network in results:
        enrichment[network] = results[network]['p']
    enrichment['median'] = enrichment.median(axis = 1)
    return enrichment

##### Phosphotyrosines


In [4]:
#restrict to phosphotyrosine data
sig_prostate_y = exploded_sig_prostate[exploded_sig_prostate['Modification'] == 'Phosphotyrosine']
prostate_y = exploded_prostate[exploded_prostate['Modification'] == 'Phosphotyrosine']


#combined median values for each group
median_y = pd.DataFrame(index = networks['Y']['nkin0']["KSTAR_KINASE"].unique(), columns = ['All','High','Low'])

#run enrichment on all networks
for type in ['All', 'High', 'Low']:
    results_y = get_enrichment_all_networks(networks['Y'], prostate_y, sig_prostate_y, type = type)

    #extract enrichment
    enrichment_y = extract_enrichment(results_y)
    median_y[type] = enrichment_y['median']

median_y.to_csv(analysis_dir + '/TCGA/Enrichment/tyrosine_kinase_enrichment_predicted.csv')

##### Serine/threonines 

In [6]:
sig_prostate_st = exploded_sig_prostate[exploded_sig_prostate['Modification'].isin(['Phosphoserine', 'Phosphothreonine'])]
prostate_st = exploded_prostate[exploded_prostate['Modification'].isin(['Phosphoserine', 'Phosphothreonine'])]

#combined median values for each group
median_st = pd.DataFrame(index = networks['ST']['nkin0']["KSTAR_KINASE"].unique(), columns = ['All','High','Low'])

#run enrichment on all networks
for type in ['All', 'High', 'Low']:
    results_st = get_enrichment_all_networks(networks['ST'], prostate_st, sig_prostate_st, type = type)

    #extract enrichment
    enrichment_st = extract_enrichment(results_st)
    median_st[type] = enrichment_st['median']

median_st.to_csv(analysis_dir + '/TCGA/Enrichment/st_kinase_enrichment_predicted.csv')

### Annotate ESRP1-Regulated SGK substrates

In [8]:
from kstar.analysis import interactions

# load known kinases from PhosphoSitePlus
ks_dataset = pd.read_csv(database_dir + '/PhosphoSitePlus/Kinase_Substrate_Dataset.tsv', sep ='\t')
ks_dataset = ks_dataset[(ks_dataset['KIN_ORGANISM'] == 'human') & (ks_dataset['KIN_ORGANISM'] == 'human')]
ks_dataset['PTM'] = ks_dataset['SUB_ACC_ID']+'_'+ks_dataset['SUB_MOD_RSD']

#extract phosphoserines with higher inclusion in ESRP1-high patients
high_sig = sig_prostate_st[sig_prostate_st['ESRP1'] == 'High'].copy()
high_sig['data:sig_ptms'] = 1
high_sig.rename({'PTM':'KSTAR_SUBSTRATE'},  axis = 1, inplace = True)
high_sig['KSTAR_SITE'] = high_sig['KSTAR_SUBSTRATE'].apply(lambda x: x.split('_')[1])
high_sig['KSTAR_ACCESSION'] = high_sig['KSTAR_SUBSTRATE'].apply(lambda x: x.split('_')[0])

#get substrate influence on prediction
kinase = 'SGK1'
experiment_influence = interactions.getSubstrateInfluence_inExperiment(networks['ST'], high_sig, kinase)

#process and annotate known substrates
known_kin = ks_dataset[ks_dataset['GENE'] == kinase]
sub_data = experiment_influence['data:sig_ptms'].reset_index()
sub_data['UniProt_ID'] = sub_data['index'].apply(lambda x: x.split('_')[0])
sub_data['Site'] = sub_data['index'].apply(lambda x: x.split('_')[1])
sub_data = sub_data.merge(known_kin['PTM'], left_on = ['index'], right_on = ['PTM'], how = 'left')
sub_data["Known"] = sub_data["PTM"].apply(lambda x: 0 if pd.isnull(x) else 1)
sub_data = sub_data.drop('PTM', axis = 1)

#save data
sub_data.to_csv(analysis_dir + f'/TCGA/Enrichment/ESRP1_Regulated_SGK1_Substrates.csv')

## Analysis of PTMs in Canonical and Alternative UniProt Isoforms

Using data from ProteomeScout, assessed how many PTMs are annotated in the canonical and alternative UniProtKB isoform. This data was used in Supplementary Figure 2.

[Return to Table of Contents](#table-of-contents)

In [2]:
def find_ptms(isoform_id, transcript_id):
        """
        Given a transcript id, find all PTMs present in the protein
        
        Parameters
        ----------
        transcript_id: strings
            Ensemble transcript for the protein of interest
            
        Returns
        -------
        ptm_df: pandas dataframe
            Dataframe containing gene id, transcript id, protein id, residue modified, location of residue and modification type. Each row
                corresponds to a unique ptm
        """
        ptms = config.ps_api.get_PTMs(isoform_id)
        
        #extract ptm position
        if isinstance(ptms, int) or ptms == '-1':
            ptm_df = None
        else: 
            ptm_df = pd.DataFrame(ptms)
            ptm_df.columns = ['PTM Location (AA)', 'Residue', 'Modification']
            ptm_df.insert(0, 'Isoform', isoform_id)
            ptm_df.insert(0, 'Transcript', transcript_id)
            
        return ptm_df
    
def findIsoformInfo(ptm_df, isoform_id, transcript_id):
    """
    Given a dataframe of ptm information associated with a specific isoform and transcript (such as one obtained from findPTMs()), find information regarding that isoform, including number of ptms, length, etc.

    Parameters
    ----------
    ptm_df: pandas dataframe
        Dataframe containing gene id, transcript id, protein id, residue modified, location of residue and modification type. Each row
    isoform_id: string
        Uniprot Isoform ID
    transcript_id: string
        Ensemble transcript for the isoform of interest
    """
    #get sequence length and number of ptms
    if ptm_df is None:
        num_ptms = 0
    else:
        num_ptms = ptm_df.shape[0]


    prot_length = config.ps_api.get_sequence(isoform_id)
    if isinstance(prot_length, int) or prot_length == '-1':
        prot_length = None
    else:
        prot_length = len(prot_length)

    isoform_info = pd.DataFrame({'Transcript':transcript_id,'Isoform': isoform_id, 
                                 'Protein Length': prot_length, 'Number of PTMs': num_ptms}, index = [isoform_id])
    return isoform_info

In [6]:


isoform_info_dict = {}
isoform_ptm_dict = {}

for type in ['Canonical', 'Alternative']:
    isoforms = config.translator[config.translator['Uniprot Canonical'] == type]
    info_list = []
    df_list = []
    for i, row in isoforms.iterrows():
        #if canonical, use normal uniprot id (i.e no '-1' at end)
        if type == 'Canonical':
            isoform_id = row['UniProtKB/Swiss-Prot ID']
        else:
            isoform_id = row['UniProtKB isoform ID']
            
        transcript_id = row['Transcript stable ID']
        #find ptms associated with isoform
        ptm_df = find_ptms(isoform_id, transcript_id)

        #get summary info about each isoform
        isoform_info = findIsoformInfo(ptm_df, isoform_id, transcript_id)
        if isoform_info is not None:
            info_list.append(isoform_info)

        #if desired save info about each ptm in isoforms
        if ptm_df is not None:
            df_list.append(ptm_df)

    info_df = pd.concat(info_list)
    info_df['UniProt ID'] = info_df['Isoform'].str.split('-').str[0]
    info_df['Type'] = type
    isoform_info_dict[type] = info_df
    isoform_ptm_dict[type] = pd.concat(df_list)

#calculate ratio of canonical to alternative isoforms
    #combine canonical and alternative info
ratio_data = isoform_info_dict['Canonical'].merge(isoform_info_dict['Alternative'], on = 'UniProt ID')
#calculate the ratio of ptms in the alternative isoform to the canonical isoform
ratio_list = []
for i, row in ratio_data.iterrows():
    if row['Number of PTMs_x'] == 0:
        ratio_list.append(row['Number of PTMs_y'])
    else:
        ratio_list.append(row['Number of PTMs_y']/row['Number of PTMs_x'])
ratio_data['Ratio'] = ratio_list
ratio_data = ratio_data.rename({'Transcript_x': 'Canonical Transcript', 'Isoform_x':'Canonical Isoform', 'Protein Length_x':'Canonical Protein Length', 'Number of PTMs_x':'Number of PTMs in Canonical', 'Transcript_y':'Alternative Transcript', 'Isoform_y':'Alternative Isoform', 'Protein Length_y':'Alternative Protein Length', 'Number of PTMs_y':'Number of PTMs in Alternative'}, axis = 1)
ratio_data = ratio_data.drop(['UniProt ID', 'Type_x', 'Type_y'], axis = 1)
ratio_data = ratio_data.drop_duplicates()

In [21]:
#save data
isoform_ptm_dict['Alternative'].to_csv(analysis_dir + '/PTMs_in_Isoforms/uniprot_alt_isoform_ptms.csv', index = False)
ratio_data.to_csv(analysis_dir + '/PTMs_in_Isoforms/alternative_to_canonical_ratio.csv', index = False)
