# Process SRM trait association data

This script combines two spreadsheets:
1. Peptide associations with AD-related traits
2. Descriptions of each peptide

The trait association file only has raw p-values which have not been adjusted for multiple testing. This script adjusts the p-values for each trait using the Bonferroni correction and marks them as significant if adjP <= 0.05. It also looks up the Ensembl gene IDs for each protein and adds it to the data frame.

This script requires install of several libraries:
```
pip install openpyxl statsmodels unipressed
```

**TODO:** upload the output file to Synapse with the correct provenances. 

In [1]:
import pandas as pd
import statsmodels.stats.multitest as mt
from unipressed import IdMappingClient
import time
#import synapseclient

#syn = synapseclient.Synapse()
#syn.login(silent=True)

These files were provided by the lab via email. 

In [3]:
stats_srm = pd.read_excel('../../input/srm/Round_1_2.output.combined.finaldta.xlsx', sheet_name=0)
stats_srm = stats_srm.rename(columns = {'Peptide': 'Peptide Name'})

peptide_info = pd.read_excel('../../input/srm/Supplementary Tables R2.xlsx', sheet_name=0)
peptide_info = peptide_info.dropna(subset = ['UniProtAC'])

Some peptide names are mis-matched between files. This mapping comes from the peptide info spreadsheet. Some not listed here are still mis-matched after the re-map but we don't have any more information. 

In [4]:
remaps = {"bA": "APP_3", "bA38": "APP_5", "bA42": "APP_6", "tau_AT8_s202": "MAPT_2", 
          "HLA_B_2": "HLA-B_2", "HLA_B_5": "HLA-B_5"}

for key, value in remaps.items():
    stats_srm.loc[stats_srm['Peptide Name'] == key, 'Peptide Name'] = value

Get mapping between Uniprot AC and Ensembl ID

In [5]:
uniprot_ids = peptide_info['UniProtAC'].drop_duplicates()
request = IdMappingClient.submit(source='UniProtKB_AC-ID', dest='Ensembl', ids=uniprot_ids)

found = False
while not found:
    time.sleep(2)
    try:
        ensembl_ids = list(request.each_result())
        found = True
    except: 
        print("Waiting for response from UniProt...")

Rename columns to match, and get rid of the Ensembl version at the end of each ID

In [6]:
ensembl_ids = pd.DataFrame(ensembl_ids).rename(columns = {'from': 'UniProtAC', 'to': 'ensembl_gene_id'})
ensembl_ids['ensembl_gene_id'] = ensembl_ids['ensembl_gene_id'].str.extract(r'(\w+)')

Add the IDs to the peptide info data frame

**TODO:** how to handle one UniProt ID -> multiple Ensembl IDs?

In [7]:
peptide_info = peptide_info.merge(ensembl_ids, how = 'left', on = 'UniProtAC')

For every p-value field, adjust it for multiple testing. All p-value fields start with "P_". 

Then create a new field called "adjP_\<field\>" with the adjusted values. 

In [8]:
fields = [X for X in stats_srm.columns if 'P_' in X]

# Adjust using bonferroni correction
for field in fields:
    (_, adjP, _, _) = mt.multipletests(pvals=stats_srm[field], alpha = 0.05, method='bonferroni')
    stats_srm['adj' + field] = adjP

Get the minimum of all adjusted p-values for each peptide to determine if it is significant at p <= 0.05

In [9]:
adj_fields = [X for X in stats_srm.columns if 'adjP_' in X]
stats_srm['adjP_min'] = stats_srm[adj_fields].min(axis = 1)

stats_srm['SRM_signifTF'] = stats_srm['adjP_min'] <= 0.05
stats_srm['SRM_signifTF'] = stats_srm['SRM_signifTF']

Merge peptide descriptions with the p-value data frame. Some fields in the descriptions data frame are numeric (0 or 1) that need to be converted to boolean (True or False). 

In [10]:
srm_final = peptide_info.merge(stats_srm, how = 'left', on = 'Peptide Name')
srm_final = srm_final.rename(columns = {'Gene': 'hgnc_symbol', 'Round #': 'Round'})

booleans = ['Detectable', 'Passed S/N QC', 'Best S/N within protein or targeted specie*']
srm_final[booleans] = srm_final[booleans].astype('boolean')

In the original SRM data for Agora, any field not filled in was blank (''). 

In [11]:
srm_final = srm_final.fillna('')

In [12]:
srm_final

Unnamed: 0,Round,UniProtAC,hgnc_symbol,Description,Peptide Name,Peptide,Sequence with mod,Detectable,conc in spike (nM),Passed S/N QC,...,SE_tangles,P_tangles,adjP_cog_lv,adjP_cog_decline,adjP_diag,adjP_gpath,adjP_amyloid,adjP_tangles,adjP_min,SRM_signifTF
0,1,Q9UKV3,ACIN1,Apoptotic chromatin condensation inducer in th...,ACIN1_1,TAQVPSPPR^,TAQVPSPPR,False,,False,...,,,,,,,,,,
1,1,Q9UKV3,ACIN1,Apoptotic chromatin condensation inducer in th...,ACIN1_2,GVPAGNSDTEGGQPGR^,GVPAGNSDTEGGQPGR,True,0.2,False,...,,,,,,,,,,
2,1,Q6AI12,ANKRD40,Ankyrin repeat domain-containing protein 40,ANKRD40_1,NAASTLTERP(Cam)YNR^,NAASTLTERPCYNR,True,0.2,False,...,,,,,,,,,,
3,1,Q6AI12,ANKRD40,Ankyrin repeat domain-containing protein 40,ANKRD40_2,GEMPVQLTSR^,GEMPVQLTSR,True,0.22,True,...,0.076819,0.61705,1.0,1.0,1.0,1.0,1.0,1.0,1.0,False
4,1,O94973,AP2A2,AP-2 complex subunit alpha-2 (100 kDa coated v...,AP2A2_1,FFQPTEMASQDFFQR^,FFQPTEMASQDFFQR,True,2.6,True,...,0.217253,0.000029,1.0,1.0,1.0,0.014246,0.003145,0.006629,0.003145,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
663,2,P10636,MAPT,Microtubule-associated protein tau (Neurofibri...,MAPT_47,SPVVSGDTSPR^,SPVVSGDTSPR,True,,True,...,,,,,,,,,,
664,2,P10636,MAPT,Microtubule-associated protein tau (Neurofibri...,MAPT_47,SPVVSGDTSPR^,SPVVSGDTSPR,True,,True,...,,,,,,,,,,
665,2,P10636,MAPT,Microtubule-associated protein tau (Neurofibri...,MAPT_48,SPVVSGDT(pS)PR^,SPVVSGDTS*PR,True,,True,...,,,,,,,,,,
666,2,P10636,MAPT,Microtubule-associated protein tau (Neurofibri...,MAPT_48,SPVVSGDT(pS)PR^,SPVVSGDTS*PR,True,,True,...,,,,,,,,,,


Write to CSV. 

**TODO:** upload to Synapse. 

In [13]:
srm_final.to_csv('../../output/SRMdata.csv', index = False)