# Process SRM differential expression for Agora
This script combines SRM differential expression data from round 1 and round 2 and formats it to be consistent with LFQ and TMT proteomics differential expression data. 

This script adjusts the p-values for each protein using FDR correction, and it also looks up the Ensembl gene IDs for each protein.

This script requires install of several libraries:

```
pip install openpyxl statsmodels unipressed
```

**TODO:** upload the output file to Synapse with the correct provenances.

In [1]:
import pandas as pd
import statsmodels.stats.multitest as mt
from unipressed import IdMappingClient
import time
import synapseclient
import agoradatatools.etl.extract as extract

# Step 1: Get and combine differential expression data
Rounds 1 & 2 differential expression are available on Synapse: https://www.synapse.org/#!Synapse:syn21444847

In [None]:
syn = synapseclient.Synapse()
syn.login(silent=True)

round1 = extract.get_entity_as_df(syn_id="syn21448389",
                                  source="csv",
                                  syn=syn)
round1 = round1.rename(columns = {"Unnamed: 0": "GeneName"})

round2 = extract.get_entity_as_df(syn_id="syn21448395",
                                  source="csv",
                                  syn=syn)
round2 = round2.rename(columns = {"Unnamed: 0": "GeneName"})

Some peptide names are mis-matched between files. This mapping comes from the peptide info spreadsheet. Some not listed here are still mis-matched after the re-map but we don't have any more information.

In [3]:
remaps = {"bA": "APP_3", "bA38": "APP_5", "bA42": "APP_6", "tau_AT8_s202": "MAPT_2", 
              "HLA_B_2": "HLA-B_2", "HLA_B_5": "HLA-B_5"}
        
round1["GeneName"] = round1["GeneName"].replace(remaps)
round2["GeneName"] = round2["GeneName"].replace(remaps)

Adjust p-values for multiple testing *before* combining the data frames, as they were run as separate ANOVAs. 

In [4]:
def adjust_pvals(df):
    df['PVal'] = df['Control-AD']
    (_, adjP, _, _) = mt.multipletests(pvals=df['PVal'], alpha = 0.05, method='fdr_bh')
    df['Cor_PVal'] = adjP
    return df
    
round1 = adjust_pvals(round1)
round2 = adjust_pvals(round2)

Combine round 1 and round 2. Only 4 genes overlap between them, but round 2 needs a little extra cleaning. Unlike round 1, which selected a single peptide variant per gene, they ran differential expression on individual peptides and included all peptide variants in the round 2 table. Variants are denoted by \<Gene\>_\<#\>, e.g. BIN1_2.

For Agora, we will pick the variant/gene with the smallest corrected p-value between the two rounds. 

In [5]:
diffexp = pd.concat([round1, round2], axis = 0, ignore_index = True)
diffexp["GeneName"] = diffexp["GeneName"].str.replace(r"_.*", "", regex = True)

rows = diffexp.groupby("GeneName").agg({"Cor_PVal": "idxmin"})
diffexp = diffexp.loc[rows["Cor_PVal"].sort_values()]

diffexp

Unnamed: 0,GeneName,F-Value,Pr(>F),AsymAD-AD,Control-AD,Control-AsymAD,diff AsymAD-AD,diff Control-AD,diff Control-AsymAD,PVal,Cor_PVal
0,AK4,10.797378,2.291591e-05,0.011495,0.000026,0.090434,-0.042417,-0.081583,-0.039166,0.000026,0.000161
1,ANKRD40,0.372186,6.893199e-01,0.679559,0.867997,0.984687,0.028536,0.021340,-0.007196,0.867997,0.999697
2,AP2A2,4.318748,1.356259e-02,0.944784,0.026749,0.015119,-0.003595,0.035770,0.039365,0.026749,0.061800
3,APOE,7.220149,7.699608e-04,0.904732,0.002536,0.000962,0.012679,-0.122649,-0.135328,0.002536,0.008736
5,APP,635.642124,1.761202e-179,0.000000,0.000000,0.000000,-1.079320,-5.757035,-4.677715,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...
215,TPRG1L,7.630556,5.139289e-04,0.007615,0.001742,0.586144,-0.065940,-0.093259,-0.027320,0.001742,0.007593
216,UGT8,1.881349,1.529181e-01,0.167573,0.377589,0.991187,-0.123945,-0.112960,0.010985,0.377589,0.580954
218,UQCR10,4.399957,1.251457e-02,0.092170,0.017405,0.565345,0.029624,0.047856,0.018233,0.017405,0.049458
220,UQCRC2,7.667740,4.954171e-04,0.014425,0.001008,0.396875,0.041283,0.065476,0.024193,0.001008,0.004763


# Step 2: Get peptide IDs and Ensembl IDs for each gene
Peptide info was provided by the lab via email. 

In [6]:
peptide_info = pd.read_excel('../../input/srm/Supplementary Tables R2.xlsx', sheet_name=0)
peptide_info = peptide_info.dropna(subset = ['UniProtAC'])

Get mapping between Uniprot AC and Ensembl ID

In [7]:
uniprot_ids = peptide_info['UniProtAC'].drop_duplicates()
request = IdMappingClient.submit(source='UniProtKB_AC-ID', dest='Ensembl', ids=uniprot_ids)

found = False
while not found:
    time.sleep(2)
    try:
        ensembl_ids = list(request.each_result())
        found = True
    except: 
        print("Waiting for response from UniProt...")

Rename columns to match, and get rid of the Ensembl version at the end of each ID

In [8]:
ensembl_ids = pd.DataFrame(ensembl_ids).rename(columns = {'from': 'UniProtAC', 'to': 'ENSG'})
ensembl_ids['ENSG'] = ensembl_ids['ENSG'].str.extract(r'(\w+)')

Add the IDs to the peptide info data frame

In [9]:
peptide_info = peptide_info.merge(ensembl_ids, how = 'left', on = 'UniProtAC')

Remove unneeded columns from peptide_info and rename remaining columns to match intended output

In [10]:
peptide_info = peptide_info[['UniProtAC', 'Gene', 'ENSG']].drop_duplicates()
peptide_info = peptide_info.rename(columns = {"UniProtAC": "UniProtID", "Gene": "GeneName"})
peptide_info

Unnamed: 0,UniProtID,GeneName,ENSG
0,Q9UKV3,ACIN1,ENSG00000100813
2,Q6AI12,ANKRD40,ENSG00000154945
4,O94973,AP2A2,ENSG00000183020
6,P02649,APOE,ENSG00000130203
8,P05067,APP,ENSG00000142192
...,...,...,...
510,P45880,VDAC2,ENSG00000165637
512,O95619,YEATS4,ENSG00000127337
514,P49750,YLPM1,ENSG00000119596
516,Q9H0M4,ZCWPW1,ENSG00000078487


# Step 4: Restructure round 1 & 2 data to match other proteomics data
Data must have the same columns and format as https://www.synapse.org/#!Synapse:syn18689335

First, create the same columns that the other proteomics data has:

In [11]:
# Make log fold-change values negative/opposite what's in the original data because the comparison in the original data
# is control - AD rather than AD - control, which is what the other proteomics datasets use. 
diffexp['Log2_FC'] = -diffexp['diff Control-AD']
diffexp['PVal'] = -diffexp['Control-AD']
diffexp['Tissue'] = 'DLPFC'

# Fake values since we don't have this info from the file. 
# CI_Upr and CI_Lwr are not used in Agora but the transform doesn't allow NA values. 
# We could get these values by re-calculating the ANOVAs if we need to. 
diffexp['CI_Upr'] = diffexp['Log2_FC'] 
diffexp['CI_Lwr'] = diffexp['Log2_FC']

Merge in the peptide info to get Ensembl IDs and UniProt IDs. Use an inner join to get rid of the "averaged" peptide rows that are in the round 2 data. 

In [12]:
diffexp_final = diffexp.merge(peptide_info, how = 'inner', on = 'GeneName')

As in the other proteomics data, the unique ID is "\<GeneName\>|\<UniProtID\>"

In [13]:
diffexp_final["UniqID"] = diffexp_final["GeneName"] + "|" + diffexp_final["UniProtID"]

Put the necessary columns in the correct order

In [14]:
diffexp_final = diffexp_final[["UniqID", "GeneName", "UniProtID", "ENSG", "Tissue", 
                               "Log2_FC", "CI_Upr", "CI_Lwr", "PVal", "Cor_PVal"]]

In [15]:
diffexp_final

Unnamed: 0,UniqID,GeneName,UniProtID,ENSG,Tissue,Log2_FC,CI_Upr,CI_Lwr,PVal,Cor_PVal
0,AK4|P27144,AK4,P27144,ENSG00000162433,DLPFC,0.081583,0.081583,0.081583,-0.000026,0.000161
1,ANKRD40|Q6AI12,ANKRD40,Q6AI12,ENSG00000154945,DLPFC,-0.021340,-0.021340,-0.021340,-0.867997,0.999697
2,AP2A2|O94973,AP2A2,O94973,ENSG00000183020,DLPFC,-0.035770,-0.035770,-0.035770,-0.026749,0.061800
3,APOE|P02649,APOE,P02649,ENSG00000130203,DLPFC,0.122649,0.122649,0.122649,-0.002536,0.008736
4,APP|P05067,APP,P05067,ENSG00000142192,DLPFC,5.757035,5.757035,5.757035,-0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...
132,TPRG1L|Q5T0D9,TPRG1L,Q5T0D9,ENSG00000158109,DLPFC,0.093259,0.093259,0.093259,-0.001742,0.007593
133,UGT8|Q16880,UGT8,Q16880,ENSG00000174607,DLPFC,0.112960,0.112960,0.112960,-0.377589,0.580954
134,UQCR10|Q9UDW1,UQCR10,Q9UDW1,ENSG00000184076,DLPFC,-0.047856,-0.047856,-0.047856,-0.017405,0.049458
135,UQCRC2|P22695,UQCRC2,P22695,ENSG00000140740,DLPFC,-0.065476,-0.065476,-0.065476,-0.001008,0.004763


Write to csv and upload to Synapse.

In [None]:
diffexp_final.to_csv('../../output/SRM_diff_expr.csv', index = False)
file = synapseclient.File('../../output/SRM_diff_expr.csv', parent = 'syn7525089') # syn7525089 is the Agora Raw Data folder
file = syn.store(file, used = ['syn21448389', 'syn21448395'], forceVersion=False)