# AlphaPept API

Calling the AlphaPept LFQ implementation on MaxQuant output files.


## Installation

When installing AlphaPept we can use `pip install alphapept`. Note that this will install just the package and not the GUI-version etc.. Raw data import will only work for Windows unless you have Mono installed for Mac. See the GitHub page for more [information](https://github.com/MannLabs/alphapept) about the different modes for installation.

For this notebook one will need `wget` to download the external files.


## Example file

The example file is from PXD006109 and was run with MaxQuant 2.0.3.1. For LFQ the minimum ratio was set to two.


## Requirements

To run this notebook, install `wget` to download the example files. Additionally required is `pandas`, `matplotlib`.

In [1]:
import os 
import wget
import pandas as pd

external_files = {}

external_files['evidence.txt'] = 'https://datashare.biochem.mpg.de/s/KAbRdiHX7rWlTAB/download'
external_files['proteinGroups.txt'] = 'https://datashare.biochem.mpg.de/s/XNbraDaneZlztzg/download'

TEMP_DIR = './lfq_demo/'

for file in external_files:
    target = os.path.join(TEMP_DIR, file)
    os.makedirs(TEMP_DIR, exist_ok=True)
    if not os.path.isfile(target):
        print(f'Downloading {file} to {target}')
        wget.download(external_files[file], target)

In [2]:
import alphapept

print(alphapept.__version__)

0.4.9


In [3]:
from alphapept.quantification import protein_profile_parallel_mq

## Inspecting function header
We have two relevant parameters: `minimum_ratios` for the second step of LFQ (extracting optimal protein ratios) and `minimum_occurence` for the delayed normalization step (first step).

When calling the function we set the `minimum_ratios` to two to better compare to the MaxQuant output. Note that the default value for AlphaPept is 1. This typically leads to less proteins being thrown overboard.

In [4]:
help(protein_profile_parallel_mq)

Help on function protein_profile_parallel_mq in module alphapept.quantification:

protein_profile_parallel_mq(evidence_path: str, protein_groups_path: str, minimum_ratios: int = 1, minimum_occurence: bool = None, delayed: bool = True, callback=None) -> pandas.core.frame.DataFrame
    Derives protein LFQ intensities from Maxquant quantified features.
    
    Args:
        evidence_path (str): path to the Maxquant standard output table evidence.txt.
        protein_groups_path (str): path to the Maxquant standard output table proteinGroups.txt.
        minimum_ratios (int): minimum ratios (LFQ parameter)
        minimum_occurence (int): minimum occurence (LFQ parameter)
        delayed (bool): toggle for delayed normalization (on/off)
        callback ([type], optional): [description]. Defaults to None.
    
    Raises:
        FileNotFoundError: if Maxquant files cannot be found.
    
    Returns:
        pd.DataFrame: table containing the LFQ intensities of each protein in each sample

In [None]:
%%time 

#This may take a while, one could use the callback to display progress

evidence_path = os.path.join(TEMP_DIR, 'evidence.txt')
protein_group_path = os.path.join(TEMP_DIR, 'proteinGroups.txt')

pt = protein_profile_parallel_mq(evidence_path, protein_group_path, minimum_ratios =2)

In [None]:
display(pt.head())

## Comparing MaxQuant LFQ to AlphaPept LFQ

The sample file consists of 6 files of two different conditions (Shotgun_02-01_1, Shotgun_02-01_2, Shotgun_02-01_3) and (Shotgun_12-01_1, Shotgun_12-01_2, Shotgun_12-01_3).
We have intensities w/ and w/o the LFQ.

For comparison, we plot the distribution of the CV before LFQ and after LFQ, for the MaxQuant output and the AlphaPeptLFQ for one condition.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

In [None]:
columns = ['Shotgun_02-01_1','Shotgun_02-01_2', 'Shotgun_02-01_3']

In [None]:
bins = np.linspace(0, 100, 100)

df = pd.read_csv('proteinGroups.txt', sep='\t')
df = df.replace(0, np.nan)

In [None]:
pt

In [None]:
plt.figure(figsize=(10,5))

case = 'AlphaPept on MaxQuant w/o LFQ'
cv =  pt[columns].std(axis=1) / pt[columns].mean(axis=1) * 100
plt.hist(cv, bins=bins, alpha=0.5, label=case)
print(f'Mean CV for {case} {np.mean(cv):.2f} % - n: {(~cv.isna()).sum():,}')

case = 'AlphaPept on MaxQuant w/ LFQ'
lfq_columns = [_+'_LFQ' for _ in columns]
cv =  pt[lfq_columns].std(axis=1) / pt[lfq_columns].mean(axis=1) * 100
plt.hist(cv, bins=bins, alpha=0.5, label=case)
print(f'Mean CV for {case} {np.mean(cv):.2f} % - n: {(~cv.isna()).sum():,}')

columns_mq = ['Intensity Shotgun_02-01_1', 'Intensity Shotgun_02-01_2','Intensity Shotgun_02-01_3']


case = 'MaxQuant w/o LFQ'
cv =  df[columns_mq].std(axis=1) / df[columns_mq].mean(axis=1) * 100
plt.hist(cv, bins=bins, alpha=0.5, label=case)
print(f'Mean CV for {case} {np.mean(cv):.2f} % - n: {(~cv.isna()).sum():,}')

case = 'MaxQuant w/ LFQ'
lfq_columns_mq = ['LFQ i'+_[1:] for _ in columns_mq]
cv =  df[lfq_columns_mq].std(axis=1) / df[lfq_columns_mq].mean(axis=1) * 100
plt.hist(cv, bins=bins, alpha=0.5, label=case)
print(f'Mean CV for {case} {np.mean(cv):.2f} % - n: {(~cv.isna()).sum():,}')

plt.legend()
plt.show()

Observation: The CV decreases when applying the LFQ optimization. 
For the LFQ optimization, the mean CV for AlphaPept is slightly better (11.25% vs 12.18%) with a slightly more proteins (3,973 vs 3,801).