# Protein Group readers

In [1]:
%reload_ext autoreload 
%autoreload 2 

In [2]:
# Helper packages
import io
from copy import copy
from typing import Literal, Optional

import anndata as ad
import numpy as np
import pandas as pd

# alphabase
from alphabase.pg_reader import pg_reader_provider
from alphabase.tools.data_downloader import DataShareDownloader

  import cgi


## Background 

The `alphabase.pg_reader` module provides a unifying interface **to read protein group (PG) tables** from different search engines and file formats. It is designed to be easy to use, and to provide a consistent output format in the form of `pandas.DataFrame`s, regardless of the input file format.

### Introduction to protein group matrices

Protein group matrices are the primary output for protein-level quantification in proteomics workflows. After search engines identify peptide spectrum matches (PSMs, see [PSM-reader tutorial](../nbs/psm_readers.ipynb)), they aggregate peptide-level evidence to infer protein-level abundances. These protein group tables represent a structured matrix that maps protein groups (features) to samples (observations), with estimated intensity values as entries.


A minimal protein group table could look something like this:

| proteins | sample_1 | sample_2 | sample_3 |
|----------|----------|----------|----------|
| P12345   | 1000.5   | 892.3    | 1150.7   |
| Q67890   | 2500.1   | 2780.9   | 2340.2   |



> 💡 Since some identified peptide sequences can match multiple proteins (such as isoforms or homologues), proteomics search engines typically handle this ambiguity by grouping these proteins into *protein groups* as features.


In this example, protein P12345 has quantified intensities of 1000.5, 892.3, and 1150.7 in samples 1, 2, and 3 respectively.

### Search engine outputs

In reality, protein group tables are significantly more complex than this, as they contain additional feature-level information about the proteins (e.g., gene names, descriptions, alternative quantification methods), and the quantification (e.g., different intensity types like raw, LFQ quantification, iBAQ). This additional information can be valuable for downstream analyses, but also makes protein group tables a lot more difficult to work with, as the exact names and formats may differ between search engines, versions, and file formats.

#### Unifying properties 

`alphabase` aligns the column names to a unified vocabulary, facilitating cross-engine comparisons. We can categorize protein group tables into several common types:

**Type 1 — Minimal**: A basic features × samples matrix. Only intensity values are stored, with sample names as columns and protein groups as the index. *Example*: AlphaDIA.

**Type 2 — Multiple Intensity Fields**: A wide matrix where each sample may appear multiple times with different quantification types (e.g., `SampleA_LFQ`, `SampleB_raw`). *Example*: AlphaPept.

**Type 3 — Feature Metadata**: A features × samples matrix with one intensity value per sample, plus additional feature-level metadata columns (e.g., gene names, descriptions). *Example*: DIA-NN.

**Type 4 — Combined**: A composite structure including both multiple intensity fields (Type 2) and feature-level metadata (Type 3). *Examples*: Spectronaut, MZTab, MaxQuant.


## Code | Read and parse protein group tables

The alphabase `pg_reader` module enables users to parse proteomics protein group reports to a dataframe for most common search engines with a single line of code via its `alphabase.pg_reader.pg_reader_provider` factory.


All readers return a standardized pandas DataFrame with:
- **Features as index**: Protein identifiers and metadata in the `pandas.DataFrame.Index`
- **Samples as columns**: Sample/run identifiers as column index
- **Intensity values**: Protein quantification data as `pandas.DataFrame.values`



The readers **support different quantification methods** by matching regular expression patterns in the output tables and the **retrieval of desired metadata columns to standardized names**.


The unified alphabase format enables seamless comparison and analysis across different search engines, facilitating:
- Method comparison studies
- Data integration workflows
- Standardized downstream analysis pipelines

### Available readers 


`alphabase.pg_reader.pg_reader_provider` has registered reader classes for the most common proteomics search engines. A list of implemented readers can be accessed via its `reader_dict` property:

In [3]:
all_registered_readers = pg_reader_provider.reader_dict.keys()

# Display all registered readers
sep = "\n\t- "
print("Registered readers in alphabase:", sep.join(sorted(all_registered_readers)), sep=sep)

Registered readers in alphabase:
	- alphadia
	- alphapept
	- diann
	- fragpipe
	- maxquant
	- mztab
	- spectronaut


### Interact with the reader provider

In [None]:
def get_pg_matrix_example(output_dir: Optional[str] = None, search_engine: Literal["alphadia", "alphapept", "spectronaut"] = "alphadia") -> str:
    """Get example data for the tutorial

    The function downloads example data and stores it
    in `output_dir`, or, alternatively in a temporary directory

    Parameter
    ---------
    output_dir
        Output directory. If `None`, creates a temporary directory

    Returns
    -------
    File location
    """
    EXAMPLE_URLS = {
        "alphadia": "https://datashare.biochem.mpg.de/s/4AtCZassaUzRR8K",
        "alphapept": "https://datashare.biochem.mpg.de/s/6G6KHJqwcRPQiOO",
        "spectronaut": "https://datashare.biochem.mpg.de/s/2u7U03wvmQDVT4y",
    }

    if search_engine not in EXAMPLE_URLS:
        raise KeyError(f"{search_engine} not found, select one of {', '.join(EXAMPLE_URLS.keys())}")

    if output_dir is None:
        from tempfile import tempdir

        output_dir = tempdir

    downloader = DataShareDownloader(url=EXAMPLE_URLS[search_engine], output_dir=output_dir)

    return downloader.download()

### Example 1 - AlphaDIA

We demonstrate how to interact with protein group tables via alphabase based on a minimal example output of the AlphaDIA search engine. 

First, let's get some minimal example data for the AlphaDIA output. The example data represents a DIA run of 6 HeLA samples on the Orbitrap Astral. 

You can see that the output data contains the feature names in the column `pg` and the computed protein group intensities per sample in the remaining columns.


In [5]:
alphadia_example_path = get_pg_matrix_example(search_engine="alphadia")

# Parse with pandas for visualization purposes
pd.read_csv(alphadia_example_path, sep="\t")

/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/alphadia1.10.4__pg_matrix.tsv already exists (0.8597145080566406 MB)


Unnamed: 0,pg,20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_03,20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_02,20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_01,20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_03,20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_02,20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_01
0,A0A024RBG1,5.597816e+05,6.285112e+05,0.000000e+00,3.153867e+05,2.753702e+05,4.505648e+05
1,A0A024RBG1;Q9NZJ9,1.331061e+06,1.400360e+06,1.551987e+06,1.606095e+06,1.464152e+06,1.397026e+06
2,A0A075B759;A0A075B767;P62937,2.024742e+08,8.552202e+06,1.837425e+08,1.674874e+08,1.768245e+08,1.595220e+08
3,A0A096LP01,6.355092e+05,4.589410e+05,4.184495e+05,4.032932e+05,2.317467e+05,2.731363e+05
4,A0A096LP49,1.777069e+05,1.387537e+05,2.513601e+05,1.296699e+05,1.276095e+05,1.623200e+05
...,...,...,...,...,...,...,...
9359,Q9Y6X3,3.898963e+05,4.353048e+05,4.150456e+05,5.069992e+05,4.195746e+05,3.675962e+05
9360,Q9Y6X6,1.869312e+05,0.000000e+00,0.000000e+00,2.304623e+05,2.421623e+05,0.000000e+00
9361,Q9Y6X9,3.362758e+06,3.395221e+06,3.541975e+06,2.704210e+06,3.141519e+06,2.995787e+06
9362,Q9Y6Y0,5.924220e+06,6.183842e+06,6.190598e+06,6.025724e+06,5.920595e+06,6.754984e+06


Then use the `pg_reader_provider.get_reader` method to get the AlphaDIA protein group reader. Use the `import_file` method to read the file, which is directly returned as a :class:`pandas.DataFrame`. 

Note how the dataframe values only contain the actual measurements and how the `pg` column was mapped to the standardized name `uniprot_ids`.

In [6]:
alphadia_reader = pg_reader_provider.get_reader('alphadia')

# Import the file or a bytestream
alphadia_report = alphadia_reader.import_file(alphadia_example_path)

# Display the result
alphadia_report

Unnamed: 0_level_0,20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_03,20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_02,20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_01,20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_03,20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_02,20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_01
uniprot_ids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A0A024RBG1,5.597816e+05,6.285112e+05,0.000000e+00,3.153867e+05,2.753702e+05,4.505648e+05
A0A024RBG1;Q9NZJ9,1.331061e+06,1.400360e+06,1.551987e+06,1.606095e+06,1.464152e+06,1.397026e+06
A0A075B759;A0A075B767;P62937,2.024742e+08,8.552202e+06,1.837425e+08,1.674874e+08,1.768245e+08,1.595220e+08
A0A096LP01,6.355092e+05,4.589410e+05,4.184495e+05,4.032932e+05,2.317467e+05,2.731363e+05
A0A096LP49,1.777069e+05,1.387537e+05,2.513601e+05,1.296699e+05,1.276095e+05,1.623200e+05
...,...,...,...,...,...,...
Q9Y6X3,3.898963e+05,4.353048e+05,4.150456e+05,5.069992e+05,4.195746e+05,3.675962e+05
Q9Y6X6,1.869312e+05,0.000000e+00,0.000000e+00,2.304623e+05,2.421623e+05,0.000000e+00
Q9Y6X9,3.362758e+06,3.395221e+06,3.541975e+06,2.704210e+06,3.141519e+06,2.995787e+06
Q9Y6Y0,5.924220e+06,6.183842e+06,6.190598e+06,6.025724e+06,5.920595e+06,6.754984e+06


### Example 2 - AlphaPept with different quantification methods

AlphaPept is a DDA search engine that returns multiple quantification methods (raw intensities, LFQ) in its protein group report. We can use the reader to extract these different types of measurements by specifying the `measurement_regex` parameter.

AlphaPept reports can be both in a `.hdf` or `.tsv` format. The `pg_readers` support all common data formats (text-based like `.tsv`, `.csv`, and binary like `.hdf` (via extra `alphabase[hdf]` dependency), `.parquet`) out of the box. 

In [7]:
# Create example MaxQuant data with multiple quantification types
alphapept_example_path = get_pg_matrix_example(search_engine="alphapept")
pd.read_csv(alphapept_example_path)

/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/alphapept0.5.3__pg_matrix_csv.csv already exists (0.33005523681640625 MB)


Unnamed: 0.1,Unnamed: 0,A_LFQ,B_LFQ,A,B
0,sp|P36578|RL4_HUMAN,4.669329e+08,4.844083e+08,4.452735e+08,5.060678e+08
1,sp|Q9P258|RCC2_HUMAN,4.074842e+08,4.138132e+08,4.177856e+08,4.035118e+08
2,sp|O60518|RNBP6_HUMAN,4.960386e+06,2.022553e+06,1.295621e+06,5.687318e+06
3,sp|P55036|PSMD4_HUMAN,1.157420e+08,1.123571e+08,1.130880e+08,1.150112e+08
4,sp|A1X283|SPD2B_HUMAN,1.247112e+07,1.180582e+07,1.380177e+07,1.047516e+07
...,...,...,...,...,...
3776,sp|Q14966|ZN638_HUMAN,,1.139844e+06,,1.139844e+06
3777,sp|P84095|RHOG_HUMAN,,9.466796e+05,,9.466796e+05
3778,sp|Q99766|ATP5S_HUMAN,,3.577785e+05,,3.577785e+05
3779,"sp|O14925|TIM23_HUMAN,sp|Q5SRD1|TI23B_HUMAN",,9.237994e+05,,9.237994e+05


#### Default - raw intensities
Let's first use the default option that imports raw intensities. You can see that the reader automatically extracts only raw intensity columns and that it parses the uniprot header index to a more streamlined format.

In [8]:
# Default: raw intensities
alphapept_reader_default = pg_reader_provider.get_reader('alphapept')
alphapept_reader_default.import_file(alphapept_example_path)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,A,B
proteins,uniprot_ids,ensembl_ids,source_db,is_decoy,Unnamed: 5_level_1,Unnamed: 6_level_1
RL4_HUMAN,P36578,na,sp,False,445273477.0318756,506067774.6891948
RCC2_HUMAN,Q9P258,na,sp,False,417785611.6324583,403511752.8857417
RNBP6_HUMAN,O60518,na,sp,False,1295621.2466679448,5687318.493374016
PSMD4_HUMAN,P55036,na,sp,False,113087994.44403341,115011156.7335174
SPD2B_HUMAN,A1X283,na,sp,False,13801771.733223092,10475164.42857083
...,...,...,...,...,...,...
ZN638_HUMAN,Q14966,na,sp,False,,1139843.6453892316
RHOG_HUMAN,P84095,na,sp,False,,946679.6466570131
ATP5S_HUMAN,Q99766,na,sp,False,,357778.52002529387
TIM23_HUMAN;TI23B_HUMAN,O14925;Q5SRD1,na;na,sp;sp,False,,923799.3856913601


#### LFQ runs
We can easily extract the LFQ intensities by selecting the pre-defined regular expression to extract them:

In [9]:
# LFQ intensities
alphapept_reader_lfq = pg_reader_provider.get_reader('alphapept', measurement_regex="lfq")
alphapept_reader_lfq.import_file(alphapept_example_path)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,A_LFQ,B_LFQ
proteins,uniprot_ids,ensembl_ids,source_db,is_decoy,Unnamed: 5_level_1,Unnamed: 6_level_1
RL4_HUMAN,P36578,na,sp,False,466932936.27537036,484408315.44570005
RCC2_HUMAN,Q9P258,na,sp,False,407484183.9302226,413813180.5879775
RNBP6_HUMAN,O60518,na,sp,False,4960386.374516514,2022553.3655254466
PSMD4_HUMAN,P55036,na,sp,False,115742020.94987468,112357130.22767611
SPD2B_HUMAN,A1X283,na,sp,False,12471120.728621317,11805815.433172602
...,...,...,...,...,...,...
ZN638_HUMAN,Q14966,na,sp,False,,1139843.6453892316
RHOG_HUMAN,P84095,na,sp,False,,946679.6466570131
ATP5S_HUMAN,Q99766,na,sp,False,,357778.52002529387
TIM23_HUMAN;TI23B_HUMAN,O14925;Q5SRD1,na;na,sp;sp,False,,923799.3856913601


#### Explore all pre-configured patterns

You can also pass custom patterns as valid regular expression and check out all pre-configured regular expression sets with the `get_preconfigured_regex` method:

In [10]:
alphapept_reader_default.get_preconfigured_regex()

{'raw': '^.*(?<!_LFQ)$', 'lfq': '_LFQ$'}

### Example 3 - Spectronaut reports

Next, we explore how users can extract non-standard columns to a unified vocabulary based on a Spectronaut PG report. Spectronaut allows users to flexibly export custom feature-level metadata. `alphabase` allows users to extract this metadata by adding new columns to the streamlined column mapping.

In [11]:
spectronaut_example_path = get_pg_matrix_example(search_engine="spectronaut")

# Parse with pandas for visualization purposes
pd.read_csv(spectronaut_example_path, sep="\t")

/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/olsen-wide-format-site-report.tsv does not yet exist
/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/olsen-wide-format-site-report.tsv successfully downloaded (27.531264305114746 MB)


Unnamed: 0,PG.Genes,PG.Organisms,PG.ProteinNames,PTM.CollapseKey,PTM.FlankingRegion,PTM.ModificationTitle,PTM.Multiplicity,PTM.ProteinId,PTM.SiteAA,PTM.SiteLocation,...,[27] 20180816_QE3_nLC3_AH_DIA_H100_Y25_03.raw.PTM.Quantity,[28] 20180816_QE3_nLC3_AH_DIA_H100_Y25_04.raw.PTM.Quantity,[29] 20180816_QE3_nLC3_AH_DIA_H100_Y25_05.raw.PTM.Quantity,[30] 20180816_QE3_nLC3_AH_DIA_H100_Y25_06.raw.PTM.Quantity,[31] 20180816_QE3_nLC3_AH_DIA_H100_Y50_01.raw.PTM.Quantity,[32] 20180816_QE3_nLC3_AH_DIA_H100_Y50_02.raw.PTM.Quantity,[33] 20180816_QE3_nLC3_AH_DIA_H100_Y50_03.raw.PTM.Quantity,[34] 20180816_QE3_nLC3_AH_DIA_H100_Y50_04.raw.PTM.Quantity,[35] 20180816_QE3_nLC3_AH_DIA_H100_Y50_05.raw.PTM.Quantity,[36] 20180816_QE3_nLC3_AH_DIA_H100_Y50_06.raw.PTM.Quantity
0,TRBV19;TRB,Homo sapiens,TVB19_HUMAN;TRBR1_HUMAN,A0A075B6N1_S86_M3,IAEGYSVSREKKESF,Phospho (STY),3,A0A075B6N1,S,86,...,69968.8359375,103632.6015625,90488.9296875,113429.859375,96970.2734375,61069.171875,99673.2734375,109199.875,112307.4765625,112374.84375
1,TRBV19;TRB,Homo sapiens,TVB19_HUMAN;TRBR1_HUMAN,A0A075B6N1_S84_M3,GDIAEGYSVSREKKE,Phospho (STY),3,A0A075B6N1,S,84,...,69968.8359375,103632.6015625,90488.9296875,113429.859375,96970.2734375,61069.171875,99673.2734375,109199.875,112307.4765625,112374.84375
2,TRBV19;TRB,Homo sapiens,TVB19_HUMAN;TRBR1_HUMAN,A0A075B6N1_Y83_M3,KGDIAEGYSVSREKK,Phospho (STY),3,A0A075B6N1,Y,83,...,69968.8359375,103632.6015625,90488.9296875,113429.859375,96970.2734375,61069.171875,99673.2734375,109199.875,112307.4765625,112374.84375
3,TRBV19;TRB,Homo sapiens,TVB19_HUMAN;TRBR1_HUMAN,P0DSE2_S86_M3,IAEGYSVSREKKESF,Phospho (STY),3,P0DSE2,S,86,...,69968.8359375,103632.6015625,90488.9296875,113429.859375,96970.2734375,61069.171875,99673.2734375,109199.875,112307.4765625,112374.84375
4,TRBV19;TRB,Homo sapiens,TVB19_HUMAN;TRBR1_HUMAN,P0DSE2_S84_M3,GDIAEGYSVSREKKE,Phospho (STY),3,P0DSE2,S,84,...,69968.8359375,103632.6015625,90488.9296875,113429.859375,96970.2734375,61069.171875,99673.2734375,109199.875,112307.4765625,112374.84375
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54858,MORC2,Homo sapiens,MORC2_HUMAN,Q9Y6X9_S739_M2,ATPSRKRSVAVSDEE,Phospho (STY),2,Q9Y6X9,S,739,...,23552.466796875,22144.580078125,20846.8515625,24248.41796875,22490.0546875,22095.990234375,25553.849609375,22250.546875,14592.869140625,19265.998046875
54859,MORC2,Homo sapiens,MORC2_HUMAN,Q9Y6X9-2_S681_M2,RKRSVAVSDEEEVEE,Phospho (STY),2,Q9Y6X9-2,S,681,...,23552.466796875,22144.580078125,20846.8515625,24248.41796875,22490.0546875,22095.990234375,25553.849609375,22250.546875,14592.869140625,19265.998046875
54860,MORC2,Homo sapiens,MORC2_HUMAN,Q9Y6X9-2_S677_M2,ATPSRKRSVAVSDEE,Phospho (STY),2,Q9Y6X9-2,S,677,...,23552.466796875,22144.580078125,20846.8515625,24248.41796875,22490.0546875,22095.990234375,25553.849609375,22250.546875,14592.869140625,19265.998046875
54861,IVNS1ABP,Homo sapiens,NS1BP_HUMAN,Q9Y6Y0_M341_M1,SKSLSFEMQQDELIE,Oxidation (M),1,Q9Y6Y0,M,341,...,Filtered,17287.40625,Filtered,15751.861328125,14749.724609375,12410.79296875,14130.1396484375,Filtered,13198.474609375,13553.0908203125


The default reader extracts some streamlined information

In [12]:
# Example with custom column mapping
reader = pg_reader_provider.get_reader('spectronaut')
reader.import_file(spectronaut_example_path)

Unnamed: 0_level_0,Unnamed: 1_level_0,[1] 20180815_QE3_nLC3_AH_DIA_Honly_ind_01.raw.PTM.Quantity,[2] 20180815_QE3_nLC3_AH_DIA_Honly_ind_02.raw.PTM.Quantity,[3] 20180815_QE3_nLC3_AH_DIA_Honly_ind_03.raw.PTM.Quantity,[4] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_01.raw.PTM.Quantity,[5] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_02.raw.PTM.Quantity,[6] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_03.raw.PTM.Quantity,[7] 20180816_QE3_nLC3_AH_DIA_H100_Y100_01.raw.PTM.Quantity,[8] 20180816_QE3_nLC3_AH_DIA_H100_Y100_02.raw.PTM.Quantity,[9] 20180816_QE3_nLC3_AH_DIA_H100_Y100_03.raw.PTM.Quantity,[10] 20180816_QE3_nLC3_AH_DIA_H100_Y100_04.raw.PTM.Quantity,...,[27] 20180816_QE3_nLC3_AH_DIA_H100_Y25_03.raw.PTM.Quantity,[28] 20180816_QE3_nLC3_AH_DIA_H100_Y25_04.raw.PTM.Quantity,[29] 20180816_QE3_nLC3_AH_DIA_H100_Y25_05.raw.PTM.Quantity,[30] 20180816_QE3_nLC3_AH_DIA_H100_Y25_06.raw.PTM.Quantity,[31] 20180816_QE3_nLC3_AH_DIA_H100_Y50_01.raw.PTM.Quantity,[32] 20180816_QE3_nLC3_AH_DIA_H100_Y50_02.raw.PTM.Quantity,[33] 20180816_QE3_nLC3_AH_DIA_H100_Y50_03.raw.PTM.Quantity,[34] 20180816_QE3_nLC3_AH_DIA_H100_Y50_04.raw.PTM.Quantity,[35] 20180816_QE3_nLC3_AH_DIA_H100_Y50_05.raw.PTM.Quantity,[36] 20180816_QE3_nLC3_AH_DIA_H100_Y50_06.raw.PTM.Quantity
proteins,genes,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
TVB19_HUMAN;TRBR1_HUMAN,TRBV19;TRB,,,,,,,89374.656250,,90181.578125,96197.070312,...,69968.835938,103632.601562,90488.929688,113429.859375,96970.273438,61069.171875,99673.273438,109199.875000,112307.476562,112374.843750
TVB19_HUMAN;TRBR1_HUMAN,TRBV19;TRB,,,,,,,89374.656250,,90181.578125,96197.070312,...,69968.835938,103632.601562,90488.929688,113429.859375,96970.273438,61069.171875,99673.273438,109199.875000,112307.476562,112374.843750
TVB19_HUMAN;TRBR1_HUMAN,TRBV19;TRB,,,,,,,89374.656250,,90181.578125,96197.070312,...,69968.835938,103632.601562,90488.929688,113429.859375,96970.273438,61069.171875,99673.273438,109199.875000,112307.476562,112374.843750
TVB19_HUMAN;TRBR1_HUMAN,TRBV19;TRB,,,,,,,89374.656250,,90181.578125,96197.070312,...,69968.835938,103632.601562,90488.929688,113429.859375,96970.273438,61069.171875,99673.273438,109199.875000,112307.476562,112374.843750
TVB19_HUMAN;TRBR1_HUMAN,TRBV19;TRB,,,,,,,89374.656250,,90181.578125,96197.070312,...,69968.835938,103632.601562,90488.929688,113429.859375,96970.273438,61069.171875,99673.273438,109199.875000,112307.476562,112374.843750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
MORC2_HUMAN,MORC2,,,6817.745605,,,,18010.679688,12501.521484,17377.408203,13730.358398,...,23552.466797,22144.580078,20846.851562,24248.417969,22490.054688,22095.990234,25553.849609,22250.546875,14592.869141,19265.998047
MORC2_HUMAN,MORC2,,,6817.745605,,,,18010.679688,12501.521484,17377.408203,13730.358398,...,23552.466797,22144.580078,20846.851562,24248.417969,22490.054688,22095.990234,25553.849609,22250.546875,14592.869141,19265.998047
MORC2_HUMAN,MORC2,,,6817.745605,,,,18010.679688,12501.521484,17377.408203,13730.358398,...,23552.466797,22144.580078,20846.851562,24248.417969,22490.054688,22095.990234,25553.849609,22250.546875,14592.869141,19265.998047
NS1BP_HUMAN,IVNS1ABP,,,38411.285156,,,,10104.601562,12773.764648,10412.311523,11411.670898,...,,17287.406250,,15751.861328,14749.724609,12410.792969,14130.139648,,13198.474609,13553.090820


Let's say that we are also interested in the PTM site in the sample. We can extract this information as well by using the `add_column_mapping` method:

In [13]:
# Add custom column mapping for organism information
reader.add_column_mapping({"ptm_site_amino_acid": "PTM.SiteAA"})
reader.import_file(spectronaut_example_path)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,[1] 20180815_QE3_nLC3_AH_DIA_Honly_ind_01.raw.PTM.Quantity,[2] 20180815_QE3_nLC3_AH_DIA_Honly_ind_02.raw.PTM.Quantity,[3] 20180815_QE3_nLC3_AH_DIA_Honly_ind_03.raw.PTM.Quantity,[4] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_01.raw.PTM.Quantity,[5] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_02.raw.PTM.Quantity,[6] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_03.raw.PTM.Quantity,[7] 20180816_QE3_nLC3_AH_DIA_H100_Y100_01.raw.PTM.Quantity,[8] 20180816_QE3_nLC3_AH_DIA_H100_Y100_02.raw.PTM.Quantity,[9] 20180816_QE3_nLC3_AH_DIA_H100_Y100_03.raw.PTM.Quantity,[10] 20180816_QE3_nLC3_AH_DIA_H100_Y100_04.raw.PTM.Quantity,...,[27] 20180816_QE3_nLC3_AH_DIA_H100_Y25_03.raw.PTM.Quantity,[28] 20180816_QE3_nLC3_AH_DIA_H100_Y25_04.raw.PTM.Quantity,[29] 20180816_QE3_nLC3_AH_DIA_H100_Y25_05.raw.PTM.Quantity,[30] 20180816_QE3_nLC3_AH_DIA_H100_Y25_06.raw.PTM.Quantity,[31] 20180816_QE3_nLC3_AH_DIA_H100_Y50_01.raw.PTM.Quantity,[32] 20180816_QE3_nLC3_AH_DIA_H100_Y50_02.raw.PTM.Quantity,[33] 20180816_QE3_nLC3_AH_DIA_H100_Y50_03.raw.PTM.Quantity,[34] 20180816_QE3_nLC3_AH_DIA_H100_Y50_04.raw.PTM.Quantity,[35] 20180816_QE3_nLC3_AH_DIA_H100_Y50_05.raw.PTM.Quantity,[36] 20180816_QE3_nLC3_AH_DIA_H100_Y50_06.raw.PTM.Quantity
proteins,genes,ptm_site_amino_acid,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
TVB19_HUMAN;TRBR1_HUMAN,TRBV19;TRB,S,,,,,,,89374.656250,,90181.578125,96197.070312,...,69968.835938,103632.601562,90488.929688,113429.859375,96970.273438,61069.171875,99673.273438,109199.875000,112307.476562,112374.843750
TVB19_HUMAN;TRBR1_HUMAN,TRBV19;TRB,S,,,,,,,89374.656250,,90181.578125,96197.070312,...,69968.835938,103632.601562,90488.929688,113429.859375,96970.273438,61069.171875,99673.273438,109199.875000,112307.476562,112374.843750
TVB19_HUMAN;TRBR1_HUMAN,TRBV19;TRB,Y,,,,,,,89374.656250,,90181.578125,96197.070312,...,69968.835938,103632.601562,90488.929688,113429.859375,96970.273438,61069.171875,99673.273438,109199.875000,112307.476562,112374.843750
TVB19_HUMAN;TRBR1_HUMAN,TRBV19;TRB,S,,,,,,,89374.656250,,90181.578125,96197.070312,...,69968.835938,103632.601562,90488.929688,113429.859375,96970.273438,61069.171875,99673.273438,109199.875000,112307.476562,112374.843750
TVB19_HUMAN;TRBR1_HUMAN,TRBV19;TRB,S,,,,,,,89374.656250,,90181.578125,96197.070312,...,69968.835938,103632.601562,90488.929688,113429.859375,96970.273438,61069.171875,99673.273438,109199.875000,112307.476562,112374.843750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
MORC2_HUMAN,MORC2,S,,,6817.745605,,,,18010.679688,12501.521484,17377.408203,13730.358398,...,23552.466797,22144.580078,20846.851562,24248.417969,22490.054688,22095.990234,25553.849609,22250.546875,14592.869141,19265.998047
MORC2_HUMAN,MORC2,S,,,6817.745605,,,,18010.679688,12501.521484,17377.408203,13730.358398,...,23552.466797,22144.580078,20846.851562,24248.417969,22490.054688,22095.990234,25553.849609,22250.546875,14592.869141,19265.998047
MORC2_HUMAN,MORC2,S,,,6817.745605,,,,18010.679688,12501.521484,17377.408203,13730.358398,...,23552.466797,22144.580078,20846.851562,24248.417969,22490.054688,22095.990234,25553.849609,22250.546875,14592.869141,19265.998047
NS1BP_HUMAN,IVNS1ABP,M,,,38411.285156,,,,10104.601562,12773.764648,10412.311523,11411.670898,...,,17287.406250,,15751.861328,14749.724609,12410.792969,14130.139648,,13198.474609,13553.090820


## scVerse compatibility 

The standardized format also allows users to easily convert the protein group tables to widely used `-omics` formats like `anndata.AnnData`.

In [14]:
def create_anndata_from_pg_matrix(file_path: str, search_engine: str, **kwargs) -> ad.AnnData:
    """Get anndata object from PG matrix."""

    reader = pg_reader_provider.get_reader(search_engine, **kwargs)
    df = reader.import_file(file_path)
    return ad.AnnData(
        X=df.values.T,
        var=df.index.to_frame(),
        obs = df.columns.to_frame(name="sample_id")
    )

In [15]:
adata = create_anndata_from_pg_matrix(
    alphadia_example_path, search_engine="alphadia"
)

adata

AnnData object with n_obs × n_vars = 6 × 9364
    obs: 'sample_id'
    var: 'uniprot_ids'

## Conclusion

The alphabase protein group reader module provides:

- **Unified interface** for reading protein group tables from multiple search engines
- **Standardized output format** that facilitates cross-engine comparisons and downstream analyses
- **Flexible quantification options** to extract different measurement types (raw, LFQ, iBAQ)
- **Extensible architecture** that supports custom column mappings and new search engines

This standardization enables researchers to focus on biological insights rather than data format complexities.