# Developer Tutorial: Proteomics meets the scverse

This tutorial showcases how `alphabase` can be used to interface proteomics search engine outputs and the [scverse](https://scverse.org), a python-centric software ecosystem that implements various tools for the analysis of (single-cell) omics data. 

The central data structure to represent and store omics data in the scverse is the [anndata object](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html), an optimized data container for the representation of high-dimensional experimental data. 

`alphabase`s core functions can be used by developers to easily create `anndata` objects from search engine outputs, more specifically peptide-spectrum-matches (see the tutorial on [PSM readers](../nbs/psm_readers.ipynb)). `anndata` can then interface with various software tools for downstream tasks, of which many might also be applicable to proteomics data. In this use case, **alphabase acts as an adapter**, that provides an interface between many search engines upstream and the various software tools downstream

## Load libraries

Note that you'll have to have `anndata` and `mudata` installed in your software environment, either by installing it next to your `alphabase` installation 

```shell
pip install alphabase anndata mudata
```

or by installing

```shell
pip install "alphabase[docs]"
```

In [1]:
import warnings
from typing import Optional

import anndata as ad
import mudata as md
import numpy as np
import pandas as pd

from alphabase.psm_reader import psm_reader_provider
from alphabase.psm_reader.keys import PsmDfCols
from alphabase.tools.data_downloader import DataShareDownloader

  import cgi


## Implement anndata reader

Here, we implement an anndata reader that can read the output of multiple search engines into the standardized anndata object

### Walk through 


#### Load file into unified structure

The first step is to load the PSM table into a standardized `pandas.DataFrame`. As described in the [PSM reader tutorial](../nbs/psm_readers.ipynb), alphabase abstracts away search engine-specific formats and harmonizes column names into a consistent schema.
This is accomplished by selecting the appropriate reader for your search engine:

```
reader = psm_reader_provider.get_reader(reader_type, **reader_kwargs)
```

Then, load the file:

```python
psm_report = reader.load(file_path)
```

The resulting dataframe will include standardized column names such as:

- **intensity** – protein intensity (or via custom mapping, the precursor intensity)
- **proteins** – feature ID
- **raw_name** – sample/run identifier

**✅ Tip**: You don’t need to worry about the original column names—alphabase maps them automatically based on the search engine.


#### Customization
If your data contains non-standard or custom column names or if you would extract other data (e.g. precursor-level information), you can easily override the defaults using alphabase’s built-in column mapping functionality:

```python
reader.add_column_mapping({
    # Maps standardized attributes to custom columns
    PsmDfCols.INTENSITY: "CustomIntensityCol",
    PsmDfCols.PROTEINS: "MyProteinIDs",
    PsmDfCols.RAW_NAME: "ExperimentName",
})
```

#### Pivot to wide format 

PSM tables are typically in long format—each row corresponds to a single precursor-spectrum match. To create an AnnData object, we need a matrix with samples as rows and features (e.g., proteins or peptides) as columns.

We use `pandas.pivot_table` to reshape the data:

```
    pivot_psm_report = pd.pivot_table(
        psm_report,
        index=PsmDfCols.RAW_NAME,
        columns=PsmDfCols.PROTEINS,
        values=PsmDfCols.INTENSITY,
        aggfunc="first",
        fill_value=np.nan,
    )
```

💡 Protein-level columns in PSM tables can be redundant—many precursors will point to the same protein and the PSM report stores the same intensity values for all redundant protein groups. We use `aggfunc="first"` to select the first observed intensity for a given `(sample, feature)` pair

💡 Since some identified peptide sequences can match multiple proteins (such as isoforms or homologues), proteomics search engines typically handle this ambiguity by grouping these proteins into *protein groups* as features.

💡 Here, we fill missing values with `NaN`, following standard practice in proteomics.


####  Build anndata object
Now that we have a wide-format matrix of intensities, creating the AnnData object is straightforward:

```python
adata = ad.AnnData(
    X=pivot_psm_report.values,
    obs=pivot_psm_report.index.to_frame(index=False),
    var=pivot_psm_report.columns.to_frame(index=False)
)
```
In the implementation below, we slightly modify this logic as we need to account for the customization

### Implementation

The complete implementation might look a bit like this:

In [2]:
def read_psm(
    file_path: str,
    reader_type: str,
    intensity_column: Optional[str] = None,
    feature_column: Optional[str] = None,
    sample_column: Optional[str] = None,
    **reader_kwargs,
) -> ad.AnnData:
    """Convert a PSM table to an anndata object that stores the feature intensities

    Parameters
    ----------
    file_path
        Path to file
    reader_type
        Type of search engine output. Must be one of the implemented readers
        (see: `alphabase.psm_reader.psm_reader_provider.reader_dict.keys()`)
    intensity_column
        Name of the column storing intensity data. Default to `intensity` key `psm_reader.yaml`
    feature_column
        Name of the column storing proteins ids. Defaults to `proteins` key in `psm_reader.yaml`
    sample_column
        Name of the column storing raw (or run) name. Defaults to `raw_name` key in `psm_reader.yaml`
    **reader_kwargs
        Passed to :meth:`alphabase.psm_reader.psm_reader_provider.get_reader`

    Returns
    -------
    :class:`anndata.AnnData`

        with
            - .X: Intensities as specified in `intensity_column`
            - .obs: Empty `.obs` dataframe with sample name (run) as index
            - .var: `.var` dataframe with search engine-specific metadata on features as values

    Raises
    ------
    warning
        For redundant features
    """
    # Get correct reader
    reader = psm_reader_provider.get_reader(reader_type, **reader_kwargs)

    # Enable customized column mapping
    custom_column_mapping = {
        k: v
        for k, v in {
            PsmDfCols.INTENSITY: intensity_column if intensity_column else None,
            PsmDfCols.PROTEINS: feature_column if feature_column else None,
            PsmDfCols.RAW_NAME: sample_column if sample_column else None,
        }.items()
        if v is not None
    }

    if custom_column_mapping:
        reader.add_column_mapping(custom_column_mapping)

    # Read file
    psm_report = reader.load(file_path)

    # Warn if duplicated features per sample exist which will get dropped
    # This check will typically warn users when working with protein groups, as the protein group data 
    # is redundant for each precursor belonging to a specific protein group
    duplicated_features = psm_report.groupby(PsmDfCols.RAW_NAME)[PsmDfCols.PROTEINS].apply(
        lambda df: df.duplicated().sum()
    )
    if any(duplicates > 0 for duplicates in duplicated_features):
        warnings.warn(
            f"Found {sum(duplicated_features)} duplicated features. Using only first.",
            stacklevel=1,
        )

    # Pivot from long to wide format
    # The psm report is oriented in a long format, while the count table has a wide format
    # Thus, we pivot the psm report so that it has the shape (samples x features)
    pivot_psm_report = pd.pivot_table(
        psm_report,
        index=PsmDfCols.RAW_NAME,
        columns=PsmDfCols.PROTEINS,
        values=PsmDfCols.INTENSITY,
        aggfunc="first",
        fill_value=np.nan,
    )

    obs = pd.DataFrame(index=pivot_psm_report.index)
    var = pd.DataFrame(index=pivot_psm_report.columns)

    # Use custom names instead of streamlined names for custom columns to prevent incorrect
    # naming (e.g. "protein" for precursor indices for custom features)
    obs = obs.rename_axis(index=sample_column) if sample_column is not None else obs
    var = var.rename_axis(index=feature_column) if feature_column is not None else var

    # Assemble to anndata
    return ad.AnnData(X=pivot_psm_report.values, obs=obs, var=var)

### Usage - Interact with anndata reader

We can explore the behaviour of this reader with some sample data, in this case an `DiaNN` PSM report. The PSM report was generated and contains data from 5 HeLa cell digests that were analysed in a label-free DIA run on an Orbitrap Astral. 

In [3]:
def get_diann_example(output_dir: Optional[str] = None) -> str:
    """Get example data for the tutorial

    The function downloads an example DiaNN v1.9.2 report (.parquet) and stores it
    in `output_dir`, or, alternatively in a temporary directory

    Parameter
    ---------
    output_dir
        Output directory. If `None`, creates a temporary directory

    Returns
    -------
    File location

    Notes
    -----
    File size: ca. 365 MB
    """
    EXAMPLE_URL = "https://datashare.biochem.mpg.de/s/i66mHaInHUP8HwS"

    if output_dir is None:
        from tempfile import tempdir

        output_dir = tempdir

    downloader = DataShareDownloader(url=EXAMPLE_URL, output_dir=output_dir)

    return downloader.download()

In [4]:
# Download example data
diann_path = get_diann_example()

/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/diann_1.9.2_HELA_QC_psm_report.parquet already exists (49.487714767456055 MB)


Per default, the function is able to read protein-level information from the PSM report. We see that `DiaNN` identified 9190 distinct protein groups in the data

In [5]:
protein_groups = read_psm(diann_path, reader_type="diann")
protein_groups



AnnData object with n_obs × n_vars = 5 × 9190

If we, instead of using the default settings, pass the custom column that identifies precursors and precursor intensities, we can also read in precursor-level data into an `anndata.AnnData` object. Note that this requires users to specify the (search engine-specific) column names for precursor identifies (=features) and their intensity values. In the given dataset, `DiaNN` identified `99478` distinct precursors

In [6]:
precursors = read_psm(
    diann_path,
    reader_type="diann",
    feature_column="Precursor.Id",
    intensity_column="Precursor.Quantity", # use unnormalized intensities here
)
precursors

AnnData object with n_obs × n_vars = 5 × 99478

## Usage - Downstream applications

We can use the anndata objects for downstream computations. Here, for example, we define a small function that computes the data completeness per sample based on the standardized `anndata` structure. In contrast to working with PSM tables directly, this function is applicable to data derived from any PSM report, as the data structure has been streamlined

In [7]:
def compute_missing_value_proportion(adata: ad.AnnData, sample_column: Optional[str] = None) -> pd.DataFrame:
    """Compute proportion of missing values per sample"""
    index = adata.obs_names if sample_column is None else adata.obs[sample_column]
    
    return pd.DataFrame(
        np.isnan(adata.X).sum(axis=1) / adata.n_vars,
        index=index, columns=["proportion_missing_values"]
    )

In [8]:
compute_missing_value_proportion(protein_groups)

Unnamed: 0_level_0,proportion_missing_values
raw_name,Unnamed: 1_level_1
20250808_OA1_Eno13_16p3min_SBM_ADIAMA_HeLa_5ng_F-40_HT_1,0.04679
20250808_OA1_Eno13_16p3min_SBM_ADIAMA_HeLa_5ng_F-40_HT_2,0.037867
20250808_OA1_Eno13_16p3min_SBM_ADIAMA_HeLa_5ng_F-40_HT_3,0.037867
20250808_OA1_Eno13_16p3min_SBM_ADIAMA_HeLa_5ng_F-40_HT_4,0.039173
20250808_OA1_Eno13_16p3min_SBM_ADIAMA_HeLa_5ng_F-40_HT_5,0.038629


In [9]:
compute_missing_value_proportion(precursors)

Unnamed: 0_level_0,proportion_missing_values
raw_name,Unnamed: 1_level_1
20250808_OA1_Eno13_16p3min_SBM_ADIAMA_HeLa_5ng_F-40_HT_1,0.083144
20250808_OA1_Eno13_16p3min_SBM_ADIAMA_HeLa_5ng_F-40_HT_2,0.069483
20250808_OA1_Eno13_16p3min_SBM_ADIAMA_HeLa_5ng_F-40_HT_3,0.070458
20250808_OA1_Eno13_16p3min_SBM_ADIAMA_HeLa_5ng_F-40_HT_4,0.071936
20250808_OA1_Eno13_16p3min_SBM_ADIAMA_HeLa_5ng_F-40_HT_5,0.069453


## Bind precursor and protein data in a single container

We can use the multi-modal extension of `anndata`, called `mudata` to store protein and precursor data in a single container

In [10]:
mdata = md.MuData(
    {"proteins": protein_groups, "precursors": precursors}
)

mdata

  self._update_attr("var", axis=0, join_common=join_common)
  self._update_attr("obs", axis=1, join_common=join_common)


## Conclusion

In this tutorial, we demonstrated how `alphabase` can serve as a bridge between proteomics search engine outputs and the `anndata` structure—an efficient, standardized format widely adopted in the scverse ecosystem. By converting [PSM reports](../nbs/psm_readers.ipynb) into AnnData objects, developers can seamlessly integrate proteomics data with a growing suite of tools for analysis, visualization, and downstream modeling.