# Tutorial: Reading PSM tables into AnnData format

This notebook demonstrates how to use the `AnnDataFactory` class to convert proteomics PSM (Peptide Spectrum Matches) data into AnnData format, which is widely used in single-cell analysis pipelines.

In [None]:
import tempfile

import pandas as pd
from alphabase.psm_reader.keys import PsmDfCols
from alphabase.tools.data_downloader import DataShareDownloader

from alphatools.io.anndata_factory import AnnDataFactory

  import cgi


## 1. Creating an AnnDataFactory from a DataFrame

First, let's create a sample PSM DataFrame with the required columns and pass it to the `AnnDataFactory` constructor.

The resulting AnnData object has:
   - Rows (obs) representing samples (raw names)
   - Columns (var) representing proteins
   - X matrix containing intensity values

In [None]:
# Create sample PSM data
sample_psm_data = {
    PsmDfCols.RAW_NAME: ["sample1", "sample1", "sample2", "sample2"],
    PsmDfCols.PROTEINS: ["proteinA", "proteinB", "proteinA", "proteinB"],
    PsmDfCols.INTENSITY: [100, 200, 150, 250],
}

psm_df = pd.DataFrame(sample_psm_data)

# Create AnnDataFactory instance
factory = AnnDataFactory(psm_df)

# Convert to AnnData
adata = factory.create_anndata()

print("AnnData shape:", adata.shape)
print("\nObservations (samples):", adata.obs_names)
print("\nVariables (proteins):", adata.var_names)
print("\nIntensity matrix:\n", adata.X)

AnnData shape: (2, 2)

Observations (samples): Index(['sample1', 'sample2'], dtype='object', name='raw_name')

Variables (proteins): Index(['proteinA', 'proteinB'], dtype='object', name='proteins')

Intensity matrix:
 [[100 200]
 [150 250]]


## 2. Loading Data from Files (AlphaDIA Example)

The AnnDataFactory can also read data directly from PSM files. Here's how to use it with AlphaDIA output:



In [None]:
url = "https://datashare.biochem.mpg.de/public.php/dav/files/Hk41INtwBvBl0kP/alphadia_1.8.1_report_head.tsv"
with tempfile.TemporaryDirectory() as temp_dir:
    file_path = DataShareDownloader(url=url, output_dir=temp_dir).download()

    factory = AnnDataFactory.from_files(file_paths=file_path, reader_type="alphadia")

# Convert to AnnData
adata = factory.create_anndata()


print("AnnData shape:", adata.shape)

adata.to_df()

## 3. Customizing Column Names

If your input files use different column names than what is preconfigured in `AnnDataFactory`, you can specify them:

In [None]:
url = "https://datashare.biochem.mpg.de/public.php/dav/files/Hk41INtwBvBl0kP/diann_1.9.0_report_head.tsv"

with tempfile.TemporaryDirectory() as temp_dir:
    file_path = DataShareDownloader(url=url, output_dir=temp_dir).download()

    factory = AnnDataFactory.from_files(
        file_paths=file_path,
        reader_type="diann",
        raw_name_column="File.Name",
        protein_id_column="Protein.Group",
        # intensity_column="PG.MaxLFQ",
    )

adata = factory.create_anndata()

print("AnnData shape:", adata.shape)

adata.to_df()