# Tutorial: Using AnnDataFactory for Proteomics Data Analysis

This notebook demonstrates how to use the `AnnDataFactory` class to convert proteomics PSM (Peptide Spectrum Matches) data into AnnData format, which is widely used in single-cell analysis pipelines.

In [None]:
import pandas as pd
import tempfile

from alphabase.psm_reader.keys import PsmDfCols
from alphabase.anndata.anndata_factory import AnnDataFactory
from alphabase.tools.data_downloader import DataShareDownloader


### 1. Creating an AnnDataFactory from a DataFrame

First, let's create a sample PSM DataFrame with the required columns and pass it to the `AnnDataFactory` constructor.

The resulting AnnData object has:
   - Rows (obs) representing samples (raw names)
   - Columns (var) representing proteins
   - X matrix containing intensity values

In [None]:
# Create sample PSM data
sample_psm_data = {
    PsmDfCols.RAW_NAME: ['sample1', 'sample1', 'sample2', 'sample2'],
    PsmDfCols.PROTEINS: ['proteinA', 'proteinB', 'proteinA', 'proteinB'],
    PsmDfCols.INTENSITY: [100, 200, 150, 250]
}
psm_df = pd.DataFrame(sample_psm_data)

# Create AnnDataFactory instance
factory = AnnDataFactory(psm_df)

# Convert to AnnData
adata = factory.create_anndata()

print("AnnData shape:", adata.shape)
print("\nObservations (samples):", adata.obs_names)
print("\nVariables (proteins):", adata.var_names)
print("\nIntensity matrix:\n", adata.X)

## 2. Loading Data from Files (AlphaDIA Example)

The AnnDataFactory can also read data directly from PSM files. Here's how to use it with MaxQuant output:



In [None]:
url = "https://datashare.biochem.mpg.de/s/Hk41INtwBvBl0kP/download?files=alphadia_1.8.1_report_head.tsv"
with tempfile.TemporaryDirectory() as temp_dir:
    file_path = DataShareDownloader(
        url=url, output_dir=temp_dir
    ).download()
    

    factory = AnnDataFactory.from_files(
        file_paths=file_path,
        reader_type="alphadia"
    )

# Convert to AnnData
adata = factory.create_anndata()


print("AnnData shape:", adata.shape)

adata.to_df()



## 3. Customizing Column Names

If your input files use different column names than what is preconfigured in `AnnDataFactory`, you can specify them:

In [None]:
# factory = AnnDataFactory.from_files(
#     file_paths="path/to/psm_file.txt",
#     reader_type="alphadia",
#     intensity_column="CustomIntensity",
#     protein_id_column="ProteinIdentifier",
#     raw_name_column="RunName"
# )