# What is AnnData and why is it the main data structure of `alphatools`?

AnnData is a Python data format from the Scverse (open source single-cell software ecosystem) that keeps numeric data (arrays) and metadata neatly aligned together. This solves a core limitation of Pandas DataFrames where numerical and non-numerical data must coexist in the same structure, making dataframe-wide numeric operations difficult. below is a schematic of an AnnData object, which is created by the `AnnData` class of the `anndata` package [1]:

<div align="center">
<img src="../../assets/anndata_schema.svg" width="400" height="300">
</div>

## Core Components for AlphaTools

For most `alphatools` applications, you'll primarily work with three key components:

### **X**: The Numeric Expression Matrix
A numpy array where rows represent **samples** and columns represent **features** (e.g., proteins, precursors, genes)

### **obs**: Sample Metadata  
A DataFrame where rows are **samples** and columns contain metadata properties (e.g., age, disease state, cohort, batch)

### **var**: Feature Metadata
A DataFrame where rows are **features** and columns contain feature properties (e.g., for proteins: Gene names, GO terms, functional annotations)

This structure ensures that when you filter samples or features, all associated metadata automatically stays synchronized, preventing common annotation misalignment issues in analysis workflows.

## References

- [1] [AnnData Documentation](https://anndata.readthedocs.io/en/stable/)

In [None]:
%load_ext
%autoreload 2

import anndata as ad
import numpy as np
import pandas as pd

numerical_data = np.array([[1, 0, 0], [0, 2, 3]])

sample_metadata = pd.DataFrame({"obs_names": ["cell1", "cell2"]}).set_index("obs_names")

feature_metadata = pd.DataFrame({"var_names": ["gene1", "gene2", "gene3"]}).set_index("var_names")

# Generate AnnData object
adata = ad.AnnData(
    X=numerical_data,
    obs=sample_metadata,
    var=feature_metadata,
)

# We can get a dataframe back
df = adata.to_df()
display(df)

# And also look at the sample and feature metadata
display(adata.obs)
display(adata.var)

UsageError: Missing module name.


### `alphatools` has a suite of useful filtering and processing functions built around AnnData, making your analyses simpler, more robust and - importantly - compatible with Scanpy/Scverse functions

## How do I load search engine data into AnnData with `alphatools`?

This notebook demonstrates core `alphatools` functionality to load protein and precursor data from different sources. 

The functionality relies on the `alphabase` backbone of PSM and PG readers, which allows for loading and parsing of common search engine output formats. In this notebook we will look at reading data for three common search engines: DIANN, AlphaDIA and Spectronaut.

### What kind of data do we expect?

Search engines output data in two main ways: Either as a *long table*, where rows are individual precursors in their respective samples:

### *PSM report* (long format)

| Precursor | Run | Stripped.Sequence | ... |
|-----------|-----|-------------------|-----|
| PEPTIDEK1 | file_1.raw | PEPTIDEK | ... |
| PEPTIDERK2 | file_1.raw | PEPTIDERK | ... |
| PEPTIDR2 | file_1.raw | PEPTIDR | ... |
| PEPTIDEK1 | file_2.raw | PEPTIDEK | ... |
| PEPTIDERK2 | file_2.raw | PEPTIDERK | ... |
| PEPTIDR2 | file_2.raw | PEPTIDR | ... |
| PEPTIDEK1 | file_3.raw | PEPTIDEK | ... |
| PEPTIDERK2 | file_3.raw | PEPTIDERK | ... |
| PEPTIDR2 | file_3.raw | PEPTIDR | ... |

### *Protein group report* (wide format)

| Protein | file_1.raw | file_2.raw | file_3.raw | ... |
|---------|------------|------------|------------|-----|
| PROT1 | 1000 | 1200 | 1100 | ... |
| PROT2 | 2000 | 2200 | 2100 | ... |
| PROT3 | 1500 | 1600 | 1550 | ... |
| ... | ... | ... | ... | ... |

--> 99 % of the time for data analysis, we are interested in a (transposed) version of the wide format, but sometimes we need to transform the long table into a wide format. Both can be done with `alphatools` and the underlying `alphabase`.

### Getting the example data

Small versions of larger real-world datasets are stored right in this repository: `alphatools/docs/_example_data/pg_tables` for wide format protein reports and `alphatools/docs/_example_data/psm_tables` for long format precursor reports. Data for three DIA search engines (DIANN, AlphaDIA and Spectronaut) are deposited there



In [None]:
from alphatools.io import pg_reader

INFO:rdkit:Enabling RDKit 2024.03.3 jupyter extensions


## Wide format Protein Tables

### AlphaDIA

In [None]:
%load_ext autoreload
%autoreload 2

# Using the pg reader
alphadia_pg_anndata = pg_reader.read_pg_table(
    path="../_example_data/pg_tables/alphadia/pg.matrix_top100.tsv",
    search_engine="alphadia",
)

# Inspect the data
display(alphadia_pg_anndata.to_df().iloc[:5, :5])

uniprot_ids,A0A024R6N5;A0A0G2JRN3,A0A075B6H7,A0A075B6H9,A0A075B6I0,A0A075B6I1
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleE01,0.0,20983750.0,1603786000.0,1234308000.0,99818110.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB05,52488540.0,15586760.0,1264443000.0,1017132000.0,143841900.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB11,306331100.0,7633061.0,985900600.0,648213500.0,54237010.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA12,155801600.0,14422270.0,355841100.0,1124133000.0,87516710.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB03,258506700.0,46279530.0,1224894000.0,873871500.0,84600910.0


### DIANN

In [None]:
# Using the pg reader
pg_reader.read_pg_table(
    path="../_example_data/pg_tables/diann/pg.matrix_top100.tsv",
    search_engine="diann",
)



TypeError: Setting a MultiIndex dtype to anything other than object is not supported

### Spectronaut

In [None]:
# Using the pg reader
pg_reader.read_pg_table(
    path=None,
    search_engine=None,
)