In [None]:
%load_ext autoreload
%autoreload 2

import anndata as ad
import numpy as np
import pandas as pd
from alphatools.io import pg_reader
from alphatools.io import psm_reader

INFO:rdkit:Enabling RDKit 2024.03.3 jupyter extensions


# What is AnnData and why is it the main data structure of `alphatools`?

AnnData is a Python data format from the Scverse (open source single-cell software ecosystem) that keeps numeric data (arrays) and metadata neatly aligned together. This solves a core limitation of Pandas DataFrames where numerical and non-numerical data must coexist in the same structure, making dataframe-wide numeric operations difficult. below is a schematic of an AnnData object, which is created by the `AnnData` class of the `anndata` package [1]:

<div align="center">
<img src="../../assets/anndata_schema.svg" width="400" height="300">
</div>

# Core components for AlphaTools

For most `alphatools` applications, you'll primarily work with three key components:

### **X**: The Numeric Expression Matrix
A numpy array where rows represent **samples** and columns represent **features** (e.g., proteins, precursors, genes)

### **obs**: Sample Metadata  
A DataFrame where rows are **samples** and columns contain metadata properties (e.g., age, disease state, cohort, batch)

### **var**: Feature Metadata
A DataFrame where rows are **features** and columns contain feature properties (e.g., for proteins: Gene names, GO terms, functional annotations)

This structure ensures that when you filter samples or features, all associated metadata automatically stays synchronized, preventing common annotation misalignment issues in analysis workflows.

[1] [AnnData Documentation](https://anndata.readthedocs.io/en/stable/)

# Basic AnnData syntax

In [None]:
numerical_data = np.array([[1, 0, 0], [0, 2, 3]])

sample_metadata = pd.DataFrame(
    {
        "obs_names": ["cell1", "cell2"],
        "age": [28, 29],
    }
).set_index("obs_names")

feature_metadata = pd.DataFrame({"var_names": ["gene1", "gene2", "gene3"]}).set_index("var_names")

# Generate AnnData object
adata = ad.AnnData(
    X=numerical_data,
    obs=sample_metadata,
    var=feature_metadata,
)

# We can get a dataframe back
df = adata.to_df()
display(df)

# And also look at the sample and feature metadata
display(adata.obs)
display(adata.var)

var_names,gene1,gene2,gene3
obs_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
cell1,1,0,0
cell2,0,2,3


Unnamed: 0_level_0,age
obs_names,Unnamed: 1_level_1
cell1,28
cell2,29


gene1
gene2
gene3


### To "get out" of AnnData and back to the more familiar world of dataframes, just run the builtin `.to_df()` function

In [None]:
df = adata.to_df()
display(df)

# Get your sample metadata added as columns
df = df.join(adata.obs)
display(df)

var_names,gene1,gene2,gene3
obs_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
cell1,1,0,0
cell2,0,2,3


Unnamed: 0_level_0,gene1,gene2,gene3,age
obs_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
cell1,1,0,0,28
cell2,0,2,3,29


### `alphatools` has a suite of useful filtering and processing functions built around AnnData, making your analyses simpler, more robust and - importantly - compatible with Scanpy/Scverse functions

---

# How to load search engine data into AnnData with `alphatools`?

The functionality relies on the `alphabase` backbone of PSM and PG readers, which allows for loading and parsing of common search engine output formats. In this notebook we will look at reading data for three common search engines: DIANN, AlphaDIA and Spectronaut. Search engines output data in two main ways: Either as a *long table*, where rows are individual precursors in their respective samples:

### *PSM report* (long format)

| Precursor | Run | Stripped.Sequence | ... |
|-----------|-----|-------------------|-----|
| PEPTIDEK1 | file_1.raw | PEPTIDEK | ... |
| PEPTIDERK2 | file_1.raw | PEPTIDERK | ... |
| PEPTIDR2 | file_1.raw | PEPTIDR | ... |
| PEPTIDEK1 | file_2.raw | PEPTIDEK | ... |
| PEPTIDERK2 | file_2.raw | PEPTIDERK | ... |
| PEPTIDR2 | file_2.raw | PEPTIDR | ... |
| PEPTIDEK1 | file_3.raw | PEPTIDEK | ... |
| PEPTIDERK2 | file_3.raw | PEPTIDERK | ... |
| PEPTIDR2 | file_3.raw | PEPTIDR | ... |

### *Protein group report* (wide format)

| Protein | file_1.raw | file_2.raw | file_3.raw | ... |
|---------|------------|------------|------------|-----|
| PROT1 | 1000 | 1200 | 1100 | ... |
| PROT2 | 2000 | 2200 | 2100 | ... |
| PROT3 | 1500 | 1600 | 1550 | ... |
| ... | ... | ... | ... | ... |


-  99 % of the time for data analysis, we are interested in a transposed version of the wide format, where rows = samples and columns = features. A frequent task is also to transform the long table into a wide protein table.

# Getting the example data

Small versions of larger real-world datasets are stored right in this repository: `alphatools/docs/_example_data/pg_tables` for wide format protein reports and `alphatools/docs/_example_data/psm_tables` for long format precursor reports. Data for three DIA search engines (DIANN, AlphaDIA and Spectronaut) are deposited there



# Wide format Protein Tables

### AlphaDIA

In [None]:
%load_ext autoreload
%autoreload 2

# Using the pg reader
alphadia_pg_anndata = pg_reader.read_pg_table(
    path="../_example_data/pg_tables/alphadia/pg.matrix_top100.tsv",
    search_engine="alphadia",
)

# Inspect the data
display(alphadia_pg_anndata.to_df().iloc[:5, :5])
display(alphadia_pg_anndata.obs.head())
display(alphadia_pg_anndata.var.head())

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


uniprot_ids,A0A024R6N5;A0A0G2JRN3,A0A075B6H7,A0A075B6H9,A0A075B6I0,A0A075B6I1
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleE01,0.0,20983750.0,1603786000.0,1234308000.0,99818110.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB05,52488540.0,15586760.0,1264443000.0,1017132000.0,143841900.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB11,306331100.0,7633061.0,985900600.0,648213500.0,54237010.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA12,155801600.0,14422270.0,355841100.0,1124133000.0,87516710.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB03,258506700.0,46279530.0,1224894000.0,873871500.0,84600910.0


20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleE01
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB05
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB11
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA12
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB03


A0A024R6N5;A0A0G2JRN3
A0A075B6H7
A0A075B6H9
A0A075B6I0
A0A075B6I1


### DIANN

In [None]:
# Using the pg reader
diann_pg_anndata = pg_reader.read_pg_table(
    path="../_example_data/pg_tables/diann/pg.matrix_top100.tsv",
    search_engine="diann",
)

# Inspect the data
display(diann_pg_anndata.to_df().iloc[:5, :5])
display(diann_pg_anndata.obs.head())
display(diann_pg_anndata.var.head())

proteins,LV469_HUMAN,LV861_HUMAN,LV460_HUMAN,LV548_HUMAN,LV746_HUMAN
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleE01.raw,77220500.0,103374000.0,10956700.0,,125708000.0
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB05.raw,39566800.0,71580200.0,12655900.0,,55909900.0
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB11.raw,27392100.0,42695000.0,4543130.0,,89816500.0
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA12.raw,12829700.0,55057200.0,7905720.0,202590.0,53537600.0
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB03.raw,33979400.0,66161200.0,4035670.0,,112146000.0


/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleE01.raw
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB05.raw
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB11.raw
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA12.raw
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB03.raw


Unnamed: 0_level_0,uniprot_ids,genes,description,peptide_count,proteotypic_peptide_count
proteins,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LV469_HUMAN,A0A075B6H9,IGLV4-69,,2,2
LV861_HUMAN,A0A075B6I0,IGLV8-61,,3,3
LV460_HUMAN,A0A075B6I1,IGLV4-60,,2,2
LV548_HUMAN,A0A075B6I7,IGLV5-48,,1,1
LV746_HUMAN,A0A075B6I9,IGLV7-46,,3,1


### Spectronaut

In [None]:
sn_pg_anndata = pg_reader.read_pg_table(
    path="/Users/vincenthbrennsteiner/Documents/mann_labs/_git_repositories/alphatools/docs/_example_data/_not_track/spectronaut/pg_table/spectronaut_protein_table.tsv",
    # path = "/Users/vincenthbrennsteiner/Documents/mann_labs/_git_repositories/alphatools/docs/_example_data/pg_tables/spectronaut/pg.matrix_top100.tsv",
    search_engine="spectronaut",
)

# Inspect the data
display(sn_pg_anndata.to_df().iloc[:5, :5])
display(sn_pg_anndata.obs.head())
display(sn_pg_anndata.var.head())

proteins,A0A023T672;A0A0G2JFX7,A0A023T778;P61327;Q9CQL1,A0A067XG53;O70589;O70589-3;O70589-4;O70589-5,A0A075B5N7;A0A0G2JGY3;P01633;P01634,A0A075B5P4;A0A0A6YWR2;P01868;P01869
[1] 20210610_TIMS05_MCT_SA_DBS05_01_delta_control1_01_D1_1_4763.htrms.PG.Quantity,649.585999,517.322876,44.137974,936.102295,127.588615
[2] 20210610_TIMS05_MCT_SA_DBS05_01_delta_control1_02_D1_1_4766.htrms.PG.Quantity,596.54071,531.168152,18.098984,749.433472,180.142838
[3] 20210610_TIMS05_MCT_SA_DBS05_01_delta_control1_03_D1_1_4770.htrms.PG.Quantity,600.497559,555.865112,41.415237,735.710571,194.361816
[4] 20210610_TIMS05_MCT_SA_DBS05_01_delta_control1_04_D1_1_4774.htrms.PG.Quantity,643.466675,588.871277,44.147488,1171.752808,195.871048
[5] 20210610_TIMS05_MCT_SA_DBS05_01_delta_control1_05_D1_1_4776.htrms.PG.Quantity,776.993408,543.412537,52.47094,1081.803711,134.120056


[1] 20210610_TIMS05_MCT_SA_DBS05_01_delta_control1_01_D1_1_4763.htrms.PG.Quantity
[2] 20210610_TIMS05_MCT_SA_DBS05_01_delta_control1_02_D1_1_4766.htrms.PG.Quantity
[3] 20210610_TIMS05_MCT_SA_DBS05_01_delta_control1_03_D1_1_4770.htrms.PG.Quantity
[4] 20210610_TIMS05_MCT_SA_DBS05_01_delta_control1_04_D1_1_4774.htrms.PG.Quantity
[5] 20210610_TIMS05_MCT_SA_DBS05_01_delta_control1_05_D1_1_4776.htrms.PG.Quantity


A0A023T672;A0A0G2JFX7
A0A023T778;P61327;Q9CQL1
A0A067XG53;O70589;O70589-3;O70589-4;O70589-5
A0A075B5N7;A0A0G2JGY3;P01633;P01634
A0A075B5P4;A0A0A6YWR2;P01868;P01869


## Long format precursor tables to protein data

### AlphaDIA

In [None]:
ad_pg_andata = psm_reader.read_psm_table(
    file_paths="../_example_data/psm_tables/alphadia/top20_precursors.tsv",
    search_engine="alphadia",
)

# Inspect the data
display(ad_pg_andata.to_df().iloc[:5, :5])
display(ad_pg_andata.obs.head())
display(ad_pg_andata.var.head())

proteins,P01009;A0A024R6N5,P01011;A0A087WY93;G3V3A0;G3V595,P01024;A0A8Q3SI05;A0A8Q3SI34;A0A8Q3WM02,P02675;D6REL8,P02787
raw_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA01,,,5122159000.0,,138340800000.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA02,3942457000.0,1895097000.0,7004725000.0,5359053000.0,116675000000.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA03,3670290000.0,1466623000.0,10705130000.0,4963886000.0,109660700000.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA04,5022584000.0,1653896000.0,6425221000.0,5095784000.0,107328200000.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA05,3314550000.0,1044103000.0,5690721000.0,4237927000.0,85038230000.0


20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA01
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA02
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA03
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA04
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA05


P01009;A0A024R6N5
P01011;A0A087WY93;G3V3A0;G3V595
P01024;A0A8Q3SI05;A0A8Q3SI34;A0A8Q3WM02
P02675;D6REL8
P02787


### DIANN

In [None]:
diann_pg_andata = psm_reader.read_psm_table(
    file_paths="../_example_data/psm_tables/diann/top20_report.parquet",
    search_engine="diann",
)

# Inspect the data
display(diann_pg_andata.to_df().iloc[:5, :5])
display(diann_pg_andata.obs.head())
display(diann_pg_andata.var.head())

proteins,A4-11_HUMAN;A4-8_HUMAN;A4_HUMAN,AACT_HUMAN,ALBU_HUMAN,APOE_HUMAN,CLUS-6_HUMAN
raw_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA01,840603300.0,979814800.0,19297070000.0,6036072000.0,7492970000.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA02,821383100.0,1396344000.0,22874990000.0,7632371000.0,7724988000.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA03,876018800.0,1025414000.0,12640760000.0,4464353000.0,6336454000.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA04,1073826000.0,1272798000.0,15577330000.0,6688002000.0,6312889000.0
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA05,1120733000.0,890513000.0,15796700000.0,4973349000.0,5925785000.0


20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA01
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA02
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA03
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA04
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA05


A4-11_HUMAN;A4-8_HUMAN;A4_HUMAN
AACT_HUMAN
ALBU_HUMAN
APOE_HUMAN
CLUS-6_HUMAN


### Spectronaut

In [None]:
# sn_pg_andata = psm_reader.read_psm_table(
#     file_paths="../_example_data/psm_tables/spectronaut/example_dataset_mouse_sn_top20peptides.tsv",
#     search_engine="spectronaut",
#     intensity_column = "F.PeakArea",
#     feature_id_column = "FG.Id"
# )

# # Inspect the data
# display(sn_pg_andata.to_df().iloc[:5, :5])
# display(sn_pg_andata.obs.head())
# display(sn_pg_andata.var.head())

# In Summary, we learned...

### - how AnnData can help us keep our data and metadata aligned
### - how to generate AnnData objects from our input dataframes
### - how to get dataframes back from AnnData objects
### - how to load protein tables using `alphatools` readers
### - how to load and pivot PSM tables using `alphatools` readers

# From here on,

Data can be used for all further downstream analyses. Continue with notebook `01_basic_workflow.ipynb` to learn how to merge your metadata into AnnData objects and perform filtering & EDA!

In [None]:
pd.read_csv("../_example_data/psm_tables/spectronaut/example_dataset_mouse_sn_top20peptides.tsv", sep="\t").head()

Unnamed: 0,R.Label,PG.Genes,F.Charge,FG.Charge,PG.ProteinGroups,F.PeakArea,F.FrgLossType,PEP.StrippedSequence,FG.MS1IsotopeIntensities (Measured),EG.ModifiedSequence,FG.Id,F.FrgIon
0,20230926_OA2_CaWe_aQuant_mBrain_200ng_01.raw,Mapt,1,2,A2A5Y6,13116.519531,noloss,HVPGGGSVQIVYK,1832162.8;1392438.6;463366;196847.3;1,_HVPGGGSVQIVYK_,_HVPGGGSVQIVYK_.2,y11
1,20230926_OA2_CaWe_aQuant_mBrain_200ng_01.raw,Mapt,1,2,A2A5Y6,40.350075,noloss,HVPGGGSVQIVYK,1832162.8;1392438.6;463366;196847.3;1,_HVPGGGSVQIVYK_,_HVPGGGSVQIVYK_.2,b11
2,20230926_OA2_CaWe_aQuant_mBrain_200ng_01.raw,Mapt,1,2,A2A5Y6,116.450981,noloss,HVPGGGSVQIVYK,1832162.8;1392438.6;463366;196847.3;1,_HVPGGGSVQIVYK_,_HVPGGGSVQIVYK_.2,y10
3,20230926_OA2_CaWe_aQuant_mBrain_200ng_01.raw,Mapt,1,2,A2A5Y6,77.492371,noloss,HVPGGGSVQIVYK,1832162.8;1392438.6;463366;196847.3;1,_HVPGGGSVQIVYK_,_HVPGGGSVQIVYK_.2,b10
4,20230926_OA2_CaWe_aQuant_mBrain_200ng_01.raw,Mapt,1,2,A2A5Y6,52.135025,noloss,HVPGGGSVQIVYK,1832162.8;1392438.6;463366;196847.3;1,_HVPGGGSVQIVYK_,_HVPGGGSVQIVYK_.2,b12
