# Example `alphatools` workflow with proteomics data

This notebook demonstrates core `alphatools` functionality for proteomics data loading, preprocessing and visualization. 

Functionalities are intended to be as close to pure python as possible, avoiding closed end-to-end implementations, which is reflected in several design choices: 

1. AnnData is used in favor of a custom data class to enable interoperability with any other tool from the Scverse.
2. matplotlib *Axes* and *Figure* instances are used for visualization, giving the user full autonomy to layer on custom visualizations with searborn, matplotlib, or any other compatible visualization package.
3. Statistical and preprocessing functions are standalone and set with strong defaults, meaning that any function can be used outside of the `alphatools` context. 

### Design choices of `alphatools`:
- **Data handling**: `AnnData` was chosen as a data container for two main reasons: 1) For presenting a lightweight, powerful solution to a fundamental challenge with dataframes, which is keeping numerical data and metadata aligned together at all times. Using dataframes, the options are to either include non-numeric metadata columns in the dataframe, complicating data operations, or to add cumbersome multi-level indices and 2) For their compatibility with the Scverse, Scanpy and all associated tools, essentially removing the barrier between proteomics and transcriptomics data analysis and enabling multi-omics analyses. 
- **Plotting**: Inspired by the [`stylia`] package, we provide a consistent design throughout `alphatools`, aiming to provide a consistent and aesthetically pleasing visual experience for all plots. A core component of this implementation is the fact that `create_figure` returns subplots as an iterable data structure, meaning that once the basic layout of a plot is decided, users simply jump from one plot window to the next and populate each one with figure elements. 
- **Standardization**: A key consideration of this package is the loading of proteomics data, the biggest painpoint of which is the nonstandard output of various proteomic search enginges. By building on `alphabase`, we handle this complexity early and provide the user with AnnData objects containing either proteins or precursors, which on the one hand can be converted to metadata containing dataframes nearly frictionless by running `df = adata.to_df().join(adata.obs)` and on the other hand are compatible with any foreseeable downstream analysis task.

[`stylia`]: https://github.com/ersilia-os/stylia.git

### An example dataset: Alzheimer study

AlphaTools is designed to perform two main functions: First, provide a unified interface between search engine outputs and downstream processing. Second, provide downstream proteomics workflows entirely compatible with transcriptomics and the AnnData framework. Below we step through an `alphatools` example by using a published dataset by Bader et al. [2], who measured cerebrospinal fluid proteomes in order to discover biomarkers for Alzheimer's disease.

[2]: Bader, Jakob M., et al. "Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease." Molecular systems biology 16.6 (2020): e9356.

In [None]:
%load_ext autoreload
%autoreload 2

import tempfile

import pandas as pd
from alphabase.tools.data_downloader import DataShareDownloader

from alphatools.io.anndata_factory import AnnDataFactory
from alphatools.pp.data import add_metadata

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### 1. Prepare the dataset using `alphatools` loaders and AnnData factory. 

The full output of a DIANN search is saved as a report file of precursors, from which precursor or protein-level data can be extracted. This is an unfiltered table, and false-discovery-control for first and second search passes must be applied. `alphatools` handles this filtering with its AnnData factory class, based on `alphabase`. The resulting AnnData object contains precursor, protein-group or gene quantities and any number of feature-metadata columns (for example, protein groups may have genes as secondary annotation, precursors may have protein groups and genes as secondary annotation).

In [None]:
# Download the dataset
url = "https://datashare.biochem.mpg.de/s/Wa9jAJvOJO35D5e/download?path=%2Fpg_matrix%2FBaderEtAl2020%2Fdiann&files=report.parquet"
with tempfile.TemporaryDirectory() as temp_dir:
    file_path = DataShareDownloader(url=url, output_dir=temp_dir).download()
    factory = AnnDataFactory.from_files(file_paths=file_path, reader_type="diann")

# Create the AnnData object, where the row index corresponds to samples and the column names correspond to proteins
adata = factory.create_anndata()

# Use the builtin dataframe conversion to get a quick overview of the data
display(adata.to_df().iloc[:5, :5])

/var/folders/2l/hhd_z4hx3070zw8rlj4c3l940000gn/T/tmpv1v02i58/report.parquet does not yet exist


100% |########################################################################|


/var/folders/2l/hhd_z4hx3070zw8rlj4c3l940000gn/T/tmpv1v02i58/report.parquet successfully downloaded (91.135817527771 MB)




### 2. Integrate the study metadata

The AnnData format provides a solution to a key problem encountered in every data-analysis project: How to keep a matrix of numerical values permanently and safely aligned with column and row annotations. This is notably difficult with dataframes, as multilevel column indices are cumbersome and non-numeric columns in one dataframe cause problems with downstram analyses methods that expect numerical features. The _*add_metadata*_ function ensures alignment of observations and variables from the start.

☝️ Importantly, while the original AnnData implementation only enforces shape compatibility, `alphatools.pp.data.add_metadata()` enforces matching indices. This means that even if the initial data and the incoming metadata were to be in different orders, the correct proteomic and metadata information for a given sample are always matched. 

In [None]:
# Download the metadata
url = "https://datashare.biochem.mpg.de/s/Wa9jAJvOJO35D5e/download?path=%2Fpg_matrix%2FBaderEtAl2020%2Fdiann&files=annotation%20of%20samples_AM1.5.11.xlsx"
with tempfile.TemporaryDirectory() as temp_dir:
    file_path = DataShareDownloader(url=url, output_dir=temp_dir).download()
    metadata = pd.read_excel(file_path).dropna(subset=["sample name"])

/var/folders/2l/hhd_z4hx3070zw8rlj4c3l940000gn/T/tmpnf_b60uy/annotation%20of%20samples_AM1.5.11.xlsx does not yet exist
/var/folders/2l/hhd_z4hx3070zw8rlj4c3l940000gn/T/tmpnf_b60uy/annotation%20of%20samples_AM1.5.11.xlsx successfully downloaded (0.028104782104492188 MB)


100% |########################################################################|


In [None]:
# Basic cleaning of the metadata prior to merging
metadata["sample name"] = metadata["sample name"].str.replace(".raw.PG.Quantity", "", regex=False)
metadata = metadata.set_index("sample name", drop=False)

# The metadata contains information for more samples than are in our data
print(f"AnnData shape: {adata.shape}")
print(f"Metadata shape: {metadata.shape}")

# Match the metadata to the AnnData object
print(f"Anndata shape prior to matching: {adata.shape}")
adata = add_metadata(
    adata=adata,  # The AnnData object we want to add metadata to. Its obs index should match the index of the metadata
    incoming_metadata=metadata,  # The metadata dataframe we want to add. Its index should match the index of adata.obs
    axis=0,  # This means that we add metadata to the rows (0) and not columns (1)
    keep_data_shape=False,  # This means that we will drop any samples for which there is no corresponding row in the metadata
)
print(f"Anndata shape after matching: {adata.shape}")

AnnData shape: (61, 2162)
Metadata shape: (210, 14)
Anndata shape prior to matching: (61, 2162)
Anndata shape after matching: (61, 2162)
