# Example `alphatools` workflow with proteomics data

This notebook demonstrates core `alphatools` functionality for proteomics data loading, preprocessing and visualization. 

Functionalities are intended to be as close to pure python as possible, avoiding closed end-to-end implementations, which is reflected in several design choices: 

1. AnnData is used in favor of a custom data class to enable interoperability with any other tool from the Scverse.
2. matplotlib *Axes* and *Figure* instances are used for visualization, giving the user full autonomy to layer on custom visualizations with searborn, matplotlib, or any other compatible visualization package.
3. Statistical and preprocessing functions are standalone and set with strong defaults, meaning that any function can be used outside of the `alphatools` context. 

### Design choices of `alphatools`:
- **Data handling**: `AnnData` was chosen as a data container for two main reasons: 1) For presenting a lightweight, powerful solution to a fundamental challenge with dataframes, which is keeping numerical data and metadata aligned together at all times. Using dataframes, the options are to either include non-numeric metadata columns in the dataframe, complicating data operations, or to add cumbersome multi-level indices and 2) For their compatibility with the Scverse, Scanpy and all associated tools, essentially removing the barrier between proteomics and transcriptomics data analysis and enabling multi-omics analyses. 
- **Plotting**: Inspired by the [`stylia`] package, we provide a consistent design throughout `alphatools`, aiming to provide a consistent and aesthetically pleasing visual experience for all plots. A core component of this implementation is the fact that `create_figure` returns subplots as an iterable data structure, meaning that once the basic layout of a plot is decided, users simply jump from one plot window to the next and populate each one with figure elements. 
- **Standardization**: A key consideration of this package is the loading of proteomics data, the biggest painpoint of which is the nonstandard output of various proteomic search enginges. By building on `alphabase`, we handle this complexity early and provide the user with AnnData objects containing either proteins or precursors, which on the one hand can be converted to metadata containing dataframes nearly frictionless by running `df = adata.to_df().join(adata.obs)` and on the other hand are compatible with any foreseeable downstream analysis task.

[`stylia`]: https://github.com/ersilia-os/stylia.git

### An example dataset: Alzheimer study

We demonstrate `alphatools` functionality on a published dataset by Bader et al. [2], who measured cerebrospinal fluid proteomes in order to discover biomarkers for Alzheimer's disease.

[2]: Bader, Jakob M., et al. "Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease." Molecular systems biology 16.6 (2020): e9356.

In [None]:
import tempfile

from alphabase.tools.data_downloader import DataShareDownloader

from alphatools.io.anndata_factory import AnnDataFactory

  import cgi


Download and add the dataset using `alphatools` loaders and AnnData factory.

In [None]:
# Download the dataset
url = "https://datashare.biochem.mpg.de/s/Wa9jAJvOJO35D5e/download?path=%2Fpg_matrix%2FBaderEtAl2020%2Fdiann&files=report.parquet"
with tempfile.TemporaryDirectory() as temp_dir:
    file_path = DataShareDownloader(url=url, output_dir=temp_dir).download()
    factory = AnnDataFactory.from_files(file_paths=file_path, reader_type="diann")

/var/folders/2l/hhd_z4hx3070zw8rlj4c3l940000gn/T/tmpu8qgu1uk/report.parquet does not yet exist


100% |########################################################################|


/var/folders/2l/hhd_z4hx3070zw8rlj4c3l940000gn/T/tmpu8qgu1uk/report.parquet successfully downloaded (91.135817527771 MB)


UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 7: invalid start byte

In [None]:
file_path

'/var/folders/2l/hhd_z4hx3070zw8rlj4c3l940000gn/T/tmpvtfripcn/report.parquet'

In [None]:
# Load the DIANN report from file
factory = AnnDataFactory.from_files(file_paths=file_path, reader_type="diann")

FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/2l/hhd_z4hx3070zw8rlj4c3l940000gn/T/tmpvtfripcn/report.parquet'