## 01: Showcase basic data loading and manipulation

In [105]:
%load_ext autoreload
%autoreload 2

import logging

import pandas as pd

from alphatools.io.anndata_factory import AnnDataFactory
from alphatools.pp import add_metadata, transform

logging.basicConfig(level=logging.INFO)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Read a dataset:

DIANN returns precursor report tables (.tsv or .parquet format). Usually, we want to load this table, perform some filtering on precursor and protein FDR, and pivot it in order to obtain protein-group intensities. In practice, this aggregates multiple rows of precursor values belonging to one protein group into a single value for that protein group, producing the familiar sample x protein matrix we can work with.

In [106]:
# Copy this file to the folder this notebook runs in. This is a randomized version of a plasma report
# with spiked in outliers to showcase the loading and plotting functions.
report_path = "../data/report_random_scrambled.tsv"

# The factory instance takes care of loading and filtering the data
factory = AnnDataFactory.from_files(
    file_paths=report_path,
    reader_type="diann",
    raw_name_column="Run",
    protein_id_column="Protein.Group",
    intensity_column="PG.MaxLFQ",
)

# Pivot the table in order to get Protein.Group intensities
adata = factory.create_anndata()

  return pd.read_csv(filename, sep=sep, keep_default_na=False)


### 2. Inspect the AnnData object

AnnData objects work like DataFrames - but better! In using DataFrames, one usually runs into the same issues over and over again:

1. Some columns are numeric, but some are just annotation (metadata), and we permanently have to exclude them for ttests, visualizations, etc. And, if we decide to put them in a separate DataFrame, we constantly have to worry about keeping the 'metadata' dataframe and the 'data' dataframe aligned
2. Indices can be multi-level, but MultiLevel indices generate confusion very easily, and accessing their respective index levels is cumbersome
3. Multiple dataframes that are just same-shaped transformations of each other (for example, raw data, logged data, normalized data, etc.) clutter the workspace and are difficult to manage. 

This and more is solved by AnnData, the standard data format from Scanpy, and now the standard of the Scverse. Simplified for our purposes, an AnnData object has a central data array ('X'), flanked by a DataFrame for row-wise annotation ('obs') and one for column-wise annotation ('var'). These annotation dataframes link to the main array via their respective indices [*], and can be easily accessed. Let's explore the AnnData object we just created: it should contain protein group abundance values, its columns are proteins and its rows are samples. Furthermore it contains sample metadata ('obs') and feature metadata ('var'), which don't interfere with whatever numerical analyses we want to run on the values.

[*] It is possible to add nonmatching metadata to conventional AnnData objects by just assigning 'obs' or 'var' directly. AlphaTools' 'add_metadata' function takes this uncertainty away and ensures that everything stays correctly aligned. 

In [107]:
# Check out the anndata object
print(adata)
print("\n---\n")

# Let's see the protein values
print("Some protein values from the anndata object")
print(
    "Note that this is a numpy array, so there are no column or row indices: \nThis information is stored in the obs and var dataframes, and their alignment to the data matrix.\n"
)
print(adata.X[:5, :5])
print("\n---\n")

# Let's check out the obs and var dataframes
print("Obs dataframe, accessed by adata.obs")
display(adata.obs.head())
print("Not much here yet, but this is what we will use to match metadata to the data matrix.")
print("\n---\n")

# Let's check out the var dataframe
print("Var dataframe, accessed by adata.var")
display(adata.var.head())
print("Same as above, think of this as the column names of the data matrix, which we can use to add more metadata.")

AnnData object with n_obs × n_vars = 111 × 2112

---

Some protein values from the anndata object
Note that this is a numpy array, so there are no column or row indices: 
This information is stored in the obs and var dataframes, and their alignment to the data matrix.

[[4.65202e+03 7.25722e+03 2.44311e+04 1.50955e+04 2.64588e+05]
 [7.58200e+03 4.69074e+03 1.01745e+04 5.48231e+02 2.65509e+05]
 [8.29987e+03 1.91775e+03 6.94765e+03 3.89882e+03 4.61355e+05]
 [4.44628e+03 7.36536e+03 1.43340e+04 4.42771e+03 1.36799e+06]
 [9.46845e+03 8.76362e+03 8.00786e+03 1.46423e+04 1.62578e+05]]

---

Obs dataframe, accessed by adata.obs


sample_0
sample_1
sample_10
sample_100
sample_101


Not much here yet, but this is what we will use to match metadata to the data matrix.

---

Var dataframe, accessed by adata.var


A0A024QZX5
A0A024R0K5
A0A024R1R8;Q9Y2S6
A0A024R6N5;A0A0G2JRN3
A0A075B6H7


Same as above, think of this as the column names of the data matrix, which we can use to add more metadata.


## Adding metadata to AnnData objects

Easy as 1-2-3

In [108]:
sample_metadata = pd.read_csv("../data/report_random_scrambled_sample_metadata.tsv", sep="\t", index_col=0)

print("Sample metadata, accessed by sample_metadata")
display(sample_metadata.head())
print("This is the metadata we will use to add to the obs dataframe.")
print("\n---\n")

feature_metadata = pd.read_csv("../data/report_random_scrambled_feature_metadata.tsv", sep="\t", index_col=0)

print("Feature metadata, accessed by feature_metadata")
display(feature_metadata.head())
print("This is the metadata we will use to add to the var dataframe.")
print("\n---\n")

Sample metadata, accessed by sample_metadata


Unnamed: 0_level_0,treatment
Run,Unnamed: 1_level_1
sample_0,control
sample_1,treatment
sample_2,treatment
sample_3,control
sample_4,control


This is the metadata we will use to add to the obs dataframe.

---

Feature metadata, accessed by feature_metadata


Unnamed: 0_level_0,Genes
Protein.Group,Unnamed: 1_level_1
P36578,RPL4
A6NIH7,UNC119B
P05154,SERPINA5
Q9Y490;Q9Y490-2,TLN1
P13497-5,BMP1


This is the metadata we will use to add to the var dataframe.

---



In [109]:
# Let's add the sample metadata to the obs dataframe
adata = add_metadata(adata=adata, incoming_metadata=sample_metadata, axis=0)  # Mind the axis argument: 0 is for rows
print("Obs dataframe after adding sample metadata with add_metadata()")
display(adata.obs.head())
print("Note that matching happened on the index. If the indices had not matched, the rows would be NAN-rows.")
print(
    "This is different from just adding obs columns, which would also work with nonmatching indices as long as the lengths match."
)
print("\n---\n")

# Now let's add the feature metadata
adata = add_metadata(
    adata=adata, incoming_metadata=feature_metadata, axis=1
)  # Mind the axis argument: 1 is for columns
print("Var dataframe after adding feature metadata with add_metadata()")
display(adata.var.head())
print(
    "Note that even though this annotates the protein columns, the 'var' index is still row-based: whether we are dealing with obs or var, \nthe matching is always row-wise."
)

Obs dataframe after adding sample metadata with add_metadata()


Unnamed: 0_level_0,treatment
raw_name,Unnamed: 1_level_1
sample_0,control
sample_1,treatment
sample_10,control
sample_100,treatment
sample_101,treatment


Note that matching happened on the index. If the indices had not matched, the rows would be NAN-rows.
This is different from just adding obs columns, which would also work with nonmatching indices as long as the lengths match.

---

Var dataframe after adding feature metadata with add_metadata()


Unnamed: 0_level_0,Genes
proteins,Unnamed: 1_level_1
A0A024QZX5,SERPINB6
A0A024R0K5,CEACAM5
A0A024R1R8;Q9Y2S6,TMA7;TMA7B
A0A024R6N5;A0A0G2JRN3,SERPINA1
A0A075B6H7,IGKV3-7


Note that even though this annotates the protein columns, the 'var' index is still row-based: whether we are dealing with obs or var, 
the matching is always row-wise.


## Preliminary summary:

This might not seem like much happened, but it takes a huge burden off our shoulders: 

1. Thanks to AnnData, we don't have to worry about aligning dataframe indices or multilevel indices. 
2. Thanks to the AlphaTools 'add_metadata' function, we can be certain that when we add metadata later, everything stays aligned. We could go ahead and add more metadata dataframes with matching indices in.

## Log-transformation

Seemingly trivial, there are nonetheless edgecases: What do we do with zeros? Negative values? -Inf/Inf values? AlphaTools' nanlog() function takes care of this by replacing values without valid logs with nan:

In [110]:
from alphatools.pp.transform import nanlog

adata = nanlog(adata, log=2)

# nanlog also works on DataFrames, Series and numpy arrays

## Filtering for data completeness

Sometimes we only want protein features in the data which have a certain amount of valid values, for example features which have non-na values in at least 30 % of all samples. We can enforce this by using AlphaTools' completeness filter. Feature annotations are filtered accordingly.

In [116]:
df = adata.to_df()
df

proteins,A0A024QZX5,A0A024R0K5,A0A024R1R8;Q9Y2S6,A0A024R6N5;A0A0G2JRN3,A0A075B6H7,A0A075B6I0,A0A075B6I4,A0A075B6I9,A0A075B6J1,A0A075B6J9,...,Q9Y6I9,Q9Y6M5,Q9Y6R7,Q9Y6W3,Q9Y6W5,Q9Y6Y9,Q9Y6Z7,S4R471,V9GYG9,V9GYJ8
raw_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
sample_0,12.183642,12.825201,14.576431,13.881831,18.013388,18.663475,14.357490,10.359804,,15.836679,...,,9.258660,11.506997,10.324203,,12.192527,15.705438,,12.193522,11.026288
sample_1,12.888363,12.195600,13.312670,9.098640,18.018401,17.245172,12.401997,,,14.439532,...,,11.406752,,11.788975,,11.033616,15.130209,14.336151,14.298270,12.376494
sample_10,13.018873,10.905199,12.762309,11.928822,18.815518,15.833273,14.405786,12.021129,12.418606,16.226442,...,10.417578,,,11.391415,,11.616664,14.990950,,12.214966,10.727265
sample_100,12.118383,12.846540,13.807154,12.112345,20.383626,18.287983,13.501962,12.208859,,16.483860,...,,,,10.976070,,11.554728,14.989142,12.521946,10.739747,11.781925
sample_101,13.208913,13.097311,12.967201,13.837855,17.310773,17.928131,11.702480,10.376690,12.186981,16.909401,...,,,,12.561794,10.000254,11.653947,15.265245,,,11.156854
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sample_95,13.504024,13.083988,13.471358,17.093696,17.846525,15.454650,13.098104,10.884109,,14.658390,...,,,,13.417404,,11.746380,14.919096,,12.427776,11.372691
sample_96,13.381381,11.627497,15.321534,11.940358,19.339463,17.762376,12.379246,12.229741,12.999982,16.454759,...,10.794725,11.705507,,11.814398,9.366174,10.968969,15.672986,,12.092751,10.714177
sample_97,12.981594,11.859647,12.327974,10.283308,18.140485,15.871236,13.416349,10.764598,,15.398209,...,,10.079245,9.166688,,,11.927659,15.032188,14.626428,14.297626,12.537898
sample_98,11.899221,11.361450,14.428903,18.212264,17.984936,17.270396,13.020216,12.393648,,16.338573,...,12.820203,12.463864,,11.637168,,10.427973,14.932824,13.170410,13.520975,11.846156


In [124]:
from alphatools.pp.data import filter_data_completeness

print(f"Before filtering, adata has the shape {adata.shape}")
adata_filtered = filter_data_completeness(adata, max_missing=0.1, axis=1)
print(f"After filtering, adata has the shape {adata_filtered.shape}")

Before filtering, adata has the shape (111, 2112)
After filtering, adata has the shape (111, 1032)


In [57]:
# Load a dataset into AnnData

# Explore AnnData

# Apply filters and transformations
# log transform
# filter completeness

# Filter for sample groups that we want to keep

# Plotting
# Explain the plot logic (AxisManager, colors, etc.)
# Scatterplotting + Labels + Legend

In [63]:
display(adata.X)
display(transform.nanlog(adata).X)
display(transform.nanlog(adata.to_df()))
display(transform.nanlog(adata.to_df()["A0A024R1R8;Q9Y2S6"]))
display(transform.nanlog(adata.to_df()["A0A024R1R8;Q9Y2S6"].values))

array([[84788000. , 37541000. ,   975192. , ...,   130693. , 11476200. ,
        24149700. ],
       [86609700. , 37697200. ,   886428. , ...,    91687.5, 11217000. ,
        24096300. ],
       [87261000. , 38341900. ,   965717. , ...,    91642.7, 11154100. ,
        27586300. ],
       [86922600. , 38314100. ,  1139240. , ...,   168758. , 11793000. ,
        25791600. ]])

array([[26.33735676, 25.16196374, 19.89532677, ..., 16.99582235,
        23.45214168, 24.52550193],
       [26.36802528, 25.16795403, 19.75764393, ..., 16.48443744,
        23.41918354, 24.5223083 ],
       [26.37883367, 25.19241849, 19.88124095, ..., 16.48373234,
        23.41107077, 24.71744863],
       [26.37322799, 25.19137208, 20.11964028, ..., 17.36459637,
        23.49142743, 24.62039794]])

Protein.Group,A0A024R1R8;Q9Y2S6,A0A024R4E5,A0A024RBG1;Q9NZJ9-2,A0A024RBG1;Q9NZJ9;Q9NZJ9-2,A0A024RCL3;A0A0G2JK11;Q96QC4,A0A075B6E5;Q8N8S7,A0A087WT44,A0A087WT44;P30519;P30519-2,A0A087WUL8;Q6P3W6;P0DPF3;A0A087WVU4;A0A087WZJ2;A0A087WZE1;A0A8V8TMC1;H7BY70;A0A087WTW4;A0A075B6G5,A0A087WV86;Q6PGQ7;Q6PGQ7-2,...,U3KQ91;O60266;A0A0A0MSC1,U3KQP1;Q9NWL6;C9IYZ1,U3KQQ1,V9GY43,V9GY95;Q8IVH8;Q8IVH8-2;Q8IVH8-3,V9GYF0,V9GYM2,X6REB3,X6RLX0,X6RLX0;O15083;H7C4G9
Run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20240321_OA2_Evo1_21min_TiHe_ADIAMA_HeLa_200ng_F-40_iO_14,26.337357,25.161964,19.895327,20.560146,21.218858,23.734093,17.494496,24.633095,22.837435,,...,20.279903,16.609551,16.413573,,10.567424,22.150192,14.973056,16.995822,23.452142,24.525502
20240321_OA2_Evo1_21min_TiHe_ADIAMA_HeLa_200ng_F-40_iO_15,26.368025,25.167954,19.757644,20.279733,21.313716,23.771075,18.592734,24.648801,22.452604,,...,20.09294,16.865673,15.703479,19.739327,10.40663,22.194633,15.104255,16.484437,23.419184,24.522308
20240321_OA2_Evo1_21min_TiHe_ADIAMA_HeLa_200ng_F-40_iO_16,26.378834,25.192418,19.881241,20.27776,21.050825,23.795348,,24.710928,23.684167,,...,19.730016,16.729422,16.482433,19.375188,11.07224,22.126469,18.199893,16.483732,23.411071,24.717449
20240321_OA2_Evo1_21min_TiHe_ADIAMA_HeLa_200ng_F-40_iO_17,26.373228,25.191372,20.11964,20.279778,21.202558,23.694321,,24.623946,22.765175,,...,19.82235,16.460392,16.210875,,9.724224,22.517685,19.098847,17.364596,23.491427,24.620398


Run
20240321_OA2_Evo1_21min_TiHe_ADIAMA_HeLa_200ng_F-40_iO_14    26.337357
20240321_OA2_Evo1_21min_TiHe_ADIAMA_HeLa_200ng_F-40_iO_15    26.368025
20240321_OA2_Evo1_21min_TiHe_ADIAMA_HeLa_200ng_F-40_iO_16    26.378834
20240321_OA2_Evo1_21min_TiHe_ADIAMA_HeLa_200ng_F-40_iO_17    26.373228
Name: A0A024R1R8;Q9Y2S6, dtype: float64

array([26.33735676, 26.36802528, 26.37883367, 26.37322799])