# Getting Started

In this notebook we're going to demonstrate how to use `cev` to compare (a) two _different_ embeddings of the same data and (b) two aligned embeddings of _different_ data.

The embeddings we're exploring in this notebook represent single-cell surface proteomic data. In other words, each data point represents a individual cell whose surface protein expression was measured. The cells were then clustered into cellular phenotypes based on their protein expression.

In [None]:
import pandas as pd
from cev.widgets import Embedding, EmbeddingComparisonWidget

The notebook requires downloading the three embeddings from data of from [Mair et al., 2022](https://www.nature.com/articles/s41586-022-04718-w):
- Tissue sample 138 (32 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/)
- Tissue sample 138 (32 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/) after being transformd with [Ozette's Annotation Transformation](https://github.com/flekschas-ozette/ismb-biovis-2022)
- Tumor sample 6 (82 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/) after being transformd with [Ozette's Annotation Transformation](https://github.com/flekschas-ozette/ismb-biovis-2022)

All three embeddings are annotated with [Ozette's FAUST method](https://doi.org/10.1016/j.patter.2021.100372).

In [None]:
# download the data
!curl -sL https://figshare.com/ndownloader/articles/23063615/versions/1 -o data.zip
!unzip data.zip -d data

## Comparing Two Embeddings of the same Data

In the first example, we are going to use `cev` to compare two different embeddings methods that were run on the very same data (the tissue sample): standard UMAP and annotation transformation UMAP.

Different embedding methods can produce very different embedding spaces and it's often hard to assess the difference wholelistically. `cev` enables us to quantify two properties based on shared point labels:

1. Confusion: the degree to which two or more labels are visually intermixed
2. Neighborhood: the degree to which the local neighborhood of a label has changed between the two embeddings

Visualized as a heatmap, these two property can quickly guide us to point clusters that are better or less resolved in either one of the two embeddings. It can also help us find compositional changes between the two embeddings.

In [None]:
tissue_umap_embedding = Embedding.from_ozette(df=pd.read_parquet("./data/mair-2022-tissue-138-umap.pq"))
tissue_ozette_embedding = Embedding.from_ozette(df=pd.read_parquet("./data/mair-2022-tissue-138-ozette.pq"))

In [None]:
umap_vs_ozette = EmbeddingComparisonWidget(
    tissue_umap_embedding,
    tissue_ozette_embedding,
    titles=["Standard UMAP (Tissue)", "Annotation-Transformed UMAP (Tissue)"],
    metric="confusion",
    selection="synced",
    auto_zoom=True,
    row_height=320,
)
umap_vs_ozette

In this example, we can see that the point labels are much more intermixed in the standard UMAP embedding compared to the annotation transformation UMAP. This not surprising as the standard UMAP embedding is not optimized for Flow cytometry data in any way and is thus only resolving broad cell phenotypes based on a few markers. You can see this by holding down `SHIFT` and clicking on `CD8` under _Markers_, which reduces the label resolution and shows that under a reduced label resolution, the confusion is much lower in the standard UMAP embedding.

When selecting _Neighborhood_ from the _Metric_ drop down menu, we switch to the neighborhood composition difference quantification. When only a few markers (e.g., `CD4` and `CD8`) are active, we can see that most of the neighborhood remain unchanged. When we gradually add more markers, we can see how the the local neighborhood composition difference slowly increases, which is due to the fact that the annotation transformation spaces out all point label clusters.

To study certain clusters or labels in detail, you can either interactively select points in the embedding via [jupyter-scatter](https://github.com/flekschas/jupyter-scatter)'s lasso selection or you can programmatically select points by their label via the `select()`. For instance, the next call will select all CD4+ T cells.

In [None]:
umap_vs_ozette.select(['CD3+', 'CD4+', 'CD8-'])

## Size Differences Between _Non-Responder_ and _Responder_

Instead of comparing identical data, let's take a look at two transformed and aligned embeddings: tissue vs tumor. The embeddings are both annotation-transformed and aligned, ensuring low confusion and high neighborhood similarity (check to confirm!). The abundance metric aids in identifying potential shifts in phenotype abundance, providing a comprehensive and visually intuitive method for analyzing complex cytometry data. Remember, our metric should be used as a exploratory tool guide exploration and quickly surface potentially interesting phenotypes, but robust statical methods must be applied to confirm whether any abundance differences exist.

In [None]:
tumor_ozette_embedding = Embedding.from_ozette(df=pd.read_parquet("./data/mair-2022-tumor-006-ozette.pq"))

In [None]:
tissue_vs_tumor = EmbeddingComparisonWidget(
    tissue_ozette_embedding,
    tumor_ozette_embedding,
    titles=["Tissue", "Tumor"],
    metric="abundance",
    selection="phenotype",
    auto_zoom=True,
    row_height=320,
)

tissue_vs_tumor

The following **CD8+ T cells** are more abundant in `tissue` (i.e., the relative abundance is higher on the left) compared to `tumor` (i.e., the relative abundance is lower on the right)

In [None]:
tissue_vs_tumor.select("CD4-CD8+CD3+CD45RA+CD27+CD19-CD103-CD28-CD69+PD1+HLADR-GranzymeB-CD25-ICOS-TCRgd-CD38-CD127-Tim3-")