# Tutorial Notebook: Analysis workflow for single-cell-resolved interaction data with the Boosting Autoencoder (BAE)

**Workflow tutorial for analyzing interaction patterns of single-cell cell-cell interaction (CCI) data with the [Boosting Autoencoder (BAE)](https://github.com/NiklasBrunn/BoostingAutoencoder).** 


### The workflow is devided in five main steps:

- [Setup](#Setup)
- [Load the gene expression data](#Load-the-gene-expression-data)
- [Construct a CCIM](#Construct-a-CCIM)
- [Pattern analysis of the CCIM with the BAE](#Pattern-analysis-of-the-CCIM-with-the-BAE)
- [Result visualization and plots saving](#Result-visualization-and-plots-saving)


To run the workflow you can either start with one of the example data that can be loaded within the notebook or load and process your own single-cell RNA sequencing (scRNA-seq) or spatial transcriptomics (ST) data. In the [Load the gene expression data](#Load-the-gene-expression-data) section, we describe how the example data can be loaded for an exemplary analysis, as well as provide details about how to load and prepare your own data for the workflow.

*The notebook is written in [Julia v1.9.3](https://julialang.org/downloads/oldreleases/). Since we use NICHES to construct CCIMs, which is implemented using the R programming language, we provide wrapper functions to call NICHES directly from Julia functions in this notebook.*

## Setup:

First, you can activate the Julia environment and load all the packages needed to run the interaction pattern analysis workflow. The first time you run the following cell, all required packages will be downloaded and precompiled, which may take a moment.

In [None]:
#---Activate the enviroment:
using Pkg;

Pkg.activate("../");
Pkg.instantiate();
Pkg.status()

#---Set the path to the project:
projectpath = joinpath(@__DIR__, "../"); 

#---Load the BoostingAutoEncoder module:
include(projectpath * "/src/BAE.jl");
using .BoostingAutoEncoder

#---Load required packages for this notebook:
using RCall, DelimitedFiles, Plots, Random, StatsBase, VegaLite, DataFrames, StatsPlots, CSV;

Next, you can specify the path to the directory containing the data you want to analyze and where you want to save the results. By default, two folders are created by executing the following code cell, one for loading and saving the analyzed data and one for saving the generated result figures. Alternatively, you can change the paths below.

In [None]:
#---Set paths to the data directory and the figures directory:

# Set the path to the data directory and create the folder if it does not already exist 
# (exchange the path below with the path to your data directory):
datapath = projectpath * "data/tutorial/";
@info "Data path: $datapath"
if !isdir(datapath)
    mkdir(datapath)
end

# Set the path to the figures directory and create the folder if it does not already exist 
# (exchange the path below with the path to where you want to store your results):
figurespath = projectpath * "figures/tutorial/"
@info "Figures path: $figurespath"
if !isdir(figurespath)
    mkdir(figurespath)
end

## Load the gene expression data:

In this section you can either choose to run the analysis workflow with some example data (scRNA-seq or ST) or upload your own data. 

Currently, the following options are supported:

- We provide access to two different example data sets that were also used as an example in the [NICHES paper](https://academic.oup.com/bioinformatics/article/39/1/btac775/6865029). The first example data is [scRNA-seq data from rat lungs](#A.-Starting-with-the-scRNA-seq-example-data) (A). The second example data is [spot-based ST data from the mouse cortex](#B.-Starting-with-the-ST-example-data) (B).

- Since NICHES is already implemented in R and is part of our workflow, you can alternatively [run NICHES directly on your data](#C.-Starting-with-an-own-Seurat-object) if it is already stored as a Seurat object (C).

- If you have your data stored as an anddata object or in some other format, another way to run the workflow is to [manually extract and load the gene expression matrix, relevant metadata, and gene names](#D.-Starting-with-own-data-from-other-sources-and-preparing-for-the-signaling-pattern-analysis-workflow) (D). 

### A. Start with the scRNA-seq example data:

As a first example, you can load [rat lung scRNA-seq data](https://www.science.org/doi/10.1126/sciadv.aaw3851) from [https://zenodo.org/record/6846618/files/raredon_2019_rat.Robj](https://zenodo.org/record/6846618/files/raredon_2019_rat.Robj), which was also used in the original NICHES publication. You can download the data directly and save it locally as a Seurat object by running the following cell without having to download the data manually. To save the data in another directory, you can specify a path in the `data_path` argument of the `load_rat_scRNAseq_data()` function.

For the workflow of cell-cell interaction pattern analysis with the BAE, we follow the steps of the [R vignette](https://github.com/msraredon/NICHES/blob/master/vignettes/03%20Rat%20Alveolus.Rmd) provided by the NICHE authors and subset the data to include only the cell types that they are confident to be spatially close to each other. You can also access the whole data by setting `subset_data=false` in the `load_rat_scRNAseq_data()` function. 
The data is already pre-processed, including cell and gene filtering, normalization and transformation. In addition, [ALRA](https://www.nature.com/articles/s41467-021-27729-z) imputation has been applied to impute missing values. 

**The first time you download the data**, it may take a few minutes. If the data has already been downloaded and a file with the downloaded data already exists in the directory where the data is stored, the download will be skipped.

In [None]:
#---Load example scRNA-seq data:
# Download a subset of the example rat lung data:
X_rat, tSNE_embeddings, celltypes_rat, genenames_rat = load_rat_scRNAseq_data(; transfer_data=true, assay="alra");

# Alternatively, you can download the full dataset by setting subset_data=false and define another directory for saving the data):
#X_rat, tSNE_embtSNE_embeddingsedding, celltypes_rat, genenames_rat = load_rat_scRNAseq_data(; data_path=datapath, subset_data=false, transfer_data=true, assay="alra");

For demonstration purposes, you can visualize the data using the precomputed [tSNE](#https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) coordinates in the Seurat object. Note that tSNE was run on the entire data, but only the cells of the selected types are shown here.

In [None]:
vegascatterplot(tSNE_embeddings, celltypes_rat; 
    path=figurespath * "ratLung_data_tSNE.png",
    legend_title="Cell type",
    Title="tSNE embedding of rat lung cells",
    color_field="labels:o",
    scheme="category10",
    domain_mid=nothing,
    range=nothing,
    save_plot=true,
    marker_size="15"
)

### B. Start with the ST example data:

You can also use data from an ST experiment in the interaction pattern analysis workflow. As an example, you can download the [10X Visium Mouse Brain Data](https://www.10xgenomics.com/datasets?menu%5Bproducts.name%5D=Spatial%20Gene%20Expression&query=&page=1&configure%5BhitsPerPage%5D=50&configure%5BmaxValuesPerFacet%5D=1000) that was previously analyzed with Seurat [here](https://satijalab.org/seurat/articles/spatial_vignette.html#x-visium). We use the `SeuratData` function `LoadData()`, which we have wrapped in a Julia function to load the data.

For illustration purposes, we cluster the data using `Seurat`. We extract the spatial coordinates and Seurat clusters along with the imputed counts from [ALRA](https://www.nature.com/articles/s41467-021-27729-z) to run the workflow.

**The first time you download the data**, it may take a few minutes! If the data has already been downloaded and a file with the downloaded data already exists in the directory where the data is stored, the download will be skipped.

In [None]:
#---Load example ST data:
# Download the example spatial mouse brain Visium data accessible via SeuratData and save it to the data directory:
X_brain, cell_locations, seurat_clusters_brain, genenames_brain = load_spatial_mousebrain_data();

For demonstration purposes, you can visualize the data using the spatial cell locations from the Seurat object. The spots are colored according to the cluster labels.

In [None]:
vegascatterplot(cell_locations, seurat_clusters_brain; 
    path=figurespath * "mouseBrain_data_locations.png",
    legend_title="Cluster",
    Title="Spatial locations of mouse brain spots",
    color_field="labels:o",
    scheme="category20",
    domain_mid=nothing,
    range=nothing,
    save_plot=true,
    marker_size="100"
)

### C. Start with your own Seurat object:

If your scRNA-seq or ST data is already stored in a Seurat object, you can start the analysis and continue by running NICHES to construct a CCIM.
Please make sure that cell type annotations are available in the meta.data of your Seurat object (in case of scRNA-seq data), as NICHES requires them for the subsampling process.

- >Continue by [running NICHES](#Construct-a-CCIM-with-NICHES) on your data to construct a CCIM.

### D. Start with your own data from other sources:

If you have your data in another format that is not a Seurat object, you can still construct a Seurat object with your data, which can then be passed as input to NICHES to construct a CCIM. To do this, you must load the relevant parts of your data into the code cell below.

**Data required to run the CCI pattern analysis workflow:**
- **Filtered count matrix (*either counts, normalized counts or ALRA imputed counts*)**
- **Genenames**
- **Cell types/clusters (*required for scRNA-seq data*) or spatial coordinates (*required for ST data*)**

Optional data:
- Further metadata ...
- Cell embeddings

To store the genenames and cell related information, such as cell types, we have created a Julia structure called `MetaData`. 
Below we show an example of how to load data from .txt files and construct a `MetaData` object MD consisting of the genenames and relevant metadata. The example data is from [Tasic et al. (2016)](https://www.nature.com/articles/nn.4216) and is wa preprocessed following the steps in the [BAE](https://github.com/NiklasBrunn/BoostingAutoencoder) GitHub repository.

To construct a Seurat object with the loaded data, we also provide a Julia function `create_seurat_object()` that passes the loaded data to `R` and creates a Seurat object.





In [None]:
#---Tasic data CCIM construction:
# Set to true for loading example data
other_source_data = false;

if other_source_data
    
    X = readdlm(datapath * "../cortical_mouse/corticalMouseDataMat_allgenes_log1.txt", Float32);
    genenames = vec(readdlm(datapath * "../cortical_mouse/genenames.txt", String));
    celltypes = vec(readdlm(datapath * "../cortical_mouse/celltype.txt", String));

    #---Summarize metadata information:
    MD = MetaData(; featurename=genenames);
    MD.obs_df[!, :Celltype] = celltypes;

    #---Create a Seurat object:
    create_seurat_object(X', MD; 
        data_path=nothing, 
        file_name=nothing,
        assay= "RNA",
        normalize_data=false, 
        alra_imputation=false,
        indents="Celltype",
        data_is_normalized=true
    );
end

## Construct a CCIM:

Based on either single-cell RNA sequencing (scRNA-seq) or spatial transcriptomics (ST) data, we use [NICHES](https://github.com/msraredon/NICHES) to reconstruct CCIs based on known ligand-receptor interactions from the [FANTOM5](https://fantom.gsc.riken.jp/5/) or [OmniPath](https://omnipathdb.org) database. The reconstructed CCIs are in the form of a **cell-cell interaction matrix (CCIM)**, where observations correspond to cell pairs that represent edges in a directed communication graph, where nodes correspond to cells. The features of the CCIM are the ligand-receptor interactions selected by NICHES and represent the edge features in the communication graph.
The resulting CCIM can be passed to BAE for interaction pattern analysis of cell pairs. 

In addition, NICHES can construct CCIMs representing **cell-to-system** or **system-to-cell** communication. Each observation corresponds to a pair that represents either a cell pair with a sender and a receiver cell, a pair with a sender cell and a receiver system, or a sender system and a receiver cell (more details can be found in the [NICHES paper](https://academic.oup.com/bioinformatics/article/39/1/btac775/6865029)).

To run NICHES on your data from the previous step to create and save a CCIM along with the desired metadata, you can set the `filepath' to your data below, as well as define a `datapath'. We have created a wrapper Julia function that calls the NICHES R function using the Julia package `RCall`. The function takes the `filepath` as input and you can specify optional arguments to determine what output the algorithm will compute and what metadata it will use.

NICHES calculates CCIMs based on normalized counts. Therefore, please make sure that your data is normalized. If not, you can normalize your data by setting the optional function argument `normalize_data=true`. This will call the Seurat function `NormalizeData()` on your Seurat object and compute *log1p* normalized counts within the Julia wrapper function `run_NICHES_wrapper()`.

Optionally, you can perform ALRA imputation on your data to reduce the sparsity level of the resulting CCIM, which can occur when the sparsity level in the original gene expression data is extremely high. To do this, you can set the function argument `alra_imputation=true`, but note that you are using pseudo values in this case.

By calling the wrapper function in the code cell below, one or more NICHES Seurat objects are created based on the type of CCIMs you want to construct. Each CCIM is then stored in your specified data path with a filename based on the CCIM type. For example, setting `CellToCell=true` will create a NICHES Seurat object consisting of a CCIM where observations correspond to single-cell pairs, and the Seurat object will be saved with the filename `NICHES_CellToCell.rds`. 

**Note:** *In general, any tool that constructs CCIM-like matrices can be used.*

### A. Construct a CCIM from the scRNA-seq example data:

In [None]:
#---Run NICHES on scRNA-seq data:
filepath_expData = datapath * "Rat_Seurat_sub.rds";
run_NICHES_wrapper(filepath_expData; 
    data_path=nothing, 
    alra_imputation=false, 
    assay="alra", 
    species="rat", 
    LR_database="fantom5", 
    cell_types="cell_types"
);

### B. Construct a CCIM from the ST example data:

In [None]:
#---Run NICHES on ST data:
filepath_expData = datapath * "MouseBrain_Seurat.rds";
run_NICHES_wrapper(filepath_expData;
    alra_imputation=false, 
    assay="alra",
    species="mouse",
    cell_types="seurat_clusters",
    position_x="x",
    position_y="y",
    n_neighbors=4,
    meta_data_to_map=["orig.ident","seurat_clusters", "x", "y"],    
    CellToCell=false,
    SystemToCell=false,
    CellToSystem=false,
    CellToCellSpatial=false,
    NeighborhoodToCell=true,
    CellToNeighborhood=false,
);

### C. Construct a CCIM from your own data:

In [None]:
#---Run NICHES on your own data:
filepath_expData = datapath * "Seurat_object.rds";
run_NICHES_wrapper(filepath_expData; 
    data_path=nothing, 
    normalize_data=true, 
    alra_imputation=true, 
    assay="alra", 
    species="mouse", 
    LR_database="fantom5", 
    cell_types="Celltype"
);

## Pattern analysis of the CCIM with the BAE:

The high-dimensional CCIM computed in the previous step is the starting point for single-cell interaction pattern analysis. Each row of the CCIM consists of individual cell pairs (observations) for which interaction scores of multiple ligand-receptor interactions are stored in the columns (features). 

The goal of the interaction pattern analysis is to group cell pairs based on similar interaction profiles and to be able to identify specific characterizing ligand-receptor interactions for these groups. For this purpose we use the Boosting Autoencoder (BAE), a neural network based tool for sparse and structured dimension reduction.

**During training, the BAE** 

- **learns a low-dimensional representation of the cell pairs, where cell pairs with similar interaction profiles are grouped together,**
- **learns to assign cluster-membership probabilities to the cell pairs based on which cluster labels can be defined,**
- **learns to sparsly link characterizing ligand-receptor interactions to the different clusters.** 

### A. Load and preprocess the CCIM:

In order to pass the CCIM to BAE, which is currently stored in a Seurat object, you must first load and transfer your data to Julia.
You can do this by executing one of the following code cells (1-6) which call Julia functions which, given the file path to your NICHES Seurat object, load it in R, extract the relevant data and transfer it to Julia. Currently, the metadata information that can be transferred is the cell type/cell type pairs, the sender cell types, and the receiver cell types. 

Part of the functionality is feature standardization, i.e. z-score transformation of the CCIM. This is a necessary step since BAE requires feature standardized data as input.

**1. Load the CellToCell NICHES results:**

In [None]:
filepath_CCIM = datapath * "NICHES_CellToCell.rds";
CCIM, CCIM_st, MD = load_CCIM_CtC(filepath_CCIM); 

**2. Load the SystemToCell NICHES results:**

In [None]:
filepath_CCIM = datapath * "NICHES_SystemToCell.rds";
CCIM, CCIM_st, MD = load_CCIM_StC(filepath_CCIM); 

**3. Load the CellToSystem NICHES results:**

In [None]:
filepath_CCIM = datapath * "NICHES_CellToSystem.rds";
CCIM, CCIM_st, MD = load_CCIM_CtS(filepath_CCIM); 

**4. Load the spatial CellToCell NICHES results:**

In [None]:
filepath_CCIM = datapath * "NICHES_CellToCell_Spatial.rds";
CCIM, CCIM_st, MD = load_CCIM_CtC_Spatial(filepath_CCIM); 

**5. Load the spatial NeighborhoodToCell NICHES results:**

In [None]:
filepath_CCIM = datapath * "NICHES_NeighborhoodToCell.rds";
CCIM, CCIM_st, MD = load_CCIM_NtC(filepath_CCIM); 

**6. Load the spatial CellToNeighborhood NICHES results:**

In [None]:
filepath_CCIM = datapath * "NICHES_CellToNeighborhood.rds";
CCIM, CCIM_st, MD = load_CCIM_CtN(filepath_CCIM); 

### B. Set hyperparameters for the BAE training:

Befor you can train a BAE with the CCIM, you must specify the hyperparameters for training. 

Here we give a brief description of the different hyperparameters (for more details you can check the `Hyperparameter` documentation):

- `zdim`: Number of latent dimensions of the BAE model. Cell pairs can be assigned to 2*zdim clusters by the BAE.
- `n_runs`: Can be set to 1 or a larger integer. If > 1, the encoder weight matrix will be reset to zero each time after training for at most the specified maximum number of training iterations for the first n_run-1 times. 
- `max_iter`: Maximum number of training epochs per run.
- `tol`: This parameter controls whether early stopping is enabled or not. If set to nothing, there will be no early stopping. Otherwise, if a tolerance value is given, the training per run will stop if the absolute difference between the mean train loss of the current and the last training epoch is less than the tolerance.
- `batchsize`: Mini-batch size used for each parameter update iteration during each training epoch.
- `η`: Learning rate for the decoder parameter optimization ([AdamW](https://arxiv.org/abs/1711.05101)).
- `λ`: Regularization parameter for decoder parameter updates ([AdamW](https://arxiv.org/abs/1711.05101)).
- `ϵ`: Step size for the bosting component to update encoder weights.
- `M`: Number of boosting steps performed to update the encoder weights during each parameter update iteration.



In [None]:
#---Define hyperparameters for training a BAE:

# Rat lung scRNA-seq data
#HP = Hyperparameters(zdim=30, n_runs=2, max_iter=50, tol=1e-5, batchsize=2^11, η=0.01, λ=0.1, ϵ=0.01, M=1); 

# Spatial mouse brain data 
HP = Hyperparameters(zdim=10, n_runs=2, max_iter=100, tol=1e-5, batchsize=2^9, η=0.01, λ=0.1, ϵ=0.01, M=1);  

# Mouse brain scRNA-seq data
#HP = Hyperparameters(zdim=8, n_runs=2, max_iter=50, tol=1e-5, batchsize=2^9, η=0.01, λ=0.1, ϵ=0.01, M=1); 


# Hyperparameters for reconstructing the results on rat lung scRNA-seq data in our paper:
#HP = Hyperparameters(zdim=30, n_runs=1, max_iter=1000, tol=1e-5, batchsize=2^12, η=0.01, λ=0.1, ϵ=0.001, M=1); 


# Customize Hyperparameters:
#HP = Hyperparameters(zdim=30, n_runs=1, max_iter=2000, tol=1e-5, batchsize=2^12, η=0.01, λ=0.1, ϵ=0.001, M=1);  

### C. Define the neural network model architecture for the BAE:

Next, you can create the neural network architecture for the BAE subdevided in the encoder and decoder. 

The encoder consists of a single-layer linear neural network solely parameterized by a weight matrix called `coeffs`. 

The decoder consists of three distinct layers: 
The first layer is the *split-softmax* transformation, which is a structured soft-clustering component to disentangle two different groups of cells that are potentially represented within a latent dimension. The *split-softmax* thus doubles the number of latent dimensions. 
The second layer of the decoder is a dense layer with a *tanh* activation function, followed by a third layer with no activation function (see the uncommented line below the line where the decoder is defined).

The hyperparameters used to train the BAE are stored as part of the model structure.

**Note:** *The decoder can in principle be defined as an arbitrary multi-layer feed forward neural network.* 

In [None]:
#---Define the decoder architecture:
p = size(CCIM_st, 2);
decoder = generate_BAEdecoder(p, HP; soft_clustering=true); #Below is the decoder structure ...
#decoder = Chain(x -> softmax(split_vectors(x)), Dense(2*HP.zdim, p, tanh_fast), Dense(p, p));

#---Initialize the BAE model:
BAE = BoostingAutoencoder(; coeffs=zeros(Float32, p, HP.zdim), decoder=decoder, HP=HP);
summary(BAE)

### D. Train the BAE:

You can now train your BAE model with the specified hyperparameters by executing the following code cell. 

In [None]:
#---Train the BAE model:
@time begin
    output_dict = train_BAE!(CCIM_st, BAE; MD=MD);
end

## Result visualization and plots saving:

**The BAE results consists of:**

- **The sparse encoder weight matrix defines the connections of model-selected ligand-receptor interactions to different latent dimensions.**
- **A low-dimensional representation of the CCIM in `zdim` dimensions, whcih can be accessed via `BAE.Z`.**
- **Cluster labels that are defined as the `argmax` values of the soft clustering component per cell pair.**

Since BAE is constrained to learning disentangled latent dimensions, each dimension captures patterns that are characteristic of a group of cell pairs. Sometimes two groups are captured by a single dimension. In this case, the split-soft transformation helps to further disentangle these groups into separate dimensions, which we refer to as clusters.
In addition, each latent dimension, and thus each cluster, is associated with a sparse set of model-selected ligand-receptor interactions that characterize the learned patterns for that dimension/cluster.

The next step is to visually inspect the learned patterns and use the information about the selected features for the different clusters.
For visualization, we use [UMAP](https://arxiv.org/abs/1802.03426) to construct a 2D embedding of the cell pairs based on the BAE latent representation. In addition, we create a custom discrete color scheme to help visualize the clustering results when the number of latent dimensions is large, e.g., > 20 (otherwise, the `"category10"` or `"category20"` color schemes can be used instead). To create feature plots later, we also define two custom continuous color ranges `color_range_dark` and `color_range_light`, one for light and one for dark backgrounds.

In [None]:
#----Compute 2D UMAP embedding of the learned BAE latent representation and add to the metadata:
BAE.UMAP = generate_umap(BAE.Z');
MD.obs_df[!, :UMAP1] = BAE.UMAP[:, 1];
MD.obs_df[!, :UMAP2] = BAE.UMAP[:, 2];

#---Generate a custom color scheme of distinct colors:
n_cols = 2 * BAE.HP.zdim; 
custom_colorscheme = shuffle([hsl_to_hex(i / n_cols, 0.7, 0.5 + 0.1 * sin(i * 4π / BAE.HP.zdim)) for i in 1:n_cols]); 

#---Set color ranges for scatter plots (one for dark and one for light backgrounds):
#For dark backgrounds:
color_range_dark = [
    "#fff5f5", "#ffe0e0", "#ffcccc", "#ffb8b8", "#ffa3a3", "#ff8f8f", "#ff7a7a", "#ff6666",
    "#ff5252", "#ff3d3d", "#ff2929", "#ff1414", "#ff0000", "#e50000", "#cc0000", "#b20000",
    "#990000", "#7f0000", "#660000", "#4c0000", "#330000"
];
#For light backgrounds:
color_range_light = [
    "#000000", "#220022", "#440044", "#660066", "#880088", "#aa00aa", "#cc00cc", "#ee00ee",
    "#ff00ff", "#ff19ff", "#ff33ff", "#ff4cff", "#ff66ff", "#ff7fff", "#ff99ff", "#ffb2ff",
    "#ffccff", "#ffe5ff", "#ffccf5", "#ff99eb", "#ff66e0"
];

With the  2D UMAP coordinates, you can now plot a scatter plot of the cell paris in a two-dimensional space reflecting similarities in interaction patterns of the cell pairs based on the learned BAE representation. For inspecting the interaction patterns, you can have a look at this plot from different perspectives.

**Cell pairs can be colored based on:**

- **1. Cell type pair information (Sender-receiver types)**
- **2. Sender types**
- **3. Receiver types**
- **4. BAE cluster labels**

In the scatterplot functions below, you can specify whether to save the generated plots (`save_plot`), as well as the file type and path where the plots should be saved (`path`). You can also set the color scheme based on an existing one using `scheme` (we recommend using `"category10"` or `"category20"` for discrete data if there are <=20 labels). If there are >20 labels, you can use the `custom_colorscheme` defined above. Finally, you can adjust the `marker_size` depending on how many observations, i.e. points, are plotted.

**1. Cell type pair information (Sender-receiver types)**

In [None]:
if "CellTypePair" in names(MD.obs_df)
    pl = vegascatterplot(Matrix(MD.obs_df[:, [:UMAP1, :UMAP2]]), MD.obs_df.CellTypePair; 
        path=figurespath * "CellTypePair_(BAE)umap.png",
        legend_title="Sender-Receiver",
        color_field="labels:o",
        scheme=nothing,
        domain_mid=nothing,
        range=custom_colorscheme,
        save_plot=true,
        marker_size="7"
    )

    display(pl)
    
else
    @warn "CellTypePair not found in metadata!"
end

**2. Sender types**

In [None]:
if "SenderType" in names(MD.obs_df)
    pl = vegascatterplot(Matrix(MD.obs_df[:, [:UMAP1, :UMAP2]]), MD.obs_df.SenderType; 
        path=figurespath * "SenderType_(BAE)umap.png",
        legend_title="Sender",
        color_field="labels:o",
        scheme="category20",
        domain_mid=nothing,
        range=nothing, 
        save_plot=true,
        marker_size="7"
    )

    display(pl)
    
else
    @warn "MetaData has no column named: SenderType."
end

**3. Receiver types**

In [None]:
if "ReceiverType" in names(MD.obs_df)
    pl = vegascatterplot(Matrix(MD.obs_df[:, [:UMAP1, :UMAP2]]), MD.obs_df.ReceiverType; 
        path=figurespath * "ReceiverType_(BAE)umap.png",
        legend_title="Receiver",
        color_field="labels:o",
        scheme="category20",
        domain_mid=nothing,
        range=nothing, 
        save_plot=true,
        marker_size="7"
    )

    display(pl)
    
else
    @warn "MetaData has no column named: ReceiverType."
end

**4. BAE cluster labels**

Inspecting the plot with the cluster labels together with one or more of the plots above (*Cell type pair / sender type / receiver type*), can reveal information about which cells communicate with others in a similar way. For example, a BAE cluster of cell pairs could consist of cell pairs which have the same sender type but have two different receiver types.

In [None]:
vegascatterplot(Matrix(MD.obs_df[:, [:UMAP1, :UMAP2]]), MD.obs_df.Cluster; 
    path=figurespath * "Cluster_(BAE)umap.png",
    legend_title="Cluster",
    color_field="labels:o",
    scheme=nothing,
    domain_mid=nothing,
    range=custom_colorscheme,
    save_plot=true,
    marker_size="7"
)

You can also inspect the BAE's learned latent patterns for each cluster individually. We provide a function for generating a 2D UMAP scatter plot for each cluster, colored by the softmax activation of cell pairs, indicating the probability that a cell belongs to that cluster. Because the latent dimensions are largely disentangled, different clusters tend to capture different groups of cell pairs.

By examining the plots, you can select the ones that capture interesting patterns, i.e., groups of cell pairs that stand out from the rest.

In [None]:
# Create scatter plots of the UMAP embedding of the learned BAE latent representation colored by activations for different clusters:
if !isdir(figurespath * "/UMAPplotsCluster")
    # Create the folder if it does not exist
    mkdir(figurespath * "/UMAPplotsCluster")
end
create_colored_vegascatterplots(Matrix(MD.obs_df[:, [:UMAP1, :UMAP2]]), BAE.Z_cluster;
    path=figurespath * "/UMAPplotsCluster/",
    filename="BAE_cluster",
    filetype="scatter.png",
    legend_title="Activation",
    color_field="labels:q",
    scheme=nothing, 
    domain_mid=nothing,
    range=color_range_light,
    save_plot=true,
    marker_size="7"
)
@info "UMAP scatter plots per cluster generated and saved to $(datapath)!"

Next, you can further investigate the ligand-receptor interactions that drive the learned patterns displayed in the 2D UMAP plots. You can do this by examining the top selected features per cluster. You can create and save data frames of the top selected interactions by running the code cell below. Each dataframe consists of the selected interaction names, the actual encoder weights learned by BAE, and the normalized weights (a relative score for the selected interactions ranging from 0 to 1).

In [None]:
# Create dataframes of top features/interactions per cluster:
topFeatures_per_Cluster(BAE, MD; save_data=true, data_path=datapath);
@info "Top features per cluster computed and saved to $(datapath)!"

# Load an example dataframe for a specific cluster:
cluster = 4;
top_features_cluster3 = CSV.read(datapath * "TopFeaturesCluster_CSV/topFeatures_Cluster_$(cluster).csv", DataFrame)

Alternatively, you can visually inspect the top *k* selected ligand-receptor interactions per cluster. We provide a function to generate a scatter plot based on the normalized encoder weights per cluster.

In [None]:
#---Create scatter plots of the top selected genes per cluster:
k = 10;
if !isdir(figurespath * "/TopFeaturesCluster")
    # Create the folder if it does not exist
    mkdir(figurespath * "/TopFeaturesCluster")
end
for key in keys(MD.Top_features)
    if length(MD.Top_features[key].Scores) > 0
        FeatureScatter_plot = TopFeaturesPerCluster_scatterplot(MD.Top_features[key], key; top_n=k)
        savefig(FeatureScatter_plot, figurespath * "/TopFeaturesCluster/" * "BAE_Cluster$(key)_Interactions.png")
    end
end
@info "Figures saved to: $(figurespath * "/TopFeaturesCluster")!"

The 2D UMAP scatter plots from above can also be colored based on the scores of selected ligand-receptor interactions for different patterns. You can compare whether the displayed interaction score patterns match the cluster, sender, or receiver patterns in the above plots to further select interesting interactions. The plots can be generated by running the code cell below.

In [None]:
#---Create scatter plots of the UMAP embedding of the learned BAE latent representation colored by expression levels of top selected genes for different clusters:
if !isdir(figurespath * "/FeaturePlots")
    # Create the folder if it does not exist
    mkdir(figurespath * "/FeaturePlots")
end
FeaturePlots(MD.Top_features, MD.featurename, CCIM, Matrix(MD.obs_df[:, [:UMAP1, :UMAP2]]); 
    top_n=5,
    marker_size="7", 
    fig_type=".png",
    path=figurespath * "/FeaturePlots/",
    legend_title="Score",
    color_field="labels:q",
    scheme=nothing, 
    domain_mid=nothing,
    range=color_range_light
);
@info "Feature plots saved to: $(figurespath * "/FeaturePlots")!"

For a final check you can even go back to the original scRNA-seq data or ST data and check the gene expressions of ligand and receptor genes of the ligand-receptor interactions which you find interesting. 

**1. Rat lung data**

In [None]:
origData_FeaturePlots(X_rat, tSNE_embeddings, MD, genenames_rat; 
    n_features=4,
    marker_size="15",
    color_range=color_range_light,
    color_scheme=nothing,
    legend_title="ALRA",
    file_type=".png",
    figures_path=figurespath
);
@info "Figures saved to: $(figurespath)!"

**2. Mouse brain data**

In [None]:
origData_FeaturePlots(X_brain, cell_locations, MD, genenames_brain; 
    n_features=4,
    marker_size="150",
    color_range=color_range_light,
    color_scheme=nothing,
    legend_title="ALRA",
    file_type=".png",
    figures_path=figurespath
);
@info "Figures saved to: $(figurespath)!"

**3. Your own data** (2D cell embeddings or 2D cell locations of the data are required)

In [None]:
origData_FeaturePlots(X, embedding, MD, genenames; 
    n_features=4,
    marker_size="15",
    color_range=color_range_light,
    color_scheme=nothing,
    legend_title="Counts",
    file_type=".png",
    figures_path=figurespath
);
@info "Figures saved to: $(figurespath)!"

**Note:** *The following plots require spatial coordinates of cells and can only be generated when analyzing ST data.*

If you are analyzing a `NeighborhoodToCell` or `CelltoNeighborhood` CCIM, you can color the spatial map of cells by the interaction scores of the top selected interactions by BAE (for the different clusters). 

In [None]:
create_spatialInteractionPlots(filepath_CCIM, MD, CCIM; 
    figures_path=figurespath,
    background_color="light", 
    legend_title="Value", 
    n_Interactions=5, 
    marker_size="150", 
    fig_type=".png"
);
@info "Figures saved to: $(figurespath)!"

**Note:** *The following plot requires spatial coordinates of cells and can only be generated when analyzing a `NeighborhoodToCell` or `CelltoNeighborhood` CCIM.*

You can also color the cells in the spatial map by the BAE cluster labels, which allows to investigate the spatial arrangement of all the interaction programs determined by the BAE.

In [None]:
n_cols = 2 * BAE.HP.zdim; 
custom_colorscheme = shuffle([hsl_to_hex(i / n_cols, 0.7, 0.5 + 0.1 * sin(i * 4π / BAE.HP.zdim)) for i in 1:n_cols]); 

vegascatterplot(Matrix(MD.obs_df[!, [:x, :y]]), MD.obs_df.Cluster; 
    path=figurespath * "mouseBrain_data_BAEclusterLabels.png",
    legend_title="Cluster",
    Title="Mouse brain spots colored by BAE cluster labels",
    color_field="labels:o",
    scheme=nothing,
    domain_mid=nothing,
    range=custom_colorscheme,
    save_plot=true,
    marker_size="100"
)