# Tutorial Notebook: Analysis workflow for single-cell-resolved interaction data with the Boosting Autoencoder (BAE)

This Jupyter notebook demonstrates a workflow for analyzing interaction patterns of cell-cell interaction (CCI) data with the BoostingAutoencoder (REF) at the single cell level. 

The requirements for running the analysis presented here, are either a filtered and pre-processed single-cell RNA sequencing (scRNA-seq) dataset or spatial transcriptomics (ST) dataset with available cell type annotations.

### The workflow is devided in five main steps:

- [Setup](#Setup)
- [Load the gene expression data](#Load-the-gene-expression-data)
- [Construct a CCIM with NICHES](#Construct-a-CCIM-with-NICHES)
- [Perform the BAE analysis of the CCIM](#Perform-the-BAE-analysis-of-the-CCIM)
- [Result visualization and data saving](#Result-visualization-and-data-saving)

As an example, we provide the analysis on a scRNA-seq dataset from (REF) which can be loaded as a part of the tutorial notebook. Alternatively, the BAE analysis can be run on a CCIM constructed from ST data (REF) following the tutorials provided in (REF).

## Setup:

First we activate the environment and load all required packages for running the analysis.

In [None]:
#---Activate the enviroment:
using Pkg;

Pkg.activate("../");
Pkg.instantiate();
Pkg.status()

#---Set the path to the project:
projectpath = joinpath(@__DIR__, "../"); 

#---Load the BoostingAutoEncoder module:
include(projectpath * "/src/BAE.jl");
using .BoostingAutoEncoder

#---Load required packages for this notebook:
using RCall;
using DelimitedFiles;
using Plots;
using Random;
using StatsBase;
using VegaLite;  
using DataFrames;
using StatsPlots;

Next, we define the path to the directory, where the data that we want to analyze is located and specify, where we want to save the results.

In [None]:
#---Set paths to the data directory and the figures directory:

# Set the path to the data directory and create the folder if it does not already exist 
# (exchange the path below with the path to your data directory):
datapath = projectpath * "data/tutorial/";
if !isdir(datapath)
    mkdir(datapath)
end

# Set the path to the figures directory and create the folder if it does not already exist 
# (exchange the path below with the path to where you want to store your results):
figurespath = projectpath * "figures/tutorial/"
if !isdir(figurespath)
    mkdir(figurespath)
end

## Load the gene expression data:

...

### A) Starting with the scRNA-seq example data:

...

When downloading the data for the first time, it may take a few minutes! If the data has already been downloaded and a file with the downloaded data already exists in the directory where the data is stored, the download will be skipped.

In [None]:
#---Load example scRNA-seq data:
# Download a subset of the example rat lung data from "https://zenodo.org/record/6846618/files/raredon_2019_rat.Robj" and save it to the data directory:
load_rat_scRNAseq_data();
# Alternatively, you can download the full dataset by setting subset_data=false and define another directory for saving the data):
#load_rat_scRNAseq_data(; data_path=datapath, subset_data=true);

### B) Starting with the ST example data:

...

When downloading the data for the first time, it may take a few minutes! If the data has already been downloaded and a file with the downloaded data already exists in the directory where the data is stored, the download will be skipped.

In [None]:
#---Load example ST data:
# Download the example spatial mouse brain Visium data accessible via SeuratData and save it to the data directory:
load_spatial_mousebrain_data();

### C) Starting with an own Seurat object:

If your scRNA-seq or ST data is already stored in a Seurat object, then you can already start the analysis and continue with running NICHES to construct a CCIM.
Please make sure that cell type annotations are available in the meta.data of your Seurat object because NICHES requires them for the subsampling process.

-> Continue with step 3.

### D) Starting with own data from other sources and preparing for the signaling pattern analysis workflow:

In case you have your data in another format, which is not a Seurat object, currently as part of the workflow, we will construct a Seurat object with your data that can then be handed as input to the NICHES algorithm for constructing a CCIM.

In [None]:
#---Tasic data CCIM construction:
other_source_data = false;
if other_source_data
    X_norm_hvg = readdlm(datapath * "../cortical_mouse/corticalMouseDataMat_allgenes_log1.txt", Float32);
    genenames = vec(readdlm(datapath * "../cortical_mouse/genenames.txt", String));
    celltypes = vec(readdlm(datapath * "../cortical_mouse/celltype.txt", String));

    # Alternatively:
    #df = that consists of the cell_ids in a column and the (normalized) counts (cells x genes) in the other columns. Can be taken as input to the create_seurat_object function.

    #---Summarize meta data information:
    MD = MetaData(; featurename=genenames);
    MD.obs_df[!, :Celltype] = celltypes;


    #---Create a Seurat object:
    create_seurat_object(X_norm_hvg', MD; 
        data_path=nothing, 
        file_name=nothing,
        assay= "RNA",
        normalize_data=false, 
        alra_imputation=false,
        indents="Celltype",
        data_is_normalized=true
    );
end

## Construct a CCIM with NICHES:

NICHES can be used to compute CCIMs representing cell-to-cell communication. Alternatively, CCIMs can be constructed that represent cell-to-system or system-to-cell communiication. Each observation corresponds to a pair that represents either a cell pair with a sender and a receiver cell, a pair with a sender cell and a receiver system, or a sender system and a receiver cell.

...

In [None]:
#---Run NICHES on scRNA-seq data:
#filepath_expData = datapath * "Rat_Seurat_sub.rds";
#run_NICHES_wrapper(filepath_expData; data_path=nothing, alra_imputation=false, assay="alra", species="rat", LR_database="fantom5", cell_types="cell_types");

#---Run NICHES on ST data:
filepath_expData = datapath * "MouseBrain_Seurat.rds";
run_NICHES_wrapper(filepath_expData;
    alra_imputation=true, 
    assay="alra",
    species="mouse",
    cell_types="seurat_clusters",
    position_x="x",
    position_y="y",
    n_neighbors=4,
    meta_data_to_map=["orig.ident","seurat_clusters"],
    CellToCell=false,
    SystemToCell=false,
    CellToSystem=false,
    CellToCellSpatial=false,
    NeighborhoodToCell=true,
    CellToNeighborhood=false,
);

#---Run NICHES on other data:
#filepath_expData = datapath * "Seurat_object.rds";
#run_NICHES_wrapper(filepath_expData; data_path=nothing, alra_imputation=true, assay="alra", species="mouse", LR_database="fantom5", cell_types="Celltype");

## Perform the BAE analysis of the CCIM:

...

### A) Loading and preprocessing the NICHES CCIM for the BAE analysis:

...

In [None]:
#---Load the NICHES results:
#filepath_CCIM = datapath * "NICHES_CellToCell.rds";
#CCIM, CCIM_st, MD = load_CCIM_CtC(filepath_CCIM); #CellToCell

#filepath_CCIM = datapath * "NICHES_SystemToCell.rds";
#CCIM, CCIM_st, MD = load_CCIM_StC(filepath_CCIM); #SystemToCell

#filepath_CCIM = datapath * "NICHES_CellToSystem.rds";
#CCIM, CCIM_st, MD = load_CCIM_CtS(filepath_CCIM); #CellToSystem


#---Load spatial NICHES results:
#filepath_CCIM = datapath * "NICHES_CellToCell_Spatial.rds";
#CCIM, CCIM_st, MD = load_CCIM_CtC_Spatial(filepath_CCIM); #CellToCell (spatial)

filepath_CCIM = datapath * "NICHES_NeighborhoodToCell.rds";
CCIM, CCIM_st, MD = load_CCIM_NtC(filepath_CCIM); #NeighborhoolToCell (spatial)

#filepath_CCIM = datapath * "NICHES_CellToNeighborhood.rds";
#CCIM, CCIM_st, MD = load_CCIM_CtN(filepath_CCIM); #CellToNeighborhood (spatial)

### B) Set hyperparameters for the BAE training:

Befor we train a BAE on the data, we need to specify the hyperparameters for training. 
...

In [None]:
#---Define hyperparameters for training a BAE:
#HP = Hyperparameters(zdim=30, n_restarts=3, epochs=20, batchsize=2^11, η=0.01, λ=0.1, ϵ=0.01, M=1);# Rat lung scRNA-seq data
HP = Hyperparameters(zdim=15, n_restarts=3, epochs=100, batchsize=2^9, η=0.01, λ=0.1, ϵ=0.01, M=1); # Spatial mouse brain data 
#HP = Hyperparameters(zdim=8, n_restarts=3, epochs=50, batchsize=2^9, η=0.01, λ=0.1, ϵ=0.01, M=1); #Mouse brain scRNA-seq data

#---Hyperparameters for reconstructing the published results:
#HP = Hyperparameters(zdim=30, n_restarts=1, epochs=2000, batchsize=2^12, η=0.01, λ=0.1, ϵ=0.001, M=1); 

### C) Define the neural network architecture for the BAE:

Next, we create the neural network architecture for the BAE. 
...

In [None]:
#---Define the decoder architecture:
p = size(CCIM_st, 2);
decoder = generate_BAEdecoder(p, HP; soft_clustering=true);

#---Initialize the BAE model:
BAE = BoostingAutoencoder(; coeffs=zeros(Float32, p, HP.zdim), decoder=decoder, HP=HP);
summary(BAE)

### D) Train the BAE:

We are now ready to train the model.

In [None]:
#---Train the BAE model:
@time begin
    output_dict = train_BAE!(CCIM_st, BAE; MD=MD, track_coeffs=true, save_data=false, path=nothing);
end

@info "Minimum Trainloss at: $(argmin(output_dict["trainloss"]))"

## Result visualization and data saving:

...

In [None]:
#---Generate a custom color scheme of distinct colors:
n_cols = 2 * BAE.HP.zdim; 
custom_colorscheme = [hsl_to_hex(i / n_cols, 0.7, 0.5 + 0.1 * sin(i * 4π / BAE.HP.zdim)) for i in 1:n_cols]; 
custom_colorscheme_shuffled = shuffle(custom_colorscheme);

#----Compute 2D UMAP embedding of the learned BAE latent representation and add to the metadata:
BAE.UMAP = generate_umap(BAE.Z');
MD.obs_df[!, :UMAP1] = BAE.UMAP[:, 1];
MD.obs_df[!, :UMAP2] = BAE.UMAP[:, 2];

In [None]:
#---Plot the mean trainloss per epoch:
mean_trainlossPerEpoch = output_dict["trainloss"];
loss_plot = plot(1:length(mean_trainlossPerEpoch), mean_trainlossPerEpoch,
     title = "Mean train loss per epoch",
     xlabel = "Epoch",
     ylabel = "Loss",
     legend = true,
     label = "Train loss",
     linecolor = :red,
     linewidth = 2
);
savefig(loss_plot, figurespath * "/Trainloss_BAE.png");
loss_plot

In [None]:
#---Plot the Sparsity score per epoch:
sparsity_level = output_dict["sparsity"];
loss_plot = plot(1:length(sparsity_level), sparsity_level,
     title = "Sparsity level per epoch",
     xlabel = "Epoch",
     ylabel = "Sparsity",
     legend = true,
     label = "Sparsity",
     linecolor = :orange,
     linewidth = 2
);
savefig(loss_plot, figurespath * "/Sparsity_BAE.png");
loss_plot

In [None]:
#---Plot the disentanglement score per epoch:
entanglement_score = output_dict["entanglement"];
loss_plot = plot(1:length(entanglement_score), entanglement_score,
     title = "Entanglement score per epoch",
     xlabel = "Epoch",
     ylabel = "Entanglement of dimensions",
     legend = true,
     label = "Entanglement",
     linecolor = :orange,
     linewidth = 2
);
savefig(loss_plot, figurespath * "/Entanglement_BAE.png");
loss_plot

In [None]:
#---Plot the clustering score per epoch:
clustering_score = output_dict["clustering"];
loss_plot = plot(1:length(clustering_score), clustering_score,
     title = "Clustering score per epoch",
     xlabel = "Epoch",
     ylabel = "Clustering score",
     legend = true,
     label = "Score",
     linecolor = :orange,
     linewidth = 2
);
savefig(loss_plot, figurespath * "/ClusteringScore_BAE.png");
loss_plot

In [None]:
#---Create scatter plots of the top selected genes per latent dimension:
if !isdir(figurespath * "/TopFeaturesLatentDim")
    # Create the folder if it does not exist
    mkdir(figurespath * "/TopFeaturesLatentDim")
end
for dim in 1:BAE.HP.zdim
    Featurescatter_plot = normalizedFeatures_scatterplot(BAE.coeffs[:, dim], MD.featurename, dim; top_n=10)
    savefig(Featurescatter_plot, figurespath * "/TopFeaturesLatentDim/" * "BAE_dim$(dim)_topInteractions.png")
end

In [None]:
#---Create scatter plots of the top selected genes per cluster:
if !isdir(figurespath * "/TopFeaturesCluster")
    # Create the folder if it does not exist
    mkdir(figurespath * "/TopFeaturesCluster")
end
for key in keys(MD.Top_features)
    if length(MD.Top_features[key].Scores) > 0
        FeatureScatter_plot = TopFeaturesPerCluster_scatterplot(MD.Top_features[key], key; top_n=10)
        savefig(FeatureScatter_plot, figurespath * "/TopFeaturesCluster/" * "BAE_Cluster$(key)_Interactions.png")
    end
end

In [None]:
#---Plot the absolute values of Pearson correlation coefficients between latent dimensions:
vegaheatmap(abs.(cor(BAE.Z, dims=2)); 
    path=figurespath * "cor_latentDimensions_BAE.png", 
    Title="Absolute correlations of latent dimensions",
    xlabel="Latent dimension", 
    ylabel="Latent dimension",
    legend_title="Value",
    scheme="orangered",
    domain_mid=nothing,
    save_plot=true,
    Width=500, 
    Height=500
)

In [None]:
#---Plot the spearman correlation between the latent dimensions:
vegaheatmap(abs.(corspearman(BAE.Z')); 
    path=figurespath * "spearman_cor_latentDimensions_BAE.png", 
    Title="Absolute Spearman rank correlations of latent dimensions",
    xlabel="Latent dimension", 
    ylabel="Latent dimension",
    legend_title="Value",
    scheme="orangered",
    domain_mid=nothing,
    save_plot=true,
    Width=500, 
    Height=500
)

In [None]:
#---Plot boxplots of the latent activations of cells per latent dimension:
plot_row_boxplots(BAE.Z; xlabel="Latent dimension", ylabel="Cell activation", saveplot=true, path=figurespath * "/BAE_Z_boxplot.png")

In [None]:
#---Plot a heatmap of the cluster probabilities of cells:
Cluster_df = DataFrame(BAE.Z_cluster', :auto);
Cluster_df[!, :Cluster] = copy(MD.obs_df.Cluster);
sort!(Cluster_df, :Cluster);

#ClusterProbabilities_plot = heatmap(Matrix(Cluster_df[:, 1:end-1]), ylabel="Cell", title="Cluster probabilities", color=:dense, xlabel="Cluster", size=(700, 500));
#savefig(ClusterProbabilities_plot, figurespath * "/clusterProbabilities_BAE_plots.svg");

vegaheatmap(Matrix(Cluster_df[:, 1:end-1]); #!Currently does not work if zdim > 30 ... (in that case use the heatmap function from Plots.jl above)
    path=figurespath * "clusterProbabilities_BAE.png", 
    Title="Cluster probabilities of cells",
    xlabel="Cluster", 
    ylabel="Cell",
    legend_title="Probability",
    scheme="purpleblue",
    domain_mid=nothing,
    save_plot=true
)

In [None]:
#---Plot the UMAP embedding of the learned BAE latent representation of cell pairs colored by the sending-receiving type pair:
if "CellTypePair" in names(MD.obs_df)
    pl = vegascatterplot(Matrix(MD.obs_df[:, [:UMAP1, :UMAP2]]), MD.obs_df.CellTypePair; 
        path=figurespath * "CellTypePair_(BAE)umap.png",
        legend_title="Sender-Receiver",
        color_field="labels:o",
        scheme=nothing,
        domain_mid=nothing,
        range=custom_colorscheme_shuffled,
        save_plot=true,
        marker_size="5"
    )

    display(pl)
    
else
    @warn "CellTypePair not found in metadata!"
end

In [None]:
#---Plot the UMAP embedding of the learned BAE latent representation colored by the cluster labels:
vegascatterplot(Matrix(MD.obs_df[:, [:UMAP1, :UMAP2]]), MD.obs_df.Cluster; 
    path=figurespath * "Cluster_(BAE)umap.png",
    legend_title="Cluster",
    color_field="labels:o",
    scheme=nothing,
    domain_mid=nothing,
    range=custom_colorscheme_shuffled,
    save_plot=true,
    marker_size="5"
)

In [None]:
#---Plot the UMAP embedding of the learned BAE latent representation colored by the sender cell types:
if "SenderType" in names(MD.obs_df)
    pl = vegascatterplot(Matrix(MD.obs_df[:, [:UMAP1, :UMAP2]]), MD.obs_df.SenderType; 
        path=figurespath * "SenderType_(BAE)umap.png",
        legend_title="Sender",
        color_field="labels:o",
        scheme="category20",
        domain_mid=nothing,
        range=nothing, #custom_colorscheme[[1, 3, 14, 26, 31, 36, 42, 45, 53]],
        save_plot=true,
        marker_size="5"
    )

    display(pl)
    
else
    @warn "MetaData has no column named: SenderType."
end

In [None]:
#---Plot the UMAP embedding of the learned BAE latent representation colored by the sender cell types:
if "ReceiverType" in names(MD.obs_df)
    pl = vegascatterplot(Matrix(MD.obs_df[:, [:UMAP1, :UMAP2]]), MD.obs_df.ReceiverType; 
        path=figurespath * "ReceiverType_(BAE)umap.png",
        legend_title="Receiver",
        color_field="labels:o",
        scheme="category20",
        domain_mid=nothing,
        range=nothing, #custom_colorscheme[[1, 3, 14, 26, 31, 36, 42, 45, 53]],
        save_plot=true,
        marker_size="5"
    )

    display(pl)
    
else
    @warn "MetaData has no column named: ReceiverType."
end

In [None]:
#---Create scatter plots of the UMAP embedding of the learned BAE latent representation colored by activations in different latent dimensions:
if !isdir(figurespath * "/UMAPplotsLatDims")
    # Create the folder if it does not exist
    mkdir(figurespath * "/UMAPplotsLatDims")
end
create_colored_vegascatterplots(Matrix(MD.obs_df[:, [:UMAP1, :UMAP2]]), BAE.Z;
    path=figurespath * "/UMAPplotsLatDims/",
    filename="Rat_BAE_dim",
    filetype="scatter.png",
    legend_title="Activation",
    color_field="labels:q",
    scheme="blueorange", 
    domain_mid=0,
    range=nothing,
    save_plot=true,
    marker_size="10"
)

In [None]:
#---Create scatter plots of the UMAP embedding of the learned BAE latent representation colored by activations for different clusters:
if !isdir(figurespath * "/UMAPplotsCluster")
    # Create the folder if it does not exist
    mkdir(figurespath * "/UMAPplotsCluster")
end
#Bright color:
#color_range = [
#    "#fff5f5", "#ffe0e0", "#ffcccc", "#ffb8b8", "#ffa3a3", "#ff8f8f", "#ff7a7a", "#ff6666",
#    "#ff5252", "#ff3d3d", "#ff2929", "#ff1414", "#ff0000", "#e50000", "#cc0000", "#b20000",
#    "#990000", "#7f0000", "#660000", "#4c0000", "#330000"
#];
#Dark color:
color_range = [
    "#000000", "#220022", "#440044", "#660066", "#880088", "#aa00aa", "#cc00cc", "#ee00ee",
    "#ff00ff", "#ff19ff", "#ff33ff", "#ff4cff", "#ff66ff", "#ff7fff", "#ff99ff", "#ffb2ff",
    "#ffccff", "#ffe5ff", "#ffccf5", "#ff99eb", "#ff66e0"
]
create_colored_vegascatterplots(Matrix(MD.obs_df[:, [:UMAP1, :UMAP2]]), BAE.Z_cluster;
    path=figurespath * "/UMAPplotsCluster/",
    filename="Rat_BAE_dim",
    filetype="scatter.png",
    legend_title="Activation",
    color_field="labels:q",
    scheme=nothing, 
    domain_mid=nothing,
    range=color_range,
    save_plot=true,
    marker_size="10"
)

In [None]:
#---Create scatter plots of the UMAP embedding of the learned BAE latent representation colored by expression levels of top selected genes for different clusters:
if !isdir(figurespath * "/FeaturePlots")
    # Create the folder if it does not exist
    mkdir(figurespath * "/FeaturePlots")
end
#Bright color:
#color_range = [
#    "#fff5f5", "#ffe0e0", "#ffcccc", "#ffb8b8", "#ffa3a3", "#ff8f8f", "#ff7a7a", "#ff6666",
#    "#ff5252", "#ff3d3d", "#ff2929", "#ff1414", "#ff0000", "#e50000", "#cc0000", "#b20000",
#    "#990000", "#7f0000", "#660000", "#4c0000", "#330000"
#];
#Dark color:
color_range = [
    "#000000", "#220022", "#440044", "#660066", "#880088", "#aa00aa", "#cc00cc", "#ee00ee",
    "#ff00ff", "#ff19ff", "#ff33ff", "#ff4cff", "#ff66ff", "#ff7fff", "#ff99ff", "#ffb2ff",
    "#ffccff", "#ffe5ff", "#ffccf5", "#ff99eb", "#ff66e0"
]
FeaturePlots(MD.Top_features, MD.featurename, CCIM, Matrix(MD.obs_df[:, [:UMAP1, :UMAP2]]); 
    top_n=5,
    marker_size="10", 
    fig_type=".png",
    path=figurespath * "/FeaturePlots/",
    legend_title="log1p",
    color_field="labels:q",
    scheme=nothing, 
    domain_mid=nothing,
    range=color_range
)

In [None]:
#---Create a coefficient plots for visually inspecting coefficient update trajectories for the last run of the training:
if haskey(output_dict, "coefficients")
    if !isdir(figurespath * "/CoefficientsPlots")
        # Create the folder if it does not exist
        mkdir(figurespath * "/CoefficientsPlots")
    end
    for dim in 1:BAE.HP.zdim
        pl = track_coefficients(output_dict["coefficients"], dim; iters=nothing, xscale=:log10)
        savefig(pl, figurespath * "/CoefficientsPlots/CoefficientsPlot_BAE_dim$(dim).png")
    end
else 
    @warn "No coefficient trajectories were saved during training."
end