# Generative Neural Networks

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Center-for-Health-Data-Science/IntroToML/blob/HEAD/Day3/scVAE.ipynb)

In this exercise, we will use a variational autoencoder (VAE) to model single-cell RNA-sequencing (scRNA-seq) gene expression data. scVAE ([Grønbech *et al.*, 2020](https://academic.oup.com/bioinformatics/article/36/16/4415/5838187)) is designed for this.

A VAE can encode a data set into a latent representation using a inference model (encoder) and decode the latent representation to reconstruct the data set using a generative model (decoder).

## Installation

We will use a development version of scVAE, so ignore any warnings that occurs.

Install scVAE and ScanPy:

In [None]:
%pip install -U -q https://people.compute.dtu.dk/chegr/scvae/scvae-3.0.0.dev0.tar.gz

In [None]:
%pip install -U -q scanpy

Restart the kernel (Kernel > Restart Kernel...).

Import scVAE and other packages:

In [None]:
import scvae
import scanpy as sc
import anndata as ad
import tensorflow as tf

## Data

We will work with a data set of single-cell RNA-sequencing (scRNA-seq) gene expression data from lupus and healthy patients ([Perez *et al.*, 2022](https://www.science.org/doi/10.1126/science.abf1970)). Since this is a very large data set of 1.2 million cells and 31000 genes, it has been preprocessed resulting in a much smaller size.

Download preprocessed lupus scRNA-seq data set:

In [None]:
%%bash
wget https://people.compute.dtu.dk/chegr/scvae/lupus.h5ad

Load data set:

In [None]:
lupus = sc.read("lupus.h5ad")
lupus

This is an annotated data set, and `obs` refer to the observations, which are the cells, and `var` refer to the variables, which are the genes. The preprocessed gene expression levels for each cell and each gene are stored in the `X` attribute:

In [None]:
lupus.X.A

However, we will use the raw gene expression counts stored in the "counts" layer.

In [None]:
lupus.layers["counts"].A

We can visualise the data set using PCA:

In [None]:
sc.tl.pca(lupus)
sc.pl.pca(lupus, color="cell_type")

We can also use UMAP for visualisation:

In [None]:
sc.pp.neighbors(lupus)
sc.tl.umap(lupus)

In [None]:
sc.pl.umap(lupus, color="cell_type")

### Exercise

Try using different cell annotations (see the `obs` annotation names above) for the PCA and UMAP plots using the `color` argument.

## Variational autoencoder (VAE)

First hyperparameters are set:

In [None]:
FEATURE_SIZE = lupus.n_vars
HIDDEN_SIZES = [200, 200]
LATENT_SIZE = 50
LIKELIHOOD_NAME = "negative_binomial"
LEARNING_RATE = 1e-3

Here:

* The feature size is the number of genes in the data set.
* The hidden sizes are the number of units in each layer of the neural networks
* The latent size is the dimension of the latent representation.
* The likelihood name is distribution we think the counts follow (other possible options are `poisson`, `zero_inflated_poisson`, and `zero_inflated_negative_binomial`).

The VAE model is then initialised and compiled:

In [None]:
vae = scvae.models.VariationalAutoEncoder(
    original_dim=FEATURE_SIZE,
    intermediate_dims=HIDDEN_SIZES,
    latent_dim=LATENT_SIZE,
    likelihood_name=LIKELIHOOD_NAME)
optimiser = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
vae.compile(optimizer=optimiser)

Now, the VAE model can be trained:

In [None]:
vae_history = vae.fit(lupus, layer="counts", epochs=50)

The model outputs a [lower bound of the marginal log-likelihood](https://en.wikipedia.org/wiki/Evidence_lower_bound), which is the difference between the reconstruction error and the [KL divergence](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence).

To see if the model has been trained for long enough, we can plot the metrics against the epochs (learning curves):

In [None]:
scvae.visualisation.plot_learning_curves(vae_history)

If the learning curves are flattening out, the model has been trained for long enough.

The model can be evaluated on the data set:

In [None]:
vae_evaluation = vae.evaluate(lupus, layer="counts")

To visualise the latent representation, we can map the data set to the latent space using the encoder:

In [None]:
vae_latent_representation = ad.AnnData(
    vae.encoder.predict(lupus, layer="counts"),
    obs=lupus.obs)

In [None]:
scvae.visualisation.plot_latent_representation(
    vae_latent_representation, annotation_name="cell_type", model=vae)

Here, the black ellipse shows the prior distribution of the latent representation.

We can also plot the latent representation using UMAP, but since this method does not generalise (see [transduction](https://en.wikipedia.org/wiki/Transduction_(machine_learning)), we cannot plot the prior distribution.

In [None]:
sc.pp.neighbors(vae_latent_representation)
sc.tl.umap(vae_latent_representation)

In [None]:
sc.pl.umap(vae_latent_representation, color="cell_type")

### Exercises

* Try training the model with different hyperparameters and compare the log-likelihood lower bounds.
* Visualise the latent representation using different cell annotations.
* Compare with the PCA and UMAP plots of the original data set.

## Gaussian-mixture VAE (GMVAE)

The Gaussian-mixture VAE uses multiple Gaussian components to model the data:

In [None]:
COMPONENT_COUNT = 7

Initialise, compile, and train a GMVAE model:

In [None]:
gmvae = scvae.models.VariationalAutoEncoder(
    original_dim=FEATURE_SIZE,
    intermediate_dims=HIDDEN_SIZES,
    latent_dim=LATENT_SIZE,
    likelihood_name=LIKELIHOOD_NAME,
    approximate_posterior_name="gaussian_mixture",
    prior_name="gaussian_mixture",
    mixture_component_size=COMPONENT_COUNT)
optimiser = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
gmvae.compile(optimizer=optimiser)

In [None]:
gmvae_history = gmvae.fit(lupus, layer="counts", epochs=50)

In [None]:
scvae.visualisation.plot_learning_curves(gmvae_history)

Evaluate the GMVAE model and visualise the latent representation:

In [None]:
gmvae_evaluation = gmvae.evaluate(lupus, layer="counts")

In [None]:
gmvae_latent_values, gmvae_latent_category_logits = gmvae.encoder.predict(
    lupus, layer="counts")
gmvae_latent_representation = ad.AnnData(
    gmvae_latent_values,
    obs=lupus.obs)
gmvae_latent_representation.obs["latent_category"] = (
    gmvae_latent_category_logits.argmax(axis=-1))

PCA:

In [None]:
scvae.visualisation.plot_latent_representation(
    gmvae_latent_representation, annotation_name="cell_type", model=gmvae)

The ellipses show the individual Gaussian components, and the plot to the right shows their corresponding contribution to the total distribution.

UMAP:

In [None]:
sc.pp.neighbors(gmvae_latent_representation)
sc.tl.umap(gmvae_latent_representation)

In [None]:
sc.pl.umap(gmvae_latent_representation, color="cell_type")

### Exercises

* Try training the model with different hyperparameters and compare the log-likelihood lower bounds.
* Visualise the latent representation using different cell annotations.
* Compare with the PCA and UMAP plots of the original data set and the VAE latent representation.
* Compare the log-likelihood lower bound with the one for the VAE.