# Exploratory analysis of single cell data with SAUCIE

In this notebook, we will use SAUCIE, a multitasking neural network, that can be used for visualization, clustering, batch correction and denoising of single cell data. We will apply it once again to the Shekhar et al. retinal bipolar data.

## 1. Imports

In [None]:
!pip install scprep
!pip install tensorflow==1.12.0

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt

import sklearn.decomposition
import scprep
import sys
%matplotlib inline

SAUCIE is not available on PyPi, but we can download it from GitHub and add it to our Python path to run it without any further installation.

In [None]:
# download SAUCIE from Github
!git clone https://github.com/KrishnaswamyLab/SAUCIE.git

In [None]:
# add SAUCIE to the python path
sys.path.append('./SAUCIE/')
import SAUCIE

## 2. Loading the retinal bipolar data

We'll use the same retinal bipolar data you saw in preprocessing and visualization.

Alternatively, you may load your own data by replacing the Google Drive file ids with your own file ids.

Note that if you do, you will likely not have annotated celltype labels yet. Replace all references to `metadata['CELLTYPE']` with an entry from `metadata`, or your favorite gene. Parts of this notebook are only applicable if you have multiple batches, which you should encode in `metadata['sample_id']` as integers and `metadata['sample_name']` as strings.

In [None]:
scprep.io.download.download_google_drive("1GYqmGgv-QY6mRTJhOCE1sHWszRGMFpnf", "data.pickle.gz")
scprep.io.download.download_google_drive("1q1N1s044FGWzYnQEoYMDJOjdWPm_Uone", "metadata.pickle.gz")

In [None]:
data_raw = pd.read_pickle("data.pickle.gz")
data_raw.head()

Technically, the retinal bipolar data comes from six separate sequencing runs with subtle batch effects between then. For simplicity, we'll treat these six runs as coming from two batches---runs 1-3 and runs 4-6.

In [None]:
metadata = pd.read_pickle("metadata.pickle.gz")
# the batch ids are in the cell barcode names
metadata['batch'] = [int(index[7]) for index in metadata.index]
# for simplicity, we'll split the six batches into two groups -- 1-3 and 4-6
metadata['sample_id'] = np.where(metadata['batch'] < 4, 1, 2)
metadata['sample_name'] = np.where(metadata['batch'] < 4, 'Samples 1-3', 'Samples 4-6')
metadata.head()

## 3. Preparing the data

Data for input to neural networks should generally be 100 or less input dimensions. If you have more than that, you should run PCA. If you have less, you should ensure that each of your data features are roughly normally distributed with mean 0 and standard deviation 1.

In [None]:
pca_op = sklearn.decomposition.PCA(100)
data = pca_op.fit_transform(data_raw)
data

In [None]:
n_features = data.shape[1]
n_features

In [None]:
scprep.plot.scatter2d(data, c=metadata['sample_name'], ticks=False, label_prefix="PC", legend_title="Batch")

## 4. Running SAUCIE for visualization

SAUCIE contains a convenience class `SAUCIE.Loader` which allows us to easily send our data to the SAUCIE model without worrying about the underlying neural network mechanics. We'll build two of these -- one for training the model, which randomizes the order of the data points, and one for evaluating the model, which sends the points through in order so we can compare them to our metadata object.

In [None]:
# in training: get random order
loader_train = SAUCIE.Loader(data, labels=metadata['sample_id'], shuffle=True)
# to evaluate: get same order, so we know which row is which
loader_eval = SAUCIE.Loader(data, labels=metadata['sample_id'], shuffle=False)

Each time we run SAUCIE, we should call `tf.reset_default_graph()` to clear any previous runs. Then all we need to do to actually run SAUCIE is to create the model `SAUCIE.SAUCIE(n_features)` and run `model.train(loader)`!

In [None]:
# clear the computational graph
tf.reset_default_graph()
# build the SAUCIE model
model = SAUCIE.SAUCIE(n_features)
# train the model!
model.train(loader_train, steps=1000)

# get the visualization layer
embedding, _ = model.get_embedding(loader_eval)

# plot the results
scprep.plot.scatter2d(embedding, c=metadata['sample_name'], ticks=False, label_prefix="SAUCIE")

### Exercise - examine the visualization

Try coloring the SAUCIE visualization by features in the `metadata` data frame, or by gene expression from `data_raw`. Compare this to what you saw when visualizing the same dataset with PCA, PHATE, t-SNE or UMAP.

In [None]:
# =======
# plot `embedding` colored by the feature or meta-feature of your choice
scprep.plot.scatter2d(
# =======

In [None]:
# ======
# run the visualization of your choice
# you may need to install PHATE or UMAP (`!pip install phate` or `!pip install umap-learn`)
embedding_op = 
alt_embedding = embedding_op.fit_transform(data)
# plot `alt_embedding` colored by the sample labels and the feature or meta-feature of your choice
scprep.plot.scatter2d(
# ======

### Discussion

1. What do you notice about the SAUCIE visualization?
2. How does the visualization compare to UMAP and PHATE?
3. When do you think it would be useful to use SAUCIE instead of any other visualization technique?

## 5. Running SAUCIE for batch correction

#### Characterizing the batch effect

We noticed in the visualization above that there is a small but noticeable difference between the batches. Let's take a look at the differences between batches to understand the batch effect.

In [None]:
# Calculate the differential expression by calculating the t-statistic between samples
de_results = scprep.stats.differential_expression(data_raw.loc[metadata['sample_name'] == 'Samples 1-3'],
                                                  data_raw.loc[metadata['sample_name'] == 'Samples 4-6'],
                                                  measure='ttest')
de_results.iloc[0:20,:]

In [None]:
fig, axes = plt.subplots(4, 5, figsize=(20, 16))
for gene, ax in zip(de_results.index, axes.flatten()):
    scprep.plot.histogram([
        data_raw.loc[metadata['sample_name'] == 'Samples 1-3', gene],
        data_raw.loc[metadata['sample_name'] == 'Samples 4-6', gene],
    ], color=['red', 'blue'], ax=ax, title=gene, log='y')

scprep.plot.tools.generate_legend({'Samples 1-3':'red', 'Samples 4-6':'blue'}, 
                                  ax=axes[0,-1], fontsize=14)
plt.tight_layout()

There seem to be two different types of genes here: one set in which Samples 1-3 have nearly zero expression ( _Xist, Platr17, Smim10l1, Tsix,_ etc ) and another where Samples have systematically higher expression ( _BC033916, 2700089E24Rik, Rsrp1,_ etc ). It's worth noting that many of these second group are poorly characterized in the literature. We should also be careful when we see a gene like _Xist_ in a list of differentially expressed genes, since this gene is strongly sex-linked.

Let's assume that we want to correct this batch effect. For more discussion of why you might _not_ want to correct it, see our materials on batch correction.

#### Correcting the batch effect

In order to run batch correction with SAUCIE, we can run SAUCIE in the same way as before, but using the keyword argument `lambda_b` (the MMD coefficient, which is set to 0 by default). The larger the coefficient, the more batch correction SAUCIE will apply.

In [None]:
# compile the tf computations for saucie
tf.reset_default_graph()
model = SAUCIE.SAUCIE(n_features, lambda_b=1)

# train the data
model.train(loader_train, steps=2000)

#### Examining the corrected data

Now we can obtain the reconstructed data from SAUCIE, and see how well we have corrected the batch effect by looking at those same differentially expressed genes from before.

In [None]:
# get the output of SAUCIE
data_reconstructed, _ = model.get_reconstruction(loader_eval)

# invert PCA to get the reconstructed data in the ambient gene space
data_raw_reconstructed = pca_op.inverse_transform(data_reconstructed)
data_raw_reconstructed = pd.DataFrame(data_raw_reconstructed, index=data_raw.index, columns=data_raw.columns)

# plot the same genes that were differentially expressed in the original data
fig, axes = plt.subplots(4, 5, figsize=(20, 16))
for gene, ax in zip(de_results.index, axes.flatten()):
    scprep.plot.histogram([
        data_raw_reconstructed.loc[metadata['sample_name'] == 'Samples 1-3', gene],
        data_raw_reconstructed.loc[metadata['sample_name'] == 'Samples 4-6', gene],
    ], color=['red', 'blue'], ax=ax, title=gene, log='y')

scprep.plot.tools.generate_legend({'Samples 1-3':'red', 'Samples 4-6':'blue'}, 
                                  ax=axes[0,-1], fontsize=14)
plt.tight_layout()

Note that the differences we saw earlier are now gone.

We can also examine the batch effect in the visualization space. Here we look at the visualization from SAUCIE as it was correcting the batch effect.

In [None]:
# visualize the data
batch_correction_embedding, _ = model.get_embedding(loader_eval)

scprep.plot.scatter2d(batch_correction_embedding, c=metadata['sample_name'], ticks=False, label_prefix="SAUCIE")

### Exercise - rerunning SAUCIE on batch corrected data

You'll notice that the embedding is far less granular than the one we saw at the beginning -- this is due to the additional constraint on the network enforced by the batch correction regularization. We can improve this somewhat by running SAUCIE again to visualize the reconstructed data.

In [None]:
# =======
# create loaders for batch-corrected data
loader_train = SAUCIE.Loader(
    data = , 
    labels = , 
    shuffle=True
)
loader_eval = SAUCIE.Loader(
    data = , 
    labels = , 
    shuffle=True
)

# compile tf computations for saucie
tf.reset_default_graph()

# build and run the model
model = 
model.train(load = , 
            steps=1000)
# =======

# look at the embedding layer
reconstructed_embedding, _ = model.get_embedding(loader_eval)

# plot the output
scprep.plot.scatter2d(reconstructed_embedding, 
                      c=metadata['sample_name'],
                      ticks=False, label_prefix="SAUCIE", figsize=(4,4))
scprep.plot.scatter2d(reconstructed_embedding, 
                      c=metadata['CELLTYPE'],
                      ticks=False, label_prefix="SAUCIE", figsize=(10,4), legend_anchor=(1,1))

### Discussion

1. What do you notice about the gene expression pre- and post-batch correction?
2. What do you notice about the SAUCIE visualizations from the batch correcting model, and from the secondary model that we ran on batch corrected data?
3. When might you use SAUCIE for batch correction instead of other methods like MNN?

## 5. Running SAUCIE for clustering

In order to run clustering with SAUCIE, we can run SAUCIE in the same way as before, but using the keyword argument `lambda_c` (the ID regularization coefficient, which is set to 0 by default) and `lambda_d` (the intra-cluster distance coefficient, which is set to 0 by default). The larger we set `lambda_c`, the stronger the binary assignments will be; the larger we set `lambda_d`, the more SAUCIE will expect clusters to be distinct from each other.

For the clustering to work well, we should scale the data to range between -10 and 10.

In [None]:
# rescale the data for better clustering
data_scaled = data / data.max() * 10
loader_train = SAUCIE.Loader(data_scaled, labels=metadata['sample_id'], shuffle=True)
loader_eval = SAUCIE.Loader(data_scaled, labels=metadata['sample_id'], shuffle=False)

In [None]:
# compile the tf computations for the clustering model
tf.reset_default_graph()
model = SAUCIE.SAUCIE(n_features, lambda_c=.1, lambda_d=.5)

# train the clustering model
model.train(loader_train, steps=5000)

# get the clusters out
_, clusters = model.get_clusters(loader_eval, binmin=10)
cluster_embedding, _ = model.get_embedding(loader_eval)

In [None]:
scprep.plot.scatter2d(cluster_embedding, c=clusters, ticks=False, label_prefix="SAUCIE", 
                      discrete=True, figsize=(10, 4), legend_anchor=(1,1))

### Exercise - understanding SAUCIE clustering

In groups, explore the effect of `lambda_c` and `lambda_d` on SAUCIE's clustering.

1. Pick one of these two coefficients to hold constant while you vary the other
2. Pick one value larger than what we used above and one value smaller.
3. Visualize the cluster assignments on both the SAUCIE visualization and another visualization of your choice (e.g. PHATE, UMAP).

In [None]:
# =======
# pick new values for lambda_c and lambda_d
lambda_c = 
lambda_d = 
# =======

# compile the tf computations for the clustering model
tf.reset_default_graph()
model = SAUCIE.SAUCIE(n_features, lambda_c=lambda_c, lambda_d=lambda_d)

# train the clustering model
model.train(loader_train, steps=5000)

# get the clusters out
_, clusters = model.get_clusters(loader_eval, binmin=10)
cluster_embedding, _ = model.get_embedding(loader_eval)

# ======
# run the visualization of your choice
# you may need to install PHATE or UMAP (`!pip install phate` or `!pip install umap-learn`)
embedding_op = 
alt_embedding = embedding_op.fit_transform(data)
# ======

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(12, 4))
scprep.plot.scatter2d(embedding, c=clusters, ticks=False, label_prefix="SAUCIE", 
                      ax=ax1, discrete=True, legend=False, title="Default SAUCIE Embedding")
scprep.plot.scatter2d(cluster_embedding, c=clusters, ticks=False, label_prefix="SAUCIE", 
                      ax=ax2, discrete=True, legend=False, title="Clustered SAUCIE Embedding")
scprep.plot.scatter2d(alt_embedding, c=clusters, ticks=False, label_prefix="Alt. Embedding ", 
                      ax=ax3, discrete=True, legend=False, title="Alternative Embedding")
plt.tight_layout()

### Discussion

1. How does `lambda_c` affect the clustering output?
2. How does `lambda_b` affect the clustering output?
3. How does SAUCIE's clustering compare to the other clustering algorithms we have learned about?
4. Do `lambda_c` and `lambda_d` affect the SAUCIE visualization? How might you mitigate this?
5. When might you choose to use SAUCIE for clustering?