# Exercise 5 (Hands-on): Single-cell RNA Sequencing and Deep Clustering

This notebook demonstrates the implementation of a deep learning-based clustering pipeline for single-cell RNA sequencing (scRNA-seq) data.
The workflow includes:
1. Data preprocessing for PBMC gene expression data.
2. Dimensionality reduction using an autoencoder.
3. Clustering cells with k-means.
Evaluation is done using Rand Index or Adjusted Rand Index.

## Install requirements
This section installs all necessary Python libraries and dependencies for running the notebook.

In [None]:
%pip install -U --quiet scikit-learn==1.3.2 torch==2.5.1 matplotlib==3.9.2 umap-learn==0.5.7 requests seaborn

## Import necessary libraries
We import the libraries needed for data manipulation, model building, and evaluation.

In [None]:
# For file handling, data fetching, and extraction.
import os
from requests import get
import tarfile

# For numerical operations
import numpy as np
import pandas as pd
import seaborn as sns

# For building and training deep learning models
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler

# For clustering (KMeans), evaluation metrics (rand_score,
# adjusted_rand_score), and data splitting utilities
from sklearn.cluster import KMeans
from sklearn.metrics import rand_score, adjusted_rand_score, normalized_mutual_info_score, silhouette_score
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit

# For timing and visualise final results
import time
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import umap

# For extracting files
import tarfile
from pathlib import Path

## Configurable Constants
Define key constants for the notebook, such as file paths, model parameters, or dataset configurations.

In [None]:
# DATASET_URL: URL to download the compressed gene expression dataset.
DATASET_URL = "https://github.com/BackofenLab/ML_LS_resources/raw/refs/heads/master/exercise_5_scrna_deep_clustering_hands_on/data/gene_expression.tar.xz"

# DATASET_DIR_PATH: Path to the directory where the gene expression dataset will be stored.
BASE_DATA_DIR = Path("./data")
RAW_DATASET_DIR = Path(f"{BASE_DATA_DIR}/gene_expression.tar.xz")
DATA_DIR = Path(f"{BASE_DATA_DIR}/gene_expression")

# EPOCH_NUM: Number of epochs for training the model.
EPOCH_NUM = 20

## Setup GPU
Check for GPU availability and ensures that the computations are optimized for CUDA if a compatible GPU is available

In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

## Extract the data
Extract the scRNA-seq dataset, which contains gene expression data
for 32,738 genes from 2700 cells of peripheral blood mononuclear cells (PBMCs).

In [None]:
# Extract the gene expression dataset if not already extracted.
with tarfile.open(RAW_DATASET_DIR, mode="r:xz") as tar:
    tar.extractall(path=BASE_DATA_DIR)            

## Data Preprocessing
Load, preprocess, and prepare the scRNA-seq dataset for model training and evaluation.

In [None]:
# Define a custom Dataset class for loading and preprocessing the gene expression data.
class RNADataset(Dataset):
    def __init__(self, dataset_path, label_path):
        # Load the dataset and corresponding labels.
        self.x_values, self.y_values = self.get_dataset(dataset_path, label_path)

        # TODO: Set the input shape and number of distinct labels for further use.

    def get_dataset(self, dataset_path, label_path):
        # TODO: Load gene expression matrix from the dataset path using numpy.

        # TODO: Load labels from the label file.

        # TODO: Remove genes with zero expression across all cells.

        return dataset, labels

    def return_y(self):
        # Return the labels.
        return self.y_values

    def return_x(self):
        # Return the dataset.
        return self.x_values

    def __len__(self):
        # Return the total number of samples.
        return len(self.y_values)

    def __getitem__(self, idx):
        # Return a specific sample and its corresponding label.
        return self.x_values[idx], self.y_values[idx]

In [None]:
# TODO: Load and check the shape of the raw gene expression data.

In [None]:
# TODO: Instantiate the RNADataset class

print("Shape of the filtered dataset:", dataset.return_x().shape)

In [None]:
# TODO: Initialize lists to store loss metrics

# TODO: Define a stratified split for cross-validation
# StratifiedShuffleSplit ensures proportional representation of labels in the splits

# TODO: Split the dataset into training and testing sets
# Using train_test_split to split data indices into train and test sets with a fixed random state

# TODO: Assign indices to training and test/validation splits

In [None]:
print("Number of cell types: {}".format(np.unique(dataset.return_y())))

## Define the Autoencoder
This section defines the autoencoder architecture, which will be used for dimensionality reduction of the scRNA-seq data.
The compressed representation learned by the autoencoder will serve as input for the clustering algorithm.

In [None]:
# TODO: Define an Autoencoder (AE) class (hint: use nn.Module as in the previous exercises)
class AE(nn.Module):
    def __init__(self, input_shape):
        super().__init__()
        # TODO: define the encoder: Maps input features to a lower-dimensional representation (latent space)

        # TODO: define the decoder: Reconstructs the input features from the latent space

    def forward(self, features):
        # TODO: Encoder forward pass

        # TODO: Decoder forward pass

        reconstructed = self.decoder_output_layer(activation)  # Reconstructed output

        return reconstructed, code

In [None]:
# TODO: Initialize the Autoencoder

# TODO: Define the optimizer and loss function

In [None]:
# TODO: Split the test/validation indices into validation and test sets

# TODO: Set a fixed random seed for reproducibility

# TODO: Create samplers for different data splits

In [None]:
# TODO: Create DataLoader objects for efficient data batch processing (total, train, validation, test)

# TODO: Initialize a variable to track the best validation loss (used for early stopping or model evaluation)

## Build, train, and evaluate the model
Define functions to build, train, and evaluate the autoencoder model.
The training process minimizes reconstruction loss to ensure the autoencoder effectively captures latent representations of the data.

In [None]:
def train_model(trainloader, net, criterion, optimizer, epoch):
    # TODO: Enable anomaly detection for debugging potential issues during backpropagation

    # TODO: initialize running loss and reset network gradients

    for i, data in enumerate(trainloader):  # Loop through the training data in batches
        input_data, target_data = data  # Extract inputs and targets
        loss = 0

        # TODO: Clear optimizer gradients, perform forward pass, compute loss, backpropagate, and update parameters

    # Return the trained model and average loss over all batches
    return net, running_loss / len(trainloader)

In [None]:
def eval_model(dataloader, net, criterion, epoch):
    # TODO: Define a function to evaluate the autoencoder model
    running_loss = 0  # Initialize running loss
    pearson_running_loss2 = 0  # Placeholder for additional loss metrics (if needed)

    # Lists to store input, output, and target data for analysis
    input_data_list = []
    output_data_list = []
    target_data_list = []

    # Return total loss and collected data
    return running_loss, output_data_list, input_data_list, target_data_list, pearson_running_loss2

In [None]:
# TODO: Train and validate the autoencoder model for the specified number of epochs
print("Training the model...")


In [None]:
# TODO: Evaluate the final trained model on the test set
# The `eval_model` function computes the test loss and collects output/input/target data

In [None]:
# TODO: Define function to extract latent embeddings from the autoencoder

In [None]:
# TODO: Define a function to perform k-means clustering

In [None]:
# TODO: Generate embeddings from the autoencoder's latent space

# TODO: Perform k-means clustering on the latent embeddings

# TODO: Evaluate k-means clustering performance using Rand Index and Adjusted Rand Index

## Analyze the results
Present and analyze the evaluation results, including:
- Test loss from the autoencoder.
- Clustering performance metrics (Rand Index and Adjusted Rand Index) for k-means and spectral clustering.

In [None]:
# TODO: Print evaluation results with improved formatting

- NMI: Normalized Mutual Information (NMI) is a normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation).

- ARI: Adjusted Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings. The adjusted Rand index is thus ensured to have a value close to 0.0 for random labeling independently of the number of clusters and samples and exactly 1.0 when the clusterings are identical

- Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar

In [None]:
# TODO: Ensure running_losses and running_losses_val are detached tensors before plotting

# TODO: Plot training vs. validation loss

#### Dimension reduction of preprocessed single-cell data using UMAP

In [None]:
# TODO: Perform UMAP on the original dimensions

In [None]:
# distinct colors for labelling clusters
color_list = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']

# TODO: Visualize UMAP embeddings with original cell types

# TODO: Add cell type labels to the DataFrame for visualization

# TODO: Visualize UMAP embeddings with original cell types

#### Dimension reduction of latent dimensions using UMAP

In [None]:
# TODO: Perform UMAP dimensionality reduction of latent dimensions

# TODO: Visualize UMAP embeddings with k-means cluster labels