# A Variational Autoencoder for Single Cell Transcriptomics in the CELLxGENE Dataset

This notebook complements the source code for a variational autoencoder (VAE) on the CELLxGENE Dataset. It is part of the course "Big Data Praktikum" at Leipzig University. In this notebook, we document our data pipeline and comment on decisions and experiences we made during the implementation. The notebook includes python code for illustrative purposes, but the main training pipeline is contained in seperate python scripts optimized for deployment to HPC infrastructure.

## Background

### Motivation



### Dataset



### Variational autoencoders

In [1]:
from autoCell.data_loader import SingleCellDataset
import src
import cellxgene_census
import anndata
import torch
import pandas as pd
import numpy as np

## Data loading and preprocessing

### Accessing the CELLxGENE Dataset

In [4]:
with cellxgene_census.open_soma() as census:
    adata = cellxgene_census.get_anndata(
        census, 
        "Homo sapiens",
        obs_coords=slice(0, 100),
        obs_value_filter="tissue_general == 'lung' and disease in ['normal','lung adenocarcinoma', 'squamous cell lung carcinoma', 'small cell lung carcinoma', 'non-small cell lung carcinoma', 'pleomorphic carcinoma', 'lung large cell carcinoma'] and is_primary_data == True",  # Specific tissue
        # var_value_filter="feature_name in ['GAPDH', 'ACTB']",  # Specific genes
        obs_column_names=["cell_type", "tissue", "disease"]  # Minimal metadata
    )

print(adata)

The "stable" release is currently 2025-01-30. Specify 'census_version="2025-01-30"' in future calls to open_soma() to ensure data consistency.


AnnData object with n_obs × n_vars = 0 × 61888
    obs: 'cell_type', 'tissue', 'disease', 'tissue_general', 'is_primary_data'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_type', 'feature_length', 'nnz', 'n_measured_obs'


### Preprocessing pipeline

TODO: Describe the data preparation that we've done, point out important code snippets

In [13]:
## Highlights from the preprocessing

In [15]:
## Generatate and describe dataset statistics

### Defining the PyTorch Dataset

In [2]:
dataset = SingleCellDataset(
    file_path="data.h5ad",
    cell_subset=[i for i in range(1000)],
    log_transform=True,
    normalize=True,
    scale_factor=1.0
)

  self.adata.X = X


Dataset loaded: 1000 cells × 61888 genes


In [4]:
dataloader = torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False)

## Defining the variational autoencoder

### Architecture

In [11]:
# Here code from the model initialization

### Definition of the ELBO

In [12]:
### Here code for the ELBO

### (Other architectural considerations - if needed)

## Training the variational autoencoder

### Training infrastructure

Describe the WandB setup, describe aspects needed for on-cluster training

### Parameterization of the training

Explain also how to run the script

### Training statistics

Show and interpret some of the charts from WandB

## Evaluation of the latent space

### Obtaining latent representation for the data samples

In [16]:
### Load model, forward pass data, store latent means

### Comparison: PCA

Create and compare PCA on Input Data vs Latent Rep.

### Comparison: UMAP

Create and compare UMAP on Input Data vs Latent Rep.

### Comparison: PCA + UMAP

Create and compare PCA + UMAP on Input Data vs Latent Rep.

### Latent factor analysis

Vary only individual latents and see effect on cluster-colored umap

## Discussion

Concluding remarks on the results, further steps, ...