# Main

Single-cell RNA sequencing (scRNA-seq) reveals the gene expression profiles of individual cells, offering insights into cellular diversity and dynamics. However, the high-dimensional and sparse nature of scRNA-seq data presents challenges for analysis. Variational autoencoders (VAEs) provide a powerful solution by learning low-dimensional representations of this complex data.

VAEs extend traditional autoencoders with a probabilistic approach, enabling them to generate new data points similar to the original dataset. This generative capability is useful for tasks like data denoising and imputing missing values, making VAEs ideal for scRNA-seq analysis.
> Generated by ChatGPT

## Loading Libraries

Library | Version | Channel
--- | --- | ---
anndata | 0.7.0 | bioconda?
PyTorch | 2.2.2 | pytorch
Torchvision | 0.17.2 | pytorch
Tensorboard | / | conda-forge

In [2]:
# Built-in libraries
from datetime import datetime
import os
import sys

# Third-party libraries
import anndata as ad
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as opt
from torch.utils.tensorboard import SummaryWriter

# Get the absolute path of the 'notebooks' directory
notebooks_dir = os.path.dirname(os.path.abspath("__file__"))

# Construct the path to the 'src' directory
src_path = os.path.abspath(os.path.join(notebooks_dir, "..", "src"))

# Add the 'src' directory to the Python path
if src_path not in sys.path:
    sys.path.append(src_path)

# Self-build modules
from utils.data_utils import SparseDataset
import variational_autoencoder.vae_model as vae
import variational_autoencoder.vae_training as T

SyntaxError: f-string: expecting '}' (vae_training.py, line 22)

## Hyperparameters

In [2]:
# Batch size
batch_size = 128  # Power of 2 is optimized in many libraries

# Model architecture
size_layers = [
    (10000, nn.ReLU()),
    (6000, nn.ReLU()),
    (3000, nn.ReLU()),
    (1000, nn.ReLU()),
    (200, nn.ReLU()),
]

# Optimizer
learning_rate = 1e-3
weight_decay = 1e-4

# Training
num_epochs = 3

### Device Specification

The CUDA architecture from NVIDIA enables high-performance parallel computing on GPUs, optimizing tasks through concurrent execution and accelerating applications like deep learning and scientific simulations.
> Generated by ChatGPT

In [3]:
## Established the type of device used for model processing
device = "cuda" if torch.cuda.is_available() else "cpu"
cuda = True if device == "cuda" else False

## Loading Data

Annotated data in the h5ad format is widely used for efficiently storing and accessing large-scale single-cell RNA sequencing (scRNA-seq) datasets. Leveraging the anndata library, researchers can seamlessly manipulate and analyze these datasets, facilitating tasks such as data integration, preprocessing, and visualization, thereby enhancing insights into complex biological systems.
> Generated by ChatGPT

In [4]:
# file_path = "../data/adata_normalized_sample.h5ad"
file_path = "../data/adata_30kx10k_normalized_sample.h5ad"

adata = ad.read_h5ad(filename=file_path)

In [5]:
adata

AnnData object with n_obs × n_vars = 30000 × 10000
    obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used'
    var: 'gene_ids-Harvard-Nuclei', 'feature_types-Harvard-Nuclei', 'gene_ids-Sanger-Nuclei', 'feature_types-Sanger-Nuclei', 'gene_ids-Sanger-Cells', 'feature_types-Sanger-Cells', 'gene_ids-Sanger-CD45', 'feature_types-Sanger-CD45'
    uns: 'cell_type_colors'
    obsm: 'X_pca', 'X_umap'
    layers: 'cpm_normalized', 'min_max_normalized'

In [6]:
count_data = adata.layers["min_max_normalized"]

In [7]:
count_data

<30000x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 12490183 stored elements in Compressed Sparse Row format>

## Data Split

Split data in training and testing data.

In [8]:
train_size = int(0.8 * count_data.shape[0])
test_size = count_data.shape[0] - train_size

torch.manual_seed(2406)
perm = torch.randperm(count_data.shape[0])
train_split, test_split = perm[:train_size], perm[train_size:]

In [9]:
train_data = SparseDataset(count_data[train_split, :])
test_data = SparseDataset(count_data[test_split, :])

In [10]:
# Create data loaders
train_loader = torch.utils.data.DataLoader(
    train_data,
    batch_size=batch_size,
    shuffle=True,
)
test_loader = torch.utils.data.DataLoader(
    test_data,
    batch_size=batch_size,
    shuffle=False,
)

In [11]:
train_loader.dataset

<modules.sparse_dataset.SparseDataset at 0x7fe2b56609d0>

In [12]:
test_loader.dataset

<modules.sparse_dataset.SparseDataset at 0x7fe2b5663ee0>

## Model Structure

**Variational autoencoders** are a type of deep generative model that extend traditional autoencoders by incorporating a probabilistic framework. This allows VAEs to learn meaningful latent representations of data, which can be used for tasks such as data generation, denoising, and imputation. Their ability to handle complex data distributions makes them particularly useful in fields like image processing and single-cell RNA sequencing (scRNA-seq) analysis.
> Generated by ChatGPT

In [13]:
model = vae.VariationalAutoencoder(size_layers, F.mse_loss, opt.Adam)

Constructed encoder...
Constructed decoder...


In [14]:
model

VariationalAutoencoder(
  (encoder): Sequential(
    (0): Linear(in_features=10000, out_features=6000, bias=True)
    (1): ReLU()
    (2): Linear(in_features=6000, out_features=3000, bias=True)
    (3): ReLU()
    (4): Linear(in_features=3000, out_features=1000, bias=True)
    (5): ReLU()
    (6): Linear(in_features=1000, out_features=400, bias=True)
  )
  (softplus): Softplus(beta=1, threshold=20)
  (decoder): Sequential(
    (0): Linear(in_features=200, out_features=1000, bias=True)
    (1): ReLU()
    (2): Linear(in_features=1000, out_features=3000, bias=True)
    (3): ReLU()
    (4): Linear(in_features=3000, out_features=6000, bias=True)
    (5): ReLU()
    (6): Linear(in_features=6000, out_features=10000, bias=True)
  )
)

## Training

In [15]:
# writer = SummaryWriter(
#     f'runs/hca/vae_{num_epochs}_{datetime.now().strftime("%Y%m%d-%H%M%S")}'
# )

prev_updates = 0
for epoch in range(num_epochs):
    print(f"Epoch {epoch+1}/{num_epochs}")
    prev_updates = T.train(
        model,
        train_loader,
        model.optimizer,
        prev_updates,
        device,
        model_type="vae",
    )
    T.test(
        model,
        test_loader,
        prev_updates,
        device=device,
        model_type="vae",
    )

Epoch 1/3


  0%|          | 0/188 [00:00<?, ?it/s]


RuntimeError: all elements of input should be between 0 and 1