# Main

Autoencoders are powerful neural network architectures used for unsupervised learning, enabling the extraction of meaningful features from high-dimensional datasets such as single-cell RNA sequencing (scRNA-seq) data. When applied to scRNA-seq data from the Heart Cell Atlas, autoencoders can efficiently compress and reconstruct gene expression profiles, aiding in the identification of cellular heterogeneity and underlying biological patterns in cardiac tissues.
> Generated by ChatGPT

In this notebook the whole preprocessing, training, and evaluation will take place.

***

## Loading Libraries

Library | Version | Channel
--- | --- | ---
anndata | 0.7.0 | bioconda?
PyTorch | 2.2.2 | pytorch
Torchvision | 0.17.2 | pytorch
Tensorboard | / | conda-forge

In [2]:
# Built-in libraries
from datetime import datetime
import os
import sys

# Third-party libraries
import anndata as ad
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as opt
from torch.utils.tensorboard import SummaryWriter

# Get the absolute path of the 'notebooks' directory
notebooks_dir = os.path.dirname(os.path.abspath("__file__"))

# Construct the path to the 'src' directory
src_path = os.path.abspath(os.path.join(notebooks_dir, "..", "src"))

# Add the 'src' directory to the Python path
if src_path not in sys.path:
    sys.path.append(src_path)

# Self-build modules
import autoencoder.ae_model as ae
import autoencoder.ae_training as T
import utils.data_utils as data_utils

## Hyperparameters

Load model structure and hyperparameters from JSON file.

In [2]:
file_path = "../config/autoencoder_test.json"
model_params = utils.import_model_params(file_path)

model_architecture = model_params["model"]
model_training = model_params["training"]

In [3]:
# Batch size
batch_size = model_training["batch_size"]  # Power of 2 is optimized in many libraries

# Training
num_epochs = model_training["training_epochs"]

### Device Specification

The CUDA architecture from NVIDIA enables high-performance parallel computing on GPUs, optimizing tasks through concurrent execution and accelerating applications like deep learning and scientific simulations.
> Generated by ChatGPT

In [4]:
## Established the type of device used for model processing
device = "cuda" if torch.cuda.is_available() else "cpu"
cuda = True if device == "cuda" else False

## Loading Data

Annotated data in the h5ad format is widely used for efficiently storing and accessing large-scale single-cell RNA sequencing (scRNA-seq) datasets. Leveraging the anndata library, researchers can seamlessly manipulate and analyze these datasets, facilitating tasks such as data integration, preprocessing, and visualization, thereby enhancing insights into complex biological systems.
> Generated by ChatGPT

In [5]:
file_path = "../data/adata_normalized_sample.h5ad"
# file_path = "../data/adata_30kx10k_normalized_sample.h5ad"

adata = ad.read_h5ad(filename=file_path)

In [6]:
adata

AnnData object with n_obs × n_vars = 1000 × 1000
    obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used'
    var: 'gene_ids-Harvard-Nuclei', 'feature_types-Harvard-Nuclei', 'gene_ids-Sanger-Nuclei', 'feature_types-Sanger-Nuclei', 'gene_ids-Sanger-Cells', 'feature_types-Sanger-Cells', 'gene_ids-Sanger-CD45', 'feature_types-Sanger-CD45'
    uns: 'cell_type_colors'
    obsm: 'X_pca', 'X_umap'
    layers: 'cpm_normalized', 'min_max_normalized'

In [7]:
count_data = adata.layers["min_max_normalized"]

In [8]:
count_data

<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 39019 stored elements in Compressed Sparse Row format>

## Data Split

Split data in training and testing data.

In [9]:
train_size = int(0.8 * count_data.shape[0])
test_size = count_data.shape[0] - train_size

torch.manual_seed(2406)
perm = torch.randperm(count_data.shape[0])
train_split, test_split = perm[:train_size], perm[train_size:]

In [10]:
train_data = SparseDataset(count_data[train_split, :])
test_data = SparseDataset(count_data[test_split, :])

In [11]:
# Create data loaders
train_loader = torch.utils.data.DataLoader(
    train_data,
    batch_size=batch_size,
    shuffle=True,
)
test_loader = torch.utils.data.DataLoader(
    test_data,
    batch_size=batch_size,
    shuffle=False,
)

In [12]:
train_loader.dataset

<modules.sparse_dataset.SparseDataset at 0x7f8926da9c30>

In [13]:
test_loader.dataset

<modules.sparse_dataset.SparseDataset at 0x7f8926dab2e0>

## Model Structure

The **autoencoder** is comprised of two primary components: the **encoder** and the **decoder**. The encoder is responsible for reducing the dimensionality of the input tensor. The decoder, in turn, attempts to reconstruct the original input data from the reduced representation generated by the encoder.

In [14]:
encoder_layers, decoder_layers = utils.import_model_architecture(
    forward=model_architecture["layers"]["encoder"],
    backward=model_architecture["layers"]["decoder"],
)

loss_function = utils.import_loss_function(model_architecture["loss_function"])

model = ae.Autoencoder(encoder_layers, decoder_layers, loss_function=loss_function)

In [15]:
model

Autoencoder(
  (encoder): Sequential(
    (0): Linear(in_features=1000, out_features=600, bias=True)
    (1): SiLU()
    (2): Linear(in_features=600, out_features=300, bias=True)
    (3): SiLU()
    (4): Linear(in_features=300, out_features=50, bias=True)
  )
  (decoder): Sequential(
    (0): Linear(in_features=50, out_features=300, bias=True)
    (1): SiLU()
    (2): Linear(in_features=300, out_features=600, bias=True)
    (3): SiLU()
    (4): Linear(in_features=600, out_features=1000, bias=True)
  )
  (loss_function): MSELoss()
)

## Training

In [28]:
optimizer = utils.import_optimizer(
    model.parameters(),
    model_architecture["optimization"]["optimizer"],
    learning_rate=model_architecture["optimization"]["learning_rate"],
    weight_decay=model_architecture["optimization"]["weight_decay"],
)

writer = SummaryWriter(
    f'../runs/hca/ae_{num_epochs}_{datetime.now().strftime("%Y%m%d-%H%M%S")}'
)

prev_updates = 0
for epoch in range(num_epochs):
    print(f"Epoch {epoch+1}/{num_epochs}")
    prev_updates = T.train(
        model, train_loader, optimizer, prev_updates, device, writer=writer
    )
    T.test(model, test_loader, prev_updates, device=device, writer=writer)

Epoch 1/5


 14%|█▍        | 1/7 [00:00<00:00,  8.79it/s]

Step 0 (N samples: 0), Loss: 0.0008 Grad: 0.0020


100%|██████████| 7/7 [00:00<00:00, 19.82it/s]
Testing: 100%|██████████| 2/2 [00:00<00:00, 58.43it/s]


====> Test set loss: 0.0002
Epoch 2/5


100%|██████████| 7/7 [00:00<00:00, 25.60it/s]
Testing: 100%|██████████| 2/2 [00:00<00:00, 48.47it/s]


====> Test set loss: 0.0001
Epoch 3/5


100%|██████████| 7/7 [00:00<00:00, 17.73it/s]
Testing: 100%|██████████| 2/2 [00:00<00:00, 53.66it/s]


====> Test set loss: 0.0001
Epoch 4/5


100%|██████████| 7/7 [00:00<00:00, 20.23it/s]
Testing: 100%|██████████| 2/2 [00:00<00:00, 40.05it/s]


====> Test set loss: 0.0001
Epoch 5/5


100%|██████████| 7/7 [00:00<00:00, 24.72it/s]
Testing: 100%|██████████| 2/2 [00:00<00:00, 55.45it/s]

====> Test set loss: 0.0001



