# ACTIVA Tutorial: Generating Realistic scRNA-Seq

In this notebook, we will go over how to load a pre-trained ACTIVA model, and how to generate synthetic data using the generator.

## Load in pre-trained ACTIVA model

Since our implementation is in pytorch, we can use the `load` funtion that pytorch provides. Our model is stored as a dict, with `epoch` corresponding to the current epoch, and `Saved_Model` corresponding to the model.

In [1]:
import torch 

# here is the path to a pre-trained ACTIVA model
path_to_pretrained = "/home/ubuntu/SindiLab/ACTIVA-Saved_Model/model_epoch_600_iter_0.pth"

model_dict = torch.load(path_to_pretrained)

activa = model_dict["Saved_Model"]

print(activa)

SCIV(
  (encoder): Encoder(
    (enc_sequential): Sequential(
      (0): Linear(in_features=17789, out_features=1024, bias=True)
      (1): ReLU()
      (2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Linear(in_features=1024, out_features=512, bias=True)
      (4): ReLU()
      (5): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (6): Linear(in_features=512, out_features=256, bias=True)
      (7): ReLU()
      (8): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (9): Linear(in_features=256, out_features=256, bias=True)
      (10): ReLU()
      (11): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (decoder): Decoder(
    (lsn): LSN()
    (thres_layer): ReLU()
    (dec_sequential): Sequential(
      (0): Linear(in_features=128, out_features=256, bias=True)
      (1): ReLU()
      (2): BatchNorm1d(256, eps=1e-05, momen

### Determine the device where you want to generate data from

We recommend using GPUs for *training*, but for inference either CPUs or GPUs should work just fine (but GPUs would be faster). 

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if str(device) == "cuda":
    print('Using GPU (CUDA)')
else:
    print('Using CPU')

Using GPU (CUDA)


## Generating Gaussian noise latent tensors for input to the generator

As mentioned in our paper, we use an VAE (IntroVAEs to be exact) as the core of our model. That means that after training the model, we can input noise tensors to the generator which then maps them to the same manifold as the single cell data. 


In [3]:
import time 
start = time.time()

# for reproducibility 
torch.manual_seed(0)

num_cells = 6991
# look at the input size to the generator network of ACTIVA
latent_dim = 128;
z_g = torch.randn(num_cells, latent_dim).to(device)
# generate synthetic cells with ACTIVA
generated_cells = activa.decoder(z_g)

count_matrix = generated_cells.detach().cpu().numpy()

print(f"We generated {num_cells} cells in {time.time() - start} seconds (on {device})")

We generated 6991 cells in 0.5929989814758301 seconds (on cuda)


### Turning count matrix into a Compressed Sparse Row matrix

here we will sparsify  our matrix (into a CSR matrix) for faster computations and less storage

In [4]:
import scipy.sparse as sp

count_matrix_sparse = sp.csr_matrix(count_matrix)

## Merge the gene-names with the generated data

Here we will read in the original AnnData object, and replace the count matrix with the ones we generated. This is because our model doesn't know the specific gene names, but it is aware of the importance of each feature in the columns. So replacing the count matrix with the generated ones (and keeping the rest of the AnnData object) allows us to do analysis with the gene names included. 

In [5]:
import scanpy as sc

# read in the original data 
adata = sc.read("/home/ubuntu/RawData/raw_68kPBMCs.h5ad")

# now make the count matrix to have the same size as the generated cells (since we will replace this in the next step)

sc.pp.subsample(adata, n_obs=num_cells, random_state=0, copy=False)

adata.X = count_matrix_sparse;

print(adata)

AnnData object with n_obs × n_vars = 6991 × 17789
    obs: 'cluster', 'n_genes', 'n_counts', 'split'
    var: 'n_cells'


In [6]:
# delete the extra attributes that the original data had 
del adata.obs["cluster"]
del adata.obs["split"]
print(adata)

AnnData object with n_obs × n_vars = 6991 × 17789
    obs: 'n_genes', 'n_counts'
    var: 'n_cells'


## Save the generated cells as a Scanpy object

now we will save as a Scanpy object, which can be used in Seurat as well (as we do for post processing).

In [8]:
import os

# check if the directory exists 
dir_name = 'ACTIVA-Generated'
if not os.path.isdir(dir_name):
    os.mkdir(dir_name) 
    print(f"Created {dir_name} directory") 

path = "./" + dir_name + f"/68kPBMC-{num_cells}Generated.h5ad"
adata.write(path)
print(f"Saved the new cells to {path}")

Saved the new cells to ./ACTIVA-Generated/68kPBMC-6991Generated.h5ad.


and that is it. In another tutorial, we will go over performing postprocessing with Seurat, and generating specific cell-types using ACTIVA.