Source: https://mira-multiome.readthedocs.io/en/latest/notebooks/tutorial_atlas_integration.html

# Atlas-level integration
In this tutorial, I will cover some tips and tricks for modeling large collections of single-cells using a scATAC-seq dataset of brain development produced by 10X genomics.

One key problem when working with large-scale atlases is it can be hard to know how many topics will best represent the dataset - complex systems could require many tens of topics to capture all of the apparaent heterogeneity. Even though we provide an automated method for determining this, Bayesian search of extremely large ranges is time consuming and inefficient. In this tutorial, I demonstrate how to use gradient descent to estimate the number of topics in a dataset using a Dirichlet Process model.

# Preprocessing ATAC-seq data

The previous tutorial outlined some best practices for preprocessing scRNA-seq data and selecting genes to model. For scATAC-seq, preprocessing is somewhat less straightforward. The basic pipeline we recommend follows closely with that employed by 10X genomics:

1. align ATAC-seq fragments

2. Generate fragment file

3. Call peaks from fragment file

4. Intersect fragments with peaks to generate sparse, near-binary count matrix of size Ncells x Npeaks

5. Filter extremely rare peaks (<~30 cells), and non-cell droplets.

The 10x pipeline employs an in-house peak caller which does okay. If possible, we recommend re-calling peaks with MACs and re-aggregating fragments. Since highly-variable peaks are hard to identify due to the sparsity of the data, we recommend using all called peaks that are accessible in some reasonable number of cells as features (for example, more than 30 cells).

First, import some packages:

In [1]:
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import mira

Since we’re training an accessibility model in this tutorial, we want to make sure we are working on a GPU:

In [2]:
import torch
assert torch.cuda.is_available()

AssertionError: 

Now, load some data: