In [None]:
import os
import sys

os.chdir("../..")
sys.path.append("../../")

# Simulate single-cell ATAC-seq data

## Introduction

In this example, we show how to use scDesign3Py to simulate the peak by cell matrix of scATAC-seq data.

## Import packages and Read in data

### import pacakges

In [None]:
import anndata as ad
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
import scDesign3Py

### Read in the reference data

The raw data is from the [Signac](https://stuartlab.org/signac/articles/pbmc_vignette.html), which is of human peripheral blood mononuclear cells (PBMCs) provided by 10x Genomics. We pre-select the differentially accessible peaks between clusters. The data was converted to `.h5ad` file using the R package `sceasy`.

To save time, we subset 1000 cells and 100 genes

In [None]:
data = ad.read_h5ad("data/ATAC.h5ad")
data = data[data.obs.sample(1000, random_state=123).index,0:100]
data

## Simulation

Here we choose the Zero-inflated Poisson (ZIP) as the distribution due to its good empirical performance. Users may explore other distributions (Poisson, NB, ZINB) since there is no conclusion on the best distribution of ATAC-seq.

In [None]:
test = scDesign3Py.scDesign3(n_cores=3)
test.set_r_random_seed(123)
simu_res = test.scdesign3(
    anndata=data,
    default_assay_name="counts",
    celltype="cell_type",
    mu_formula="cell_type",
    sigma_formula="1",
    family_use="zip",
    usebam=False,
    corr_formula="cell_type",
    copula="gaussian",
)

In [None]:
simu_res["new_count"]

We also run the TF-IDF transformation.

In [None]:
tfidf = TfidfTransformer()
org_tfidf = tfidf.fit_transform(data.X)
simu_tfidf = tfidf.fit_transform(simu_res["new_count"])

Then we can construct new data using the simulated count matrix and add the `tfidf` layer.

In [None]:
simu_data = ad.AnnData(X=simu_res["new_count"], obs=simu_res["new_covariate"], layers={"tfidf": simu_tfidf})
data.layers["tfidf"] = org_tfidf

## Visualization

In [None]:
plot = scDesign3Py.plot_reduceddim(
    ref_anndata=data,
    anndata_list=simu_data,
    name_list=["Reference", "scDesign3"],
    assay_use="tfidf",
    if_plot=True,
    color_by="cell_type",
    n_pc=20,
    point_size=5,
)

In [None]:
plot["p_umap"]