In [None]:
import os
import sys

os.chdir("../..")
sys.path.append("../../")

# Simulate datasets with multiple lineages

## Introduction

In this example, we will show how to use pyscDesign3 to simulate the multiple lineages single-cell data.

## Import packages and Read in data

### import pacakges

In [None]:
import anndata as ad
import numpy as np
import pyscDesign3

### Read in the reference data

The raw data is from the [GEO with ID GSE72859](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72859), which describes myeloid progenitors from mouse bone marrow. 

We pre-select the top 1000 highly variable genes. To save time, we only use the top 30 genes.

In [None]:
data = ad.read_h5ad("data/MARROW.h5ad")
data = data[:,0:30]
data

As we can see, this example dataset has two sets of pseudotime, thus two lineages. The variables `pseudotime1` and `pseudotime2` contain the corresponding pseudotime for each cell. The variables `l1` and `l2` indicate whether a particular cell belong to the first and/or second lineages.

In [None]:
data.obs[["pseudotime1","pseudotime2","l1","l2"]].head()

## Simulation

Then, we can use this multiple-lineage dataset to generate new data by setting the parameter `mu_formula` as two smooth terms for each lineage.

In [None]:
test = pyscDesign3.scDesign3(n_cores=6)
test.set_r_random_seed(123)
simu_res = test.scdesign3(
    anndata=data,
    default_assay_name="counts",
    celltype="cell_type",
    pseudotime=["pseudotime1", "pseudotime2", "l1", "l2"],
    mu_formula="s(pseudotime1, k = 10, by = l1, bs = 'cr') + s(pseudotime2, k = 10, by = l2, bs = 'cr')",
    sigma_formula="1",
    family_use="nb",
    usebam=True,
    corr_formula="1",
    copula="gaussian",
)

Then we can construct new data using the simulated count matrix.

In [None]:
simu_data = ad.AnnData(X=simu_res["new_count"], obs=simu_res["new_covariate"])
simu_data.layers["log_transformed"] = np.log1p(simu_data.X)
data.layers["log_transformed"] = np.log1p(data.X)

## Visualization

In [None]:
plot1 = pyscDesign3.plot_reduceddim(
    ref_anndata=data,
    anndata_list=simu_data,
    name_list=["Reference", "scDesign3"],
    assay_use="log_transformed",
    if_plot=True,
    color_by="pseudotime1",
    n_pc=20,
    point_size=5,
)
plot2 = pyscDesign3.plot_reduceddim(
    ref_anndata=data,
    anndata_list=simu_data,
    name_list=["Reference", "scDesign3"],
    assay_use="log_transformed",
    if_plot=True,
    color_by="pseudotime2",
    n_pc=20,
    point_size=5,
)

### Pseudotime1

In [None]:
plot1["p_umap"]

### Pseudotime2

In [None]:
plot2["p_umap"]