In [1]:
import os
import sys

sys.path.append("../../../../")

# Compare gaussian copula and vine copula

## Introduction

In this example, we will show the differences between using Gaussian copula and vine copula when simulate new data. Vine copula can better estimate the high-dimensional gene-gene correlation, however, the simulation with vine copula does takes more time than with Gaussian copula. If your reference dataset have more than **1000 genes**, we recommend you simulate data with Gaussian copula.

## Step 1: Import packages and Read in data

### import pacakges

In [67]:
import re
import anndata as ad
import pandas as pd
import scanpy as sc
import pyscDesign3

### Read in data

The raw data is from the R package `DuoClustering2018` and converted to `.h5ad` file using the R package `sceasy`.

In [64]:
data = ad.read_h5ad("../../data/Zhengmix4eq.h5ad")
data.obs["cell_type"] = data.obs["phenoid"]
data.var.index = data.var["symbol"]

For demonstration purpose, we use the top 100 highly variable genes. We further filtered out some highly expressed housekeeping genes and added TF genes.

In [65]:
humantfs = pd.read_csv("http://humantfs.ccbr.utoronto.ca/download/v_1.01/TF_names_v_1.01.txt",header=None)
# choose HVG genes
sc.pp.highly_variable_genes(data,layer="logcounts",n_top_genes=100)
gene_list = data.var[data.var["highly_variable"] == True].index.to_series()
# get whole candidate genes
gene_list = pd.unique(pd.concat([humantfs,gene_list])[0])
# filter out unneeded genes
gene_list = [x for x in gene_list if (re.match("RP",x) is None) and (re.match("TMSB",x) is None) and (not x in ["B2M", "MALAT1", "ACTB", "ACTG1", "GAPDH", "FTL", "FTH1"])]
# get final data
subdata =  data[:,list(set(gene_list).intersection(set(data.var_names)))]

In [66]:
subdata

View of AnnData object with n_obs × n_vars = 3555 × 139
    obs: 'barcode', 'phenoid', 'total_features', 'log10_total_features', 'total_counts', 'log10_total_counts', 'pct_counts_top_50_features', 'pct_counts_top_100_features', 'pct_counts_top_200_features', 'pct_counts_top_500_features', 'sizeFactor', 'cell_type'
    var: 'id', 'symbol', 'mean_counts', 'log10_mean_counts', 'rank_counts', 'n_cells_counts', 'pct_dropout_counts', 'total_counts', 'log10_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'hvg'
    obsm: 'X_pca', 'X_tsne'
    layers: 'logcounts', 'normcounts'

## Simulation

We then use pyscdesign3 to simulate two new datasets using Gaussian copula and vine copula respectively.

In [None]:
bpparam = pyscDesign3.get_bpparam("SnowParam",show=False)
gaussian = pyscDesign3.scDesign3(n_cores=3,parallelization="bpmapply",bpparam=bpparam)
vine = pyscDesign3.scDesign3(n_cores=3,parallelization="bpmapply",bpparam=bpparam)

In [None]:
gaussian_res = gaussian.scdesign3(sce = subdata,
                            celltype = 'cell_type',
                            corr_formula = "cell_type",
                            mu_formula = "cell_type",
                            sigma_formula = "cell_type",
                            copula = "gaussian",
                            assay_use = "normcounts",
                            family_use = "nb",
                            pseudo_obs = True, 
                            return_model = True)

vine_res = vine.scdesign3(sce = subdata,
                        celltype = 'cell_type',
                        corr_formula = "cell_type",
                        mu_formula = "cell_type",
                        sigma_formula = "cell_type",
                        copula = "vine",
                        assay_use = "normcounts",
                        family_use = "nb",
                        pseudo_obs = True, 
                        return_model = True)

## Visualization

For the simulation result using Gaussian copula, the return object contains a `corr_list` which is the gene-gene correlation matrices for each group that user specified, in this case, the groups are cell types. For the simulation result using vine copula, the `corr_list` gives the vine structure for each group that user specified, in this case, the groups are cell types. We then reformat the two `corr_list` and visualize them.

### Gaussian copula

### Vine copula

Comparing with the visualization above, the plots below give more direct visualization about which genes are connected in the vine structure and show gene networks.