In [None]:
import os
import sys

os.chdir("..")
sys.path.append("../../")

# Run pyscDesign3 pipeline step by step

## Introduction

In this section, we will show how to run the whole pyscDesign3 pipeline step by step.

Some basic introduction of pyscDesign3 package is included in the [All in one simulation](./all_in_one.ipynb) section, so if you have no idea of how to use the package, you can first go through that tutorial.

In this section, we will introduce the following methods according to their actual execution order.

- [Construct data](step-1-construct-data) ([API](../set_up/_autosummary/pyscDesign3.scDesign3.construct_data.rst))
- [Fit marginal](step-2-fit-marginal) ([API](../set_up/_autosummary/pyscDesign3.scDesign3.fit_marginal.rst))
- [Fit copula](step-3-fit-copula) ([API](../set_up/_autosummary/pyscDesign3.scDesign3.fit_copula.rst))
- [Extract parameters](step-4-extract-parameters) ([API](../set_up/_autosummary/pyscDesign3.scDesign3.extract_para.rst))
- [Simulate new data](step-5-simulate-new-data) ([API](../set_up/_autosummary/pyscDesign3.scDesign3.simu_new.rst))

## Step 0: Preparation

### import packages


In [None]:
import anndata as ad
import pyscDesign3

### Read in data

The raw data is from the [scvelo](https://scvelo.readthedocs.io/scvelo.datasets.pancreas/) and we only choose top 30 genes to save time.


In [None]:
data = ad.read_h5ad("data/PANCREAS.h5ad")
data = data[:, 0:30]
data

### Create the `scDesign3` instance


In [None]:
test = pyscDesign3.scDesign3(n_cores=6, parallelization="mcmapply", return_py=True)
test.set_r_random_seed(123)

(step-1-construct-data)=
## Step 1: Construct data

This function construct the input dataset.

```{eval-rst}
.. Note::
    The default assay counts stored in `anndata.AnnData.X` don't have a specified name, when you are going to use the default assay, you should assign a name to it in the `default_assay_name` parameter. Else if you are using the assay stored in `anndata.AnnData.layers[assay_use]`, you can specify the name in the `assay_use` parameter.
```


In [None]:
const_data = test.construct_data(
    anndata=data,
    default_assay_name="counts",
    celltype="cell_type",
    pseudotime="pseudotime",
    corr_formula="1",
)

The results are all converted to `pandas.DataFrame` so that you can easily check and manipulate the result in python.


In [None]:
const_data["dat"].head()

(step-2-fit-marginal)=
## Step 2: Fit marginal

Fit regression models for each gene (feature) based on your specification.


```{eval-rst}
.. Note::
    Though we have already set the parallel method when creating the instance, we can change the setting temporarily when executing the methods one by one.

    Here is an example where we change the parallel method to `bpmapply` , thus we need an extra bpparam object got from the get_bpparam() function. (Details of the function is included in :doc:`Get BPPARAM <./bpparam>` section)
```

In [None]:
bpparam = pyscDesign3.get_bpparam(mode="MulticoreParam", show=False)
marginal = test.fit_marginal(
    data=const_data,
    mu_formula="s(pseudotime, k = 10, bs = 'cr')",
    sigma_formula="1",
    usebam=True,
    family_use="nb",
    n_cores=3,
    parallelization="bpmapply",
    bpparam=bpparam,
)

```{eval-rst}
.. Warning::
    So far there has been an unfixed problem in converting the marginal list to OrdDict. Use .rx2 method to get values.
```

**If you want to manipulate the results**


In [None]:
print(marginal.rx2("Pyy").rx2("fit").rx2("coefficients"))

In [None]:
marginal.rx2("Pyy").rx2("fit").rx2("coefficients")[0] = 1.96
print(marginal.rx2("Pyy").rx2("fit").rx2("coefficients"))

In [None]:
# marginal["Pyy"]["fit"]["coefficients"]

In [None]:
# marginal["Pyy"]["fit"]["coefficients"][0] = 1.96
# marginal["Pyy"]["fit"]["coefficients"]

(step-3-fit-copula)=
## Step 3: Fit copula

Fit a copula, obtain AIC and BIC.


In [None]:
copula = test.fit_copula(
    input_data=const_data["dat"],
    marginal_dict=marginal,
    important_feature="auto",
    copula="vine"
)

We can evaluate the model by checking the AIC.


In [None]:
copula["model_aic"]

```{eval-rst}
.. Note::
    The return value is a `rpy2.rlike.container.OrdDict` . Not all elements in this `dict` like object have to be named but they have a given order. **None** as a key value means an absence of name for the element. For the values without a named key, you can call `byindex` method to get them by index (rank).
```

Here, we show an example to get the vine copula values. **If you call `byindex` method, you will get a tuple with the first value being the key and the second value being the value.**

The example fetches the R vinecop class property `pair_copulas`, and get the `family` info. The equal R version code is `copula$copula_list$"1"$pair_copulas[[1]][[1]]$"family"`.

In [None]:
print(copula["copula_list"]["1"]["pair_copulas"].byindex(0)[-1].byindex(0)[-1]["family"])

(step-4-extract-parameters)=
## Step 4: Extract parameters

Extract out the estimated parameters so you can make some modifications and use the modified parameters to generate new data if needed. The following parameters can be extracted:

- a cell-by-gene mean matrix
- a sigma matrix which is:
  - a cell-by-gene matrix of $\frac{1}{\phi}$ for negative binomial distribution
  - a cell-by-gene matrix of the standard deviation $\sigma$ for Gaussian distribution
  - a cell-by-gene matrix of 1 for poisson distribution
- a zero matrix which is:
  - a cell-by-gene matrix of zero probabilities for zero-inflated negative binomial and zero-inflated poisson distributions
  - a zero matrix for negative binomial, Gaussian, and poisson distributions


In [None]:
para = test.extract_para(
    marginal_dict=marginal,
    data=const_data["dat"],
    new_covariate=None,
)

The output matrix can be modified based on `pandas.DataFrame` syntax.

In [None]:
para["mean_mat"].iloc[0:6,0:5]

(step-5-simulate-new-data)=
## Step 5: Simulate new data

In [None]:
simu_new = test.simu_new(
    mean_mat=para["mean_mat"],
    sigma_mat=para["sigma_mat"],
    zero_mat=para["zero_mat"],
    copula_dict=copula["copula_list"],
    input_data=const_data["dat"],
    new_covariate=const_data["newCovariate"],
    important_feature=copula["important_feature"],
    filtered_gene=const_data["filtered_gene"],
)

The final simulated result is also a `pandas.DataFrame` object.

In [None]:
simu_new.iloc[0:6,0:6]