# Sampling subclones

This notebook illustrates how to use the utilities related to sampling CNV profiles.

## Loading some data
First, let's load some data

In [None]:
import anndata as ad
import pathlib
import scvi

DATA_PATH = pathlib.Path("../data")


adata = ad.read_h5ad(DATA_PATH / "non_malignant.h5ad")
adata = adata[(adata.obs["celltype"] == "Tcells") | (adata.obs["celltype"] == "Bcells")].copy()

## Generating subclones

Now, we'll import the subpackage dedicated to this:

In [None]:
import sys
sys.path.insert(0, "../")

import simul.cna.api as cna

Now we will need to load human genome (or at least the genes of interest) into a convenient object:

In [None]:
genome = cna.Genome(genes_df=adata.var, chromosome_column="chromosome", start_column="start")

We can use it to generate the CNV profiles, corresponding to different subclones. We will generate 10 different subclones.

In [None]:
profile_generator = cna.CNVProfileGenerator(
    genome=genome,
    chromosomes_gain=["chr1", "chr2", "chr5"],
    chromosomes_loss=["chr4", "chr12", "chr11"],
    seed=2022,
)

subclones = [
    profile_generator.generate_subclone() for _ in range(10)
]

Let's see how many genes in each subclone are gained/lost:

**TODO: Consider adding visualisations, like these in inferCNVpy?**

In [None]:
for i, subclone in enumerate(subclones, 1):
    n_gains = sum(subclone == 1)
    n_losses = sum(subclone == -1)

    print(f"For {i}th subclone we have {n_gains} gains and {n_losses} losses.")

Now we will try to learn anchor genes:

**TODO: Do we want to learn them on synthetic data or something from inferCNVpy?
We can use the estimator in both cases, if the gene order doesn't change.**

In [None]:
anchor_estimator = cna.MostFrequentGainLossAnchorsEstimator(gene_names=genome)
anchor_estimator.fit(subclones)

print(f"As the gain anchor gene we will use {anchor_estimator.gene_gain}.")
print(f"As the loss anchor gene we will use {anchor_estimator.gene_loss}.")

Let's see how the anchors of different subclones look like:

In [None]:
anchor_estimator.predict(subclones)

## Sampling programs

For each batch and anchor we want to have a distribution over different programs. Consider a case with three different programs and two patients.

We will need to specify how the probability of each program depends on the anchors. (The sampling procedure is in fact more complex, we will discuss it later).

In [None]:
anchors_to_alphas={
        (True, True): [100, 5, 6],
        (False, True): [10, 10, 100],
        (True, False): [10, 100, 10],
        (False, False): [10, 10, 10],
    }

batches = ["patient1", "patient2",]
program_names = ["program1", "program2", "program3"]

As mentioned above, the sampling procedure is more complex. For each patient we will have a _sample_ from the Dirichlet distribution parametrized by the values above.

Moreover, we will randomly drop some programs in some patients. Let's set the dropout probability to something high, but require at least two different programs in each patient. 

In [None]:
distribution = cna.generate_probabilities(
    anchors_to_alphas=anchors_to_alphas,
    batches=batches,
    min_programs=2,
    prob_dropout=0.8,
    program_names=program_names,
)

We have a distribution object, which can be used to sample from $P(program | anchors, batch)$.

Let's see how this works in practice:

In [None]:
import numpy as np
from itertools import product

for batch in batches:
    print(f"Batch {batch} ")
    for anchors in product([True, False], [True, False]):
        prob = distribution.probabilities(anchors=anchors, batch=batch)
        print(f"Anchors {anchors}:\t{np.round(prob, 2).tolist()}")
    
    print("\n\n")

## Modifying gene expression

See the `simul.cna.gene_expression` submodule. The important things are `sample_gain_vector`, `sample_loss_vector`, `perturb`, and (the most important one) `change_expression`.