# Noise in gene expression
<!-- Generally speaking, we call deviations from what we might expect from a deterministic view of gene expression _stochasticity_, or **noise**.

To quantitatively define noise, it helps to perform a thought experiment. Say we have many exactly identical cells. In the future, as gene expression, responses to the environment (which we assume is the same for each cell), etc. proceed, the cells will no longer be identical due to stochasticity in all of these processes involving small numbers of molecules. Consider one gene product of interest in these cells. We can define the **total noise**, $\eta_\mathrm{tot}$, as the **coefficient of variation** of the copy numbers of the gene product. The coefficient of variation is the standard deviation of the copy number divided by the mean copy number. If the standard deviation is comparable to the mean, as we would expect in the case of large copy numbers, we have low noise, but if it is large compared to the mean, we have high noise.

We might have fluctuations in environmental conditions that might change the expression level of the gene of interest, either directly, or indirectly through expression of other genes in any given cell. We might also have fluctuations in gene expression due to the inherent stochasticity present in cellular processes, such as in those involved in the nonnoncentral Dogma. This leads us to categorize the noise according to **intrinsic** and **extrinsic** noise. -->

> $CV$ - Coefficient of variation of the noise of the copy number of the gene product = $\sigma / \mu$

- **Intrinsic noise** ($\eta_\mathrm{int}$): Transcription and translation can occur at different times and rates in otherwise identical systems. This results in fluctuations in copy numbers. The fluctuations in the copy number of the protein of interest are due to fluctuations that affect _only_ the gene of interest.

- **Extrinsic noise** ($\eta_\mathrm{ext}$): Other molecular species, such as RNA polymerase, ribosomes, chemical species in the cell's environment, vary over time and affect the gene of interest. The fluctuations in the copy number of the protein of interest are due to fluctuations that affect _all_ genes in a cell. 

**Issue**: just measuring cell-cell variation does not separate intrinsic noise in the process of gene expression from extrinsic cell-cell variation in key cellular components. Thus, it was critical to actually measure both intrinsic and extrinsic noise and see how both behave.

**Solution**: we can put two (nearly) identical _genes_ in the _same_ cell. This allows us to think of these two genes as two independent stochastic samples of the same underlying process. If everything behaved deterministically and was only influenced by extrinsic noise, we would expect strongly correlated variation in both genes. If, however, expression is non-deterministic then variation will be uncorrelated.

<!-- To perform this experiment [Elowitz and coworkers (_Science_, 2002)](https://doi.org/10.1126/science.1070919) developed a system in which identical promoters were used to express fluoresnoncent proteins of different color, cyan fluorescent protein (CFP) and YFP. The promoters were repressed by the LacI protein. They could tune the repression by changing concentration of IPTG, since IPTG inhibits LacI's ability to act as a repressor. They then could measure the expression level for identical promoters under identical conditions, since the promoters are in the same cell.

Your job in this problem is to do Bayesian modeling to get parameter estimates for the intrinsic and extrinsic noise. This problem has been approached several times, first, of course, in the original [Elowitz paper](https://doi.org/10.1126/science.1070919).

> **intrinsic noise** if the noise inherent in gene expressions levels of the gene of interest in each individual cell and<br>**extrinsic noise** results from variations from cell to cell.
 -->
Build a hierarchical model for the Elowitz experiment to get parameter estimates for the intrinsic and extrinsic noise. 

### In your modeling, you can make the following assumptions.

1. The copy number of a fluorophore is linearly related to the measured fluorescence with zero intercept, which also means there is no background fluorescence.
2. If CFP and YFP have the same copy number in a cell, their measured fluorescence differs by a constant factor.
3. All noise is inherent to the genetic machinery of the bacteria and environmental fluctuations; there is no measurement error.
4. The fluorescent intensity of each cell is independent of all other cells and also identically distributed (i.i.d.).
5. Although under these assumptions the measured fluorescence should take on discrete values, we nonetheless model the fluorescence values as continuous.

__________________________
# Answer:
> Note for Justin and the TA's: Intially the "regular" model (centered parameters) failed and I debbugged a lot. Eventually I wrote another version of the stan file with non-centered parameters and it worked great. Eventually I elongated the warmup and increased the maximal tree depth of the "centered parameters" script and it solved it. The results are almost identical and so are the expected log pointwise predictive densities (estimated by either `WAIC` or `LOO`) but the "non-centered" runs faster. I decided to keep both and compare them. I run both scripts for each strain.

## Steps:
1. Light exploration of the data
2. Prior predictive checks
3. Stan file creation (both files)
4. Run on m22 strain
    1. Run MCMC with "**centered** parameters"
       1. Run MCMC (elongate warmup and increase tree depth)
       2. Check diagnostics
       3. Plot the parameter estimations (corner)
       4. Plot posterior predictive checks (ECDF and Diff-ECDF)
       5. Get log-likelihood score (LOO)
    2. Run MCMC with "**non-centered** parameters"
        1. \- 5.   _same steps as for centered parameters_
5. Run on d22 strain
    1. \- 3.  _same steps as m22 strain_

## Conclusion
- When CFP and YFP have the same copy number in cells of strains $\substack{\text{m22} \\ \text{d22}}$, their measured fluorescence differ by factors of $\substack{\text{1.84} \\ \text{1.41}}$.

- The fact that the non-centered and the centered (with extra depth and warmup) approaches produce similar estimates indicates that the groups (cells) are not similar (extrinsic noise), otherwise the centred approach would have encounter issues associated with "the funnel of hell".

- The extrinsic noise was calculated as tau/theta (level of separating between cells) and intrinsic as sigma/theta_1 (expression variation within a cell):<br>*values in the table are based on the results of the non-centered models.

| Strain| Noise<br>Type | Values of parameter of mean <br> (median and 95% HPD)| Parameter<br>of mean | Values of parameter of std <br> (median and 95% HPD) | Parameter<br>of std | CV     |
|-------|------|------------------------------------------------------|-------------------|------------------------------------------------------|------------------|----------|
| m22   | Extrinsic  | $1408.7^{+1422.2}_{-1395.2}$      |$\theta$   | $76.3_{-65.2}^{+88.0}$     |$\tau$|            $0.054$  |
| m22   | Intrinsic  | $1409.0_{-1260.8}^{+1555.3}$      |$\theta_1$ | $78.6_{-72.0}^{+86.0}$     |$\sigma$         | $0.056$  |
| d22   | Extrinsic  | $1897.62_{-1871.03}^{+1923.69}$   |$\theta$   | $153.8_{-133.1}^{+174.0}$  |$\tau$          | $0.081$  |
| d22   | Intrinsic  | $1893.7_{-1600.3}^{+2207.6}$      |$\theta_1$ | $155.3_{-143.8}^{+169.5}$  |$\sigma$          | $0.082$  |

Surprisingly, the coefficient of variance for the extrinsic and intrinsic noise is nearly identical (in both strains)



In [None]:
#%% import
import os

import numpy as np
rng = np.random.default_rng()

import polars as pl
# import pandas as pd

import cmdstanpy
import arviz as az

import bebi103
import iqplot

import bokeh.io
from bokeh.layouts import row as bokeh_row
from bokeh.layouts import column as bokeh_column

bokeh.io.output_notebook()

## Load and explore data

In [2]:
data_path = '../data/'
df = pl.read_csv(data_path + 'elowitz_et_al_2002_fig_3a.csv')
print(df.head())

df_long = df.unpivot(index=['strain'], on=['yfp','cfp'], variable_name='gene', value_name='expression')
bokeh.io.show(
    iqplot.strip(
        df_long,
        q="expression",
        cats=["strain",'gene'],
        color_column="gene",
        marker_kwargs=dict(alpha=0.6),
    )
)

shape: (5, 3)
┌────────┬─────────────┬─────────────┐
│ strain ┆ cfp         ┆ yfp         │
│ ---    ┆ ---         ┆ ---         │
│ str    ┆ f64         ┆ f64         │
╞════════╪═════════════╪═════════════╡
│ m22    ┆ 2438.345791 ┆ 1408.98312  │
│ m22    ┆ 2315.822957 ┆ 1391.341618 │
│ m22    ┆ 2521.433006 ┆ 1510.704169 │
│ m22    ┆ 2646.205984 ┆ 1460.46272  │
│ m22    ┆ 2830.095578 ┆ 1637.701793 │
└────────┴─────────────┴─────────────┘


In [3]:
#%% Split by strain
df_m22 = df.filter(pl.col('strain') == 'm22').rename({"yfp":"Fy","cfp":"Fc"})
df_m22 = df_m22.with_columns(pl.arange(0, df_m22.height).alias('cell')) 
print(df_m22.head())

df_d22 = df.filter(pl.col('strain') == 'd22').rename({"yfp":"Fy","cfp":"Fc"})
df_d22 = df_d22.with_columns(pl.arange(0, df_d22.height).alias('cell'))
print(df_d22.head())



shape: (5, 4)
┌────────┬─────────────┬─────────────┬──────┐
│ strain ┆ Fc          ┆ Fy          ┆ cell │
│ ---    ┆ ---         ┆ ---         ┆ ---  │
│ str    ┆ f64         ┆ f64         ┆ i64  │
╞════════╪═════════════╪═════════════╪══════╡
│ m22    ┆ 2438.345791 ┆ 1408.98312  ┆ 0    │
│ m22    ┆ 2315.822957 ┆ 1391.341618 ┆ 1    │
│ m22    ┆ 2521.433006 ┆ 1510.704169 ┆ 2    │
│ m22    ┆ 2646.205984 ┆ 1460.46272  ┆ 3    │
│ m22    ┆ 2830.095578 ┆ 1637.701793 ┆ 4    │
└────────┴─────────────┴─────────────┴──────┘
shape: (5, 4)
┌────────┬─────────────┬─────────────┬──────┐
│ strain ┆ Fc          ┆ Fy          ┆ cell │
│ ---    ┆ ---         ┆ ---         ┆ ---  │
│ str    ┆ f64         ┆ f64         ┆ i64  │
╞════════╪═════════════╪═════════════╪══════╡
│ d22    ┆ 3080.197178 ┆ 2308.663544 ┆ 0    │
│ d22    ┆ 3082.424396 ┆ 2394.410207 ┆ 1    │
│ d22    ┆ 2893.26981  ┆ 2144.537762 ┆ 2    │
│ d22    ┆ 3052.736615 ┆ 2340.273146 ┆ 3    │
│ d22    ┆ 2890.798622 ┆ 2244.935215 ┆ 4    │
└─────

## Prior predictive checks

In [None]:
#%% priors
theta_ = 10**rng.normal(3.0, 0.6378, size=1000) 
tau_ = np.abs(rng.normal(0, 2000, size=1000))
sigma_ = np.abs(rng.normal(0, 1000, size=1000))
mult_factor_ = rng.gamma(1.5, 1.5, size=1000)

# centered parameterization:
theta1_ = rng.normal(theta_, tau_)
# non-centered parameterization:
# theta1_tilde = rng.normal(0, 1, size=1000)
# theta1_ = theta_ + tau_ * theta1_tilde

cfp_syn_events = np.array([rng.normal(t1, s, size=1000) for t1,s in zip(theta1_, sigma_)])
yfp_syn_events = np.array([rng.normal(t1 * mf, s, size=1000) for t1,s,mf in zip(theta1_, sigma_, mult_factor_)])

p1 = bebi103.viz.predictive_ecdf(
        cfp_syn_events, 
        x_axis_label="Fluorescence (a.u.)", 
        # x_axis_type='log',
        name="CFP",
        title="Prior Predictive Check - CFP",
)
# bokeh.io.show(p1)
p2 = bebi103.viz.predictive_ecdf(
        yfp_syn_events,
        # p=p,
        x_axis_label="Fluorescence (a.u.)", 
        # x_axis_type='log',
        color='orange',
        name="YFP",
        title="Prior Predictive Check - YFP",
)
bokeh.io.show(bokeh_row(p1, p2))


In [None]:
#%% Create data dicts for Stan
data_m22, df_m22_ = bebi103.stan.df_to_datadict_hier(
    df=df_m22,
    level_cols=["cell"],   
    data_cols=["Fy","Fc"]
)
data_d22, df_d22_ = bebi103.stan.df_to_datadict_hier(
    df_d22, level_cols=["cell"], data_cols=["Fy","Fc"]
)

__________________________
## Create stan files:

In [None]:
#%% Stan code for centered parameterization
centered_parameterization_stan_code="""
data {
  // Total number of data points
  int N;
  
  // Number of entries in each level of the hierarchy
  int J_1;
  // int J_2;
  
  //Index arrays to keep track of hierarchical structure
  // array[J_2] int index_1;
  array[N] int index_1;
  
  // The measurements
  array[N] real Fy;
  array[N] real Fc;

}
parameters {
  // Log-scale hyperparameters
  real log10_theta;
  // non-Log-scale hyperparameters
  real<lower=0> tau;
  real<lower=0> sigma;
  
  // parameters for hierarchical levels
  vector[J_1] theta_1;
  
  // Multiplicative factor
  real<lower=0> r;
}
transformed parameters {
  // Transform to natural scale
  real<lower=0> theta = 10^log10_theta;
  real<lower=0> theta_1_mu = theta;
}
model {
  // Priors on log-scale parameters
  log10_theta ~ normal(3.0, 0.6378);
  tau ~ normal(0, 2000);
  sigma ~ normal(0, 1000);
  r ~ gamma(1.5, 1.5);
  
  // Hierarchical structure
  theta_1 ~ normal(theta_1_mu, tau);
  
  // Likelihood
  Fy ~ normal(theta_1[index_1], sigma);
  Fc ~ normal(r*theta_1[index_1], r*sigma);
}

generated quantities {
    // Posterior predictive samples
    array[N] real Fy_pred;
    array[N] real Fc_pred;

    // Generate one predicted observation per each real observation
      for (n in 1:N) {
        real temp_theta_1 = theta_1[index_1[n]];
        Fy_pred[n] = normal_rng(temp_theta_1, sigma);
        Fc_pred[n] = normal_rng(r*temp_theta_1, r*sigma);
    }
    // Log-likelihood
    array[N] real log_lik_Fy;
    array[N] real log_lik_Fc;
    array[N] real log_lik;
    for (n in 1:N) {
      real temp_theta_1 = theta_1[index_1[n]];
      log_lik_Fy[n] = normal_lpdf(Fy[n] | temp_theta_1, sigma);
      log_lik_Fc[n] = normal_lpdf(Fc[n] | r*temp_theta_1, r*sigma);
      log_lik[n] = log_lik_Fy[n] + log_lik_Fc[n];
    }

}
"""
#%% Write stan code to file
centered_parameterization_stan_name = 'TomerAntman-87-noise_gene_hierarchical_centered.stan'

if not os.path.exists(centered_parameterization_stan_name):
    with open(centered_parameterization_stan_name, 'w') as f:
        f.write(centered_parameterization_stan_code)


In [None]:
#%% Stan code for non-centered parameterization
noncentered_parameterization_stan_code="""
data {
  // Total number of data points
  int N;
  
  // Number of entries in each level of the hierarchy
  int J_1;
  
  //Index arrays to keep track of hierarchical structure
  array[N] int index_1;
  
  // The measurements
  array[N] real Fy;
  array[N] real Fc;

}
parameters {
  // Log-scale hyperparameters
  real log10_theta;
  
  // non-Log-scale hyperparameters
  real<lower=0> tau;
  real<lower=0> sigma;
  
  // parameters for hierarchical levels
  vector[J_1] theta_1_tilde;
  // vector[J_1] theta_1; // now defined in transformed parameters
  
  // Multiplicative factor
  real<lower=0> r;
}
transformed parameters {
  // Transform to natural scale
  real<lower=0> theta = 10^log10_theta;
  vector[J_1] theta_1 = theta + tau * theta_1_tilde;  

}
model {
  // Priors on log-scale parameters
  log10_theta ~ normal(3.0, 0.6378);
  theta_1_tilde ~ normal(0, 1);
  tau ~ normal(0, 2000);
  sigma ~ normal(0, 1000);
  r ~ gamma(1.5, 1.5);
  
  // Hierarchical structure
  // theta_1 ~ normal(theta, tau); // now defined in transformed parameters
  
  // Likelihood
  Fy ~ normal(theta_1[index_1], sigma);
  Fc ~ normal(r*theta_1[index_1], r*sigma);
  
}

generated quantities {
    // Posterior predictive samples
    array[N] real Fy_pred;
    array[N] real Fc_pred;

    // Generate one predicted observation per each real observation
      for (n in 1:N) {
        real temp_theta_1 = theta_1[index_1[n]];
        Fy_pred[n] = normal_rng(temp_theta_1, sigma);
        Fc_pred[n] = normal_rng(r*temp_theta_1, r*sigma);
    }
    // Log-likelihood
    array[N] real log_lik_Fy;
    array[N] real log_lik_Fc;
    array[N] real log_lik;
    for (n in 1:N) {
      real temp_theta_1 = theta_1[index_1[n]];
      log_lik_Fy[n] = normal_lpdf(Fy[n] | temp_theta_1, sigma);
      log_lik_Fc[n] = normal_lpdf(Fc[n] | r*temp_theta_1, r*sigma);
      log_lik[n] = log_lik_Fy[n] + log_lik_Fc[n];
    }

}
"""
#%% Write stan code to file
noncentered_parameterization_stan_name = 'TomerAntman-87-noise_gene_hierarchical_noncentered.stan'

if not os.path.exists(noncentered_parameterization_stan_name):
    with open(noncentered_parameterization_stan_name, 'w') as f:
        f.write(noncentered_parameterization_stan_code)


____________________________________
## m22
### Centered Parameterization

In [None]:
#%% run centered parameterization for m22
with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file=centered_parameterization_stan_name)
    samples = sm.sample(data=data_m22, iter_warmup=2000, max_treedepth=12)
    samples_m22_cent = az.from_cmdstanpy(
        samples, 
        posterior_predictive=["Fy_pred","Fc_pred"], 
        log_likelihood="log_lik")


#%% Diagnostics message is very long...
bebi103.stan.check_all_diagnostics(samples_m22_cent)

#%% viz parameters
# corner plot (doesn't recognize theta_1)
bokeh.io.show(
    bebi103.viz.corner(
    samples_m22_cent,
    show_contours=True, 
    parameters=['theta','tau','sigma','r'],
    frame_width=250,
    frame_height=250
    )
)
# plot predicted_theta_1:
theta_1_p = iqplot.histogram(
    samples_m22_cent.posterior['theta_1'].values.flatten(),
    x_axis_label='theta_1',
    rug=False,
    line_kwargs={'color':'black'}
)
bokeh.io.show(theta_1_p)
#%% posterior predictive check
Fy_pred_m22 = (
    samples_m22_cent.posterior_predictive['Fy_pred']
    .stack({"sample": ("chain", "draw")})
    .transpose("sample", "Fy_pred_dim_0")
)
p_m22_fy = bebi103.viz.predictive_ecdf(Fy_pred_m22, 
                                    data_m22['Fy'], 
                                    x_axis_label="Fluorescence (a.u.)",
                                    title="YFP - m22"
                                    )
p_diff_m22_fy = bebi103.viz.predictive_ecdf(Fy_pred_m22, 
                                    data_m22['Fy'], 
                                    diff='ecdf', 
                                    x_axis_label="Fluorescence (a.u.)",
                                    title="YFP - m22"
                                    )
Fc_pred_m22 = (
    samples_m22_cent.posterior_predictive['Fc_pred']
    .stack({"sample": ("chain", "draw")})
    .transpose("sample", "Fc_pred_dim_0")
)
p_m22_fc = bebi103.viz.predictive_ecdf(Fc_pred_m22, 
                                    data_m22['Fc'], 
                                    x_axis_label="Fluorescence (a.u.)",
                                    title="CFP - m22"
                                    )
p_diff_m22_fc = bebi103.viz.predictive_ecdf(Fc_pred_m22, 
                                    data_m22['Fc'], 
                                    diff='ecdf', 
                                    x_axis_label="Fluorescence (a.u.)",
                                    title="CFP - m22"
                                    )
layout = bokeh_row(bokeh_column(p_m22_fy, p_m22_fc), bokeh_column(p_diff_m22_fy, p_diff_m22_fc))
print("Posterior Predictive Check - m22 strain")
bokeh.io.show(layout)

#%% LOO log likelihood
az.loo(samples_m22_cent, scale="deviance")

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

                                                                                                                                                                                                                                                                                                                                
Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.


Posterior Predictive Check - m22 strain




Computed from 4000 posterior samples and 250 observations log-likelihood matrix.

             Estimate       SE
deviance_loo  6314.95    29.18
p_loo          151.08        -

------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)      133   53.2%
   (0.70, 1]   (bad)       110   44.0%
   (1, Inf)   (very bad)    7    2.8%

### Non-Centered Parameterization

In [9]:
#%% Now run the non-centered parameterization for m22
with bebi103.stan.disable_logging():
    sm_noncent = cmdstanpy.CmdStanModel(stan_file=noncentered_parameterization_stan_name)
    samples_noncent = sm_noncent.sample(data=data_m22)
    samples_m22_noncent = az.from_cmdstanpy(samples_noncent, posterior_predictive=["Fy_pred","Fc_pred"], log_likelihood="log_lik")

#%% run diagnostics
bebi103.stan.check_all_diagnostics(samples_m22_noncent)

#%% viz parameters
# corner plot (doesn't recognize theta_1)
bokeh.io.show(
    bebi103.viz.corner(
    samples_m22_noncent,
    show_contours=True, 
    parameters=['theta','tau','sigma','r'],
    frame_width=250,
    frame_height=250
    )
)

# plot predicted_theta_1:
theta_1_p = iqplot.histogram(
    samples_m22_noncent.posterior['theta_1'].values.flatten(),
    x_axis_label='theta_1',

    rug=False,
    line_kwargs={'color':'black'}
)
bokeh.io.show(theta_1_p)

#%% posterior predictive check
Fy_pred_m22_noncent = (
    samples_m22_noncent.posterior_predictive['Fy_pred']
    .stack({"sample": ("chain", "draw")})
    .transpose("sample", "Fy_pred_dim_0")
)
p_m22_noncent_fy = bebi103.viz.predictive_ecdf(Fy_pred_m22_noncent, 
                                    data_m22['Fy'], 
                                    x_axis_label="Fluorescence (a.u.)",
                                    #height=350, width=350,
                                    title="YFP - m22 - noncentered"
                                    )
p_diff_m22_noncent_fy = bebi103.viz.predictive_ecdf(Fy_pred_m22_noncent, 
                                    data_m22['Fy'], 
                                    diff='ecdf', 
                                    x_axis_label="Fluorescence (a.u.)",
                                    #height=350, width=350,
                                    title="YFP - m22 - noncentered"
                                    )
Fc_pred_m22_noncent = (
    samples_m22_noncent.posterior_predictive['Fc_pred']
    .stack({"sample": ("chain", "draw")})
    .transpose("sample", "Fc_pred_dim_0")
)
p_m22_noncent_fc = bebi103.viz.predictive_ecdf(Fc_pred_m22_noncent, 
                                    data_m22['Fc'], 
                                    x_axis_label="Fluorescence (a.u.)",
                                    #height=350, width=350,
                                    title="CFP - m22 - noncentered"
                                    )
p_diff_m22_noncent_fc = bebi103.viz.predictive_ecdf(Fc_pred_m22_noncent, 
                                    data_m22['Fc'], 
                                    diff='ecdf', 
                                    x_axis_label="Fluorescence (a.u.)",
                                    #height=350, width=350,
                                    title="CFP - m22 - noncentered"
                                    )
layout = bokeh_row(bokeh_column(p_m22_noncent_fy, p_m22_noncent_fc), bokeh_column(p_diff_m22_noncent_fy, p_diff_m22_noncent_fc))
print("Posterior Predictive Check - m22 strain - noncentered:")
bokeh.io.show(layout)

#%% LOO log likelihood
az.loo(samples_m22_noncent, scale="deviance")

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

                                                                                                                                                                                                                                                                                                                                
Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.


Posterior Predictive Check - m22 strain - noncentered:




Computed from 4000 posterior samples and 250 observations log-likelihood matrix.

             Estimate       SE
deviance_loo  6313.19    28.79
p_loo          149.91        -

------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)      138   55.2%
   (0.70, 1]   (bad)       109   43.6%
   (1, Inf)   (very bad)    3    1.2%

### HPD and Coefficient of variation of the noise

In [68]:
theta_HPD = az.hdi(samples_m22_noncent.posterior['theta'], hdi_prob=0.95).to_array().values[0]
median_theta = np.median(samples_m22_noncent.posterior['theta'].values)
tau_HPD = az.hdi(samples_m22_noncent.posterior['tau'], hdi_prob=0.95).to_array().values[0]
median_tau = np.median(samples_m22_noncent.posterior['tau'].values)
extrinsic_noise = median_tau / median_theta

theta_1_HPD = az.hdi(samples_m22_noncent.posterior['theta_1'].values.flatten(), hdi_prob=0.95)
median_theta_1 = np.median(samples_m22_noncent.posterior['theta_1'].values)
sigma_HPD = az.hdi(samples_m22_noncent.posterior['sigma'], hdi_prob=0.95).to_array().values[0]
median_sigma = np.median(samples_m22_noncent.posterior['sigma'].values)
intrinsic_noise = median_sigma / median_theta_1

print(
    f"Analyzing Noise for m22 cells:\n Extrinsic (CV = tau/theta):\n"
    f"\ttheta:\t{median_theta:.2f} [{theta_HPD[0]:.2f}, {theta_HPD[1]:.2f}] (median and 95% HPD)\n"
    f"\ttau:\t{median_tau:.2f} [{tau_HPD[0]:.2f}, {tau_HPD[1]:.2f}] (median and 95% HPD)\n" 
    f"\tCV:\t{extrinsic_noise:.2}\n"
    f" Intrinsic (CV = sigma/theta1):\n"
    f"\ttheta1:\t{median_theta_1:.2f} [{theta_1_HPD[0]:.2f}, {theta_1_HPD[1]:.2f}] (median and 95% HPD)\n" 
    f"\tsigma:\t{median_sigma:.2f} [{sigma_HPD[0]:.2f}, {sigma_HPD[1]:.2f}] (median and 95% HPD)\n" 
    f"\tCV:\t{intrinsic_noise:.2}"
    )

Analyzing Noise for m22 cells:
 Extrinsic (CV = tau/theta):
	theta:	1408.70 [1395.16, 1422.16] (median and 95% HPD)
	tau:	76.26 [65.17, 87.97] (median and 95% HPD)
	CV:	0.054
 Intrinsic (CV = sigma/theta1):
	theta1:	1409.01 [1260.77, 1555.31] (median and 95% HPD)
	sigma:	78.58 [72.02, 85.98] (median and 95% HPD)
	CV:	0.056


________________________________
## d22
### Centered Parameterization

In [None]:
#%% Run the model without noncentering the parameters
with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file=centered_parameterization_stan_name)
    # sm = cmdstanpy.CmdStanModel(stan_file='noise_gene_hierarchical_centered.stan')
    samples = sm.sample(data=data_d22, iter_warmup=2000, max_treedepth=12)
    samples_d22_cent = az.from_cmdstanpy(samples, posterior_predictive=["Fy_pred","Fc_pred"], log_likelihood="log_lik")

#%% Diagnostics message is very long...
bebi103.stan.check_all_diagnostics(samples_d22_cent)

#%% viz parameters
# corner plot (doesn't recognize theta_1)
bokeh.io.show(
    bebi103.viz.corner(
    samples_d22_cent,
    show_contours=True, 
    parameters=['theta','tau','sigma','r'],
    frame_width=250,
    frame_height=250
    )
)
# plot predicted_theta_!:
theta_1_p = iqplot.histogram(
    samples_d22_cent.posterior['theta_1'].values.flatten(),
    x_axis_label='theta_1',
    rug=False,
    line_kwargs={'color':'black'}
)
bokeh.io.show(theta_1_p)
#%% posterior predictive check
Fy_pred_d22 = (
    samples_d22_cent.posterior_predictive['Fy_pred']
    .stack({"sample": ("chain", "draw")})
    .transpose("sample", "Fy_pred_dim_0")
)
p_d22_fy = bebi103.viz.predictive_ecdf(Fy_pred_d22, 
                                    data_d22['Fy'], 
                                    x_axis_label="Fluorescence (a.u.)",
                                    title="YFP - d22"
                                    )
p_diff_d22_fy = bebi103.viz.predictive_ecdf(Fy_pred_d22, 
                                    data_d22['Fy'], 
                                    diff='ecdf', 
                                    x_axis_label="Fluorescence (a.u.)",
                                    title="YFP - d22"
                                    )
Fc_pred_d22 = (
    samples_d22_cent.posterior_predictive['Fc_pred']
    .stack({"sample": ("chain", "draw")})
    .transpose("sample", "Fc_pred_dim_0")
)
p_d22_fc = bebi103.viz.predictive_ecdf(Fc_pred_d22, 
                                    data_d22['Fc'], 
                                    x_axis_label="Fluorescence (a.u.)",
                                    title="CFP - d22"
                                    )
p_diff_d22_fc = bebi103.viz.predictive_ecdf(Fc_pred_d22, 
                                    data_d22['Fc'], 
                                    diff='ecdf', 
                                    x_axis_label="Fluorescence (a.u.)",
                                    title="CFP - d22"
                                    )
layout = bokeh_row(bokeh_column(p_d22_fy, p_d22_fc), bokeh_column(p_diff_d22_fy, p_diff_d22_fc))
print("Posterior Predictive Check - d22 strain")
bokeh.io.show(layout)

#%% LOO log likelihood
az.loo(samples_d22_cent, scale="deviance")

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

                                                                                                                                                                                                                                                                                                                                
Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.


Posterior Predictive Check - d22 strain




Computed from 4000 posterior samples and 284 observations log-likelihood matrix.

             Estimate       SE
deviance_loo  7798.37    37.94
p_loo          173.45        -

------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)      156   54.9%
   (0.70, 1]   (bad)       113   39.8%
   (1, Inf)   (very bad)   15    5.3%

### Non-Centered Parameterization

In [12]:
#%% Now run the non-centered parameterization for d22
with bebi103.stan.disable_logging():
    sm_noncent = cmdstanpy.CmdStanModel(stan_file=noncentered_parameterization_stan_name)
    samples_noncent = sm_noncent.sample(data=data_d22)
    samples_d22_noncent = az.from_cmdstanpy(samples_noncent, posterior_predictive=["Fy_pred","Fc_pred"], log_likelihood="log_lik")

#%% run diagnostics
bebi103.stan.check_all_diagnostics(samples_d22_noncent)

#%% viz parameters
# corner plot (doesn't recognize theta_1)
bokeh.io.show(
    bebi103.viz.corner(
    samples_d22_noncent,
    show_contours=True, 
    parameters=['theta','tau','sigma','r'],
    frame_width=250,
    frame_height=250
    )
)

# plot predicted_theta_1:
theta_1_p = iqplot.histogram(
    samples_d22_noncent.posterior['theta_1'].values.flatten(),
    x_axis_label='theta_1',

    rug=False,
    line_kwargs={'color':'black'}
)
bokeh.io.show(theta_1_p)

#%% posterior predictive check
Fy_pred_d22_noncent = (
    samples_d22_noncent.posterior_predictive['Fy_pred']
    .stack({"sample": ("chain", "draw")})
    .transpose("sample", "Fy_pred_dim_0")
)
p_d22_noncent_fy = bebi103.viz.predictive_ecdf(Fy_pred_d22_noncent, 
                                    data_d22['Fy'], 
                                    x_axis_label="Fluorescence (a.u.)",
                                    #height=350, width=350,
                                    title="YFP - d22 - noncentered"
                                    )
p_diff_d22_noncent_fy = bebi103.viz.predictive_ecdf(Fy_pred_d22_noncent, 
                                    data_d22['Fy'], 
                                    diff='ecdf', 
                                    x_axis_label="Fluorescence (a.u.)",
                                    #height=350, width=350,
                                    title="YFP - d22 - noncentered"
                                    )
Fc_pred_d22_noncent = (
    samples_d22_noncent.posterior_predictive['Fc_pred']
    .stack({"sample": ("chain", "draw")})
    .transpose("sample", "Fc_pred_dim_0")
)
p_d22_noncent_fc = bebi103.viz.predictive_ecdf(Fc_pred_d22_noncent, 
                                    data_d22['Fc'], 
                                    x_axis_label="Fluorescence (a.u.)",
                                    #height=350, width=350,
                                    title="CFP - d22 - noncentered"
                                    )
p_diff_d22_noncent_fc = bebi103.viz.predictive_ecdf(Fc_pred_d22_noncent, 
                                    data_d22['Fc'], 
                                    diff='ecdf', 
                                    x_axis_label="Fluorescence (a.u.)",
                                    #height=350, width=350,
                                    title="CFP - d22 - noncentered"
                                    )
layout = bokeh_row(bokeh_column(p_d22_noncent_fy, p_d22_noncent_fc), bokeh_column(p_diff_d22_noncent_fy, p_diff_d22_noncent_fc))
print("Posterior Predictive Check - d22 strain - noncentered:")
bokeh.io.show(layout)


#%% LOO log likelihood
az.loo(samples_d22_noncent, scale="deviance")

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

                                                                                                                                                                                                                                                                                                                                
Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.


Posterior Predictive Check - d22 strain - noncentered:




Computed from 4000 posterior samples and 284 observations log-likelihood matrix.

             Estimate       SE
deviance_loo  7802.17    38.30
p_loo          175.08        -

------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)      152   53.5%
   (0.70, 1]   (bad)       110   38.7%
   (1, Inf)   (very bad)   22    7.7%

### HPD and Coefficient of variation of the noise

In [96]:
theta_HPD = az.hdi(samples_d22_noncent.posterior['theta'], hdi_prob=0.95).to_array().values[0]
median_theta = np.median(samples_d22_noncent.posterior['theta'].values)
tau_HPD = az.hdi(samples_d22_noncent.posterior['tau'], hdi_prob=0.95).to_array().values[0]
median_tau = np.median(samples_d22_noncent.posterior['tau'].values)
extrinsic_noise = median_tau / median_theta

theta_1_HPD = az.hdi(samples_d22_noncent.posterior['theta_1'].values.flatten(), hdi_prob=0.95)
median_theta_1 = np.median(samples_d22_noncent.posterior['theta_1'].values)
sigma_HPD = az.hdi(samples_d22_noncent.posterior['sigma'], hdi_prob=0.95).to_array().values[0]
median_sigma = np.median(samples_d22_noncent.posterior['sigma'].values)
intrinsic_noise = median_sigma / median_theta_1

print(
    f"Analyzing Noise for d22 cells:\n Extrinsic (CV = tau/theta):\n"
    f"\ttheta:\t{median_theta:.2f} [{theta_HPD[0]:.2f}, {theta_HPD[1]:.2f}] (median and 95% HPD)\n"
    f"\ttau:\t{median_tau:.2f} [{tau_HPD[0]:.2f}, {tau_HPD[1]:.2f}] (median and 95% HPD)\n" 
    f"\tCV:\t{extrinsic_noise:.2}\n"
    f" Intrinsic (CV = sigma/theta1):\n"
    f"\ttheta1:\t{median_theta_1:.2f} [{theta_1_HPD[0]:.2f}, {theta_1_HPD[1]:.2f}] (median and 95% HPD)\n" 
    f"\tsigma:\t{median_sigma:.2f} [{sigma_HPD[0]:.2f}, {sigma_HPD[1]:.2f}] (median and 95% HPD)\n" 
    f"\tCV:\t{intrinsic_noise:.2}"
    )

Analyzing Noise for d22 cells:
 Extrinsic (CV = tau/theta):
	theta:	1897.62 [1871.03, 1923.69] (median and 95% HPD)
	tau:	153.82 [133.05, 174.01] (median and 95% HPD)
	CV:	0.081
 Intrinsic (CV = sigma/theta1):
	theta1:	1893.65 [1600.27, 2207.62] (median and 95% HPD)
	sigma:	155.34 [143.78, 169.53] (median and 95% HPD)
	CV:	0.082
