In [None]:
%load_ext autoreload
%autoreload 2

import copy

import numpy as np
import pandas as pd
import scipy.stats as st

import re #regex

import cmdstanpy
import arviz as az

import bebi103
import bokeh_catplot

import bokeh.io
import bokeh.plotting
# import bokeh.models.mappers
import bokeh.palettes

import holoviews as hv
import holoviews.operation.datashader
hv.extension('bokeh')
bebi103.hv.set_defaults()

import panel as pn
pn.extension()

bokeh.io.output_notebook()


%load_ext blackcellmagic

### Munging
Load data from Brewster, pre-tidied by Manuel, and drop the spurious column that was the index in csv.
See `code/exploratory/fish_munging.ipynb` for details. TL;DR: don't use the regulated csv, the one below has all the FISH data. mRNA_cell is the data we want, not spots_totals.

In [None]:
df_fish = pd.read_csv("../../data/jones_brewster_2014.csv")
del df_fish['Unnamed: 0']
df_fish.head()

#### Separating out regulated data
The regulated datasets' labels start with O1, O2, or O3. Everything else doesn't. Use regex to parse.

In [None]:
raw_expt_labels = df_fish['experiment'].unique()
raw_expt_labels.sort()

# put all strings that start w/ 'O' in one list
regulated_labels = [label for label in raw_expt_labels if re.match('^O', label)]

This leaves behind some data with insufficient metadata, e.g., what good does the aTc concentration do me if I don't know what promoter it was for?

Now that we've got labels we want, let's slice dataframes accordingly.

In [None]:
df_reg = df_fish[df_fish['experiment'].isin(regulated_labels)]
df_UV5 = df_fish[df_fish["experiment"] == "UV5"]

## Analyzing simple repression with Stan

Stan model borrows from JB's tutorial 7a, 2018, and from JB's finch beak tutorial for bebi103b 2020 TAs.

The model here is the same negative binomial model we used for the constitutive case, except with the burst rate multiplied by the fold-change in mean. Can we successfully infer the Bohr parameter?

As in the constitutive case, Stan parametrizes the negative binomial with $\alpha$ and $\zeta$, where $\alpha$ is the burst frequency (dimensionless, nondimensionalized by mRNA lifetime) and $\zeta = 1/b$ where $b$ is the mean burst size.

What is our prior range for fold-change? Consider
\begin{align}
fc = \left( 1+\frac{R}{N_{NS}} e^{-\beta\Delta\epsilon} \right)^{-1}
   = \left( 1+
        \exp\left(\ln\left(\frac{R}{N_{NS}}\right)-\beta\Delta\epsilon\right)
     \right)^{-1}
   = \left( 1+ e^{-\beta\Delta F} \right)^{-1}
\end{align}
$R/N_{NS}$ should definitely remain between $10^{-7}$ and $10^{-2}$, and we expect/know the $\Delta\epsilon$ to range between about $-18$ and $-9$. So we can say $\beta\Delta F$ is unlikely to escape a range between $-14$ and $+7$, corresponding to fold-changes of $10^{-6}$ and $(1-10^{-3})$, which certainly already exceed our detection limits anyways. A prior such as
\begin{align}
\beta\Delta F \sim \text{Normal}(\mu=-3,\sigma=5)
\end{align}
corresponds to this story.

Then our target likelihood is
\begin{align}
p_m \sim \text{NBinom}(\alpha, \zeta)
\end{align}
for the UV5 data, and simultaneously
\begin{align}
p_m \sim \text{NBinom}\left(\frac{\alpha}{1+ e^{-\beta\Delta F}}, \zeta\right)
\end{align}
for the regulated data.


### Prior predictive checks

In [None]:
sm_prior_predictive = cmdstanpy.CmdStanModel(
    stan_file="stan/simple_rep_means_prior_predictive_v01.stan"
)
# print(sm_prior_predictive.code())

In [None]:
data_prior_pred = dict(
    N=len(df_UV5), 
    log_alpha_loc=0.0, 
    log_alpha_scale=2.0, 
    log_b_loc=0.5,
    log_b_scale=1.5,
    bohr_loc=-3.0,
    bohr_scale=5
)

In [None]:
prior_pred_samples = sm_prior_predictive.sample(
    data=data_prior_pred,
    fixed_param=True,
    sampling_iters=1000,
    output_dir="./stan/stan_samples",
)

In [None]:
# Convert to ArviZ InferenceData
prior_pred_samples = az.from_cmdstanpy(
    posterior=prior_pred_samples,
    prior=prior_pred_samples,
    prior_predictive=['mRNA_counts']
)

In [None]:
p = bebi103.viz.predictive_ecdf(
    prior_pred_samples.prior_predictive['mRNA_counts'],
    frame_height=250,
    frame_width=350,
    discrete=True,
    percentiles=(95, 90, 75, 50),
    x_axis_label='mRNA counts',
    x_axis_type='log'
)
bokeh.io.show(p)

### Simulation-based calibration

Next up: simulation-based calibration (SBC). Quoting JB, this checks "that the sampler can effectively sample the entire space of parameters covered by the prior." We'll go ahead and set up the data for the posterior, even though we won't be sampling the posterior right now. Then we can set up the model.

In [None]:
data_rep_test = copy.deepcopy(data_prior_pred)
del data_rep_test["N"]
df_rep_test = df_reg[df_reg['experiment'] == 'Oid_2ngmL']
data_rep_test["N_cells_uv5"] = len(df_UV5)
data_rep_test["N_cells_rep"] = len(df_rep_test)
data_rep_test["mRNA_counts_uv5"] = df_UV5["mRNA_cell"].values.astype(int)
data_rep_test["mRNA_counts_rep"] = df_rep_test["mRNA_cell"].values.astype(int)
data_rep_test["ppc"] = 0

sm = cmdstanpy.CmdStanModel(stan_file="stan/simple_rep_means_v01.stan")
# print(sm.code())

_There's currently some bug with my call to sbc here that I can't find to save my life. I think it's just a typo somewhere but... Error message suggests it's in the python wrapper, not in the Stan code, and also b/c posterior sampling later on works fine._

In [None]:
try:
    sbc_output = pd.read_csv("stan/sbc_simple_rep_means_v01.csv")
except:
    sbc_output = bebi103.stan.sbc(
        prior_predictive_model=sm_prior_predictive,
        posterior_model=sm,
        prior_predictive_model_data=data_prior_pred,
        posterior_model_data=data_rep_test,
        measured_data=["mRNA_counts_uv5", "mRNA_counts_rep"],
        parameters=["alpha", "b", "bohr"],
        sampling_kwargs={'thin': 10},
        cores=4,
        N=400,
        progress_bar=True,
    )

    sbc_output.to_csv("stan/sbc_simple_rep_means_v01.csv", index=False)

Plot ECDFs of the rank statistics, which should be ~ uniform. Color according to warning code, which will also let us know if there were any issues with divergences, R-hat, EBFMI, effective number of steps, or tree depth.

In [None]:
plots = [
    bokeh_catplot.ecdf(
        data=sbc_output.loc[sbc_output["parameter"] == param, :],
        val="rank_statistic",
        cats="warning_code",
        kind="colored",
        frame_width=400,
        frame_height=150,
        title=param,
        conf_int=True,
    )
    for param in sbc_output["parameter"].unique()
]

bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=1))

Warning code 2 means R-hat failure. Let's look at the R-hat values.

In [None]:
bokeh.io.show(
    bokeh_catplot.ecdf(data=sbc_output, val="Rhat", cats="parameter")
)

If there are a substantial fraction of Rhats above the brightline 1.01, JB suggests that more samples from the posterior might be wise.

JB also has a nice helper function to plot the difference from uniform of rank ECDFs, with a confidence interval for the uniform distribution to guide the eye. Let's take a look.

In [None]:
bokeh.io.show(bebi103.viz.sbc_rank_ecdf(sbc_output))

Finally let's check shrinkage and z-scores.

In [None]:
hv.Points(
    data=sbc_output,
    kdims=['shrinkage', 'z_score'],
    vdims=['parameter', 'ground_truth', 'mean', 'sd'],
).opts(
    color='parameter',
    alpha=0.3,
    xlim=(0, 1.05),
    tools=['hover']
)

### Sampling the Posterior
We already finished building the model in order to do SBC. Now we just run it.

In [None]:
# We do want posterior predictive checks this time
data_rep_test["ppc"] = 1

posterior_samples = sm.sample(data=data_rep_test, cores=5)
posterior_samples = az.from_cmdstanpy(
    posterior_samples, posterior_predictive=["mRNA_counts_uv5_ppc", "mRNA_counts_rep_ppc"]
)

In [None]:
bebi103.stan.check_all_diagnostics(posterior_samples)

Good, no sampler warnings. Let's visualize the posterior.

In [None]:
bokeh.io.show(
    bebi103.viz.corner(
        posterior_samples,
        pars=["alpha", "b", "bohr"],
        alpha=0.1,
        xtick_label_orientation=np.pi / 4,
    )
)

That looks quite reasonable. The transcripts per burst & burst frequency are both comparable to what we would have inferred from Manuel's MCMC, and also what we inferred from UV5 data alone.

### Posterior predictive checks

Even though the posterior looks ok, the model could still be wrong: it is identifiable, but is it consistent with the data? Posterior predictive checks address this by asking whether the model could plausibly generate the observed data. (Function borrowed from JB's finch beak tutorial for bebi103b 2020 TAs.)

In [None]:
def ppc_ecdf_pair(posterior_samples, ppc_var, df, percentiles=(80, 60, 40, 20),
                 x_axis_label="mRNA counts per cells", frame_width=200, frame_height=200):
    """Plot posterior predictive ECDFs."""
    n_samples = (
        posterior_samples.posterior_predictive.dims["chain"]
        * posterior_samples.posterior_predictive.dims["draw"]
    )

    p1 = bebi103.viz.predictive_ecdf(
        posterior_samples.posterior_predictive[ppc_var].values.reshape(
            (n_samples, len(df))
        ),
        data=df["mRNA_cell"],
        percentiles=percentiles,
        discrete=True,
        x_axis_label=x_axis_label,
        frame_width=frame_width,
        frame_height=frame_height
    )

    p2 = bebi103.viz.predictive_ecdf(
        posterior_samples.posterior_predictive[ppc_var].values.reshape(
            (n_samples, len(df))
        ),
        data=df["mRNA_cell"],
        percentiles=percentiles,
        discrete=True,
        x_axis_label=x_axis_label,
        frame_width=frame_width,
        frame_height=frame_height,
        diff=True,
    )
    p1.x_range = p2.x_range
    
    return [p1, p2]

In [None]:
ppc_plots = (ppc_ecdf_pair(posterior_samples, "mRNA_counts_uv5_ppc", df_UV5),
             ppc_ecdf_pair(posterior_samples, "mRNA_counts_rep_ppc", df_rep_test))

# flatten list of lists w/ list comp, then plot
ppc_plots = [subplot for sublist in ppc_plots for subplot in sublist]
bokeh.io.show(bokeh.layouts.gridplot(ppc_plots, ncols=2))

Depending on which lac operator and aTc concentration we analyze, I'd draw rather different conclusions. Let's just analyze everything.

### Sampling all the data
Let's repeat for all the operators and repressor copy numbers! Since we have so many, do separate loops to generate the samples and generate viz (so we can tweak viz without pointlessly rerunning the sampling). Put all the samples in a dict for easy access; I'm not sure if arviZ will keep them in ram, but model is small enough that even if so, it'll be fine.

In [None]:
# suppress many lines of cmdstanpy output, uncomment if you want to watch it sample
all_samples = {}
for trial in df_reg["experiment"].unique():
    temp_df = df_reg[df_reg["experiment"] == trial]
    data = copy.deepcopy(data_prior_pred)
    data["N_cells_uv5"] = len(df_UV5)
    data["N_cells_rep"] = len(temp_df)
    data["mRNA_counts_uv5"] = df_UV5["mRNA_cell"].values.astype(int)
    data["mRNA_counts_rep"] = temp_df["mRNA_cell"].values.astype(int)
    data["ppc"] = 1

    posterior_samples = sm.sample(data=data, cores=6)
    all_samples[trial] = az.from_cmdstanpy(
        posterior_samples,
        posterior_predictive=["mRNA_counts_uv5_ppc", "mRNA_counts_rep_ppc"],
    )

It's always wise to check diagnostics.

In [None]:
# for trial in all_samples:
#     print(trial)
#     bebi103.stan.check_all_diagnostics(all_samples[trial])

There's some run-to-run variation. Sometimes one or two promoters give Rhat warnings a bit over 1.01, but generally this looks ok.

Now that we've sampled we can plot. (I can't figure out how to add titles to the bokeh layouts that `viz.corner` returns, so I'm doing the poor man's version for now.)

In [None]:
for trial in all_samples:
    print(trial)
    # plot posterior as corner plot
    bokeh.io.show(
        bebi103.viz.corner(
            all_samples[trial],
            pars=["alpha", "b", "bohr"],
            alpha=0.1,
            xtick_label_orientation=np.pi / 4,
        )
    )
    # setup post pred ecdfs
    ppc_plots = (
        ppc_ecdf_pair(all_samples[trial], "mRNA_counts_uv5_ppc", df_UV5),
        ppc_ecdf_pair(
            all_samples[trial],
            "mRNA_counts_rep_ppc",
            df_reg[df_reg["experiment"] == trial],
        ),
    )

    # flatten list of lists to prepare for gridplot
    ppc_plots = [subplot for sublist in ppc_plots for subplot in sublist]
    bokeh.io.show(bokeh.layouts.gridplot(ppc_plots, ncols=2))

First off, obviously this model is not "true." But it's not a bad zeroth-order model for simple repression: it provides an interesting way to link the thermodynamic model (Bohr parameter) to observables beyond means.

And once again this model performs better than I'd expect it to, and it's failure mode is very enlightening. It is _extremely_ interesting that for the very weakly repressed and the very highly repressed conditions, the joint posteriors on $\alpha$ and $b$ (i.e., after marginalizing away Bohr) are _very_ similar to the posterior we got from the UV5 data alone. In these limits, approximating the distribution as still being negative binomial with a rescaled burst rate seems alright. Another statement of this: in these limits, there is still only one timescale in sight: the burst rate, appropriately scaled.

In the intermediate regime, this is not true: the repressed mRNA distributions are substantially more disperse than constitutive UV5. The model can't fit both dists simultaneously, so it contorts itself to try and split the difference. `Oid0p5ngmL` is probably the most extreme example. $\alpha$ gets pulled down and $b$ gets pulled up to increase the variance, resulting in posterior predictive distributions that are too wide for UV5 and still too narrow for the repressed data.

This is actually _a good sign_: the addition of repressor leaves a nontrivial imprint on the mRNA distribution. So a kinetic model with repressor dynamics may actually be identifiable. The question is whether, even if we can infer 2 rates and label them $k_R^+$ and $k_R^-$, do they actually correspond to the microscopic repressor kinetics, or are they just related to cell cycle or mRNA lifetimes or TetR partitioning imprecision or or...? But we'll cross that bridge when we get there...

#### Inferring repressor copy number
If we take the binding energies of each operator as knowns, we can extract $R/N_{NS}$ for each op/aTc pair and see if they make sense. First collect the approximate MAP of the Bohr parameter (in $kT$) for each trial, tabulated below.

Inferred Bohr parameters:

|     | 0.5 ng/ml | 1 ng/ml | 2 ng/ml | 10 ng/ml |
| --- | --------- | ------- | ------- | -------- |
| Oid | 1.2       | -2.75   | -4.8    | -4.4     |
| O1  | 2.5       | -1.8    | -3.45   | -3.6     |
| O2  | 1.8       | -0.5    | -1.8    | -2.8     |
| O3  | ?         | > 3     | > 2.5   | 2 ?      |

Then using
\begin{align}
\Delta F = \Delta \epsilon - \ln(R/N_{NS})
\end{align}
and
\begin{align}
\Delta \epsilon_{Oid} = -17.7 \\
\Delta \epsilon_{O1}  = -15.3 \\
\Delta \epsilon_{O2}  \in (-13.9, -13.6) \\
\Delta \epsilon_{O3}  \in (-9.7, -9.4) \\
\end{align}
we get estimates for $R$.

Inferred $R$:

|     | 0.5 ng/ml | 1 ng/ml | 2 ng/ml | 10 ng/ml |
| --- | --------- | ------- | ------- | -------- |
| Oid | 0.03      | 1.5     | 11      | -        |
| O1  | 0.1       | 6       | 30-35   | -        |
| O2  | 0.5-1     | 6-8     | 25-35   | 70-100   |
| O3  | -         |  -      | > 25    | > 30?    |

We have left blank values where the Bohr parameter was clearly inferred poorly (either very strong or very weak repression), which would result in obviously inaccurate estimates of $R$. Even the values listed should be taken with a grain of salt, but this gives us an order of magnitude sense of things.

Strangely, though, Brewster and Daniel were using HG203 as a base just as I am, with _tetR_ integrated at the _gspI_ locus and _lacI_ at _ybcN_. The only difference is the reporter at _galK_ and the mNeonGreen fusion. So that means our aTc induction levels ought to be approximately apples to apples, no? Why do I seem to get $\sim1$ order of magnitude stronger induction?? Or maybe my estimates here are just that sloppy?

#### Next step

So we can fit each new repressor copy number & operator combo with it's own _de novo_ Bohr parameter & that is an easily identifiable model. Good.

Can we be more principled? For the next iteration of the model, let's define a binding energy for each operator and a single fit parameter that globally converts from aTc to lacI copy number (assume they're linearly proportional). Fit all the data at once. Now there's only one global $\alpha$ and $b$; how will they adapt? And can we recover, at least approximately, the known operator binding energies.

For coding the full true likelihood, JB suggests referring to a JavaScript library of special functions, by paulmasson.