In [109]:
import numpy as np
import pandas as pd
import scipy.stats as st

import re #regex

import bebi103

import bokeh.io
import bokeh.plotting
# import bokeh.models.mappers
import bokeh.palettes
import bokeh_catplot

bokeh.io.output_notebook()


%load_ext blackcellmagic

The blackcellmagic extension is already loaded. To reload it, use:
  %reload_ext blackcellmagic


### Munging
Load data from Brewster, pre-tidied by Manuel, and drop the spurious column that was the index in csv.
See `code/exploratory/fish_munging.ipynb` for details. TL;DR: don't use the regulated csv, the one below has all the FISH data. mRNA_cell is the data we want, not spots_totals (some of the repressed strains have higher spots_totals than UV5, so that's clearly not the readout we want).

In [2]:
df_fish = pd.read_csv("../../data/jones_brewster_2014.csv")
del df_fish['Unnamed: 0']
df_fish.head()

Unnamed: 0,area_cells,date,experiment,mRNA_cell,num_intens_totals,spots_totals
0,402,20111220,UV5,27,4.544086,21
1,288,20111220,UV5,19,3.196886,14
2,358,20111220,UV5,25,4.24925,19
3,310,20111220,UV5,30,5.075867,22
4,300,20111220,UV5,31,5.361156,24


Next, let's get the energies from the supplement of Brewster/Jones 2012 paper.

In [3]:
df_energies = pd.read_csv("../../data/brewster_jones_2012.csv")
df_energies.head()

Unnamed: 0,Name,Sequence,Energy (AU),Energy (kT)
0,UV5,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGG,41.796231,-6.992058
1,WT,CAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGG,53.446117,-5.346594
2,WTDL10,CAGGCATTACACTTTATGCTTCCGGCTCGTATGTTGTGTGG,57.831389,-4.727205
3,WTDL20,CAGGCTTAAGACTTTATGCTTCCGGCTCGTATGTTGTGTGG,69.025484,-3.146118
4,WTDL20v2,CAGGCCTTAGACTTTATGCTTCCGGCTCGTATGTTGTGTGG,69.933345,-3.017889


All the promoters in the 2012 dataset are in the 2014 fish dataset (verified in `code/exploratory/fish_munging.ipynb`). These are the only constitutive promoters I'm interested in (this only excludes a couple, and they are useless without more metadata).

#### Splitting into regulated & constitutive data
Some of these datasets are not of interest right now so let's split it into multiple dataframes for easier downstream handling. The regulated datasets start with O1, O2, or O3. Everything else doesn't. From that everything else, grab the ones that we have energies for, and set aside the rest. Use regex to parse.

In [4]:
raw_expt_labels = df_fish['experiment'].unique()
raw_expt_labels.sort()

# put all strings that start w/ 'O' in one list
regulated_labels = [label for label in raw_expt_labels if re.match('^O', label)]
# from that, split out those we have energies for
constitutive_labels = [label for label in raw_expt_labels if label in tuple(df_energies.Name)]

Without more metadata, I don't really know what to do with the leftover labels data, e.g., what good does the aTc concentration do me if I don't know what promoter it was for?

Now that we've got labels we want, let's slice dataframes accordingly.

In [5]:
df_reg = df_fish[df_fish['experiment'].isin(regulated_labels)]
df_unreg = df_fish[df_fish['experiment'].isin(constitutive_labels)]

## Sampling with Stan

Code below copies from JB's tutorial 7a, 2018. Stan parametrizes the negative binomial with $\alpha$ and $\beta$, where $\alpha$ is the burst frequency (dimensionless, nondimensionalized by mRNA lifetime) and $\beta = 1/b$ where $b$ is the mean burst size.

### Prior predictive checks

Look into ArviZ!!

In [6]:
model_code_prior_pred = """
data {
  int N;
}


generated quantities {
  int n[N];

  real alpha = lognormal_rng(0.0, 2.0);
  real b = lognormal_rng(2.0, 3.0);
  real beta = 1.0 / b;
  
  for (i in 1:N) {
    n[i] = neg_binomial_rng(alpha, beta);
  }
}
"""

In [7]:
sm_gen = bebi103.stan.StanModel(model_code=model_code_prior_pred)

Using cached StanModel.


In [8]:
data = dict(N=279)
samples_gen = sm_gen.sampling(data=data,
                              algorithm='Fixed_param',
                              warmup=0,
                              chains=1,
                              iter=300)

Something is wrong with `extract_array`, the df it returns doesn't have all the columns in claims it does, so plotting the ecdfs below doesn't work: it's missing the `chain_idx`, so we can't plot the samples grouped by the model parameters that generated them. At least I think that's the problem? This'll take more debugging and reference to the docs, which I don't have now while on a plane!

_Maybe look into ArviZ instead, sounds like that's gonna supercede bebi103 utilities very soon._

In [9]:
df_samples = bebi103.stan.extract_array(samples_gen, name="n")

In [10]:
bokeh.io.show(
    bokeh_catplot.ecdf(
        data=df_samples,
        val="n",
        show_legend=False,
        style='staircase',
#         alpha=0.1,
#         x_scale="log",
    )
)
# bokeh.io.show(bebi103.viz.ecdf_collection(data=df_samples,
#                                           val='n',
#                                           cats='chain_idx',
#                                           color='#4e79a7',
#                                           alpha=0.1,
#                                           show_legend=False,
#                                           val_axis_type='log'))

### Sampling the Posterior
Since JB used essentially the same model, I'm not too worried about the prior predictive checks passing. Let's just run the full sampling to get posteriors! (The prior definitely extends up to mRNA counts that'd be reasonable for mammalian cells but absurd for bacteria, but that's ok. I think it does include enough mass at low counts that we should still be fine. This will exaggerate the shrinkage if we do the full pipeline w/ SBC and everything, but oh well. JB's refined, tighter prior for mammalian cells is _definitely_ too tight for us, so let's stick with this for now.)

In [11]:
model_code = """
data {
  int N;
  int n[N];
}


parameters {
  real<lower=0> alpha;
  real<lower=0> b;
}


transformed parameters {
  real beta_ = 1.0 / b;
}


model {
  // Priors
  alpha ~ lognormal(0.0, 2.0);
  b ~ lognormal(2.0, 3.0);

  // Likelihood
  n ~ neg_binomial(alpha, beta_);
}
"""

In [12]:
sm = bebi103.stan.StanModel(model_code=model_code)

Using cached StanModel.


In [13]:
data = dict(N=len(df_unreg[df_unreg['experiment'] == 'UV5']),
            n=df_unreg[df_unreg['experiment'] == 'UV5']['mRNA_cell'].values.astype(int))

samples = sm.sampling(data=data)

In [14]:
df_mcmc = bebi103.stan.to_dataframe(samples, diagnostics=False, inc_warmup=False)

# Take a look
df_mcmc.head()

Unnamed: 0,chain,draw,warmup,alpha,b,beta_,lp__
0,0,0,0,5.500826,3.345588,0.298901,-9471.689413
1,0,1,0,5.530088,3.350096,0.298499,-9470.801406
2,0,2,0,5.385429,3.478141,0.28751,-9469.555721
3,0,3,0,5.104134,3.646759,0.274216,-9470.224885
4,0,4,0,5.082517,3.698882,0.270352,-9470.250175


In [15]:
p = bokeh.plotting.figure(width=450, height=400, 
                          x_axis_label='α (bursts per mRNA lifetime)', 
                          y_axis_label='b (transcripts per burst)')
p.circle(df_mcmc['alpha'], df_mcmc['b'], alpha=0.05)
bokeh.io.show(p)

That looks quite reasonable. The transcripts per burst & burst frequency are both comparable to what we would have inferred from Manuel's MCMC, but now both parameters are actually identifiable!

### Sampling all the data
Let's repeat for all the constitutive promoters! Since we have so many, do separate loops to generate the samples and generate viz (so we can tweak viz without resampling).

In [16]:
all_samples = {}
for gene in df_unreg['experiment'].unique():
    temp_df = df_unreg[df_unreg['experiment'] == gene]
    data = dict(N=len(temp_df),
                n=temp_df['mRNA_cell'].values.astype(int))

    samples = sm.sampling(data=data)
    
    all_samples[gene] = bebi103.stan.to_dataframe(samples)

Now plot all the samples, overlaid with a contour enclosing 95% of the samples. (Default smoothing in the contour calculator occasionally breaks and totally misses the HPD, so I increased it slightly.)

In [224]:
plots = []
for gene in all_samples:
    p = bokeh.plotting.figure(
        width=300,
        height=250,
        title=gene,
        x_axis_label="α (bursts per mRNA lifetime)",
        y_axis_label="b (transcripts per burst)",
    )
    alpha_samples = all_samples[gene]["alpha"]
    b_samples = all_samples[gene]["b"]
    p.circle(alpha_samples, b_samples, alpha=0.025)
    x_contour, y_contour = bebi103.viz.contour_lines_from_samples(
        alpha_samples.values, b_samples.values, smooth=0.025, levels=0.95
    )
    p.line(x_contour[0], y_contour[0])

    plots.append(p)

bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

Very interesting. Seems like most of the burst sizes are between $\sim1$ and $\sim6$, with a few extreme cases as low as $0.5$. By contrast the burst frequencies vary over a wider range, from $\sim5$ down to $\sim0.03$. Weakly suggests burst freq is the more dynamic variable? But need more exploration.

First compute all the contours coords with JB's utility. (Note that `hv.Contours` prefers its input as a dictionary: `"x"` and `"y"` keys must be labeled as such and provide the coords of the contour, and a 3rd key providing a scalar for the contour level, here the promoter binding energy which we lookup.) Then we can overlay all the contours, colored by their corresponding promoter binding energy (fat lines make it easier to perceive the color).

In [246]:
contour_list = []
for gene in all_samples:
    alpha_samples = all_samples[gene]["alpha"]
    b_samples = all_samples[gene]["b"]
    x_contour, y_contour = bebi103.viz.contour_lines_from_samples(
        alpha_samples.values, b_samples.values, smooth=0.025, levels=0.95
    )
    contour_list.append(
        {
            "x": x_contour[0],
            "y": y_contour[0],
            "Energy (kT)": df_energies.loc[
                df_energies["Name"] == gene, "Energy (kT)"
            ].values[0],
            "Promoter": gene
        }
    )
p = (
    hv.Contours(contour_list, vdims=["Energy (kT)", "Promoter"])
    .opts(logx=True, logy=True)
    .opts(line_width=2)
)
p.opts(
    hv.opts.Contours(
        cmap="viridis",
        colorbar=True,
        tools=["hover"],
        width=500,
        height=500,
        xlabel="α (bursts per mRNA lifetime)",
        ylabel="b (transcripts per burst)",
        padding=0.03,
    )
)

Very interesting. By eye I'd say there's little to no correlation between burst size $b$ and binding energy, and maybe linear scaling between the log of burst frequency $\alpha$ and binding energy, though that correlation is quite noisy. Not immediately obvious how to proceed from a theory/modeling viewpoint.

Probably the next thing to do is posterior predictive checks: the posteriors are nice and tight, but that doesn't necessarily mean the model is sufficiently complex to explain the data.

Depending how PPCs look, one possible next step would be to repeat this analysis accounting for gene copy number changes accross the cell cycle, as Manuel did. It'd be interesting to do posterior predictive comparisons between the two models (that is, bursty one-state with or without copy number changes). Perhaps that might address the variability in burst size?