In [None]:
import numpy as np
import pandas as pd
import scipy.stats as st

import bebi103

import bokeh.io
import bokeh.plotting
import bokeh_catplot

bokeh.io.output_notebook()

%load_ext blackcellmagic

Load data from Brewster, pre-tidied by Manuel, and drop the spurious column that was the index in csv.

In [None]:
raw_data = pd.read_csv("../../data/jones_brewster_2014.csv")
del raw_data['Unnamed: 0']

In [None]:
raw_data.head()

What are all the experiment labels in the dataset?

In [None]:
raw_expt_labels = raw_data['experiment'].unique()
raw_expt_labels

Woah, that's a lot to wrestle! Let's take a glance at all of it and then zoom in to test a pipeline.

In [None]:
plot_kwargs = {
    "x_axis_label": "counts",
    "y_axis_label": "expt",
    "width": 500,
    "height": 1000,
    "horizontal": True,
}
p = bokeh_catplot.box(data=raw_data, cats="experiment", val="mRNA_cell", **plot_kwargs)
bokeh.io.show(p)

UV5, 5DL10, and 5DL20 look like good candidates for a closer look; all have decent non-zero expression, and they look different from each other.

In [None]:
df_slice = raw_data.query("experiment == 'UV5' \
                          or experiment == '5DL10' \
                          or experiment == '5DL20'")

df_slice['experiment'].unique()

Now that we've got a more manageable set, let's make ECDFs and chi-by-eye with negative binomial. `scipy.stats` convention is `cdf(k, n, p, loc=0)`, where $n$ is the number of successes we're waiting for and $p$ is probability of success.

In [None]:
p = bokeh_catplot.ecdf(data=df_slice, cats='experiment', val='mRNA_cell', style='staircase')
# compute upper bound for theoretical CDF plots
u_bound = max(df_slice['mRNA_cell'])
x = np.arange(u_bound+1)
p.line(x, st.nbinom.cdf(x, 5, 0.2))
p.line(x, st.nbinom.cdf(x, 3, 0.4), color='orange')
p.line(x, st.nbinom.cdf(x, .3, 0.26), color='green')
bokeh.io.show(p)

Ok, good start.

## Sampling with Stan

Code below copies from JB's tutorial 7a, 2018. Stan parametrizes the negative binomial with $\alpha$ and $\beta$, where $\alpha$ is the burst frequency (dimensionless, nondimensionalized by mRNA lifetime) and $\beta = 1/b$ where $b$ is the mean burst size.

### Prior predictive checks

In [None]:
model_code_prior_pred = """
data {
  int N;
}


generated quantities {
  int n[N];

  real alpha = lognormal_rng(0.0, 2.0);
  real b = lognormal_rng(2.0, 3.0);
  real beta = 1.0 / b;
  
  for (i in 1:N) {
    n[i] = neg_binomial_rng(alpha, beta);
  }
}
"""

In [None]:
sm_gen = bebi103.stan.StanModel(model_code=model_code_prior_pred)

In [None]:
df_UV5 = df_slice[df_slice["experiment"] == "UV5"]

In [None]:
data = dict(N=279)
samples_gen = sm_gen.sampling(data=data,
                              algorithm='Fixed_param',
                              warmup=0,
                              chains=1,
                              iter=300)

Something is wrong with `extract_array`, the df it returns doesn't have all the columns in claims it does, so plotting the ecdfs below doesn't work: it's missing the `chain_idx`, so we can't plot the samples grouped by the model parameters that generated them. At least I think that's the problem? This'll take more debugging and reference to the docs, which I don't have now while on a plane!

In [None]:
df_samples = bebi103.stan.extract_array(samples_gen, name="n")

In [None]:
bokeh.io.show(
    bokeh_catplot.ecdf(
        data=df_samples,
        val="n",
        show_legend=False,
        style='staircase',
#         alpha=0.1,
#         x_scale="log",
    )
)
# bokeh.io.show(bebi103.viz.ecdf_collection(data=df_samples,
#                                           val='n',
#                                           cats='chain_idx',
#                                           color='#4e79a7',
#                                           alpha=0.1,
#                                           show_legend=False,
#                                           val_axis_type='log'))

### Sampling the Posterior
Since JB used essentially the same model, I'm not too worried about the prior predictive checks passing. Let's just run the full sampling to get posteriors! (The prior definitely extends up to mRNA counts that'd be reasonable for mammalian cells but absurd for bacteria, but that's ok. I think it does include enough mass at low counts that we should still be fine. This will exaggerate the shrinkage if we do the full pipeline w/ SBC and everything, but oh well. JB's refined, tighter prior for mammalian cells is _definitely_ too tight for us, so let's stick with this for now.)

In [None]:
model_code = """
data {
  int N;
  int n[N];
}


parameters {
  real<lower=0> alpha;
  real<lower=0> b;
}


transformed parameters {
  real beta_ = 1.0 / b;
}


model {
  // Priors
  alpha ~ lognormal(0.0, 2.0);
  b ~ lognormal(2.0, 3.0);

  // Likelihood
  n ~ neg_binomial(alpha, beta_);
}
"""

In [None]:
sm = bebi103.stan.StanModel(model_code=model_code)

In [None]:
data = dict(N=len(df_UV5),
            n=df_UV5['mRNA_cell'].values.astype(int))

samples = sm.sampling(data=data)

In [None]:
df_mcmc = bebi103.stan.to_dataframe(samples, diagnostics=False, inc_warmup=False)

# Take a look
df_mcmc.head()

In [None]:
p = bokeh.plotting.figure(width=450, height=400, 
                          x_axis_label='α (bursts per mRNA lifetime)', 
                          y_axis_label='b (transcripts per burst)')
p.circle(df_mcmc['alpha'], df_mcmc['b'], alpha=0.05)
bokeh.io.show(p)

That looks quite reasonable. The transcripts per burst & burst frequency are both comparable to what we would have inferred from Manuel's MCMC, but now both parameters are actually identifiable!

### Sampling all the data
Let's repeat for all the genes!

In [None]:
plots = []
for gene in raw_expt_labels:
    temp_df = raw_data[raw_data['experiment'] == gene]
    data = dict(N=len(temp_df),
                n=temp_df['mRNA_cell'].values.astype(int))

    samples = sm.sampling(data=data)
    
    df_mcmc = bebi103.stan.to_dataframe(samples)

    p = bokeh.plotting.figure(width=300, height=250, title=gene,
                              x_axis_label='α (bursts per mRNA lifetime)', 
                              y_axis_label='b (transcripts per burst)')
    p.circle(df_mcmc['alpha'], df_mcmc['b'], alpha=0.025)

    plots.append(p)
    
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=3))

Very interesting. Seems like most of the burst sizes are between $\sim1$ and $\sim6$, with a few extreme cases as low as $0.5$ or as high as $15$. By contrast the burst frequencies vary over a wider range, from $\sim5$ down to $\sim0.03$. Weakly suggests burst freq is the more dynamic variable? But need more exploration.

### Next steps
Next let's plot these all together to look for trends. Does Stan or bebi103.viz have a way to extract contour credible regions? If not, maybe just construct a 2D Gaussian from the mean & covariance matrix of the samples for each condition, and then plot a 95% credible region or something for each from that and overlay all those. Let's also exclude the regulated data and just look for trends in the unregulated promoters first.