# A Principled Bayesian Inference of a Fluorescence Calibration Factor

© 2019 Griffin Chure. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

--- 

In [2]:
import numpy as np
import pandas as pd
import pystan 
import mwc.viz
import mwc.stats
import mwc.bayes
import scipy.stats
import altair as alt
%load_ext stanmagic

In this notebook, we lay out a principled workflow for the parameter estimation of a fluorescence calibration factor. This notebook covers the inference of the calibratoin factor from a single measurement. A principled analysis of a hierarchical approach will be written at a later date. 

## Writing A Physical Model

It has become trivial in the era of molecular biology to label your favorite protein with a reporter that fluoresces at your favorite wavelength. Similarly, single-cell and single-molecule microscopy has become commonplace, making precise measurement of total cell fluorescence and/or localization a relatively painless procedure. However, it is still remarkably difficult to translate that precise measurement of fluorescence into the absolute copy number of your protein. 

By measuring the fluctuations in intensity between a mother/daughter pair after cell division, we can back calculate how bright a single molecule of interest is in arbitrary unitls, permitting the relatively easy calculation of protein copy number. We operate under the assumption that protein degradation is negligible and that protein production is ceased immediately before division. Using these assumptions, we can say that the total intensity of a given cell $I_\text{tot}$ is related to the number of fluorescent proteins and their relative brightness,

$$
I_\text{tot} = \alpha N_\text{tot}, \tag{1}
$$

where $N_\text{tot}$ is the total number of proteins and $\alpha$ is the brightness of a single fluorophore. Assuming there is no more production or deradation, we can relate the intensity of the mother cell to the daughter cells by dictating that the fluorescence must be conserved,

$$
I_\text{tot} = I_1 + I_2 = \alpha (N_1 + N_2), \tag{2}
$$

where we have used $I_1$ and $I_2$ to represent the total intensity of daughter cells 1 and 2. By looking at the fluctuations in intensity between the two daughter cells, followed by invocation of the mean and variance of the Binomial distribution, we arrive at the simple relation that

$$
\langle (I_1 - I_2)^2 \rangle = \alpha I_\text{tot}. \tag{3}
$$

Of course, lumped in with $\alpha$ is all of the features of the detector, fluorophore quantum efficiency, and other minutae of measurement. While incorporating these details into the generative model building is the proper thing to do, I know from experience in these experiments and quantitative biological. microscopy in general that the noise in these measurements is much smaller than the noise of the biological system. To this end, we will neglect them for simplicity. 

## Building a generative statistical model

The posterior probability distribution for our parameter of interest $\alpha$ is given by Bayes' theorem as
$$
g(\alpha\,\vert\,[I_1, I_2]) \propto f([I_1, I_2]\,\vert\, \alpha)g(\alpha)
\tag{4},
$$
where I have used $g$ and $f$ to denote probability density functions over parameters and data

In [19]:
n_div = 500
n_tot = np.random.gamma(10, 10, n_div).astype(int)
n1 = np.random.binomial(n_tot, p=0.5)
n2 = n_tot - n1
alpha_mu = 150
alpha_sig = 20
alpha_rand = np.random.normal(alpha_mu, alpha_sig, n_div)
I1 = n1 * alpha_rand
I2 = n2 * alpha_rand
sq_diff = (I1 - I2)**2
summed = I1 + I2
df = pd.DataFrame(np.array([n_tot, n1, n2, I1, I2, sq_diff, summed, alpha_rand]).T,
                 columns=['n_tot', 'n_1', 'n_2', 'I1', 'I2', 'sq_diff', 'summed', 'alpha'])

In [35]:
%%stan -v hier
functions{
    /** 
    * Approximate the Binomial distirubution for continuous variables 
    * as a ratio of Gamma functions 
    * 
    * @param I1: Observed fluorescence of daughter cell 1. 
    * @param I2: Observed fluorescence of daughter cell 2.
    * @param alpha: Fluorescenc calibration factor in units of a.u. / molecule
    * @param N: Total number of measurements 
    **/
    real GammaApproxBinom_lpdf(vector I1, vector I2, real alpha) {
        return sum(-log(alpha) + lgamma(((I1 + I2) ./ alpha) + 1) - lgamma((I1 ./ alpha) + 1) - lgamma((I2 ./ alpha) + 1) - ((I1 + I2) ./ alpha) * log(2));
    } 
}
     
data {
    int<lower=0> N; // Number of data points
    vector<lower=0>[N] I1; // Observed fluorescence of daughter cell 1
    vector<lower=0>[N] I2; // Observed fluorescence of daughter cell 2
}

parameters {
    // Generate non-centered modifiers
    real<lower=0> alpha_mu;
    real alpha_raw; 
    real<lower=0> tau;
}

transformed parameters{
    real alpha = alpha_mu + tau * alpha_raw;
}


model {    
    alpha_raw ~ normal(0, 2);
    alpha_mu ~ lognormal(2, 2);
    tau ~ normal(0, 1);
    I1 ~ GammaApproxBinom(I2, alpha);  
}


Using pystan.stanc compiler..
-------------------------------------------------------------------------------
Model compiled successfully. Output stored in hier object.
Type hier in a cell to see a nicely formatted code output in a notebook
     ^^^^
Access model compile output properties
hier.model_file -> Name of stan_file [None]
hier.model_name -> Name of stan model [None]
hier.model_code -> Model code [functions{     /**   ....]


In [36]:
model = pystan.StanModel(model_code=hier.model_code)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_8279ca5d8b472d53461b0ef41ed5fb07 NOW.
  tree = Parsing.p_module(s, pxd, full_module_name)


In [37]:
data_dict = {'N': len(df), 'I1':df['I1'], 'I2':df['I2']};
samples = model.sampling(data_dict, iter=1000)

In [38]:
samples

Inference for Stan model: anon_model_8279ca5d8b472d53461b0ef41ed5fb07.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

            mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
alpha_mu  141.52    0.28   9.34 123.24  135.4 141.35 147.46 160.41   1121    1.0
alpha_raw   0.08    0.05   1.98  -3.65  -1.29   0.06   1.39   4.11   1406    1.0
tau         0.82    0.02   0.59   0.04   0.35   0.71   1.17   2.24   1502    1.0
alpha     141.59    0.26   9.17 123.83 135.35 141.43 147.41 159.95   1227    1.0
lp__       -3990    0.05   1.31  -3993  -3990  -3989  -3989  -3988    650    1.0

Samples were drawn using NUTS at Fri Feb 15 16:15:28 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).