# Bayesian model selection

(c) 2017 the authors. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT). 

In [1]:
import os
import glob
import pickle
import datetime
# Our numerical workhorses
import numpy as np
import pandas as pd
import scipy.special

# Useful plotting libraries
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns

# Set the plotting style.
import sys
sys.path.insert(0, '../')
import mwc_mutants_utils as mwc
mwc.set_plotting_style()

# Magic function to make matplotlib inline; other style specs must come AFTER
%matplotlib inline

# This enables SVG graphics inline (only use with static plots (non-Bokeh))
%config InlineBackend.figure_format = 'svg'

# Model selection between single parameter change and multiple parameters.

In this notebook we will explore the use of Bayesian model selection to distinguish between changes in the parameters that a single point mutation in the transcription factor can cause. Specifically we will be comparing two models:
1. $M_1$: A single family of parameters ($\Delta\varepsilon_{RA}$ for the DNA binding domain mutants and $K_A$ and $K_I$ for inducer binding pocket mutants) change when a single amino-acid substitution takes place on a specific part of the repressor.
2. $M_2$: All mutations in the transcription factor change all parameters having to do with the protein (except for $\Delta\varepsilon_{AI}$), i.e. $\Delta\varepsilon_{RA}$, $K_A$, and $K_I$.

The advantage of using Bayesian model selection is that intrinsically by the nature of the methodology 3 features are compared between models:
1. Prior information on how likely is each of the models to be true.
2. Goodness of fit of the model with the data.
3. Complexity of the model.

The framework then compares how well the model describes the data, but also how complicated is the model.

To see how these features are natually compared within the Bayesian framework we can write Bayes theorem for a model $M_i$ being true as
$$
P(M_i \mid D) = \frac{P(D \mid M_i) P(M_i)}{P(D)},
\tag{1}
$$
where $D$ is the data. In principle the denominator can be computed as
$$
P(D) = \sum_j P(D \mid M_j)P(M_j),
\tag{2}
$$
where we would have to sum over all possible models, making it impossible to ever compute the probability of a specific model being true. But we can instead compare two models $M_1$ and $M_2$. Since the denominator would be the same for both we can write
$$
\frac{P(M_1 \mid D)}{P(M_2 \mid D)} = \frac{P(D \mid M_1) P(M_i)}{P(D \mid M_2) P(M_i)}.
\tag{3}
$$

For a given model $M_i$ with parameters $\mathbf{a}_i$ we have that
$$
P(D \mid M_i) = \int d\mathbf{a}_i P(D \mid \mathbf{a}_i, M_i) P(\mathbf{a}_i \mid M_i)
\tag{4}
$$

In general this integral is not easy to compute analitically and one has to use parallel temporing MCMC (to be discussed later) to perform this integral numerically. But for cases where the posterior distribution of the parameters is single peaked and more or less symmetric (i.e. Gaussian-like) one can approximate the integral with the area of a rectangle. This simplification is known as the Laplace approximation and it is written as
$$
P(D \mid M_i) \approx P(\mathbf{a}_i^* \mid M_i) \underbrace{P(D \mid \mathbf{a}_i^*, M_i)}_\text{height} \underbrace{(2\pi)^{\vert \mathbf{a}_i \vert / 2} \sqrt{\det \boldsymbol{\sigma}_i^2}}_\text{width},
\tag{5}
$$
where $\mathbf{a}_i^*$ represents the most likely parameter values, $\vert \mathbf{a}_i \vert$ represents the number of parameters in the model, and $\boldsymbol{\sigma}_i$ is the covariance matrix that can be computed with the Gaussian approximation of the parameter posterior probability.

With this approximation we can write the odds ratio, i.e. the probability ratio between two models as
$$
O_{ij} = \underbrace{\left[ \frac{P(M_i)}{P(M_j)} \right]}_\text{prior on model}
\underbrace{\left[ \frac{P(D \mid \mathbf{a}_i^*, M_i)}{P(D \mid \mathbf{a}_j^*, M_j)} \right]}_\text{goodness of fit}
\underbrace{\left[ \frac{P(\mathbf{a}_i^* \mid M_i) (2\pi)^{\vert \mathbf{a}_i^* \vert / 2 } \sqrt{\det \boldsymbol{\sigma}_i^2}}{P(\mathbf{a}_j^* \mid M_j) (2\pi)^{\vert \mathbf{a}_j^* \vert / 2 } \sqrt{\det \boldsymbol{\sigma}_j^2}} \right]}_\text{Occam factor}.
\tag{6}
$$

This form of the odds ratio explicitly compares the 3 features mentioned above. For the Occam factor what it is accounting is the volume in parameter space in which the parameters can live. This naturally penalizes more complex models with many more parameters, giving a natural origin to [Occam's razor](https://en.wikipedia.org/wiki/Occam%27s_razor).

As usual when dealing with probabilities is easier to take the log. For the case of the Odds ratio we have
\begin{align}
\log O_{ij} &= \log \left[ \frac{P(M_i)}{P(M_j)} \right]\\
&+ \log P(D \mid \mathbf{a}_i^*, M_i) - \log P(D \mid \mathbf{a}_j^*, M_j)\\
&+ \log P(\mathbf{a}_i^* \mid M_i) - \log P(\mathbf{a}_j^* \mid M_j)\\
&+ \frac{\vert \mathbf{a}_i^* \vert  - \vert \mathbf{a}_j^* \vert}{2} \log 2\pi\\
&+ \frac{1}{2}\left( \log \boldsymbol{\sigma}_i^2 - \log \boldsymbol{\sigma}_j^2 \right)
\tag{7}
\end{align}

## DNA-binding domain mutants.

Let's work out the specific case for DNA-binding domain mutants in which
1. $M_1$ : only the $\Delta\varepsilon_{RA}$ is changed.
2. $M_2$ : $\Delta\varepsilon_{RA}$ along with $K_A$ and $K_I$ are changed.

We will assume that a priori both models are equally likely such that the first term of Eq. (7) is zero. We will also assume that each data point is independent of each other such that
$$
\log P(D \mid \mathbf{a}_i^*, M_i) = \sum_{d \in D} \log P(d \mid \mathbf{a}_i^*, M_i),
\tag{8}
$$
where $d$ is an individual data point of the dataset $D$.

We will assign a Gaussian likelihood with constant error across IPTG concentrations such that
\begin{align}
\sum_{d \in D} \log P(d \mid \mathbf{a}_i^*, {\sigma_i^*}, M_i) &= 
\frac{n}{2} \log \left( 2 \pi  {\sigma_i^*}^2 \right) \\
&- \sum_{d \in D} \frac{\left( \text{fold-change}_{exp} - \text{fold-cange}_{thry}^{(d)}(\mathbf{a}_i^*)\right)^2}{2 {\sigma_i^*}^2},
\tag{9}
\end{align}
where $n = \vert D \vert$ is the number of data points, ${\sigma_i^*}$ is the most likely error associated with the Gaussian likelihood, $\text{fold-cange}_{exp}^{(d)}$ is the experimental fold change of the $d^{\text{th}}$ data point and $\text{fold-cange}_{thry}^{(d)}$ is the experimental prediction for the same datum.

Finally for the prior on the parameters we will assume uniform priors for $\Delta\varepsilon_{RA}$, $\tilde{k}_A \equiv -\log K_A / 1M$, and $\tilde{k}_I \equiv -\log K_I / 1M$, and a Jeffreys' prior for the $\sigma_i$ parameter associated with the Gaussian likelihood, obtaining
$$
\log P(\mathbf{a}_1^*, \sigma_1 \mid M_1) - \log P(\mathbf{a}_2^*, \sigma_2 \mid M_2) = \left[ \log P(\Delta\varepsilon_{RA} \mid M_1)  + \log P(\sigma_1 \mid M_1) \right]
- \left[ \log P(\Delta\varepsilon_{RA} \mid M_2) + \log P(\tilde{k_A} \mid M_2) + \log P(\tilde{k_I} \mid M_2) + \log P(\sigma_2 \mid M_2) \right].
\tag{10}
$$
Since for both models $\Delta\varepsilon_{RA}$ and $\sigma$ represent the same thing, the prior on these parameters should be the same so that those terms are canceled out. Since we stated that the parameters had a uniform prior we then can write this as
$$
\log P(\mathbf{a}_1^*, \sigma_1 \mid M_1) - \log P(\mathbf{a}_2^*, \sigma_2 \mid M_2) = \left[ \log \left( \tilde{k}_A^{\max} - \tilde{k}_A^{\min} \right) + \log \left( \tilde{k}_I^{\max} - \tilde{k}_I^{\min} \right) \right]
\tag{11}
$$

Putting all these terms together gives a log odds ratio
\begin{align}
\log O_{12} &= \frac{n}{2} \log \left( 2 \pi  {\sigma_1^*}^2 \right)
- \sum_{d \in D} \frac{\left( \text{fold-change}_{exp} - \text{fold-cange}_{thry}^{(d)}({\Delta\varepsilon_{RA}}_1^*)\right)^2}{2 {\sigma_1^*}^2}\\
&+ \frac{n}{2} \log \left( 2 \pi  {\sigma_2^*}^2 \right)
- \sum_{d \in D} \frac{\left( \text{fold-change}_{exp} - \text{fold-cange}_{thry}^{(d)}({\Delta\varepsilon_{RA}}_2^*, {\tilde{k}_A}_2^*, {\tilde{k}_I}_2^*) \right)^2}{2 {\sigma_2^*}^2}\\
&+ \log \left( \tilde{k}_A^{\max} - \tilde{k}_A^{\min} \right) + \log \left( \tilde{k}_I^{\max} - \tilde{k}_I^{\min} \right) \\
&+ \log 2\pi \\
&+ \frac{1}{2} \left( \log \det \boldsymbol{\sigma}_1^2 - \log \det \boldsymbol{\sigma}_2^2 \right),
\tag{12}
\end{align}
where $\boldsymbol{\sigma}_1^2$ is the covariance matrix for the two parameters for model $M_1$ ($\Delta\varepsilon_{RA}$ and $\sigma_1$) and $\boldsymbol{\sigma}_2$ is the covariance matrix for the four parameters for model $M_2$ ($\Delta\varepsilon_{RA}$, $\tilde{k}_A$, $\tilde{k}_I$, and $\sigma_2$)