In [None]:
import pymc3 as pm
import pandas as pd
import matplotlib 

from sklearn.preprocessing import LabelEncoder

%matplotlib inline

## Problem Type

The Bayesian estimation model is widely applicable across a number of scenarios. The classical scenario is when we have an experimental design where there is a control vs. a treatment, and we want to know what the difference is between the two. Here, "estimation" is used to estimate the "true" value for the control and the "true" value for the treatment, and the "Bayesian" part refers to the computation of the uncertainty surrounding the parameter. 

Bayesian estimation's advantages over the classical t-test was first described by John Kruschke (2013). 

In this notebook, I provide a concise implementation suitable for two-sample and multi-sample inference.

## Data structure

To use it with this model, the data should be structured as such:

- Each row is one measurement.
- The columns should indicate, at the minimum:
    - What treatment group the sample belonged to.
    - The measured value.

## Extensions to the model

As of now, the model only samples posterior distributions of measured values. The model, then, may be extended to compute differences in means (sample vs. control) or effect sizes, complete with uncertainty around it. Use `pm.Deterministic(...)` to ensure that those statistics' posterior distributions, i.e. uncertainty, are also computed.

## Reporting summarized findings

Here are examples of how to summarize the findings.

> Treatment group A was greater than control by x units (95% HPD: [`lower`, `upper`]). 

> Treatment group A was higher than control (effect size 95% HPD: [`lower`, `upper`]). 

## Other notes

Here, we make a few modelling choices.

1. We care only about the `normalized_measurement` column, and so we choose the t-distribution to model it, as we don't have a good "mechanistic" model that incorporates measurement error of OD600 and 'measurement'.

In [None]:
df = pd.read_csv('../datasets/biofilm.csv')
continuous_cols = ['OD600', 'ST', 'replicate', 'measurement', 'normalized_measurement']
for c in continuous_cols:
    df[c] = pm.floatX(df[c])
df.head()

In [None]:
df.dtypes

In [None]:

le = LabelEncoder()
le.fit(df['isolate'])
df['indices'] = le.transform(df['isolate']).astype('int32')

In [None]:
le.classes_

In [None]:
with pm.Model() as best:
    nu = pm.Exponential('nu_minus_one', lam=1/30) + 1
    
    fold = pm.Flat('fold', shape=len(le.classes_))
    
    var = pm.HalfCauchy('var', beta=1, shape=len(le.classes_))
    
    mu = fold[df['indices'].values]
    sd = var[df['indices'].values]
    
    like = pm.StudentT('like', mu=mu, sd=sd, nu=nu, 
                       observed=df['normalized_measurement'])
    
    # Compute differences
    diffs = pm.Deterministic('differences', fold - fold[0])

In [None]:
with best:
    trace = pm.sample(draws=2000)

In [None]:
pm.forestplot(trace, varnames=['fold'], ylabels=le.classes_)

In [None]:
pm.forestplot(trace, varnames=['differences'], ylabels=le.classes_)