In [None]:
import numpy as np
import pandas as pd
import scipy.stats as st

import bokeh.io
import bokeh.plotting
import holoviews as hv
import bokeh_catplot

bokeh.io.output_notebook()
hv.extension('bokeh')

%load_ext blackcellmagic

Load data from Brewster, pre-tidied by Manuel, and drop the spurious column that was the index in csv.

In [None]:
raw_data = pd.read_csv("../../data/jones_brewster_2014.csv")
del raw_data['Unnamed: 0']

In [None]:
raw_data.head()

What are all the experiment labels in the dataset?

In [None]:
raw_expt_labels = raw_data['experiment'].unique()
raw_expt_labels

Woah, that's a lot to wrestle! Let's take a glance at all of it and then zoom in to test a pipeline.

In [None]:
plot_kwargs = {
    "x_axis_label": "counts",
    "y_axis_label": "expt",
    "width": 500,
    "height": 1000,
    "horizontal": True,
}
p = bokeh_catplot.box(data=raw_data, cats="experiment", val="mRNA_cell", **plot_kwargs)
bokeh.io.show(p)

UV5, 5DL10, and 5DL20 look like good candidates for a closer look; all have decent non-zero expression, and they look different from each other.

In [None]:
df_slice = raw_data.query("experiment == 'UV5' \
                          or experiment == '5DL10' \
                          or experiment == '5DL20'")

df_slice['experiment'].unique()

Now that we've got a more manageable set, let's make ECDFs and chi-by-eye with negative binomial. `scipy.stats` convention is `cdf(k, n, p, loc=0)`, where $n$ is the number of successes we're waiting for and $p$ is probability of success.

In [None]:
p = bokeh_catplot.ecdf(data=df_slice, cats='experiment', val='mRNA_cell', style='staircase')
# compute upper bound for theoretical CDF plots
u_bound = max(df_slice['mRNA_cell'])
x = np.arange(u_bound+1)
p.line(x, st.nbinom.cdf(x, 5, 0.2))
p.line(x, st.nbinom.cdf(x, 3, 0.4), color='orange')
p.line(x, st.nbinom.cdf(x, .3, 0.26), color='green')
bokeh.io.show(p)

Ok, good start. Next let's sample.