# Homework 1
(c) 2017 Justin Bois. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

In [1]:
import numpy as np
import pandas as pd

import bokeh.io
import bokeh.plotting

from bokeh.models import Legend
from bokeh.plotting import figure, show, output_file

bokeh.io.output_notebook()

## Problem 1.3 (Microtubule catastrophes)
For this exercise, we will analyze the paper by Gardner, Zanic, et al., Depolymerizing kinesins Kip3 and MCAK shape cellular microtubule architecture by differential control of catastrophe, *Cell*, 147, 1092-1103, 2011
<br>
<br>
In this paper, the authors investigated the dynamics of microtubule catastrophe, the switching of a microtubule from a growing to a shrinking state. In particular, they were interested in the time between the start of growth of a microtubule and the catastrophe event. They monitored microtubules by using tubulin (the monomer that comprises a microtubule) that was labeled with a fluorescent marker. As a control to make sure that fluorescent labels and exposure to laser light did not affect the microtubule dynamics, they performed a similar experiment using differential interference contrast (DIC) microscopy. They measured the time until catastrophe with labeled and unlabeled tubulin.
<br>
<br>
We will use their data to generate the plot which is similar to the Fig. 2a of their paper.
<br>
<br>
First, let's load the data into a Pandas `DataFrame`

In [7]:
# Read the data from the data file into a DataFrame
df = pd.read_csv('data/gardner_et_al_2011_time_to_catastrophe_dic.csv', comment='#')

# Let's take a look
df

Unnamed: 0,time to catastrophe with labeled tubulin (s),time to catastrophe with unlabeled tubulin (s)
0,470,355.0
1,1415,425.0
2,130,540.0
3,280,265.0
4,550,1815.0
5,65,160.0
6,330,370.0
7,325,460.0
8,340,190.0
9,95,130.0


The data above are not tidy. Let's remember the three rules for the tidy data:
* Each variable forms a column.
* Each observation forms a separate row.
* Each type of observational unit forms a separate table.

To tidy these data, we should drop `NaN`s. We could not use these values for analysis anyways, as we need to match control measurements (unlabeled) with the measurements of interest (labeled)

In [28]:
# Drop NaN values
df = df.dropna()

# Let's look at it
df.head()

Unnamed: 0,tc_lab,tc_unlab
0,470,355.0
1,1415,425.0
2,130,540.0
3,280,265.0
4,550,1815.0


Apparently, the `NaN` datapoints have been removed.
<br>
<br>
In the Fig. 2a of their paper, Gardner, Zanic, et al. have the empirical cumulative distribution function (ECDF). We will try to reconstruct this plot. First, we need to write a function `ecdf_vals(data)` which takes a one-dimensional Numpy array (or Pandas `Series`; same construction will work) of data anad returns the `x` and `y` values for plotting the ECDF. The definition of ECDF is
\begin{align}
ECDF(x) = fraction \ of \ data \ points \leq x
\end{align}

In [10]:
def ecdf_vals(data):
    """Function returns the x and y values for the plotting of ECDF.
        Input: data (Numpy array or Pandas Series)
        Output: a pair of Numpy arrays (xaxis data and yaxis data)."""
    x = np.sort(data)
    y = np.arange(1, len(data)+1)/len(data)
    
    return x, y

In [29]:
# First let's rename the columns to have shorter names
rename_dict = {'time to catastrophe with labeled tubulin (s)' : 'tc_lab',
               'time to catastrophe with unlabeled tubulin (s)' : 'tc_unlab'}

df = df.rename(columns=rename_dict)

# Let's look at it
df.head()

Unnamed: 0,tc_lab,tc_unlab
0,470,355.0
1,1415,425.0
2,130,540.0
3,280,265.0
4,550,1815.0


In [38]:
# Select labeled tubulin data
d_lab = df['tc_lab']

# Select unlabeled tubulin data
d_unlab = df['tc_unlab']

# Apply the ECDF functions to both labeled and unlabeled data
xlab, ylab = ecdf_vals(d_lab)
xunlab, yunlab = ecdf_vals(d_unlab)

# Now use bokeh to the ECDFs on the single graph
# Set up the plot
f = bokeh.plotting.figure(plot_height=300,
                          plot_width=500,
                          x_axis_label='time to catastrophe (s)',
                          y_axis_label='ECDF')

# Add a scatter plot
f1 = f.circle(xlab, ylab, color='red')
f2 = f.circle(xunlab, yunlab, color='green')

# Make a legend object
legend = Legend(items=[
            ('labeled tubulin', [f1]),
            ('unlabeled tubulin', [f2])
            ])
# Add legend
f.add_layout(legend)
f.legend.location = 'bottom_right'

# Add axis labels
f.xaxis.axis_label = 'time to catastrophe (s)'
f.yaxis.axis_label = 'ECDF'



bokeh.io.show(f)