# Parsing `clist` files

In [301]:
import numpy as np
import bokeh.io
import bokeh.plotting
import pboc.plotting
import pboc.mcmc
import pandas as pd
import scipy.io
import glob
import theano.tensor as tt
import pymc3 as pm
import tqdm
bokeh.io.output_notebook()

##  The problem

In all of the previous dilution experiments, I have been reading in and parsing the `cell.mat` files produced through super segger. This was necessary when I was taking many images throughtout a large number of division events as I needed to count how many exposures each cell recieved. Now that I am doing the expeiment by taking only one bleaching mCherry image at the end of the movie,  I no longer need to load in all of the `cell.mat` files as the `clist` contains all of the necessary information. In this notebook, I'll write somfe functions that are necessary for parsing these files and will write corresponding test suites.

##  Loading an example `.clist` file.  

I'll use the successful experiment from 20180202 taken on Hermes. 

In [273]:
# Define the data directory and grab the clist files for each position. 
data_dir = 'data/images/20180202_hermes_37C_glucose_O2_dilution/'
clists = glob.glob('{0}/*growth*/xy*/clist.mat'.format(data_dir))

# Load an example clist using scipy. 
ex_mat = scipy.io.loadmat(clists[-1], squeeze_me=True)

The structure of this file is a little wierd. It's not really a dictionary or even a struct. It's an odd combination of arrays. For each cell in the position, there is a vector in the `clist`. Each of these vectors has 99 different quantities associated with them. Rather than being able to index them via a string (i.e. `'mother ID'`), I have to index them in the `data` vector with the same index as `'mother ID'` in the description vector. To avoid any confusion, I'll load this list of quantities as a dictionary. 

In [274]:
# Define the properties of interest.
desired_props = ['Area birth', 'Area death', 'Cell ID', 'Cell birth time', 'Cell death time',
                'Daughter1 ID', 'Daughter2 ID', 'Fluor1 mean death',  'Fluor2 mean death',
                'Mother ID']

# Get the name of the definitions, excluding the 'Focus' information since I do not quantify those
defs = {j:i for i, j in enumerate(ex_mat['def']) if j in desired_props}

# Iterate through each cell in the clist file and extract these properties. 
df = pd.DataFrame([], columns=desired_props)
for i, c in enumerate(ex_mat['data']):
    
    # Define an empty dictionary for the cell. 
    for key, value in defs.items():
        cell_dict[key] = ex_mat['data'][i][value]
    
    df = df.append(cell_dict, ignore_index=True)
    
# Change the names of the columns for easy parsing.
renamed_cols = {n: '_'.join(n.lower().split(' ')) for n in df.keys()}
df.rename(columns=renamed_cols, inplace=True)

That seems to do the trick! As a bounus, it seems to be *much, much* faster than reading in all of those pesky cell files. Below, I define two functions that will parse a list of cell files and generate this DataFrame.

In [275]:
def clist_to_df(clist_file, desired_props='default', added_props={}, excluded_props=None):
    """
    Reads in a SuperSegger `clist` file and extracts the desired properties.
     
    Parameters
    ----------
    clist_file : str
        Path to clist file of interest.
    desired_props: list of str
        A list of the desired properties. Default selection is ['Area birth', 'Area death', 
        'Cell ID', 'Cell birth time', 'Cell death time', 'Daughter1 ID', 'Daughter2 ID',
        'Fluor1 mean death', 'Fluor2 mean death', 'Mother ID']
    added_props : dict
        A dict of additional props to include in the DataFrame.
    excluded_props : list or tuple of str
        Properties from the default included properties to ignore. These should match 
        the case exactly as defined in the SuperSegger documentation.
    
    Returns
    -------
    df : pandas DataFrame
        A tidy pandas DataFrame with extracted properties for all cells in the clist file.
    """
    # Ensure that the clist file is a string. 
    if type(clist_file) is not str:
        raise TypeError('clist_file must be a string')
   
    # Convert the excluded props to a list if is given as a string.
    if type(excluded_props) == str:
        excluded_props = list(excluded_props)
        
    # Load the clist file using scipy.
    mat = scipy.io.loadmat(clist_file, squeeze_me=True)
    
    # Assemble a dictionary of the indices and key values. 
    if desired_props == 'default':
        desired_props = ['Area birth', 'Area death', 'Cell ID', 'Cell birth time', 
                         'Cell death time', 'Daughter1 ID', 'Daughter2 ID',
                         'Fluor1 mean death', 'Fluor2 mean death', 'Mother ID']
    defs = {key: value for value, key in enumerate(mat['def']) if key in desired_props}
    
    # Generate an empty DataFrame with the desired columns.
    for k, v in added_props.items():
        desired_props.append(k)     
    df = pd.DataFrame([], columns=desired_props)
    
    # Iterate through the clist and extrac the properties.
    for i, cell in enumerate(mat['data']):  
        # Extract the properties and add to DataFrame
        cell_dict = {key: cell[value] for key, value in defs.items()}
        
        # Add any additional properties. 
        for k, v in added_props.items():
            cell_dict[k] = v
        df = df.append(cell_dict, ignore_index=True)
         
    # Rename the columns to accomodate pep8 style.
    if excluded_props is not None:
            for x in excluded_props:
                df.drop(x, axis=1, inplace=True)
    new_cols = {nom: '_'.join(nom.split(' ')).lower() for nom in df.keys()}
    df.rename(columns=new_cols, inplace=True)
    return df
        

def parse_clists(clists, parse_position=True, added_props={}, 
                 verbose=False, **kwargs):
    """
    A helper function to iterate over a list of clist files. See `clist_to_df` for function
    documentation
    
    Parameters
    ----------
    clists: list of str
        A list of pathnames for clist files.
    parse_position: bool
        If True, the position of the item will be parsed from the file name by splitting at 
        'xy'.
    added_props: dict
        A dictionary of additional props to add to the DataFrame. Default is an empty dict.
        If `parse_psotion` is True, the position will be passed as an added property.
    verbose: bool
        If True, a progressbar will be displayed for the clist iteration.
    
    Returns
    -------
    df : pandas DataFrame
        A pandas DataFrame containing all cell properties for each item in the provided 
        clist file.
    
    """
 
    # Iterate through each item in the clists.
    dfs = []
    if verbose:
        iterator = tqdm.tqdm(clists)
    else:
        iterator = clists
    for i, c in enumerate(iterator):
        # Parse the position.
        pos = int(c.split('xy')[-1].split('/')[0])        
        if parse_position:
            added_props['position'] = pos
        
        # Pass the file to the parser.
        df = clist_to_df(c, added_props=added_props, **kwargs)
        dfs.append(df)
    return pd.concat(dfs, ignore_index=True) 

Now that we have a function in place, we can test it out on our list of clist files. 

In [276]:
# Generate a dataframe from all experimental positions
dilution_df = parse_clists(clists, excluded_props=['Fluor2 mean death'])

For a sanity check, we should look at the DataFrame to make sure things pass the smell test.  

In [277]:
dilution_df.cell_birth_time.unique()


array([  1.,   3.,   4.,   8.,   9.,  10.,  11.,  12.,  14.,  15.,  17.,
        18.,  19.,   2.,   5.,   6.,   7.,  13.,  16.])

Everything looks well. I'll develop test functions for this later.  

## Computing the calibration factor.

One thing that is slightly different about parsing the `clist` file rather than the individual `cell` files is that it isn't clear if the background fluorescence is subracted from each cell intensity or not. To check, I will calculate the calibration factor and make sure I get the same results. First, let's parse the clist files from all of the snapshots.

In [354]:
# Grab all of the snapshot files. 
snaps = glob.glob('{0}/*snaps*'.format(data_dir))
snap_dfs = []
for i, s in enumerate(snaps):
    # Get the strain information.
    strain = s.split('_')[-3]
    atc_conc = float(s.split('_')[-2].split('ngmL')[0])
    
    # Grab all of the clist files.
    clists = glob.glob('{0}/xy*/clist.mat'.format(s))
    
    # Pass them into the function.
    df = parse_clists(clists, added_props={'strain': strain, 'atc_conc': atc_conc})
    snap_dfs.append(df)
snap_df = pd.concat(snap_dfs)

We'll first compute the mean pixel value of the autofluorescence in both channels.

In [279]:
auto_strain = snap_df[snap_df['strain']=='autofluorescence']
mean_auto_yfp = auto_strain['fluor2_mean_death'].mean()
mean_auto_cherry = auto_strain['fluor1_mean_death'].mean()

To find only those cells which had measured mCherry at the end of their growth, we can extract only the cells which had a measured 
`fluor1_mean_death`.

In [284]:
# Get only the measured cells.
measured_cells = dilution_df[dilution_df['fluor1_mean_death'] > 0]

# Make a new data frame for the intensity pairings.
cal_df = pd.DataFrame([], columns=['I_1', 'I_2', 'summed', 'fluctuation'])

# Group the measured cells by position and mother ID.
grouped = measured_cells.groupby(['position', 'mother_id'])
for g, d in grouped:
    # Ensure we are looking at a daughter pair. 
    if len(d) == 2:
        # Subtract the autofluorescence.
        sub_int = d['area_death'].values * (d['fluor1_mean_death'].values - mean_auto_cherry)
        
        # Ensure we are only dealing with positive values. 
        if (sub_int >= 0).all(): 
            I_1, I_2 = sub_int
            summed = I_1 + I_2
            fluctuation = (I_1 - I_2)**2
            div_dict = dict(I_1=I_1, I_2=I_2, summed=summed, fluctuation=fluctuation)
            cal_df = cal_df.append(div_dict, ignore_index=True)

Now we can generate the scatter plot to make sure things qualitatively look okay.  

In [288]:
p = pboc.plotting.boilerplate(x_axis_label='summed fluorescence', y_axis_label='squared fluctuation',
                             x_axis_type='log', y_axis_type='log', plot_width=800)
p.circle('summed', 'fluctuation', source=cal_df, color='slategray')
bokeh.io.show(p)

Now we can estimate the calbration factor.  

In [302]:
class DeterministicLogPosterior(pm.Continuous):
    def __init__(self, *args, **kwargs):
        super(DeterministicLogPosterior, self).__init__(*args, **kwargs)
    def logp(self, value, *args):
        n1 = cal_df['I_1'].values / value
        n2 = cal_df['I_2'].values / value
        ntot = n1 + n2
        k = len(cal_df['I_1'])
        binom = tt.sum(tt.gammaln(ntot+1)) - tt.sum(tt.gammaln(n1+1)) -tt.sum(tt.gammaln(n2+1))
        return -k * tt.log(value) + binom - tt.sum(ntot) * tt.log(2)

In [303]:
with pm.Model() as model:
    alpha = DeterministicLogPosterior('alpha', testval=100)
    trace = pm.sample(draws=10000, tune=10000)
    trace_df = pboc.mcmc.trace_to_dataframe(trace, model)
    stats = pboc.mcmc.compute_statistics(trace_df) 

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha]




 41%|████      | 8118/20000 [00:05<00:08, 1372.33it/s]INFO (theano.gof.compilelock): Waiting for existing lock by process '76707' (I am process '76708')
INFO (theano.gof.compilelock): To manually release the lock, delete /Users/gchure/.theano/compiledir_Darwin-17.3.0-x86_64-i386-64bit-i386-3.6.4-64/lock_dir
 41%|████▏     | 8276/20000 [00:06<00:08, 1375.72it/s]INFO (theano.gof.compilelock): Waiting for existing lock by process '76707' (I am process '76709')
INFO (theano.gof.compilelock): To manually release the lock, delete /Users/gchure/.theano/compiledir_Darwin-17.3.0-x86_64-i386-64bit-i386-3.6.4-64/lock_dir
100%|█████████▉| 19929/20000 [00:12<00:00, 1563.20it/s]INFO (theano.gof.compilelock): Waiting for existing lock by process '76708' (I am process '76709')
INFO (theano.gof.compilelock): To manually release the lock, delete /Users/gchure/.theano/c

Now we can visualize the samples. 

In [307]:
p = pboc.plotting.traceplot(trace, dist_type='kde')

That certainly looks like a well-behaved unimodal posterior! Using the mode and hpd, we can plot our fit over the data. We will also bin the data by 65 events (arbitrary) as a sanity check.  

In [351]:
# Bin the data by inumber of events.
bin_size = 65 
bins = np.arange(0, len(cal_df) + bin_size, bin_size)
sorted_vals = cal_df.sort_values('summed')
binned_summed = [] 
binned_fluct = [] 
for i in range(1, len(bins)):
    slc = sorted_vals.iloc[bins[i-1]:bins[i]].mean()
    binned_summed.append(slc['summed'])
    binned_fluct.append(slc['fluctuation'])

In [352]:
# Generate the fit line.
summed_range = np.logspace(2, 6, 500)
fit = stats['mode'].values * summed_range
upper = stats['hpd_max'].values * summed_range
lower = stats['hpd_min'].values * summed_range

# Generate the plot
p = pboc.plotting.boilerplate(x_axis_label='summed intensity [a.u.]', y_axis_label='square fluctuation [a.u.]',
                             x_axis_type='log', y_axis_type='log', plot_width=800)

# Plot the data. 
p.circle('summed', 'fluctuation', source=cal_df, color='slategray', alpha=0.5)
p.circle(binned_summed, binned_fluct, color='dodgerblue', size=8, legend='binned data')

# Plot the fit and confidence interval
p.line(summed_range, fit, color='tomato', legend='best fit, α = {0:0.0f} a.u /molecule'.format(stats['mode'].values[0]))
pboc.plotting.fill_between(p, summed_range, lower, upper, color='tomato', alpha=0.3)

# Generate the plot.
p.legend.location = 'bottom_right'
bokeh.io.show(p)

That looks like a good fit and has an understandable error. 

##  Computing the fold-change

As a final sanity check, we can compute the fold-change in gene expression from the clist extracted data. 

In [373]:
# Subtract the autofluorescence.
snap_df['fluor1_sub'] = snap_df['area_death'] * (snap_df['fluor1_mean_death'] - mean_auto_cherry)
snap_df['fluor2_sub'] = snap_df['area_death'] * (snap_df['fluor2_mean_death'] - mean_auto_yfp)

# Compute the mean constitutive expression.
mean_delta = snap_df[snap_df['strain'] == 'deltaLacI']['fluor2_sub'].mean()

# Group the dilution data by the aTc concentration.
grouped = snap_df[snap_df['strain']=='dilution'].groupby('atc_conc')

# Set up lsits to store the repressor copy number and the fold change. 
mean_repressors = []
repressor_extrema = []
fold_change = []

for g, d in grouped:
    # Compute the fold-change.
    fold_change.append(d['fluor2_sub'].values.mean() / mean_delta)
    
    # Compute the repressor copy number.
    mean_repressors.append((d['fluor1_sub'].values.mean() / stats['mode'].values)[0])
    max_rep = d['fluor1_sub'].values.mean() / stats['hpd_max'].values
    min_rep = d['fluor1_sub'].values.mean() / stats['hpd_min'].values
    repressor_extrema.append((min_rep[0], max_rep[0])) 

In [381]:
# Compute the theoretical fold-change.
ep_R = -13.9
N_ns = 4.6E6
R_range = np.logspace(0, 3, 500)
theo = (1 + (R_range / N_ns) * np.exp(-ep_R))**-1

# Set up the plotting canvas.
p = pboc.plotting.boilerplate(x_axis_label='repressors per cell', y_axis_label='fold-change',
                             x_axis_type='log', y_axis_type='log', plot_width=800)

# Plot the theory.
p.line(R_range, theo, color='slategray', legend='theory')

# Plot the datapoints.
p.circle(mean_repressors, fold_change, color='dodgerblue', size=5, legend='experiment')

# Plot the x error.
for i, ex in enumerate(repressor_extrema):
    p.line(ex, fold_change[i], color='dodgerblue')
bokeh.io.show(p)

Looks good!

## The End

It looks like I can do the expeirment by parsing all of the information directly from the clists, meaning I don't have to waste spacing storing all of the images and cell mat files. 