# Tracking Lineages with SuperSegger 

In [12]:
import numpy as np
import pandas as pd
import mscl_utils as mscl
import bokeh.io
import glob
import bokeh.plotting
import matlab.engine as matlab
import scipy.io
bokeh.io.output_notebook()

This notebook demonstrates how the output from a SuperSegger session is parsed and read into Python. Note that this notebook requires that Matlab R2017B or higher is installed (along with the Matlab Python API) for Python 3.6+.  

## The `Cell` File 

While supersegger outputs several useful things, the files we will be the most interested in are the `c/Cell.mat` files in each position. The structure of these ouputs are described in great detail [on the SuperSegger Wiki page](). Each individual cell has a corresponding `.mat` file which contains information about the time of birth, time of death, fluorescence information, etc. Most importantly, it contains information about its assigned ID as well as its mothers' and sister IDs. While these files exist as Matlab `.mat` files, we can easily load them in to Python using the `scipy.io.loadmat` function.

We can load an example `cell.mat` file to get the key information. 

In [93]:
# Define the data directory for a single image position. 
data_dir = '../data/images/20171017_sfGFP_10ngmL_dilution/growth/xy00/cell/'

# Grab all of the mat files. Note that these exist as cell and Cell. 
mat_files = glob.glob('{0}*.mat'.format(data_dir))
example_mat = mat_files[0]

# Define `example_mat` as a variable in Matlab workspace. 
eng.workspace['f'] = example_mat

# Load the matfile in Matlab and assign it to a Python variable. 
mat_file = eng.eval('load(f)')

# Print out the keys of the mat_file.
mat_file.keys()

dict_keys(['CellA', 'death', 'birth', 'divide', 'sisterID', 'motherID', 'daughterID', 'ID', 'neighbors', 'stat0', 'ehist', 'contactHist'])

While this looks like a very short list, the `CellA` value is also a dictionary containing information about the segmentation mask, fluorescence, area, etc. for every frame in the time-lapse. We would like to have all cell information for a single position in a tidy pandas DataFrame. Below are two functions which do exactly that. 

In [137]:
def cell_to_dict(file, add_props=None, excluded_props=None):
    """
    Reads a single cell file and produces a dictionary containing
    the properties of interest. 
    
    The returned properties are 
    * birth - frame number at which the cell was born.
    * death - frame number at which the cell died.
    * divide - bool for an observed cell division.
    * ID - integer ID number of the cell.
    * motherID - integer ID number of the mother cell.
    * sisterID - integer ID number of the sister cell.
    * birth_fluo - fluorescence value at the cell's birth.
    * death_fluo - fluorescence value at the cell's death.
    * daughter_1_ID - integer ID number of the first daughter.
    * daughter_2_ID - integer ID number of the second daughter.
    
    
    Parameters
    ----------
    file: str
        Path of the cell file. This must be in a `.mat` format.
    add_props : dict, default None
        Dictionary of additional properties (not found in the mat file)
        to be included in the returned dictionary.
    excluded_props: list of str
        Properties of cell.mat file to be ignored. These must be 
        exactly how they are defined in the cell file.
    
    Returns
    -------
    cell_dict : dictionary
        Dictionary of all extracted properties from the cell files. 
    """
    
    # Ensure the supplied file is actually a .mat and other types are correct. 
    if file.split('.')[-1] != 'mat':
        raise TypeError("supplied file {0} is not a `.mat` file.".format(file))
    if add_props is not None and type(add_props) is not dict:
        raise TypeError("add_props is {0} and not dict.".format(type(add_props)))                  
    if excluded_props is not None and type(excluded_props) is not list:
        raise TypeError("add_props must be list. Type is currently {0}.".format(type(excluded_props)))
                        
    # Define the values of interest.
    vals = ['birth', 'death', 'divide', 'ID', 'motherID', 'sisterID', 
             'daughter_1_ID', 'daughter_2_ID', 'birth_fluo', 'death_fluo'] 
    
    # Load the mat file using MATLAB.
    mat = scipy.io.loadmat(file, squeeze_me=True,
                      chars_as_strings=True,
                      struct_as_record=True)
    
    # Assemble the dictionary for constant properties. 
    cell_dict = {v: mat[v] for v in vals[:-4]}
    daughters = mat['daughterID']
    
    # Determine  if daughters were produced. If not, change ID to NaN.
    if len(daughters) == 0:
        daughter_1, daughter_2 = None,  None
    else:
        daughter_1, daughter_2 = daughters 
    cell_dict['daughter_1_ID'] = daughter_1
    cell_dict['daughter_2_ID'] = daughter_2
     
    # Extract fluorescence information -- This is a bit gross but checked.
    cell_dict['birth_fluo'] = mat['CellA'][0]['fl1'].flatten()[0].flatten()[0][0]
    cell_dict['death_fluo'] = mat['CellA'][-1]['fl1'].flatten()[0].flatten()[0][0]
    
    # Deal with exclusion and addition of props. 
    if excluded_props is not None:
        new_dict = {}
        keys = cell_dict.keys()
        for key in keys not in excluded_props:
            new_dict[key] = cell_dict[key]
        cell_dict = new_dict
    if add_props is not None:
        for key in add_props.keys(): 
            cell_dict[key] = add_props[key]
                        
    # Return the cell dictionary.
    return cell_dict

def parse_cell_files(files, add_props=None, excluded_props=None):
    """
    Executes cell_to_dict across a list of files and returns a Pandas DataFrame.
    """
    if type(files) is not list:
        raise TypeError("'files' is type {0} not list.".format(type(files)))
    for i, f in enumerate(files):
        cell_dict = cell_to_dict(f, add_props=add_props, 
                                 excluded_props=excluded_props)
        if i == 0:
            keys = cell_dict.keys()
            df = pd.DataFrame([], columns=keys)
            df = df.append(cell_dict, ignore_index=True)
        else:
            df = df.append(cell_dict, ignore_index=True)
    return df            

We can test this function out on an example mat file. 



In [138]:
mat_df = parse_cell_files([example_mat])
mat_df.head()

Unnamed: 0,birth,death,divide,ID,motherID,sisterID,daughter_1_ID,daughter_2_ID,birth_fluo,death_fluo
0,25.0,26.0,0.0,117.0,66.0,116.0,,,0.0,103215.0


## Mapping the lineages 

To ensure that the experiment is working as advertised, it's important to show that the total fluorescence is conserved across a lineage. This means that the total fluorescence of a mother cell should be equivalent to the sum of the total fluorescence of all of its daughters. Unfortunately, this also means that we

In [69]:
vals[:-4]

['birth_frame', 'death_frame', 'division', 'ID', 'mother_ID', 'sister_ID']

In [86]:
a = [1, 2, 3]
if type(a) is list:
    print('yes')


yes


In [111]:
test_mat = scipy.io.loadmat(example_mat, squeeze_me=True,
                      chars_as_strings=True,
                      struct_as_record=True)
(test_mat['CellA'][0]['fl1'].flatten()[0].flatten()[0][0])


0

In [128]:
pd.DataFrame(dict(a=1, b=2), index=(0))

TypeError: Index(...) must be called with a collection of some kind, 0 was passed

In [113]:
mat['CellA'][0]['fl1']

{'Ixx': 1.730041132072942,
 'Ixy': -0.11338367054350042,
 'Iyy': 6.108859374474545,
 'bg': 881.8148870548939,
 'r': matlab.double([[78.95350518818468,474.6947286597491]]),
 'sum': 55318.0}

In [None]:
q

In [15]:
scipy.io.loadmat?

##  Computing the calibration factor

As a reminder, we predict that the intensity of a single cell $I$ should be proportional to the number of fluorescent proteins per cell $N$ multiplied by some calibration factor $\alpha$,

$$
I = \alpha N.
$$

To estimate the value of $\alpha$, we can look at how the intensity fluctuates between any two daughter cells, revealing information about the partitioning of proteins during a division event. Once the mathematical dust settles, we find that relationship to be

$$
\langle (I_1 - I_2)^2 \rangle = \alpha I_\text{tot}
$$

where $I_1$ and $I_2$ are the intensities of the two daughter cells and $I_\text{tot}$ is the sum $I_1 + I_2$.  

To extract the data (keeping in mind that we have not subtracted the autofluorescence **TODO**), we can group the tidy DataFrame generated above by the Mother ID and compute the quantities of interest.

In [65]:
# Only look at cells who died on the final frame.
final_position = df[df['death_frame']==26]
grouped = final_position.groupby('mother_ID')

# Make a new DataFrame to store the information.
int_df = pd.DataFrame([], columns=['I_tot', 'sq_diff'])
for g, d in grouped:
    if len(d) == 2:
        daughter_fluo = d['death_fluo'].values
        I_tot = daughter_fluo.sum()
        sq_diff = np.diff(daughter_fluo)[0]**2
        int_df = int_df.append(dict(I_tot=I_tot, sq_diff=sq_diff),
                              ignore_index=True)

In [66]:
# Compute the log of the values. 
log_I_tot = np.log10(int_df['I_tot'].values)
log_sq_diff = np.log10(int_df['sq_diff'].values)
# Plot the results. 
p = mscl.bokeh_boiler(x_axis_label='log\u2081\u2080 (I\u2081 + I\u2082)',
                     y_axis_label='log\u2081\u2080 (I\u2081 - I\u2082)\u00B2')

p.circle(x=log_I_tot, y=log_sq_diff, color='slategray', alpha=0.5)
bokeh.io.show(p)