# Non-supervised gating of the data via the front and side scattering.

In [1]:
import os
import glob
# Our numerical workhorses
import numpy as np
import pandas as pd
import scipy

# Import matplotlib stuff for plotting
import matplotlib.pyplot as plt
import matplotlib.cm as cm

# Seaborn, useful for graphics
import seaborn as sns

# Import the project utils
import sys
sys.path.insert(0, '../../analysis/')
import mwc_induction_utils as mwc


# Import Bokeh modules for interactive plotting
import bokeh.io
import bokeh.mpl
import bokeh.plotting

# favorite Seaborn settings for notebooks
rc={'lines.linewidth': 2, 
    'axes.labelsize' : 16, 
    'axes.titlesize' : 18,
    'axes.facecolor' : 'F4F3F6',
    'axes.edgecolor' : '000000',
    'axes.linewidth' : 1.2,
    'xtick.labelsize' : 13,
    'ytick.labelsize' : 13,
    'grid.linestyle' : ':',
    'grid.color' : 'a6a6a6'}
sns.set_context('notebook', rc=rc)
sns.set_style('darkgrid', rc=rc)
sns.set_palette("deep", color_codes=True)

# Magic function to make matplotlib inline; other style specs must come AFTER
%matplotlib inline

# This enables SVG graphics inline (only use with static plots (non-Bokeh))
%config InlineBackend.figure_format = 'svg'

# Datashader to plot lots of datapoints
import datashader as ds
from datashader.bokeh_ext import InteractiveImage
from datashader.utils import export_image
from IPython.core.display import HTML, display

# Set up Bokeh for inline viewing
bokeh.io.output_notebook()
bokeh.plotting.output_notebook()

# Testing the automatic gating

While trying to implement the automatic bivariate Gaussian gating I ran into a problem with the following output
`/Users/razo/anaconda/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)`
Let's explore a little bit further. One of the data points that presented that issue was the `20160725_wt_O3_auto_0mMIPTG_00.csv` file.

In [15]:
# list the directory with the data
date = 20160725
datadir = '../../../data/flow/csv/'
files = np.array(os.listdir(datadir))
csv_bool = np.array([str(date) in f and 'csv' in f for f in files])
files = files[np.array(csv_bool)]

df = pd.read_csv(datadir + files[48])
print('file : ' + files[48])

file : 20160725_wt_O3_auto_0mMIPTG_00.csv


Now let's plot the channels we've been using for the thresholding, i.e. front and side scattering area.

In [3]:
p, pipeline = mwc.ds_plot(df, 'FSC-A', 'SSC-A', log=True)
InteractiveImage(p, pipeline)

I can't detect by eye any pathological pattern on this distribution. Let's see how the linear scale looks

In [5]:
p, pipeline = mwc.ds_plot(df, 'FSC-A', 'SSC-A', log=False)
InteractiveImage(p, pipeline)

### Non-pathological case
Let's just quickly plot a non-pathological case to see if there is any difference

In [6]:
df_normal = pd.read_csv(datadir + files[0])
print('file : ' + files[0])

file : 20160725_wt_O1_auto_0mMIPTG_00.csv


In [71]:
p, pipeline = ds_plot(df_normal, 'FSC-A', 'SSC-A', log=True)
InteractiveImage(p, pipeline)

In [7]:
p, pipeline = mwc.ds_plot(df_normal, 'FSC-A', 'SSC-A', log=False)
InteractiveImage(p, pipeline)

From these plots I couldn't really tell what's the difference between a pathological and a non-pathological case. 

# Apply the gate to the log scattering

Let's try applying the automatic gate to the log of the scattering. This is the step that generated problems on my first attempt.

In [9]:
alpha = 0.5
df_thresh = mwc.auto_gauss_gate(df, 0.5, log=True)



This is the error I've been finding. It seems that when the `astroML` `fit_bivariate_normal` function tries to use the np.percentile over this data set for some reason I really don't understand we get this warning.

Let's explore the functions that the `astroML` package uses to fit the function. All this code was directly copied from [here](https://github.com/astroML/astroML/blob/5fc967ce1fa009c4a924cc33790743c0f64f6722/astroML/stats/_point_statistics.py) for the only purpose of debugging this issue.

In [31]:
sigmaG_factor = 0.74130110925280102

def median_sigmaG(a, axis=None, overwrite_input=False, keepdims=False):
    """Compute median and rank-based estimate of the standard deviation
    Parameters
    ----------
    a : array_like
        Array containing numbers whose mean is desired. If `a` is not an
        array, a conversion is attempted.
    axis : int, optional
        Axis along which the means are computed. The default is to compute
        the mean of the flattened array.
    overwrite_input : bool, optional
       If True, then allow use of memory of input array `a` for
       calculations. The input array will be modified by the call to
       median. This will save memory when you do not need to preserve
       the contents of the input array. Treat the input as undefined,
       but it will probably be fully or partially sorted.
       Default is False. Note that, if `overwrite_input` is True and the
       input is not already an array, an error will be raised.
    keepdims : bool, optional
        If this is set to True, the axes which are reduced are left
        in the result as dimensions with size one. With this option,
        the result will broadcast correctly against the original `arr`.
    Returns
    -------
    median : ndarray, see dtype parameter above
        array containing the median values
    sigmaG : ndarray, see dtype parameter above.
        array containing the robust estimator of the standard deviation
    See Also
    --------
    mean_sigma : non-robust version of this calculation
    sigmaG : robust rank-based estimate of standard deviation
    Notes
    -----
    This routine uses a single call to ``np.percentile`` to find the
    quartiles along the given axis, and uses these to compute the
    median and sigmaG:
    median = q50
    sigmaG = (q75 - q25) * 0.7413
    where 0.7413 ~ 1 / (2 sqrt(2) erf^-1(0.5))
    """
    q25, median, q75 = np.percentile(a, [25, 50, 75],
                                     axis=axis,
                                     overwrite_input=overwrite_input)
    sigmaG = sigmaG_factor * (q75 - q25)

    if keepdims:
        if axis is None:
            newshape = a.ndim * (1,)
        else:
            newshape = np.asarray(a.shape)
            newshape[axis] = 1

        median = median.reshape(newshape)
        sigmaG = sigmaG.reshape(newshape)

    return median, sigmaG

In [28]:
log_df = np.log10(df)
log_df['SSC-A'].notnull().sum()

99999

In [32]:
print(median_sigmaG(log_df['FSC-A']))
print(median_sigmaG(log_df['SSC-A']))

(3.9307831816990983, 0.13063327045391412)
(nan, nan)




Aha! I found the error. Since one of the values from the `SSC-A` channel is `NaN` the function cannot work because it is not using the `np.nanpercentile` function.