# Distributions and Curves

Distributions and curves are some the most common statistical topics that you will use in electrophysiology analysis. Distributions and curves help us model data. Modeling data helps us describe the data we have and simplify complex data. While this may sound complex or intimidating it is not. A simple linear regression is a model. Distributions of our data are a model. Curve fitting is modeling. We have used these models in both the current clamp and mini analysis chapters but we will go more in-depth here to describe some of the underlying properties. Modeling can be divided into what, how and why models (see: https://compneuro.neuromatch.io/tutorials/intro.html for more). We will primarily focus on what models since these are the first step to using the other types of models. I rarely see what models discussed in patch clamp electrophysiology analysis, however I believe they are needed to analyze our data well.

## Distributions
Distributions of data are one of the first things you should look at especially if you have large sample sizes. Think mini amplitude, IEI, and rise rate. Most the statistical analyses we do make assumptions about the distribution of our data yet rarely do papers show distributions of their data. Most other distribution contain the equivalent of the mean but the value is quiet different from the mean. When taking the mean of non-gaussian distributions you are not get the true central tendency of your data and may be measuring the effect of outliers rather than a true shift in the distribution of data. In the case of some thing like a beta distribution the mean may not accurately describe the data.

Distributions can show counts or probabilities. When you make a histogram or kernel density estimate (KDE) you are creating a distribution of your data (technically a non-parametric distribution). Most often in statistical text books you will see something called probability density functions (PDF) for continuous variables or probability mass functions (PMF) for discrete variables. These are parametric distributions becuase that have parameters that describe the distribution. The important thing about these "functions" is that the distribution of values you get from the functions integrates (i.e. the area under the curve) to 1 for the when you take the distribution from negative limit to positive limit. When you put in a single number with some parameters you get a number out called a likelihood. 

The last thing to note is that finding the distribution that fits your data describes your data but does not tell you how or why is was generated that they. This how and why questions are not something we will get at here.

You will also often see cumulative distribution functions (CDFs). The CDF is the just the integral of the PDF and the PDF the derivative of the CDF. Below we will plot the PDF and CDF for each of the distributions.

If you want to learn a little more about PDFs and PMFs I suggest watching Very Normal on Youtube.

First we are going to look over some of the data we tend to collect in electrophysiology experiments and see what the distribution of data looks like. Then we are going to go over some specific distributions and see which distributions look most like our data.

We are going to go over some common distributions and their properties that you will see in electrophysiology.

In [None]:
import numpy as np
from bokeh.io import output_notebook, show
from bokeh.layouts import column, row
from bokeh.models import Checkbox, ColumnDataSource, CustomJS, Select, Slider, Spinner
from bokeh.plotting import figure
from scipy import stats

output_notebook()

### Gaussian distribution
In terms of data that we collect in electrophysiology the gaussian distribution is actually not that common but most of the time we assume our data follows a gaussian distribution. The gaussian distribution has two parameters, the mean and standard deviation. The mean is what is called a location parameter and shifts the distribution around. The standard deviation is related to the spread of data symmetrically around the mean. Technically the gaussian distribution is unbounded. This means that you can get any value from -$\infty$ to +$\infty$. However, due to limits that we have on computers we generally don't show all the values up to +/- $\infty$, but only up to a couple standard deviations past the mean on each side. The equation for the gaussian PDF is: $$\frac{1}{\sqrt{2\pi\sigma^2}}\exp\frac{(x-\mu)^2}{2\sigma^2}$$

Below you can see how changing the mean and standard deviation changes the magenta distribution relative to the grey reference distribution. We plot both the PDF and the CDF (the integral of the PDF). There are a couple things to note. The area under the curve of the PDF will always equal 1. Changing the standard deviation decreases the likelihood of getting any value but increases the range of likely values you can get. 

In [None]:
mu = 0
std = 1
x = np.linspace(mu - std * 4, mu + 4 * std, num=400)
y = np.exp(-((x - mu) ** 2) / (2 * std**2)) / np.sqrt(2 * np.pi * std**2)
source = ColumnDataSource(
    {
        "x": x,
        "y": y,
        "yc": np.cumsum(y) * 0.020050125313283207,
    }
)
pdf = figure(height=250, width=350, title="PDF")
pline = pdf.line("x", "y", source=source, color="magenta")
pline1 = pdf.line(x, y, color="grey")
cdf = figure(height=250, width=350, title="CDF")
cline = cdf.line("x", "yc", source=source, color="magenta")
cline1 = cdf.line(source.data["x"], source.data["yc"], color="grey")

mu = Slider(start=-10, end=10, value=0, step=0.5, title="Mu (mean)")
std = Slider(start=0.1, end=10, value=1, step=0.5, title="Sigma (std)")
callback = CustomJS(
    args=dict(
        source=source,
        mu=mu,
        std=std,
    ),
    code="""
    const arr = [];
    const start = mu.value-std.value*4
    const end = mu.value+std.value*4
    const step = (end - start) / (400 - 1);

    for (let i = 0; i < 400; i++) {
        arr.push(start + step * i);
    }
    const temp_y = arr.map(x => {
        const coefficient = 1 / Math.sqrt(2 * Math.PI * Math.pow(std.value, 2));
        const exponent = -Math.pow((x - mu.value), 2) / (2 * Math.pow(std.value, 2));
        return coefficient * Math.exp(exponent);
    })
    const cumsum = [temp_y[0]]
    for (let i = 1; i < 400; i++) {
        cumsum.push((cumsum[i-1] + temp_y[i]));
    }
    source.data.y = temp_y;
    source.data.x = arr;
    source.change.emit();
""",
)

mu.js_on_change("value", callback)
std.js_on_change("value", callback)
show(column(row(mu, std), row(pdf, cdf)))

First we will start by looking at some of the data we previously collected.

### Lognormal distribution
The lognormal distribution is a distribution where is you log transform the data you will get the normal distribution. You may ask why not just log transform the data? Log transforming means your data will no longer be in the same scale which makes downstream interpretations more complicated. The lognormal distribution is very common in biological sciences. Things like rates, lengths, concentrations and energies often follow a lognormal distribution. The lognormal distribution bounds are (0,+$\infty$), The () brackets are exclusive which means that you can never have a 0 in the distribution since any log of 0 is undefined. The PDF of the lognormal distribution is: $$\frac{1}{x\sigma\sqrt{2\pi\sigma^2}}\exp(\frac{-(ln(x)-\mu)^2}{2\sigma^2})$$

In [None]:
mu = 0
std = 1
x = np.linspace(0.000001, np.exp(mu) + 4 * np.exp(std), num=400)
y = np.exp(-((np.log(x) - mu) ** 2) / (2 * std**2)) / (
    x * std * np.sqrt(2 * np.pi * std**2)
)
source = ColumnDataSource(
    {
        "x": x,
        "y": y,
        "yc": np.cumsum(y) * 0.020050125313283207,
    }
)
pdf = figure(height=250, width=350, title="PDF")
pline = pdf.line("x", "y", source=source, color="magenta")
pline1 = pdf.line(x, y, color="grey")
cdf = figure(height=250, width=350, title="CDF")
cline = cdf.line("x", "yc", source=source, color="magenta")
cline1 = cdf.line(source.data["x"], source.data["yc"], color="grey")

mu = Slider(start=0, end=10, value=0, step=0.5, title="Mu (mean)")
std = Slider(start=0.25, end=10, value=1, step=0.25, title="Sigma (std)")
callback = CustomJS(
    args=dict(
        source=source,
        mu=mu,
        std=std,
    ),
    code="""
    const arr = [];
    const start = 0.00001;
    const end = Math.exp(mu.value)+Math.exp(std.value)*4;
    const step = (end - start) / (400 - 1);

    for (let i = 0; i < 400; i++) {
        arr.push(start + step * i);
    }
    const temp_y = arr.map(x => {
        const coefficient = 1 / (Math.sqrt(2 * Math.PI * Math.pow(std.value, 2))*x*std.value);
        const exponent = -Math.pow((Math.log(x) - mu.value), 2) / (2 * Math.pow(std.value, 2));
        return coefficient * Math.exp(exponent);
    })
    const cumsum = [temp_y[0]]
    for (let i = 1; i < 400; i++) {
        cumsum.push((cumsum[i-1] + temp_y[i]));
    }
    source.data.y = temp_y;
    source.data.x = arr;
    source.change.emit();
""",
)

mu.js_on_change("value", callback)
std.js_on_change("value", callback)
show(column(row(mu, std), row(pdf, cdf)))

### Gamma distribution
The gamma distribution is probably one of the most important distribution for neuroscience data. A lot of data we collect is based on rates, such as mini or spike frequency. Rates are often gamma distributed. The gamma distribution is the generalization the exponential, Erlang and chi-squared distribution. The gamma distribution is defined by a shape, $\alpha$, and scale, $\theta$, parameter. Similar to the lognormal distribution the gamma distribution bounds are (0,+$\infty$). The PDF of the gamma distribution is: $$\frac{1}{\Gamma(\alpha)\theta^{\alpha}}x^{\alpha-1}e^{-x/\theta}$$

In [None]:
from bokeh.plotting import figure, show
from bokeh.layouts import column, row
from bokeh.models import ColumnDataSource, CustomJS, Select, Checkbox
import numpy as np
from bokeh.io import output_notebook
from scipy import stats

shapes = [1.0, 2.0, 3.0, 5.0, 9.0, 7.5, 0.5]
scales = [2.0, 2.0, 2.0, 1.0, 0.5, 1.0, 1.0]
x = np.linspace(0, 20, num=400)
dsource = {"x": x}
keys = []
glines = {}
gamma_fig = figure()

for i, j in zip(shapes, scales):
    pdf = stats.gamma.pdf(x, i, scale=j)
    k = f"a={i} theta={j}"
    keys.append(k)
    glines[k] = gamma_fig.line(x, pdf, color="grey")
source = ColumnDataSource(dsource)
select = Select(
    title="Option:", value="a=1.0 theta=2.0", options=keys
)
glines["a=1.0 theta=2.0"].glyph.line_color = "magenta"
callback = CustomJS(
    args=dict(
        source=source,
        glines=glines,
        select=select,
    ),
    code="""
    for (let key in glines) {
        if (select.value == key) {
            glines[key].glyph.line_color = 'magenta';
        }
        else {
            glines[key].glyph.line_color = 'grey';
        }
    }
""",
)

select.js_on_change("value", callback)
show(column(select, gamma_fig))

In [None]:
dsource.keys()