# Distributions

Distributions and curves are some the most common statistical topics that you will use in electrophysiology analysis. Distributions and curves help us model data. Modeling data helps us describe the data we have and simplify complex data. While this may sound complex or intimidating it is not. A simple linear regression is a model. Distributions of our data are a model. Curve fitting is modeling. We have used these models in both the current clamp and mini analysis chapters but we will go more in-depth here to describe some of the underlying properties. Modeling can be divided into what, how and why models (see: https://compneuro.neuromatch.io/tutorials/intro.html for more). We will primarily focus on what models since these are the first step to using the other types of models. I rarely see what models discussed in patch clamp electrophysiology analysis, however I believe they are needed to analyze our data well.

## Distributions
Distributions of data are one of the first things you should look at especially if you have large sample sizes. Think mini amplitude, IEI, and rise rate. Most the statistical analyses we do make assumptions about the distribution of our data yet rarely do papers show distributions of their data. Most other distribution contain the equivalent of the mean but the value is quiet different from the mean. When taking the mean of non-gaussian distributions you are not get the true central tendency of your data and may be measuring the effect of outliers rather than a true shift in the distribution of data. In the case of some thing like a beta distribution the mean may not accurately describe the data.

Distributions can show counts or probabilities. When you make a histogram or kernel density estimate (KDE) you are creating a distribution of your data (technically a non-parametric distribution). Most often in statistical text books you will see something called probability density functions (PDF) for continuous variables or probability mass functions (PMF) for discrete variables. These are parametric distributions becuase that have parameters that describe the distribution. The important thing about these "functions" is that the distribution of values you get from the functions integrates (i.e. the area under the curve) to 1 for the when you take the distribution from negative limit to positive limit. When you put in a single number with some parameters you get a number out called a likelihood. 

The last thing to note is that finding the distribution that fits your data describes your data but does not tell you how or why is was generated that way. This how and why questions are not something we will get at here and are what computational neuroscientists study.

You will also often see cumulative distribution functions (CDFs). The CDF is the just the integral of the PDF and the PDF the derivative of the CDF. When taking the integral of a continous function you can just use a Numpy [cumsum](https://numpy.org/doc/stable/reference/generated/numpy.cumsum.html) and multiply by your delta X to get the CDF if you have evenly spaced samples. For the area under the curve you can use a variety of functions provided by scipy such as the [trapezoid](https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.trapezoid.html#scipy.integrate.trapezoid), [simpson](https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.trapezoid.html#scipy.integrate.simpson), or the [romberg](https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.romb.html#scipy.integrate.romb). Below we will plot the PDF and CDF for each of the distributions.

For this tutorial, there are interactive samples. Not all of these examples not use the preferred way of using and working with distributions in Python. The recommended way is to use the Scipy [stats](https://docs.scipy.org/doc/scipy/reference/stats.html) module. The stats module distrbutions provide a lot of useful features but can be a little bit intimidating to start with. Scipy stats distributions can provide PDF/PMFs, CDFs, PPFs and fit distributions to your data using maxmimum likelihood.

If you want to learn a little more about PDFs and PMFs I suggest watching Very Normal on Youtube.

First we are going to look over some of the data we tend to collect in electrophysiology experiments and see what the distribution of data looks like. Then we are going to go over some specific distributions and see which distributions look most like our data.

We are going to go over some common distributions and their properties that you will see in electrophysiology.

In [None]:
import math
import warnings
from pathlib import Path

import numpy as np
import pandas as pd
from bokeh.io import output_notebook, show
from bokeh.layouts import column, row
from bokeh.models import ColumnDataSource, CustomJS, Select, Slider
from bokeh.plotting import figure
from scipy import stats

warnings.filterwarnings("ignore", category=RuntimeWarning)

cwd = Path.cwd().parent / "data"

# output_notebook()

### Gaussian distribution
In terms of data that we collect in electrophysiology the gaussian distribution is actually not that common but most of the time we assume our data follows a gaussian distribution.  Some non-gaussian distributions, such as the beta and von Mises (radians) distributions, can coverge to a normal distribution with certain parameters. The are problem when you assume that non-gaussian data follows a gaussian distribution AND you do not check your statistical models. We will cover some of this later in this chapter. The gaussian distribution has two parameters, the mean and standard deviation. The mean is what is called a location parameter and shifts the distribution around. The standard deviation is related to the spread of data symmetrically around the mean. Technically the gaussian distribution is unbounded. This means that you can get any value from -$\infty$ to +$\infty$. However, due to limits that we have on computers we generally don't show all the values up to +/- $\infty$, but only up to a couple standard deviations past the mean on each side. The equation for the gaussian PDF is: $$\frac{1}{\sqrt{2\pi\sigma^2}}\exp\frac{(x-\mu)^2}{2\sigma^2}$$

Below you can see how changing the mean and standard deviation changes the magenta distribution relative to the grey reference distribution. We plot both the PDF and the CDF (the integral of the PDF). There are a couple things to note. The area under the curve of the PDF will always equal 1. Changing the standard deviation decreases the likelihood of getting any value but increases the range of likely values you can get. 

In [None]:
mu = 0
std = 1
x = np.linspace(mu - std * 4, mu + 4 * std, num=400)
y = np.exp(-((x - mu) ** 2) / (2 * std**2)) / np.sqrt(2 * np.pi * std**2)
source = ColumnDataSource(
    {
        "x": x,
        "y": y,
        "yc": np.cumsum(y) * 0.020050125313283207,
    }
)
pdf = figure(height=250, width=350, title="PDF")
pline = pdf.line("x", "y", source=source, color="magenta")
pline1 = pdf.line(x, y, color="grey")
cdf = figure(height=250, width=350, title="CDF")
cline = cdf.line("x", "yc", source=source, color="magenta")
cline1 = cdf.line(source.data["x"], source.data["yc"], color="grey")

mu = Slider(start=-10, end=10, value=0, step=0.5, title="Mu (mean)")
std = Slider(start=0.1, end=10, value=1, step=0.5, title="Sigma (std)")
callback = CustomJS(
    args=dict(
        source=source,
        mu=mu,
        std=std,
    ),
    code="""
    const arr = [];
    const start = mu.value-std.value*4
    const end = mu.value+std.value*4
    const step = (end - start) / (400 - 1);

    for (let i = 0; i < 400; i++) {
        arr.push(start + step * i);
    }
    const temp_y = arr.map(x => {
        const coefficient = 1 / Math.sqrt(2 * Math.PI * Math.pow(std.value, 2));
        const exponent = -Math.pow((x - mu.value), 2) / (2 * Math.pow(std.value, 2));
        return coefficient * Math.exp(exponent);
    })
    const cumsum = [temp_y[0]]
    for (let i = 1; i < 400; i++) {
        cumsum.push((cumsum[i-1] + temp_y[i]));
    }
    source.data.y = temp_y;
    source.data.x = arr;
    source.change.emit();
""",
)

mu.js_on_change("value", callback)
std.js_on_change("value", callback)
show(column(row(mu, std), row(pdf, cdf)))

First we will start by looking at some of the data we previously collected.

### Lognormal distribution
The lognormal distribution is a distribution where if you log transform the data you will get the normal distribution. If your data has a right skew (long tail to the right) your data may follow a lognormal distribution. You may ask why not just log transform the data? Log transforming means your data will no longer be in the same scale which makes downstream interpretations more complicated. The lognormal distribution is very common in biological sciences. Things like rates, lengths, concentrations and energies often follow a lognormal distribution. The lognormal distribution bounds are (0,+$\infty$), The () brackets are exclusive which means that you can never have a 0 in the distribution since any log of 0 is undefined. The PDF of the lognormal distribution is: $$\frac{1}{x\sigma\sqrt{2\pi\sigma^2}}\exp(\frac{-(ln(x)-\mu)^2}{2\sigma^2})$$

In [None]:
mu = 0
std = 1
x = np.linspace(0.000001, np.exp(mu) + 4 * np.exp(std), num=400)
y = np.exp(-((np.log(x) - mu) ** 2) / (2 * std**2)) / (
    x * std * np.sqrt(2 * np.pi * std**2)
)
source = ColumnDataSource(
    {
        "x": x,
        "y": y,
        "yc": np.cumsum(y) * 0.020050125313283207,
    }
)
pdf = figure(height=250, width=350, title="PDF")
pline = pdf.line("x", "y", source=source, color="magenta")
pline1 = pdf.line(x, y, color="grey")
cdf = figure(height=250, width=350, title="CDF")
cline = cdf.line("x", "yc", source=source, color="magenta")
cline1 = cdf.line(source.data["x"], source.data["yc"], color="grey")

mu = Slider(start=0, end=10, value=0, step=0.5, title="Mu (mean)")
std = Slider(start=0.25, end=10, value=1, step=0.25, title="Sigma (std)")
callback = CustomJS(
    args=dict(
        source=source,
        mu=mu,
        std=std,
    ),
    code="""
    const arr = [];
    const start = 0.00001;
    const end = Math.exp(mu.value)+Math.exp(std.value)*4;
    const step = (end - start) / (400 - 1);

    for (let i = 0; i < 400; i++) {
        arr.push(start + step * i);
    }
    const temp_y = arr.map(x => {
        const coefficient = 1 / (Math.sqrt(2 * Math.PI * Math.pow(std.value, 2))*x*std.value);
        const exponent = -Math.pow((Math.log(x) - mu.value), 2) / (2 * Math.pow(std.value, 2));
        return coefficient * Math.exp(exponent);
    })
    const cumsum = [temp_y[0]]
    for (let i = 1; i < 400; i++) {
        cumsum.push((cumsum[i-1] + temp_y[i]));
    }
    source.data.y = temp_y;
    source.data.x = arr;
    source.data.yx = cumsum;
    source.change.emit();
""",
)

mu.js_on_change("value", callback)
std.js_on_change("value", callback)
show(column(row(mu, std), row(pdf, cdf)))

### Gamma distribution
The gamma distribution is another important distribution for neuroscience data. The gamma distribution is used to model waiting times and rates. A lot of data we collect are rates, such as mini or spike frequency. If your data has a right skew (long tail to the right) your data may follow a gamma distribution. Interestingly, gamma distributed variables often maximize the information content of a signal which is useful in the brain. The gamma distribution is the generalization the exponential, Erlang and chi-squared distribution. The gamma distribution is defined by a shape, $\alpha$, and scale, $\theta$, parameter. Similar to the lognormal distribution the gamma distribution bounds are (0,+$\infty$). The PDF of the gamma distribution is: $$\frac{1}{\Gamma(\alpha)\theta^{\alpha}}x^{\alpha-1}e^{-x/\theta}$$

$\Gamma$ is the Gamma function is the factorial function that is generalized to complex numbers (except non-positive complex numbers)

You can also get the mean: $\mu=\alpha\theta$ and variance: $\sigma^{2}=\alpha\theta^2$ of the distribution.

In [None]:
shapes = [1.0, 1.25, 1.25, 2.0, 3.0, 5.0, 7.5]
scales = [2.0, 2.0, 3.0, 2.0, 2.0, 1.0, 1.0]
x = np.linspace(0, 20, num=400)
keys = []
glines = {}
cdflines = {}
gamma_fig = figure(height=250, width=350)
gamma_cdf = figure(height=250, width=350)

for i, j in zip(shapes, scales):
    pdf = stats.gamma.pdf(x, i, scale=j)
    cdf = stats.gamma.cdf(x, i, scale=j)
    k = f"a={i} theta={j}"
    keys.append(k)
    glines[k] = gamma_fig.line(x, pdf, color="grey")
    cdflines[k] = gamma_cdf.line(x, cdf, color="grey")
select = Select(title="Option:", value="a=1.0 theta=2.0", options=keys)
glines["a=1.0 theta=2.0"].glyph.line_color = "magenta"
glines["a=1.0 theta=2.0"].glyph.line_width = 3
cdflines["a=1.0 theta=2.0"].glyph.line_color = "magenta"
cdflines["a=1.0 theta=2.0"].glyph.line_width = 3
callback = CustomJS(
    args=dict(
        glines=glines,
        cdflines=cdflines,
        select=select,
    ),
    code="""
    for (let key in glines) {
        if (select.value == key) {
            glines[key].glyph.line_color = 'magenta';
            cdflines[key].glyph.line_color = 'magenta';
            glines[key].glyph.line_width = 3;
            cdflines[key].glyph.line_width = 3;
        }
        else {
            glines[key].glyph.line_color = 'grey';
            cdflines[key].glyph.line_color = 'grey';
            glines[key].glyph.line_width = 1;
            cdflines[key].glyph.line_width = 1;
        }
    }
""",
)

select.js_on_change("value", callback)
show(column(select, row(gamma_fig, gamma_cdf)))

## Beta distribution
The beta distribution is used to model data that has finite intervals, and proportions and percentages. Some bounds values you might see in neuroscience are correlations (-1,1) and PPR (if you use the percentage formula and not the ratio). The beta distribution's bound are (0,1) so if you have bounded data that is not between those you can transform it.  The beta distribution is defined by a shape, $\alpha$, and scale, $\beta$, parameter. The PDF of the beta distribution is: $$\frac {x^{\alpha -1}(1-x)^{\beta -1}}{\frac {\Gamma (\alpha )\Gamma (\beta )}{\Gamma (\alpha +\beta ) (\alpha ,\beta )}}$$
$\Gamma$ is the Gamma function is the factorial function that is generalized to complex numbers (except non-positive complex numbers). The denominator of the PDF is a normalization factor to ensure the distribution has a total probability of 1. 

One thing that you will notice is that the beta distribution can take on a wide range of shapes. This makes is very powerful for oddly shaped distributions. It can even even model normally distribution and uniformally distribution data when $\alpha$=$\beta$

You can also get the mean: $\mu=\frac{\alpha}{\alpha+\beta}$ and variance: $\sigma^{2}=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$ of the distribution.

In [None]:
shapes = [0.5, 5, 1, 2, 5, 1.25, 1, 5]
scales = [0.5, 1, 3, 2, 2, 5, 1, 5]
x = np.linspace(0, 1, num=400)
keys = []
blines = {}
cdflines = {}
beta_fig = figure(height=250, width=350)
beta_cdf = figure(height=250, width=350)

for i, j in zip(shapes, scales):
    pdf = stats.beta.pdf(x, i, j)
    cdf = stats.beta.cdf(x, i, j)
    k = f"a={i} b={j}"
    keys.append(k)
    blines[k] = beta_fig.line(x, pdf, color="grey")
    cdflines[k] = beta_cdf.line(x, cdf, color="grey")
select = Select(title="Option:", value="a=0.5 b=0.5", options=keys)
blines["a=0.5 b=0.5"].glyph.line_color = "magenta"
blines["a=0.5 b=0.5"].glyph.line_width = 3
cdflines["a=0.5 b=0.5"].glyph.line_color = "magenta"
cdflines["a=0.5 b=0.5"].glyph.line_width = 3
callback = CustomJS(
    args=dict(
        blines=blines,
        cdflines=cdflines,
        select=select,
    ),
    code="""
    for (let key in blines) {
        if (select.value == key) {
            blines[key].glyph.line_color = 'magenta';
            cdflines[key].glyph.line_color = 'magenta';
            blines[key].glyph.line_width = 3;
            cdflines[key].glyph.line_width = 3;
        }
        else {
            blines[key].glyph.line_color = 'grey';
            cdflines[key].glyph.line_color = 'grey';
            blines[key].glyph.line_width = 1;
            cdflines[key].glyph.line_width = 1;
        }
    }
""",
)

select.js_on_change("value", callback)
show(column(select, row(beta_fig, beta_cdf)))

## Poisson distribution
The Poisson distribution is a discete distribution used model the probability of a certain number of events occuring within a time frame thus is related to the rate of events occur. The poisson distribution is similar to the gamma distribution kind of like the discrete cousin of the gamma distribution. The Poisson distribution has an important related feature called dispersion or the fano factor: $\frac{\sigma^2}{\mu} = 1$, which is also related to the coefficient of variation: $\frac{\sigma}{\mu} = \lambda^{-\frac{1}{2}}$. The fano factor is often used to check the variability of a spike train or even minis. The closer the fano factor is to 0, then the more predicable the process is. The farther you go above 1 the more clustered your events will be suggesting that there is some reason you events cluster (make sense for neurons that can fire spike trains). The Poisson PMF is defined by $\lambda, or the rate, and k, the integer value. The Poisson PMF (not PDF since it is distrete) is: $${\frac {\lambda ^{k}e^{-\lambda }}{k!}}$$. Since the Poisson distribution is a discrete distribution you actually get probabilities out instead of likelihoods.

With the Poisson distribution you could model how many minis are likely to occur in a time frame based on a rate but you cannot model when the minis occur. To do that you would need to use the continuous exponential distribution (which is related to the gamma distribution). One thing to note is that the plot uses scatter instead of a line. This is to indicate that the Poisson distribution can only take integer values.

In [None]:
lam = 1
k = np.arange(0, 21)
y = (lam**k * np.exp(-lam)) / np.array([math.factorial(i) for i in k])
source = ColumnDataSource({"k": k, "y": y, "yc": np.cumsum(y)})
pdf = figure(height=250, width=350, title="PDF")
pline = pdf.scatter("k", "y", source=source, color="magenta")
pline1 = pdf.scatter(k, y, color="grey")
cdf = figure(height=250, width=350, title="CDF")
cline = cdf.scatter("k", "yc", source=source, color="magenta")
cline1 = cdf.scatter(source.data["k"], source.data["yc"], color="grey")

lam = Slider(start=0.5, end=10, value=1, step=0.5, title="Lambda")
callback = CustomJS(
    args=dict(
        source=source,
        lam=lam,
    ),
    code="""
    const arr = [];
    for (let i = 0; i < 21; i++) {
        let result = 1;
        for (let k = 2; k <= i; k++) {
            result *= k;
        }
        const val = ((lam.value**i)*(Math.exp(-1*lam.value)))/result;
        arr.push(val);
    }
    const cumsum = [arr[0]]
    for (let i = 1; i < 21; i++) {
        cumsum.push((cumsum[i-1] + arr[i]));
    }
    source.data.y = arr;
    source.data.yc = cumsum
    source.change.emit();
""",
)

lam.js_on_change("value", callback)
show(column(row(lam), row(pdf, cdf)))

## Examining distributions in your data.
So far we have covered five distributions that are heavily used (or should be) in neuroscience. The next part is look at some data that we get in patch clamp recordings, fit each of these distributions (when possible) to the data and see how the fit looks.

The most common method to look at distributions of data is to use a histogram. My preferred method is to use a kernel density estimate (KDE). Both of these methods are non-parametric in that they technically do not have any parameters other than your data to create a distribution. The reason I prefer KDEs to histograms is that you can interpolate where you do not have data. KDEs are essentially the non-parametric probabability density function like the ones we already covered. I will show you how create a histogram and KDE in Python and then we will use the KDE to compare our data to the distributions above. One thing to note is that we will only use the Poisson for the IEI data since IEI can easily be converted to integers if you know what the sample rate of the data is (in our case 10000 Hz). To do this we will use the data we collected in Miniature/spontaneous postsynaptic currents chapter.


In [None]:
df = pd.read_csv(cwd/"mini_data.csv")

### How to Create a Histogram
To create a histogram in Python the easiest way is to use Numpy's [histogram](https://numpy.org/doc/stable/reference/generated/numpy.histogram.html). For this example we will use 50 bins, however Numpy has a pretty good algorithm for automatically selecting bins. One setting we will use is the density equals true. This will ensure that each bin shows the likelihood so that is matches with the KDE. If you just want to plot a histogram in other popular plotting packags such as Matplotlib or Seaborn, they have modules where you can just put in your data.

In [None]:
hist_dict = {}
for i in df:
    hist, edges = np.histogram(df[i], range=(df[i].min()-1e-6, df[i].max()+1e-6), density=True, bins=50)
    hist_dict[f"{i}_hist"] = hist
    hist_dict[f"{i}_left"] = edges[:-1]
    hist_dict[f"{i}_right"] = edges[1:]
hist_dict["left"] = edges[:-1]
hist_dict["right"] = edges[1:]
hist_dict["hist_y"] = hist


### How to Create a KDE

In [None]:
kde_dict = {}
for i in df:
    data = df[i]
    if i == "iei":
        data = data[data > 0]
    min_val = min(data)
    max_val = max(data)
    padding = (max_val - min_val) * 0.1  # Add 10% padding
    grid_min = min_val - padding
    grid_max = max_val + padding
    grid_min = max(grid_min, 0)
    positions = np.linspace(grid_min, grid_max, num=248)
    kernel = stats.gaussian_kde(data)
    y = kernel(positions)
    kde_dict[f"{i}_x"] = positions
    kde_dict[f"{i}_y"] = y

kde_dict["kde_y"] = y
kde_dict["x"] = positions

### Examining the data
Below are some plots of our data. There are three common transforms;  sqrt, log10 and the negative inverse. These transforms are used to correct for right-skew in your data. The functions correcting for skewness go from light correction to heavy correction in this order: sqrt &rarr; log10 &rarr; negative inverse. You can also see that since we do not transform the data before any corrections the histogram can look pretty weird. Part of not correcting the histogram before is due to limitations of how this book is written and published. However, you can see how the transform "resizes" the bins, putting greater emphasis (larger bin size) on smaller values. All these transforms decrease the effects of outliers. Additionally, I also plot the mean, median and mode. These are all measures of central tendency. You can see how the transform shifts the mean and median towards the mode. However the mean, median and mode almost never fully align. This is because we have truncated distributions in addition to skew.

In [None]:
menu = Select(
    title="Variables",
    value="iei",
    options=df.columns.to_list()
)
kde_source = ColumnDataSource(kde_dict)
hist_source = ColumnDataSource(hist_dict)
transform = Select(title="Transform", value="Identity", options=["Identity", "sqrt", "log", "neg_inverse"])

hist_figure = figure(height=250, width=400)
hist_data = hist_figure.quad(top="hist_y", bottom=0, left="left", right="right", line_color="white", alpha=0.5, source=hist_source)

kde_figure = figure(height=250, width=400)
line = kde_figure.line(x="x", y="kde_y", source=kde_source, line_color="black")

central_tendency = {}
for i in df:
    data = df[i]
    if i == "iei":
        data = data[data > 0]
    index = kde_source.data[f"{i}_y"].argmax()
    central_tendency[f"{i}_Identity"] = [np.mean(data), np.median(data), kde_source.data[f"{i}_x"][index]]
    sqrt_data = np.sqrt(data)
    central_tendency[f"{i}_sqrt"] = [np.mean(sqrt_data), np.median(sqrt_data), np.sqrt(kde_source.data[f"{i}_x"][index])]
    log_data = np.log10(data)
    central_tendency[f"{i}_log"] = [np.mean(log_data), np.median(log_data), np.log10(kde_source.data[f"{i}_x"][index])]
    ninv_data = -1/data
    central_tendency[f"{i}_neg_inverse"] = [np.mean(ninv_data), np.median(ninv_data), -1/kde_source.data[f"{i}_x"][index]]
central_tendency["x"] = central_tendency["iei_Identity"]
central_tendency["color"] = ["orange", "blue", "magenta"]
central_tendency = ColumnDataSource(central_tendency)

vspans = kde_figure.vspan(x="x", color="color", source=central_tendency)

callback = CustomJS(
    args=dict(
        hist_source=hist_source,
        kde_source=kde_source,
        central_tendency=central_tendency,
        menu=menu,
        transform=transform,
    ),
    code="""
    if (transform.value == "Identity") {
        var left = hist_source.data[`${menu.value}_left`];
        var right = hist_source.data[`${menu.value}_right`];
        var x = kde_source.data[`${menu.value}_x`];
        var mx = central_tendency.data[`${menu.value}_Identity`];
    } else if (transform.value == "log") {
        var left = hist_source.data[`${menu.value}_left`].map(num => Math.log10(num));
        var right = hist_source.data[`${menu.value}_right`].map(num => Math.log10(num));
        var x = kde_source.data[`${menu.value}_x`].map(num => Math.log10(num));
        var mx = central_tendency.data[`${menu.value}_log`];
    } else if (transform.value == "sqrt") {
        var left = hist_source.data[`${menu.value}_left`].map(num => Math.sqrt(num));
        var right = hist_source.data[`${menu.value}_right`].map(num => Math.sqrt(num));
        var x = kde_source.data[`${menu.value}_x`].map(num => Math.sqrt(num));
        var mx = central_tendency.data[`${menu.value}_sqrt`];
    }  else if (transform.value == "neg_inverse") {
        var left = hist_source.data[`${menu.value}_left`].map(num => -1/num);
        var right = hist_source.data[`${menu.value}_right`].map(num => -1/num);
        var x = kde_source.data[`${menu.value}_x`].map(num => -1/num);
        var mx = central_tendency.data[`${menu.value}_neg_inverse`];
    }
    const hist_y = hist_source.data[`${menu.value}_hist`];
    const kde_y = kde_source.data[`${menu.value}_y`];
    hist_source.data.hist_y = hist_y; 
    hist_source.data.left = left;
    hist_source.data.right = right;
    kde_source.data.kde_y = kde_y
    kde_source.data.x = x;
    central_tendency.data.x = mx;
    central_tendency.change.emit();
    kde_source.change.emit();
    hist_source.change.emit();
""",
)

menu.js_on_change("value", callback)
transform.js_on_change("value", callback)

show(column(row(menu, transform), row(hist_figure, kde_figure)))

### Fitting your data to a distribution
In this next section we will look at how data fits to different distributions. Due to limitations of the format we will stick with IEI data however, I encourage to you modify the code below on either the other data we have or better yet on your own data. You should note that it is quite hard to accurately describe your data with distributions as you will see. Part of this is the limitation of the tools we have to fit a distribution to our data and that our data are messy due to measurement noise, recording conditions, etc.

You will also notice that the gamma distribution does not fit very well with the untransformed data. You will also notice how the normal, lognormal and gamma distribution converge as the transform gets stronger. The biggest problem with IEI data is that it is tightly clusters around the mode. The mode can be estimated by looking at peak of the KDE. Many models assume some relationship between the mean and the variance of the data so the functions have limited types of shapes that they can support. For all the models that we used, the lognormal distribution is the best. This is probably because it is parameterized the same way as the normal distribution.

Lastly, when Scipy fits data to a distribution it will modify the location of the distribution which usually does not occur when fitting regression models if the model does not explicitly contain a location parameter (of which the normal distribution is one of the few that explicity contain it). So if you want to run a regression model (t-test, ANOVAs included) you may need to shift your data to just above 0 to improve the fit if you are using a gamma, beta or lognormal distribution.

In [None]:
dist_data = df.loc[df["iei"] > 0, "iei"]

def fit_distributions(data):
    dist_dict = {}

    min_val = min(data)
    max_val = max(data)
    padding = (max_val - min_val) * 0.1  # Add 10% padding
    grid_min = min_val - padding
    grid_max = max_val + padding
    grid_min = max(grid_min, 0)
    positions = np.linspace(grid_min, grid_max, num=248)
    kernel = stats.gaussian_kde(data)
    y = kernel(positions)
    dist_dict["kde_x"] = positions
    dist_dict["kde_y"] = y
    

    # Gaussian distribution fit
    norm_vals = stats.norm.fit(data)
    dist_dict["norm_x"] = np.linspace(1e-10, data.max()+1e-10, 248)
    dist_dict["norm_y"] = stats.norm.pdf(dist_dict["norm_x"], loc=norm_vals[0], scale=norm_vals[1])

    # Lognormal distribution fit
    logvals = stats.lognorm.fit(data)
    dist_dict["lognorm_x"] = np.linspace(1e-10, data.max()+1e-10, 248)
    dist_dict["lognorm_y"] = stats.lognorm.pdf(dist_dict["lognorm_x"], logvals[0], loc=logvals[1], scale=logvals[2])

    # Gamma distribution fit
    gammavals = stats.gamma.fit(data)
    dist_dict["gamma_x"] = np.linspace(1e-10, data.max()+1e-10, 248)
    dist_dict["gamma_y"] = stats.gamma.pdf(dist_dict["lognorm_x"], gammavals[0], gammavals[1], gammavals[2])
    
    return dist_dict

def create_fig(source, title):
    fig = figure(height=250, width=400, title=title)
    fig.line("lognorm_x", "lognorm_y", source=source, line_color="magenta")
    fig.line("norm_x", "norm_y", source=source, line_color="orange")
    fig.line("gamma_x", "gamma_y", source=source, line_color="blue")
    fig.line("kde_x", "kde_y", source=source, line_color="black", line_width=3)
    return fig

source = ColumnDataSource(fit_distributions(dist_data))
fig = create_fig(source, title="No transform")

sqrt_source = ColumnDataSource(fit_distributions(np.sqrt(dist_data)))
sqrt_fig = create_fig(sqrt_source, title="Sqrt")

log10_source = ColumnDataSource(fit_distributions(np.log10(dist_data)))
log10_fig = create_fig(log10_source, title="Log10")

legend = figure(y_range=(5,5), height=250, width=400, title="Legend")
legend.line([0, 1], [0, 0], line_color="magenta", line_width=3, legend_label="lognormal")
legend.line([0, 1], [1, 1], line_color="orange", line_width=3, legend_label="normal")
legend.line([0, 1], [3, 3], line_color="blue", line_width=3, legend_label="gamma")
legend.line([0, 1], [4, 4], line_color="black", line_width=3, legend_label="kde")
legend.legend.location = "center"

show(column(row(fig, sqrt_fig), row(log10_fig, legend)))

This concludes the chapter on distributions.