# Overview

As discussed [here](https://github.com/PennNGG/Quantitative-Neuroscience/blob/master/Concepts/Parametric%20Versus%20Nonparametric%20Statistics.ipynb), parametric tests are useful because they are typically based on exact, analytic solutions to inference problems about samples from known distributions. These solutions often provide more power (i.e., the ability to identify real signals when they are present) than other approaches. However, in practice these tests can be tricky to work with (even putting aside questions of whether it is reasonable to assume a particular form of distribution for your data), because they are applied to [samples, not populations](https://github.com/PennNGG/Quantitative-Neuroscience/blob/master/Concepts/Samples%20and%20Populations.ipynb). As such, they rely on using the samples to estimate the parameters of the appropriate distribution(s).

Here we focus on the the commonly used case of sample means and show some of the key approaches that have been developed over the years to help mitigate problems associated with using samples to estimate properties of the population. These approaches include several components that show up over and over in parametric testing and thus are good to have some intuitions for, including ...

Note that most parametric hypothesis tests are of the form: what is the probability that I would have obtained my test statistic, given its expected distribution under the null hypothesis? Answering this question requires: 1) defining a test statistic, and 2) knowing (or assuming) how it is distributed under the null hypothesis. 

In the cases discussed here, the test statistic is the sample mean. The sample mean is quite straightforward to compute. As everyone has learned, for *n* data points the sample mean $\bar{X}$ is just:

$\quad\bar{X}=\frac{1}{n}\sum^n_{i-1}x_i$

An advantage of using the sample mean is that it is typically reasonable to assume how it is distributed. As the equation above shows, the sample mean is computed by adding together a bunch of individual samples (which themselves may come from any arbitrary distribution). This case is exactly what is described by the [Central Limit Theorem](https://statisticsbyjim.com/basics/central-limit-theorem/), which says that, regardless of the distribution of the measurements themselves, their sum (and therefore mean) will be (with enough data!) normally distributed.

As we know, [a normal distrubtion is defined by two parameters](https://github.com/PennNGG/Quantitative-Neuroscience/blob/master/Probability%20Distributions/Python/Gaussian%20(Normal).ipynb): the mean and variance (or standard deviation). Our estimate of the mean parameter (typically denoted as $\mu$) is just the sample mean ($\bar{X}$), defined above.

The standard deviation goes by a special name in this case, as applied to the distribution of mean values (e.g., imagine doing 1000 separate experiments, and computing a separate mean for each one; this standard deviation would describe the spread of the distribution of those values): the standard error of the mean. There is an analytic solution for this term:



A major challenge in statistics is that we are working with [samples, not populations](https://github.com/PennNGG/Quantitative-Neuroscience/blob/master/Concepts/Samples%20and%20Populations.ipynb). Thus, when using parametric tests, we almost always must depend on estimates of the parameters of the given distribution derived from the samples, not their "true" (in quotes because we typically cannot actually ever know these values and instead just assume their existence and move on) values from the population. In the case of the normal distribution, these estimates are the sample mean and the sample variance (or standard devaiation).



The Student's *t* distribution, discovered by a [beer guy](https://en.wikipedia.org/wiki/William_Sealy_Gosset), represents a confidence interval on the mean of a normal distribution with unknown variance, based on samples from that distribution. More specifically, if you take a sample of *n* observations from a normal distribution, then the z-score of the difference between the sample mean and the population mean computed using the estimated standard deviation of the normal distribution follows a *t* distribution.

Let's unpack that:

1\. Start with a normally distributed random variable *𝑋* with parameters $\mu$ (mean) and $\sigma^2$ (variance):

$\quad X \sim N(\mu, \sigma^2)$

2\. For *n* data points, the sample mean $\bar{X}$ is just:

$\quad\bar{X}=\frac{1}{n}\sum^n_{i-1}X_i$

3\. If you know the variance ($\sigma^2$), then the standard error of the mean (*sem*) is:

$\quad sem=\frac{a}{\sqrt{n}}$

In this case, a good test statistic that quantifies the signal-to-noise ratio is the difference between the actual mean and the sample mean (the signal, in the numerator), standardized by the standard error of the mean (the noise, in the denominator), is:

$\quad z=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}$

This quantity (which is a z-score and thus called *z*) has a standard normal distribution (i.e., mean=0, variance=1).

4\. However, if you do not know the variance ($\sigma^2$), then you need to use the [Bessel-corrected sample variance](https://mathworld.wolfram.com/BesselsCorrection.html) (*S*) to compute the standard error of the mean in the test statistic, now called *t*:

$\quad t=\frac{\bar{X}-\mu}{S/\sqrt{n}}, where\:S=\sqrt{\frac{1}{n-1}\sum^n_{i=1}{(X_i-\bar{X})^2}}$

Note the (*n*–1) term in *S*, which is the Bessel correction and is the **degrees of freedom** of the *t*-distribution. This term makes the distribution of *t* slightly different than the distribution of *z*. Specifically, *t* has "heavy tails"; i.e., a higher probability of extreme values. Note that as *n* increases, *𝑡*⟶*𝑧* (they become more and more similar).




In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Let's compare simulated and theoretical Gaussians
mu = 5
sigma = 10
N = 10000


# Additional Resources

Working with the *t* distribution in [Matlab](https://www.mathworks.com/help/stats/students-t-distribution.html), [R](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/TDist.html), and [Python](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html).

# Credits

Copyright 2021 by Joshua I. Gold, University of Pennsylvania