# Mean, Median, and Mode

The mean of a distribution is given by:

$\mu = \frac{1}{N} \sum_i x_i$

The expected value of a distribution is given by:

$E[x] = \sum\limits_{i} x_i p_i$

The median of a distribution is given by the middle element.  For an even number of samples,
the median is the average of the middle two elements.

The mode of a distribution is given by the most frequent element

Note that the median is more resistant to outliers or skewness than the mean is, but the mean provides a better measure of the central tendecy of the distribution otherwise.

# Variance and Standard Deviation

The variance of a random variable X is the expected value of the squared deviation from the mean of X:

$\text{Var}(X) = E[(X-\mu)^2] = E[X^2]-E[X]^2$

The pneumonic for the expression above is the "mean of the square minus square of the mean."

If X is a discrete variable, then:

$\text{Var}(X) = \sum\limits_{i=1}^{n} p_i (x_i - \mu)^2$

For n equally weighted values, this becomes

$\text{Var}(X) = \frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \mu)^2$

The standard deviation is equal to the squareroot of the variance:

$\text{Var}(X) = \sigma^2$

For arithmetic with variance:

$\text{Var}(Cx) = C^2 \text{Var}(x)$

# Sample Vairance versus Population Variance

After taking a sample of a population, the **biased sample variance** is given by the standard deviation of the sample, squared:

$\sigma_{samp}^2 = \sum_{i} \frac{ (x_i - \mu_x)^2}{N}$

Since we are randomly sampling from the population, there is a chance that we are sampling points that are very close together.  In these cases, the mean of the sample distribution can be higher or lower than the true population mean, and the variance of the sample distribution can be lower to the population variance.  This is why we say the above equation is the *biased* sample variation.

A better estimate of the populatin variance is given by the **unbiased sample variance**:

$\sigma_{samp}^2 = \sum_{i} \frac{ (x_i - \mu_x)^2}{N-1}$

This is a better estimate of the population variance, because adding more sample causes the sample standard deviation to approach $\frac{N-1}{N} \sigma_{pop}^2$.

# Combinatorics

To count the number of possible outcomes, create a tree diagram.  For instance, if we have 4 plants that can be sold in 3 different types of pots, the diagram would have 4 initial branches, with 3 branches each, for a total of 3 times 4 (12) possibilities.

For N options, the number of permutations of K samples is:
$_nP_k= \frac{n!}{(n-k)!}$

If we assume that the order of the permutation doesn't matter, then the number of ways to
choose K combinations from N options is given by the number of possible permutations of length K divided by the number of ways to arrange those permutations.  That is:
${n \choose k}  = \frac{n!}{k!(n-k)!}$

# Binomial distribution

The binomial distribution describes, with parameters $n$ and $k$, describes the discrete probability distribution of the number of successes in a sequence of $n$ independent yes/no experiments, with a probability of success given by $k$.

The probability of having the exactly k successes out of n events is given by:
$P( x = k | n,p) = {n \choose k}p^k (1-p)^{n-k}$

In words, this is the probability of having exactly $k$ successes happen in $n$ events in one particular way (without caring about the order of the success), multiplied by number of ways to have $k$ successes ordered within n events.

The expectation value of a binomial distribution is the number of trails times the probability of success:
$E = np$

The variance of a binomial distribution is given by:

$\sigma^2 = \text{var}(\sum_i X_i) = n \; \text{var}(X_i)=np(1-p)$

where we have used the fact that any binomial distribution is the distribution of the sum of n Bernoulli trials.  


# Bernoulli distribution

The Bernoulli distribution is a special case of the binomial distribution, where n = 1.  In this case, $k$ can equal 0 or 1, since we either have one success or no successes.

The mean of a Bernoulli distribution is:

$\mu = p$

The variance of a Bernoulli distribtuion is:

$\sigma^2 = E[x^2] - E[x]^2 = p - p^2 = p(1-p)$


# Beta distribution 

Given a binomial process, where we know the number of success and trials, the beta distribution describes the likelihood of the true success rate being equal to $p$.  In general, for $0\le p \le 1$, and the number of successes given by $\alpha$ and number of failures given by $\beta$:

$P(p|\alpha,\beta) = N p^{\alpha-1} (1-p)^{\beta-1}$

where $N$ is a normalization constant to make the area of the beta distribution equal to one.

The mean and variance of a beta distribution are equal to :

$\mu = \frac{\alpha}{\alpha + \beta}$

$\text{Var}(p) = \frac{ \alpha \beta }{(\alpha + \beta)^2(\alpha + \beta +1)}$


# Poisson distribution

A poisson distribution describes the probability that $k$ events will occurr in a fixed interval of time or space, assuming we have a known average number of occurances, and that the events are independent.

$P(k|\mu) = \frac{\mu^k e^{-\mu}}{k!}$

The variance of a poisson distribution is equal to the mean:

$\text{Var}(k) = \mu$

Hence, in counting experiments the standard deviation is equal to the square root of the number of counts.

Note that the poisson distribution is different from the binomial distribution in a few ways:

1) The binomial distribution has a finite number of trials, while the poisson distribution has an infinite number of trials (ie -- there is no limit to how many times a flood could happen)

2) The binomial distribution has two possible outcomes (a success or a failure), while the poisson distribution has unlimited possible outcomes (1,2,3,4,... etc)

Essentially, a poisson distribution measures counting experiments, and a binomial distribution measures hit/miss experiments.

The poisson distribution is the limit of the binomial distribution when p goes to zero and n goes to infinity.




# Normal distribution

A normal distribution is described by 

$P(x | \mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{ \frac{-(x-\mu)^2}{2\sigma^2}}$

Where $\mu$ is the mean, and $\sigma$ is the standard deviation of the distribution. 

A normal distribution follows the 68-95-99.7 rule.  That is -- 68% if the data will fall within one $\sigma$ of the mean, 95% will fall within two $\sigma$, and 99.7% will fall within three $\sigma$.

# The central limit theorem

Suppose we have a sample of independent, identically distributed (i.i.d.) random variables ${X_1,...,X_n}$.  By the law of large numbers, the sample average

$S_n = \frac{X_1 + ... + X_n}{n}$

converges to the expected value $\mu$ as $n$ increases.  The central limit theorem states that the distribution of sample means (from samples of size $N$) will be approximately normal, with mean equal to the population mean, and standard deviation equal to the population standard deviation divided by the square root of the number of samples:

$\mu_{\text{dist}} = \mu_{\text{pop}}$

$\sigma_{\text{dist}} = \frac{\sigma_{\text{pop}}}{\sqrt{N}}$


# Skewness of a distribution

The skew of a histogram is given by the mean minus the variance, divided by the standard deviation:

Skew = $\frac{\mu-\nu}{\sigma}$

A postive skew indicates a right tailed distribution, and a negative skew indicates a left tailed distribution.

# Probability

For independent events:

$P(A \cap B) = P(A) P(B)$

$P(A \cup B) = P(A) + P(B) - P(A \cap B)$


Bayes Theorem:

$P(A | B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B | A)P(A)}{P(B)}$

The Law of Total Probability:

$P(A) = \sum\limits_{i=1}^n P(A \cap B_i) = \sum\limits_{i=1}^n P(A|B_i)P(B_i)$

Where the partitions are P(B) and P(not B), for instance.

# Types of Errors

Type 1 error: Incorrect rejection a true null hypothesis

Type 2 error: Incorrectly retaining a false null hypothesis

# Resampling Methods

In statistics, you want to ask a question of a population by taking a small sample from the popluation. Since we are randomly sampling the population, it's unlikely that the sample statistics will be exactly equal to the true population statistics.  If we could take multiple samples from the population, we could measure the variance of the sample statistics to produce confidence intervals that will likely contain the true population mean. In practice, it's often not possible to take multiple samples, so instead we either make some assumptions about the underlying distribution of the population, or use the information in the sample you actually have to learn about it.

If we decide to make assumptions, we are creating a parametic model to represent the population.  For instance, we might assume that the population is Normal or Binomial. We can use the parameters of these model to estimate the variance of the population means, and produce confidence intervals that likely contain the true population mean.

If we don't want to, or can't make assumptions about the underlying distribution, we can try to learn about the underlying distribution from the sample we collected instead. Many of these non-parametric techniques focus on resampling the sample that we collected to learn about the variance of the sample statistics.  This is possible by treating sample itself as a population, that is approximately close to original population.  The two most popular resampling methods are:

**Bootstrapping Resampling**:

In the case of bootstrapping:

1) Collect a sample of size N from the population

2) Resample your original sample with replacement, to produce a new sample of size N

3) Repeat this process many times, and create a distribution of the means of the bootstrapped samples

The variance of the population statistic can by inferred from the boostrap statistic using:

$\text{Var} = \frac{1}{n-1} \sum\limits_{i=1}^{n} (\bar{x_i} - \bar{x})^2$

Where $\bar{x}$ is the average of the boostrap statistics, and $n$ is the number of bootstrap samples. Note that in the case of the population mean, the variance of the boostrap means converges to the variance of the population mean as $n \rightarrow \infty$.

The disadvantage of bootstrapping is that it is computationally intensive, and relies on your sample being an accurate representation of the population.

**Jackknife Resampling**:

Jackknife is another form of resampling in which we remove one of the samples, calculate the statistic of interest, and the replace the sample.  This process is repeated until we have removed each of the samples once.  The variance of the statistic is then given by:

$\text{Var} = \frac{n-1}{n} \sum\limits_{i=1}^{n} (\bar{x_i} - \bar{x})^2$

Where $\bar{x}$ is the average of the jackknife statistics.  This process is less computationally intensive than bootstrapping, but can only estimate the variance of a statistic, rather than estimate the entire distribution of the sample means.

# Hypothesis testing

When hypothesis testing, we first define the null and alternative hypotheses.  Then, assuming the null hypothesis is true, we calculate the probability (the p-value)of observing a more extreme test statistic in the direction of the alternative hypothesis. If the P-value is small, typically less than 0.05, then we reject the null hypothesis.  

## Bonferonni Correction
If we perform multiple comparissons by performing multiple hypothesis tests, then we will eventually observe a low p-value by random chance.  To account for this, we divide the threshold for signifcance by the number of hypothesis tests being performed.  For instance, in A/B/C testing, we perform 3 tests (A/B, A/C, B/C), so we would want a p-value lower than 0.05/3 for signficance.

## One Sample Z-test

Assuming we know the true population standard deviation we can assume that our sample originated from a normally distributed sampling distribution of standard error equal to $\frac{\sigma_{pop}}{\sqrt{N}}$.  In this case, the number of standard errors that our sample mean is from the true population mean $\mu_{pop}$ is given by the z-statistic:

$z = \frac{\mu_{samp} - \mu_{pop}}{\frac{\sigma_{pop}}{\sqrt{N}}}$

We can use a z-table or the 68-95-99.7 normal distribution rule to convert the number of standard errors to a p-value to calculate how likely we would be to observe this sample mean from a population with a known mean, or to produce confidence intervals on an unknown mean.

## One Sample T-test

If we don't know the true population variance, we can approximate it with the unbiased sample variance

$\sigma_{samp} = \frac{1}{N-1} \sum\limits_{i=1}^N (\mu_{samp} - x_i)^2$

Due to the estimated population variance, we assume that the sampling distribution is a t-distribution with $N-1$ degress of freedom, and calculate a t-score instead of a z-score:

$t = \frac{\mu_{samp} - \mu_{pop}}{\frac{\sigma_{samp}}{\sqrt{N}}}$

To extract p-values from the t-distribution, we use a t-table instead of a z-table.

## Two Sample T-test

If we have two samples, and want to determine if they are statistically different, then we can use a two sample t-test.  This involves using the CLT to estimate the sample distributions that each sample originated from, and then producing a sample distribution for the difference of the sample means.  The resulting distribution has a variance of 

$\sigma^2 = \frac{\sigma_1^2}{N} + \frac{\sigma_2^2}{M}$

and a mean equal to the difference of the two population means.  If we are testing the hypothesis that the distributions are different, we assume a null hypothesis of the difference in the means being zero, and calculate the probability of observing a difference as large or larger than the difference of our two sample means.


## $\chi^2$ Test

We use pearson's $\chi^2$ test to determine if two sets of data come from the same distribution.  The $\chi^2$ test statistics is given by:

$\chi^2 = \sum\limits_{i=1}^n \frac{(O_i - E_i)^2}{E_i}$

Where $E_i$ is the expected value, and $O_i$ is the observed value.  We can take the expected value to be a theoretical distribution, or to be the distribution assuming a null hypothesis is true.  It's also common to use this as a test of independence.  For instance, we assume that A and B are independent so that P(A|B) = P(A).  We take these values as the expected values, and compare to the observed values to see if the p-value is consitent with our assumption of independence.

A $\chi^2$ distribution with k degrees of freedom describes the distribution of the sum of the squares of k independent standard normal random variables (with $\mu = 0$ and $\sigma = 1$).  Since Pearson's $\chi^2$ statistic subtracts the expected value, and normalizes to the magnitude of the value, the statistic approaches a $\chi^2$ distribution with degrees of freedom N-1 as N increase.  To convert the $\chi^2$ statistic to a p-value, we use a $\chi^2$ table.  

Note that, for counting experiments, the mean of a poisson distribution is equal to the variance.  Therefore, the $\chi^2$ statistic can be written:

$\chi^2 = \sum\limits_{i=1}^n \frac{(O_i - E_i)^2}{\sigma_{pop}^2} = \frac{(N-1)\sigma_{samp}^2}{\sigma_{pop}^2}$


## ANOVA Tables

An ANOVA table tests for statistical signficance between multiple groups.  The ANOVA table breaks the total variance of the data into two parts -- the variance within the groups of samples, and the variance between the groups samples:

 Total SS = SSB + SSW 
 
The total sum of squares is given by the sum of squares of all of our data combined:

SST $= \sum\limits_{i=1}^k \sum\limits_{j=1}^{n_i} (y_{ij}-\bar{y})^2$

Where the first sum runs over the groups, and the second sum runs over the samples within each group.  The average of all of our samples is denoted by $\bar{y}$.

The sum of squares between groups is given by the squared difference of each group's average with respect to the sample average, weighted by the number of samples within each group:

SSB $ = \sum\limits_{i=1}^k n_i(\bar{y_i} - \bar{y})^2$

The sum of squares within groups is given by the squared difference of each group's samples with respect to the group's mean:

SSW $ = \sum\limits_{i=1}^k \sum\limits_{j=1}^{n_i} (y_{ij} - \bar{y_i})^2$

To test for statistical difference between the groups, we calculate the F-statistic, which denotes the ratio of explained variance divided by unexplained variance given by:

$ F = \frac{SSB}{\text{dof groups}} \frac{\text{dof observations}}{SSW} 
 = \frac{SSB}{\text{groups} -1 } \frac{\text{observations - groups}}{SSW}$

## ANOVA in regression tables

Note that in regression problems, we can treat each of the individual points as their own group (so $n_i$ = 1 for all groups) with an average equal to the predicted value.  This allows us to calculate the p-value of our regression by looking up the f-score in an f distribution table.  This also allows us to calculate the coefficient of determination ($R^2$), which tells us the fracation of variance which is explained by our model:

$R^2 = 1 - \frac{SSW}{SST}$

## Power Analysis

In a power analysis, we are trying to calculate the probability of accepting the alternative hypothesis, given that it is true.  This is equivalent to the probability of not committing a Type II error.  To calculate this, we do the following:

1) Create the sampling distribution under the null hypothesis, and determine the critical value for which we would reject the null hypothesis.

2) Create the sampling distribution under the alternative hypotehsis.  Calculate the fraction of the distribution which lies above the criticial value.  This area tells us the probability of accepting the null hypothesis given it is true.

We typically want a power greater than 0.8 in our analysis.


# Tests of Normality

Kolmogorov-Smirnov Test plots the emprical cdf (the sum of the probability distribution up to $x_i$)of our data versus a theoretical cdf.  The maximum distance between the two distributions is used as the KS-statistic, which can be turned into a p-value indicating the probability that the two distributions are the same using a table.

The QQ Plot plots the z-score versus the sorted sample data.  For instance, if we have 10 ordered samples, we divide the 0-100% probabilities into 10 bins.  We look up the corresponding z-scores for those probabilities, and plot the z-score versus the ordered samples.  A straight diagonal line indicates that the sample distribution is approximately normal.