# Probability and Statistics for Engineering and the Sciences

I've taken courses in business statistics (bleh), probability (much better!), and stochastic processes (fascinating and difficult and fascinating), now I'm reading this book to to start off with stapling all of my knowledge together. Jupyter notebooks are ideal because they're laptop-on-my-lap-on-the-train compatible.

Basically this is me self-studying Statistics II. After this I'm reading a Bayesian book.

* A **trimmed mean** removes some lowest and highest percentile of the data (outlier correction). The median can be thought of as a competely trimmed mean.
* Mean includes full weight of outliers, median takes zero stock. Sometimes you want a measure somewhere in between, which is when a trimmed mean is useful (more analytical than simply "discard these five values as being too outlier-y).
* **Box plot** bits:
 * **First IQR** and **third IQR** are lower-half and upper-half medians, respectively, these are the bottom and top of a box plot respectively. The distance between the two is **fourth spread**, $f_4$, and this is a statistic that is resistant to outliers.
 * Line down the middle is the median (not mode!).
 * Lines extend on either side to the smallest value still within $1.5 \times f_4$  of the median.
 * **Outliers** are in the $\pm [1.5 f_4, 3 f_4 ]$ range, **extreme outliers** are in the $\geq 3 f_4$ range.

**Box plot in Plotly in Jupyter**

In [1]:
import json
import plotly.tools as tls

# Set my plotly credentials.
data = json.load(open('plotly_credentials.json'))['credentials']
tls.set_credentials_file(username=data['username'], api_key=data['key'])

In [2]:
import string
import pandas as pd
import numpy as np
import plotly.plotly as py
import plotly.graph_objs as go

# Populate a pandas DataFrame with randomized letter values.
# It'd be more sensible to get letter frequency -grams but /\o/\ this is an example direct from the source.
N = 100
y_vals = {}
for letter in list(string.ascii_uppercase):
    # np.random.randn() returns N random standard normal samples. Defaults to five.
    # Note, numpy doesn't provide sigma and mu parameters, you have to do the normal transform yourself. Cute!
    y_vals[letter] = np.random.randn(N)+(3*np.random.randn())

df = pd.DataFrame(y_vals)
# df.head returns the first five rows of the DataFrame.
df.head()

data = []

# I prefer to use iteritems, but to each his own.
for col in df.columns:
    data.append(  go.Box( y=df[col], name=col, showlegend=False ) )

data.append( go.Scatter( x = df.columns, y = df.mean(), mode='lines', name='mean' ) )

# This both creates and displays the graph and sends it to and saves it on their server (no non-public data!).
py.iplot(data, filename='pandas-box-plot')

# cf. https://plot.ly/python/histograms-and-box-plots-tutorial/

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~ResidentMario/0 or inside your plot.ly account where it is named 'pandas-box-plot'


* **Pairwise independence** is not full independence if you can find counterexamples in larger sets.
* A **Bernoulli random variable** (or Bernoulli trial) takes on a value of 0 or 1, e.g. success or failure.
* A **family of probability distributions** are organized around a **parameter**.
* **Discrete CDF**: $F(X) = P(X \leq x) = \sum_{y: y \leq x} p(y)$.
* Since CDFs are right-continuous, technically $P(X \in [a, b]) = F(b) - F(a-)$, where $a-$ is the limit inferior of $a$.
* **Discrete expectation**: $E(X) = \mu_x = \sum_{x \in \Omega} x \cdot p(x)$.
* **Drunk statistician's rule**: $E[h(x)] = \sum_{x \in \Omega} h(x) \cdot p(x)$.
* $\sigma^2(X) = \sum_{x \in \Omega}(x-\mu)^2 \cdot p(x) = E[(X-\mu)^2] = E(X^2) - [E(X)]^2$
* $s$ and $s^2$ in samples correspond with $\sigma$ and $\sigma^2$ in distributions.
* If $X \sim bin[n, p](x)$ then $E(X) = np$ and $\sigma^2(x) = npq$
* The **hypergeometric distribution** is the **binomial distribution** either without replacement or without the simplifying assumption that the sample space is so large that the distortion of non-replacement is negligible. It combinatorically corrects for the effects of non-replacement.
* Hypergeometric facts, if $X \sim hyper[N, n_p, n](x)$ ($N$ is population, $n_p$ is successes in the population, $n$ is the draw size) then:
 * $P(X=x) = \frac{{n_p \choose x}{N - n_p \choose n - x}}{N \choose n}$
 * $E(X) = n \cdot \frac{n_p}{N}$
 * $\sigma^2(X) = \left(\frac{N - n}{N - 1}\right) \cdot npq$. The fractional term is known as the "finite population correction term", is it related to the population sample correction term? Seems suspiciously similar. Also note the mixture of absolute and percentage terms...
* The **negative binomial distribution** describes the number of trials necessary before $n$ successes.
 * $P(X=x) = {x + n - 1 \choose n - 1}p^n q^x$
 * $E(X) = \frac{nq}{p}$
 * $\sigma^2(x) = \frac{nq}{p^2}$
* Because the terms of the negative binomial are geometrically convergent it's sometimes refered to as a **geometric sequence**. The other sequence that is loosely refered to this way is the probability distribution for the number of failures before the first success. It's kind of confusing. But the point is that both of these distributions are geometrically convergent (see the $q^x$ term).
* The **Poisson distribution** rationalizes the binomial when it is taken to the limit in the case that $n \to \infty$ and $p \to 0$ but $\lambda = np$ stays constant, then $binom[n,\:p](x) \to poisson[\lambda](x)$.
* $\lambda$ is both mean and variance for a Poisson!
* Appendix I from my stochastics textbook has some good computational summaries of the calculations of the characteristics of these functions, including moment generating function calculations, which I'll leave for later.
* Interestingly the book goes off then to describe **Poisson arrival processes**, including introducing but not really explaining $o(\Delta t)$ notation. See my stochastics textbook instead for details.
* General rule of thumb for using the Poisson distribution for modeling: $n \geq 100$, $p \leq .01$, and $np \leq 20$.

**Hypergeometric correction for the binomial**

Imagine that amongst $n$ balls exactly half are red. What is the probability of, in drawing two balls, getting one red one, if we manipulate the population size?

Since we are not replacing the balls the "correct" distribution is the hypergeometric one, but as $n \to \infty$, $hypergeom[N, \frac{N}{2}, 2](1) \to binom[N, \frac{1}{2}](1)$

In [3]:
import scipy
import math

binom_probs = [scipy.stats.binom.pmf(1, 2, .5)] * len(list(range(2, 10000, 2))) # 0.5
hyper_probs = [scipy.stats.hypergeom.pmf(1, n, n / 2, 2) for n in range(2, 10000, 2)]

# Create traces
trace1 = go.Scatter(
    x = list(range(2, 10000, 2)),
    y = binom_probs,
    mode = 'lines',
    name = 'binomial'
)
trace2 = go.Scatter(
    x = list(range(2, 10000, 2)),
    y = hyper_probs,
    mode = 'lines',
    name = 'hypergeometric'
)

layout = go.Layout(
    xaxis=dict(
        type='log',
        autorange=True
    ),
    yaxis=dict(
        type='linear',
        autorange=True
    )
)

data = [trace1, trace2]

# Plot and embed in ipython notebook!
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='hypergeometric-approach-to-binomial')

**Poisson approximation**

Why does the Poisson approximation work? It's a minor coeval to the central limit theorem (still in the to-do). If you explicitly write the binomial distribution out and then insert an approximation involving our $e^{-\lambda}$ term you will get the result. The math is worked out [here](http://www.stat.yale.edu/~pollard/Courses/241.fall97/Poisson.pdf). I'm not sure if we did this work in stochastics, but we did something similar using Stirling's formula for Poisson arrival processes (that was harder, of course!). I'm satisfied that I get the intrinsic, I think it'll make more sense once I dive into moment-generating functions and revisit stuff with $e$ in it.

* The book then jumps into CDFs and PDFs, all pretty standard except for the inclusion of percentiles!
* $\eta(x)$ is the **percentile function** for a CDF. It's a bijective map, e.g. $\eta(x): [-\infty, \infty] \to [0, 1]$. 
* So to solve it just do the $F^{-1}(x)$ inverse transform that we did to transform the uniform random in stochastics.
* Drunk statistician's rule is the same for continuous distributions.
* The normal distribution edition of the Gaussian, in case you forgot it:
$$f[\mu, \sigma](x) = \dfrac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu)^2 / 2\sigma^2}$$
* What a right beaut.
* The CDF for this standard normal is symbolized $\Phi(z)$, where z is the **z-score** (really the inverse percentile).
* The normalization process is $P(X \leq x) = \Phi[(x-\mu)/\sigma]$
* The 68%-85%-99.7% rule-of-thumb for standard deviations.
* Any distribution with a large enough number of samples will become approximate a normal distribution (not necessarily a standard normal one). The proof for why is absent in stats textbooks but there's a chapter dedicated to it in my stochastics textbook that we skipped in class which offers a partial proof. A complete proof requires analysis, it says...which I should be able to do after next semester. Ah well.
* When approximating a discrete random variable with the continuous normal it is useful to make a **continuity correction**. Since remember we're circumspecting a bunch of boxes with a smooth curve, we're getting an undercount on the left and an overcount on the right. Usual method is to knock .5 off the right and tack it on to the left. This still contains error of couse but it's nearer to the mark, and so is a good general strategy. /\o/\
* $binom[n, p](x) \to N(\mu = np, \sigma = \sqrt{npq})$

**Other continuous distributions**

* The **gamma distribution** is an important non-symmetric distribution.
* $\Gamma(\alpha) = \int_0^\infty x^{\alpha - 1}e^{-x} dx$
 * $\Gamma(\frac{1}{2}) = \sqrt{\pi}$
 * $\alpha > 1 \implies \Gamma(\alpha) = (\alpha - 1) \cdot \Gamma(\alpha - 1)$
 * $n \in \mathbb{N} \implies \Gamma(n) = (n - 1)!$
* $f[\alpha, \beta](x) = \frac{1}{\beta^\alpha \Gamma(\alpha)}x^{\alpha - 1}e^{-x/\beta}$, $x \geq 0$.
* With $X \sim \Gamma[\alpha, \beta](x)$, $E(X) = \alpha\beta$ and $\sigma^2 = \alpha \beta^2$

![A](https://upload.wikimedia.org/wikipedia/commons/e/e6/Gamma_distribution_pdf.svg "Gamma distribution")

* **Exponential distribution**: $f[\lambda](x) = \lambda e^{-\lambda x}$, $x \geq 0$, yada yada yada. Exponentials are **memoryless** because $P(X > t + s|X > s) = P(X > t)$ Note that it's a special case of the gamma distribution.

![A](https://upload.wikimedia.org/wikipedia/commons/e/ec/Exponential_pdf.svg "Exponential distribution")

* The book continues to valiently talk about the Poisson arrival process without doing the math. :) Refer to stochastics.

* **Chi-squared distribution**: $f[\nu](x) = \frac{1}{2^{\nu/2}\Gamma(\nu/2)}x^{(\nu/2)^{1}e^{-x/2}}$
* $\nu$ is the **number of degrees of freedom**, the chi-squared distribution is important for probability stuff later on involving inference.
* The **Weibull distribution** is a sort of less smooth function that kind of operates like the gamma one.
* $f[\alpha, \beta](x) = \frac{\alpha}{\beta^\alpha}x^{\alpha - 1}e^{-(x/\beta^\alpha)}$ and then so on.
* $\beta$ is an instance of a **shape parameter**, one which distorts the shape of the distribution. $\alpha$ is a **scale parameter**, one which acts principally on spread.
* In reliability engineering and other applications it is useful for its ability to model failure rates.
* The **lognormal** distribution is the $\ln(\cdot)$ of a normal distribution. Genius! $f[\mu, \sigma] = \frac{1}{\sqrt{2\pi}\sigma x}e^{-[\ln(x)-\mu]^2/(2\sigma^2)}$
* The **beta distribution** is a crazy looking thing that basically gets sheered in between parameters $[A, B]$, that's enough no need to reproduce the PDF.

**Probability plotting**
* Oh! These are new, and cool. You're given a bunch of data, how do you check to see that the distribution that you have in mind is an appropriate one for modeling the dataset?
* A **probability plot** (better termed the **P-P plot**) is a tool for visually checking the fit of your data.
* Arrange your $n$ sample observations from smallest to largest, taking each $i$th observation to be the $[100(i-.5)/n]$th sample percentile. In Python code, you create `([dist.pmf([100(i-.5)/n], observations[i]) for i in range(0, n)]`, where n = `len(data)`. Now plot the tuples!
* This is a neat application of `plotly`, let's make some plots!

In [4]:
# The following data is on the compressive strength of concrete (mhm), and the supposition is that it's normal.
data = [1400, 1932, 2000, 2200, 2200, 2530, 2630, 2665, 2735, 2735, 2800, 2935, 3000, 3000,
        3030, 3065, 3065, 3065, 3170, 3200, 3235, 3260, 3335, 3365, 3465, 3500, 3600, 3600,
        3835, 4460]

# My (overfitted?) guess at the distribution parameters.
dist = scipy.stats.norm(loc=np.mean(data), scale=650)
print(dist.cdf(data[0]))
print(dist.cdf(data[len(data) - 1]))

0.00795722077215
0.989185494557


In [5]:
from plotly import tools

# Let's plot it to see if we're right or not!
percentiles = [(p - .5)/len(data) for p in range(1, len(data) + 1)]

# Create traces
trace1 = go.Scatter(
    x = list(range(0, len(data))),
    y = data,
    mode = 'markers',
    name = 'raw data'
)

trace2 = go.Scatter(
    x = percentiles,
    y = [dist.cdf(d) for d in data],
    mode = 'markers',
    name = 'normal plot'
)

trace3 = go.Scatter(
    x = percentiles,
    y = percentiles,
    mode = 'line',
    name = 'p = p'
)

fig = tools.make_subplots(rows=1, cols=2)

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 2)

# Plot and embed in ipython notebook!
py.iplot(fig, filename='normal-pp')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



* The **Q-Q plot** is more commonly used for this, however.

In [6]:
percentiles = [(p - .5)/len(data) for p in range(1, len(data) + 1)]

# Create traces
trace1 = go.Scatter(
    x = list(range(0, len(data))),
    y = data,
    mode = 'markers',
    name = 'raw data'
)

trace2 = go.Scatter(
    x = [dist.ppf(p) for p in percentiles],
    y = data,
    mode = 'markers',
    name = 'q - q'
)

trace3 = go.Scatter(
    x = [dist.ppf(p) for p in percentiles],
    y = [dist.ppf(p) for p in percentiles],
    mode = 'line',
    name = 'q = q'
)

fig = tools.make_subplots(rows=1, cols=2)

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 2)

# Plot and embed in ipython notebook!
py.iplot(fig, filename='normal-qq')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



* A population sample which is non-normal is usually one of:
 * Symettric with lighter-than-Gaussian tails.
 * Symmetric with heavier-than-Gaussian tails.
 * Skewed.
* You can read a lighter-tailed or heavier-tailed distribution off of the against-the-normal Q-Q plot. A lighter-tailed distribution will be s-curved towards the observed axis, a heavier-tailed distribution will be s-curved towards the actual axis.
* The book gives 30 as a rule of thumb for when large deviations from the standard normal pattern are indicative of the distribution not actually being normal. Otherwise it could just be noise from the sample.
* You can also Q-Q plot a Weibull distribution, apparently in that case plotting $[\eta(x), ln(x)]$ works well?

**Joint and multivariate probabilities**

* $P[(X,Y) \in A] = \sum_{x \in X}\sum_{y \in Y}f(x,y)$ (discrete)
* $p_X(x) = \sum_{y \in Y} f(x,y)$
* $P[(X, Y) \in A] = \iint_A f(x,y)dx dy$ (continuous)
* $f_X(x) = \int_{-\infty}^\infty f(x, y) dy$
* $f_Y(y) = \int_{-\infty}^\infty f(x, y) dx$
* $X \bot Y \leftrightarrow f(x,y) = f_X(x)f_Y(y)$
* $f_{Y|X}(y|x) = \frac{f(x,y)}{f_X(x)}$
* Drunk statistician's rule holds in the multivariate case.
* $Cov(X,Y) = E[(X-E[X])(Y-E[Y])]$
 * $\sum_{x \in X} \sum_{y \in Y} (x - E[X]) (y - E[Y]) \cdot f_{X,Y}(x,y)$, discrete
 * $\int_{-\infty}^\infty \int_{-\infty}^\infty (x - E[X])(y - E[Y])f_{X,Y}(x,y)dx dy$, continuous
* Covariance works because if $X$ and $Y$ are correlated, their multiples will be $(+)(+)$ or $(-)(-)$; otherwise they will be $(+)(-)$ or $(-)(+)$ indicating negative correlation.
* Zero or near-zero covariance (when variables are **uncorrelated**) is not indicative of independence, just of this particular relationship canceling itself out.
* Shortcut we used all the time in stochastic, $Cov(X,Y) = E[XY] - E[X]E[Y]$
* $\rho_{X,Y} = \frac{Cov(X,Y)}{\sigma_X \cdot \sigma_y}$
* Covariance depends on the unit of measurement, correlation is not. Thus correlation standardizes covariance measure to $[-1, 1]$, which is necessary for comparative purposes.
* Independent random variables have a covariance/correlation of 0, but the latter does not imply the former (as stated above).
* Interestingly, $\rho = 1$ or $\rho = -1$ iff $Y$ is a linear transform of $X$, e.g. $Y = aX + b$.
* The book defines **idd rvs**. :)
* The expectation of a sum is the sum of the expectations.
* The variance of a sum is operative across only when the random variables are uncorrelated (a weaker condition than independence).
* At this point the book veers into avoiding moment generating functions. This would be a good time to learn these.

**Moment-generating functions**

* Their purpose is to act as an alternative definition to the probability density functions, one with their own sets of operations which are surprisingly useful for understanding probability.
* $M_x (t) := E[e^{tx}]$
* Sometimes this function does not exist.
* I will need to go back through and review linear algebra and apply it to moment-generating to figure them out. Which is fine, since it's a higher-level theoretical result and I'm not there yet.

**Central limit theorem**

* Moment-generating functions (and analysis, as previously mentioned) is necessary for providing an analytical proof of the central limit theorem.
* Look into convolution and inverse Laplace transforms also at that time?
* The sum of a sequence of normal random variables has mean $\mu_Y = \sum_n a_i \mu_i$ and variance $\sigma^2_Y = \sum_n a_i^2 \sigma_i^2$
* $binom[n, p](x) \to N(\mu = np, \sigma = \sqrt{npq})$ (from earlier) is a subcase of the CLT, as, for sufficiently large $n$, taking $n$ samples $X_1, ..., X_n$ from $X$ with mean $\mu_X$ and variance $\sigma^2_X$:
 * $\bar{X} \sim N(\mu=\mu_X, \sigma_X^2 = \sigma^2/n$
 * $T = N(\mu = n\mu_X, \sigma_T^2 = n\sigma^2)$ (this random second part is included in the book?)
* Rule of thumb for the application of the CLT is $n > 30$.

**Point estimation**

* Some technical definitions...
* The word **statistic** is used to distinguish a parameter or characteristic of the population or of the underlying distribution from the statistics (the word!) characterizing a sample population.
* A **point estimate** is a statistic computed as a **point estimator** of some underlying unknown descriptive fact $\theta$.
* $\hat{\theta}$ is the usual notation for this sort of thing.
* A point estimator $\hat{\theta}$ is **unbiased** if for all possible $\theta$, $E[\hat{\theta}] = \theta$. Otherwise $E[\hat{\theta}] - \theta$ is known as the **bias** of $\hat{\theta}$.
* A confusing fact from statistics that was never really explained is the $n - 1$ in the following fact:
* $\hat{\sigma}^2 = \frac{\sum(X_i - \bar{X})^2}{n - 1}$
* As Howard explained, "it turns out that changing $n$ to $n-1$ exactly corrects for the bias in the sample estimate". He was less responsive when I asked if he could show us the proof...but here it is! Well, there it is. On page 235 in the book. I went through it, very succint!
* Taking this even further, the best estimator for a statistic is not just one that is unbiased, it is the one which is the **minimum variance unbiased estimator**, shorthanded **MVUE**. The MVUE is the statistic which has the smallest variance.
* Interesting sidenote, in machine learning there is a **bias-variance tradeoff** between your underlying assumption being wrong and not being able to adjust quickly to fix this (bias), and overmodeling and trying to fit statistical noise instead of actual data (variance).
* If you accept the underlying notion that the best estimation comes from a statistic which is unbiased and has the lowest possible variance (a good general assumption but perhaps not the only one you can make, at more advanced levels), then the MVUE is a best-case estimator: it does not incur any "trade-off" versus flatly worse statistics. The trade-off only comes into play in those cases for statistics for which different kinds of distributions have different "best" statistics, e.g. there is not "clear winner" MVUE.
* The variance in an estimator comes from the variance in our $n$ samples, fyi. So does the bias of the statistic.
* The 10% or 20% trimmed mean is never the MVUE, but it works reasonably well in almost all scenarios: it is a **robust estimator**.

**Constructing point estimators**

* You can construct a point estimator using moments via the **method of moments**.
* The $k$th moment of a distribution $X$ is $E[X^k]$. The $k$th sample moment is $\frac{1}{n}\sum_{i=1}^n X_i^k$
* The method of moments works thusly: compute the formula for some number of ascending moments (starting from the first). Then, reverse-solve this system of equations to obtain formulas for the parameters ($\alpha$, $\beta$, etc.) in terms of the first and second and so on moments. Then replace $\alpha$ with $\hat{\alpha}$ and $\beta$ with $\hat{\beta}$ and $E[X]$ with $\frac{1}{n} \sum X_i$ and $Var(X)$ with $\frac{1}{n}\sum X_i^2$ and solve for the parameter estimators.
* Solving this system of equations is not always possible (since stat distributions can get pretty \*\*\*\*ing crazy brah).
* The **maximum likelihood estimator** (MLE) is another point estimation technique. For this one you take your sample values $X_1, ..., X_n$ and use them to create $f[\sigma_1, ..., \sigma_m](x_1, ..., x_n)$. For instance, for an iid sequence of independent exponential random variables you get $f[\cdot](\cdot) = \lambda^n e^{-\lambda \sum x_i}$. You then solve for the derivative (or, in the case of multiple parameters, partial derivative) of this function equaling zero (via elementary first or second derivative test) in order to find the maximizing function.
* Under very broad conditions and sufficiently large sample sizes the MLE is approximately the MVUE, ae. it's approximately ideal. MLEs are therefore used in general in statistics because they are both almost always derivable (though sometimes numerical integration is required) and are broadly accurate.

**Confidence intervals**

* It follows from the old $Var(\bar{X})$ calcuations that for the confidence interval for a standard normal transform is $[x \pm z_{\pm \alpha/2}\frac{\sigma}{\sqrt{n}}]$ (remember the old two-sided versus one-sided stuff).
* An old calculation that I performed in stats that impressed my professor of the time is $n = \left(2z_{a/2} \cdot \frac{\sigma}{L}\right)^2$, where L is the desired interval length.
* Remember the point from business statistics that increasing the sample size is expensive, since it is proportional to $\frac{1}{\sqrt{n}}$. But this is the only way to increase both sureness and accuracy.
* 30 is the rule of thumb for all of this CLT stuff.
* Since $\sigma$ is almost never known, usually you will be using the sample parameter $s$ instead, which inserts a second source of variability, no biggie though.
* $\alpha$ is **statistical significance**, the possible leftover error, using it we get the expression:
$P[-z_{\alpha/2} < \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} < z_{\alpha/2}] \approx 1 - \alpha$
* You can generalize this to any estimator, not just mean: $P[-z_{\alpha/2} < \frac{\hat{\theta}-\theta}{\sigma_{\hat{\theta}}} < z_{\alpha / 2}] \approx 1 - \alpha$. Here $\hat{\sigma}$ is the standard deviation of $\theta$, our estimator.
* The above formulas are general ones that take advantage of normal approximation via the CLT, however if you know more about the underlying distribution you can improve this estimator by inserting the known distributive $\mu$ and $\sigma$ into the equations. This is the purpose of the $\hat{\sigma}$ stuff above! This is done in particular for the case of binomial trials.
* Continuing specifically to binomial trials, using the fact that $binom[n, p](x) \to N[\mu = np, \sigma = \sqrt{npq}]$ in probability, you can obtain (from $\mu:\:[\hat{\mu} \pm z_{\alpha/2}\hat{\sigma}]$) that $p: [\hat{p} \pm z_{\alpha/2}\sqrt{\frac{\hat{p}\hat{q}}{n}}]$
* Using $L = 2z_{\alpha/2}\sqrt{\frac{\hat{p}\hat{q}}{n}}$ you also get $n=4z_{\alpha/2}^2\frac{\hat{p}(1-\hat{p})}{L^2}$
* Recall that in sampling from a normal distribution, since we know neither $\mu$ nor $\sigma$ there is variability in both the numerator and denominator when we produce the standard normal transform $Z = \frac{\bar{X} - \mu}{S/\sqrt{n}}$ (since it's both $\bar{X}$ not $\mu$ and $S$ not $\sigma$). This double variability creates, in small samples ($n < 30$), what is known as the **family of t-distributions**.
* So for small samples, $\bar{X} \approx \mu$ is distributed against the *t* distribution with $n-1$ degrees of freedom. At $n > 30$ these t-distributions become effectively indistinguishably normal; 30 is used as the rule of thumb for just declaring that it be so.
* It's $n-1$ degrees of freedom because $n-1$ is the number of variables in the statistic that are free to vary: "The number of independent ways by which a dynamic system can move, without violating any constraint imposed on it, is called number of degrees of freedom. In other words, the number of degrees of freedom can be defined as the minimum number of independent coordinates that can specify the position of the system completely."
* If you are estimating multiple statistics then you have $n-k$ degrees of freedom (this happens in multinomial regression).
* t curves are more spread out to start (since small sample sizes have more variability) and converge to the standard normal.
* With some chosen confidence $\alpha$, $\bar{x} \pm t_{\alpha/2,n-1}\cdot s/\sqrt{n}$

**Variance confidence inteverals**
* It has been observed that given that $X_1, ..., X_n$ is a random sample from $N(\mu, \sigma^2)$, $\frac{(n-1)S^2}{\sigma^2} \sim \chi^2[\nu = n- 1](x)$.
* With some algebraic manipulation the confidence interval for $\sigma^2$ can thus be represented as: $\frac{(n-1)s^2}{\chi^2_{\alpha/2, n-1}} < \sigma^2 < \frac{(n-1)s^2}{\chi^2_{1-\alpha/2,n-1}}$.
* You read the $\chi^2$ off a table (usually).

![A](https://upload.wikimedia.org/wikipedia/commons/3/35/Chi-square_pdf.svg "Chi-squared distribution")

* If $Z_1, ..., Z_k$ are independent, standard normal random variables, then the sum of their squares, e.g. $\sum_k Z_i$, is distributed $\chi^2$, e.g. $\sum_k Z_i \sim \chi^2$. This is where this all comes from.

**Hypothesis testing**
* The claim favored to be true is the **null hypothesis** ($H_0$), the alternative is the **alternative hypothesis** ($H_a$).
* A **test procedure** is a rule by which a hypothesis test is to be conducted. It defines a **test statistic** and a **rejection region**: the null hypothesis is to be rejected (at least outright) only if the observed test statistic falls in the rejection region.
* A **t-I error** (**$\alpha$ error**) is rejecting the null hypothesis when it is true, a **t-2 error** (**$\beta$ error**) is accepting it when it is not.
* $\alpha$ error is single-valued. Since $\beta$ error requires that we know the underlying $\mu$, and we don't, there are actually multiple beta values: one for each *actual* value of $\mu$.
* For example, suppose we guess that $\mu = 75$ and reject when $\bar{x} \leq 70$. Then a type I error occurs when $\bar{x} \leq 70 \cap \mu = 75$: this is captured by the requisite lower tail mass of the responsible normal distribution. A type II error occurs if e.g. $\bar{x} > 70 \cap \mu \neq 75$: this is captured by some upper mass, the exact amount of which depends on what $\mu$ actually is after all (the further away the real mean, the smaller change we actually get the outlier value of 70 and therefore correctly reject).
* The book has an adequate illustration on pg. 289.
* Focusing on a boundary value is a worst case analysis, usually instead we want to test e.g. $H_0:\:x\geq75$.
* Type I and Type II error inherently trade off with one another.
* The level of type I error is usually refered to as the **significance level** of the test.
* Tradeoff between more type I or more type II is a real-world issue, in scientific experimentation it's better to have a false negative than a false positive however (but also respect the file cabinet effect...).