# Week 6: Inferential Statistics

In [1]:
# Loading the libraries
import numpy as np
import sympy as sy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

## Day 1: The Central Limit Theorem & Confidence Intervals
This week we will discuss how conclusions can be drawn (inferences be made) about data using random variables and their probability distributions.

The setting is as follows: let $X \sim \mathcal{N}(\mu, \sigma)$ be a random variable. Assime we choose a random sample of size $n$ based on the distribution of $X$. Next we calculate the sample mean $\bar{x}_1$ where the 1 denotes it is the "first" sample we draw. Now, repeat the process many times, generating a **sequence of averages of samples with size $n$**:
\begin{equation} \bar{x}_1, \bar{x}_2, \dots, \bar{x}_m \end{equation}
Question: what are the mean and the standard deviation of the averages; what shape does this distribution have?

Let us run a simulation to discover

### Example 1
Construct a simulation that generates $m$ samples from the random variable $X ~ \mathcal{N}(100, 12)$, each with size $n$. Then calculate *the mean of the sample means* and the *standard deviation of the sample means* and plot the samples on a histogram. Think about the shape of the distribution.

In [None]:
# Define X
mu = 100
sigma = 12


### Example 2
Run a similar simulation like in **Example 1**, but now use a random variable $Q$ which follows a $\chi^2$-distribution with 10 degrees of freedom. Calculate *the mean of the sample means* and the *standard deviation of the sample means* and plot the samples on a histogram. Think about the shape of the distribution.

In [None]:
# Define Q
df = 10


### The Central Limit Theorem
The Central Limit Theorem (CLT) is one of the most (ab)used facts in statistics. Sometimes it is called **the fundamental theorem of statistics** (in analogy to similar theorems in arithmetic, algebra, and calculus)

In simple terms, the CLT states that if you have a random variable $X$ with mean $\mu_X$ and standard deviation $\sigma_X$, and you draw many samples of size $n$ from $X$, then the averages of those samples are described by a random variable, labeled as $\overline{X}$, for which:
* $\mu_\overline{X} = \mu_x$ (the mean of the sample means equals the mean of the original distribution)
* $\sigma_\overline{X} = \displaystyle\frac{\sigma_X}{\sqrt{n}}$ (the standrad deviation of the sample means equals the standard deviation of the original distribution scaled down by a factor of $\sqrt{n}$)

Even more, the shape of the distribution is **approximately** Normal, or in other words $\overline{X} \sim \mathcal{N}\left( \mu_\overline{X}, \sigma_\overline{X}  \right) = \mathcal{N}\left( \mu, \displaystyle\frac{\sigma}{\sqrt{n}} \right)$. This distribution is called the **sampling distribution of the mean**.

If $X$ is distributed normally, then $\overline{X}$ is also distributed normally, and this is the only case when the sampling distribution of the mean is exactly normal.

### Example 3
Illustrate the CLT on the random variable $X \sim \mathcal{N}(160, 10)$ if the sample size is $n = 16$

In [None]:
# Define X
mu = 160
sigma = 10
n = 16


### Example 4
Illustrate the CLT on the random variable $G$ which follows a **geometric distribution** with parameter $p=0.2$, i.e. $G \sim \mathcal{Geom}(0.2)$, with sample size $n = 17$

In [None]:
# Define G
p = 0.2
n = 17


### How does this help us with data?
The situation is as following: we have one sample of size $n$ from a population with unknown population mean $\mu$ and population standard deviation $\sigma$. Obviously, the *sample statistics* $\bar{x}$ and $s$ can estimate $\mu$ and $\sigma$, but they can also "miss" them by a lot. These sample statistics are called **point estimates** of the population parameters.

Question: how do we take adventage of the facts stated in the CLT to give an **interval estimate** of the population mean $\mu$?

In practice, if you work with a *large sample* ($n \geqslant 30$ or $40$), then you can consider the sampling distribution to be close enough to the normal distribution. However, in all other cases, the Student *t*-distribution with $n-1$ degrees of freedom describes the sampling distibution much better than the normal distribution.

Recall the **68-95-99.7 Rule** for the Normal distribution. We can constuct something like that for any distribution. Since we only have one sample in reality, we use the sample statistics as estimates of the population parameters. If  $\bar{x}$ and $s$ are the sample mean standard deviation from a sample with size $n$, then the **$(1-\alpha)\%$ confidence interval** estimate of the population mean is given by:
\begin{equation}\displaystyle \bar{x} \pm t_{n-1}^* \cdot \frac{s}{\sqrt{n}}\end{equation}
where $t_{n-1}^*$ is a scaling coefficient that depends on the sample size $n$ and can be obtained using $t$-distribution. The number $df \ n-1$ is the number of **degrees of freedom**, while the quantity $SE = \frac{s}{\sqrt{n}}$ is called **standard error**. Let's see some examples. The number $(1-\alpha)\%$ is called **confidence level** and usually is at least 90%.

### Example 5
In this example, we use a confidence interval to estimate the mean of a known population (i.e. this is just a *controlled* experiment). Let $X \sim \mathcal{N}(100, 12)$. Generate 5 different 90% CIs and 5 more 95% CIs for the mean of the population based on samples of size $n=9$. Comment on what you get.

In [None]:
# Define X
mu = 100
sigma = 12


### Example 6
Let $Y \sim \mathcal{Po}(13)$. Generate 5 different 95% CIs and 5 more 99% CIs for the mean of the population based on samples of size $n=25$. Comment on what you get.

In [None]:
# Define Y
lmbd = 13


### Example 7
The sample given in the next cell comes from a distribution whose mean you do not know, but need to estimate. Use a 95% CI to estimate the population mean.

In [None]:
sample = np.array([3.23545161, 3.77542568, 1.01742999, 1.95137322, 3.70661749,
       1.54115566, 3.97507688, 3.74119874, 2.99544951, 1.20815545,
       4.43162589, 1.68634582, 0.97010408, 3.81707371, 1.37509011,
       1.46900854, 5.10493947, 8.15556455, 6.73199071, 2.29971986])

n = sample.size
df = n-1
c_level = 0.95
