# Statistics

## 3.1 Introduction
- Probability theory (Chapter 2) is all about modeling a distribution over observed data outcomes $\mathcal{D}$ (knowing the parameters $\boldsymbol{\theta}$) by computing $p(\mathcal{D}| \boldsymbol{\theta})$
- Statistics is the inverse problem. We want to compute $p(\boldsymbol{\theta}| \mathcal{D})$ (so we want to infer the parameters $\boldsymbol{\theta}$) given observations. There are two approaches:
    - **Frequentist**
    - **Bayesian** ($\leftarrow$ this is king)

## 3.2 Bayesian statistics
- Observed data $\mathcal{D}$ is known and fixed, parameters are unknown $\boldsymbol{\theta}$ (this is the opposite than frequentist approach (Sec3.3))
- We represent our beliefs about the parameters after seing data as a **posterior distribution** (eq.3.1): $p(\boldsymbol{\theta}| \mathcal{D}) = \frac{p(\boldsymbol{\theta})p(\mathcal{D}|\boldsymbol{\theta})}{p(\mathcal{D})} = \frac{p(\boldsymbol{\theta})p(\mathcal{D}|\boldsymbol{\theta})}{\int p(\boldsymbol{\theta}^\prime)p(\mathcal{D}|\boldsymbol{\theta}^\prime)d\boldsymbol{\theta}^\prime}$
    - **posterior dist**:  $p(\boldsymbol{\theta}| \mathcal{D})$
    - **prior dist**: $p(\boldsymbol{\theta})$
    - **likelihood**: $p(\mathcal{D}|\boldsymbol{\theta})$
    - **marginal dist**: $p(\mathcal{D})$
    
### 3.2.1 Tossing coins
- this is the 'atom' example of probabilities
- We record the outcomes of observed data as $\mathcal{D}=\{y_n\in\{0,1\}:n=1:N \}$

#### 3.2.1.1 Likelihood
- In a simple coin toss, data is iid, and thus the **sufficient statistics** are $(N_1, N_0=N-N_1)$ in:
$p(\boldsymbol{\theta}| \mathcal{D}) = \prod_n^N\theta^{y_n}(1-\theta)^{1-y_n}=\theta^{N_1}(1-\theta)^{N_0}$
    - $y_n$ is the prob of seing heads at toss number $n$
    - *note that sufficient statistics $\neq\boldsymbol{\theta}$, sufficient stats refer to the quantities that capture enough info about the data to be able to estimate the parameters
- simple coint toss posterior can also be computed using a *Binomial dist*: $p(\boldsymbol{\theta}| \mathcal{D}) = \operatorname{Bin}(y|N, \theta)$

#### 3.2.1.2 Prior
- We can write an **uninformative prior** using a *uniform dist*, but the *beta dist* it is more general: $p(\theta)=\operatorname{Beta}(\theta|\breve{\alpha},\breve{\beta}) \propto \theta^{\breve\alpha-1}(1-\theta)^{\breve\beta-1}$ 
    - $\breve{\alpha},\breve{\beta}$ are **hyperparameters** (params of the prior that determine our belief about $\boldsymbol{\theta}$), if $\breve{\alpha}=\breve{\beta}=1$ we recover the uniform dist


#### 3.2.1.3 Posterior
- $\text{posterior}\propto\text{likelihood}\times\text{prior}$
- continuing the example of a beta prior, we have a **congugate prior** because the posterior has the same functional form:
    $p(\theta \mid \mathcal{D}) \propto \theta^{N_1}(1-\theta)^{N_0} \theta^{\breve{\alpha}-1}(1-\theta)^{\breve{\beta}-1} \propto \operatorname{Beta}\left(\theta \mid \breve{\alpha}+N_1, \breve{\beta}+N_0\right)=\operatorname{Beta}(\theta \mid \widehat{\alpha}, \widehat{\beta})$

#### 3.2.1.4 Posterior mode (MAP estimate)
- In Bayesian statistics, MAP estimate is the mode of the posterior dist. It gives the most probable value of the parameter $\hat{\theta}_{\text{map}}=\arg\max_\theta p(\theta\mid\mathcal{D})=\arg\max_\theta\log p(\theta)+\arg\max_\theta\log p(\mathcal{D}|\theta)$
    - for a beta dist prior $\hat{\theta}_{\text {map }}=\frac{\breve{\alpha}+N_1-1}{\breve{\alpha}+N_1-1+\breve{\beta}+N_0-1}$
    - if the prior is a uniform dist we get the MLE $\hat{\theta}_{\text {mle}}$, because $p(\theta)\propto 1\rightarrow \log p(\theta)\propto 0$
    - if sample size is low, we can use a stronger prior (more pronounced beta dist) **add-one smoothing**
    
#### 3.2.1.5 Posterior mean
- MAP is the mode, thus ut us a weak 

#### 3.2.1.6 Posterior variance


