# Statistics

## 3.1 Introduction
- Probability theory (Chapter 2) is all about modeling a distribution over observed data outcomes $\mathcal{D}$ (knowing the parameters $\boldsymbol{\theta}$) by computing $p(\mathcal{D}| \boldsymbol{\theta})$
- Statistics is the inverse problem. We want to compute $p(\boldsymbol{\theta}| \mathcal{D})$ (so we want to infer the parameters $\boldsymbol{\theta}$) given observations. There are two approaches:
    - **Frequentist**
    - **Bayesian** ($\leftarrow$ this is king)

## 3.2 Bayesian statistics
- Observed data $\mathcal{D}$ is known and fixed, parameters are unknown $\boldsymbol{\theta}$ (this is the opposite than frequentist approach (Sec3.3))
- We represent our beliefs about the parameters after seing data as a **posterior distribution** (eq.3.1): $p(\boldsymbol{\theta} \mid  \mathcal{D}) = \frac{p(\boldsymbol{\theta})p(\mathcal{D} \mid \boldsymbol{\theta})}{p(\mathcal{D})} = \frac{p(\boldsymbol{\theta})p(\mathcal{D} \mid \boldsymbol{\theta})}{\int p(\boldsymbol{\theta}^\prime)p(\mathcal{D} \mid \boldsymbol{\theta}^\prime)d\boldsymbol{\theta}^\prime}$
    - **posterior dist**:  $p(\boldsymbol{\theta} \mid  \mathcal{D})$
    - **prior dist**: $p(\boldsymbol{\theta})$
    - **likelihood**: $p(\mathcal{D} \mid \boldsymbol{\theta})$
    - **marginal dist**: $p(\mathcal{D})$
    
### 3.2.1 Tossing coins
- this is the 'atom' example of probabilities
- this whole section will use the example of Bayesian approach to coint tosses (Bernoulli) w/ beta prior
- We record the outcomes of observed data as $\mathcal{D}=\{y_n\in\{0,1\}:n=1:N \}$

#### 3.2.1.1 Likelihood
- In a simple coin toss, data is iid, and thus the **sufficient statistics** are $(N_1, N_0=N-N_1)$ in:
$p(\boldsymbol{\theta} \mid  \mathcal{D}) = \prod_n^N\theta^{y_n}(1-\theta)^{1-y_n}=\theta^{N_1}(1-\theta)^{N_0}$
    - $y_n$ is the outcome heads/tails at toss number $n$
    - *note that sufficient statistics $\neq\boldsymbol{\theta}$, sufficient stats refer to the quantities that capture enough info about the data to be able to estimate the parameters
- simple coint toss posterior can also be computed using a *Binomial dist*: $p(\boldsymbol{\theta} \mid  \mathcal{D}) = \operatorname{Bin}(y \mid N, \theta)$

#### 3.2.1.2 Prior
- We can write an **uninformative prior** using a *uniform dist*, but the *beta dist* is more general: $p(\theta)=\operatorname{Beta}(\theta \mid \breve{\alpha},\breve{\beta}) \propto \theta^{\breve\alpha-1}(1-\theta)^{\breve\beta-1}$ 
    - $\breve{\alpha},\breve{\beta}$ are **hyperparameters** (params of the prior that determine our belief about $\boldsymbol{\theta}$), if $\breve{\alpha}=\breve{\beta}=1$ we recover the uniform dist


#### 3.2.1.3 Posterior
- $\text{posterior}\propto\text{likelihood}\times\text{prior}$
- continuing the example of a beta prior, we have a **congugate prior** because the posterior has the same functional form:
    $p(\theta \mid \mathcal{D}) \propto \theta^{N_1}(1-\theta)^{N_0} \theta^{\breve{\alpha}-1}(1-\theta)^{\breve{\beta}-1} \propto \operatorname{Beta}\left(\theta \mid \breve{\alpha}+N_1, \breve{\beta}+N_0\right)=\operatorname{Beta}(\theta \mid \widehat{\alpha}, \widehat{\beta})$

#### 3.2.1.4 Posterior mode (MAP estimate)
- In Bayesian statistics, MAP estimate is the mode of the posterior dist. It gives the most probable value of the parameter $\hat{\theta}_{\text{map}}=\arg\max_\theta p(\theta\mid\mathcal{D})=\arg\max_\theta\log p(\theta)+\arg\max_\theta\log p(\mathcal{D} \mid \theta)$
    - if prior is beta dist: $\hat{\theta}_{\text {map}}=\frac{\breve{\alpha}+N_1-1}{\breve{\alpha}+N_1-1+\breve{\beta}+N_0-1}$
    - if prior is uniform dist, we get the MLE: $\hat{\theta}_{\text {mle}}$, because $p(\theta)\propto 1\rightarrow \log p(\theta)\propto 0$
    - if sample size is low, we can use a stronger prior (more pronounced beta dist) **add-one smoothing**
    
#### 3.2.1.5 Posterior mean
- MAP is the posterior mode, which is equivalent to finding the mode of the dist, thus it is a weak representation of it (single point in the dist)
- The posterior mean is a more robust estimate, its an integral $\rightarrow \bar{\theta}=\int\theta p(\mathcal{D} \mid \theta)d\theta$ 
    - if posterior is beta $\rightarrow \bar{\theta}=\mathbb{E}(\theta \mid \mathcal{D})=\frac{\hat\alpha}{\hat{N}}$, where the *strength of the posterior* is $\hat{N}=\hat\alpha+\hat\beta$ (equivalent sample size)
        - the posterior is a convex combination of the MLE $\hat{\theta}_{\text {mle}}=N_1/ N$ and the prior mean $m=\breve\alpha/\breve{N}$:
        - $\bar\theta = \lambda m + (1-\lambda)\hat\theta_{\text{mle}}=\frac{\hat\alpha}{\hat{N}}$, where $\lambda=\breve{N} / \hat{N}$
        
#### 3.2.1.6 Posterior variance
- The uncertainity that comes with an estimation is given by the **Standard error**: $\operatorname{se}=\sqrt{\mathbb{V}(\theta \mid \mathcal{D})}$
- where the **variance** $\mathbb{V}$ is: 
    - for a beta posterior (we non-strictly showed that the posterior of a Bernoulli problem is a beta posterior dist) and when $N\ll \breve\alpha+\breve\beta$: $\rightarrow$ $\mathbb{V}(\theta \mid \mathcal{D}) = \frac{\hat{\theta}_{\text {mle}}(1-\hat{\theta}_{\text {mle}})}{N}$
    - $\rightarrow$ $\operatorname{se}\approx \sqrt{\frac{\hat{\theta}_{\text {mle}}(1-\hat{\theta}_{\text {mle}})}{N}}$

#### 3.2.1.7 Credible intervals
- Since posterior dists can be incredibly complex functions, we usually work w/ single point estimates ie. mode, mean
- Typically we quantify the uncertainty of these using $100\times(1-\alpha)\%$ **credible intervals**: $\mathcal{C}_\alpha(\mathcal{D}) = \{\theta : F\}$ (predicate function $F$ determines how we define the members of the set in the interval)
    - **central interval** - interval bounded by $(\text{lower},\text{upper})=(l,u)$ that contains half of the weigth on each side $l=F^{-1}(\alpha/2), u=F^{-1}(1-\alpha/2)$,  such that $\mathcal{C}_\alpha(\mathcal{D}) = \{\theta : l\leq\theta\leq u\}$, where $F=\operatorname{cdf}$
    - **Higest probability density (HPD)** - unlike central interval, HPD can correct for highly probable values that can fall outside the interval by considering points above a threshold $p^*$ in the pdf: $1-\alpha=\int_{\mathcal{C}_\alpha}p(\theta \mid \mathcal{D})d\theta$, such that $\mathcal{C}_\alpha(\mathcal{D}) = \{\theta : p(\theta \mid \mathcal{D})\geq p^*\}$

#### 3.2.1.8 Posterior predictive distribution
- We want to predict a future observation, to achieve this we can
    - use Bayesian inference to obtain the posterior dist of the model parameters $p(\boldsymbol{\theta} \mid \mathcal{D})$ 
    - define the likelihood of observing a new data point $\boldsymbol{y}$ given $\boldsymbol{\theta}$: $p(\boldsymbol{y} \mid \boldsymbol{\theta})$
    - then we use the **posterior predictive dist**, which marginalizes OUT all the unkown params:   
    $p(\boldsymbol{y} \mid \mathcal{D})=\int \text{likelihood}\times\text{bayes-post} \;d\boldsymbol{\theta} = \int p(\boldsymbol{y} \mid \boldsymbol{\theta})p(\boldsymbol{\theta} \mid \mathcal{D})d\boldsymbol{\theta}$
    
- `frequentist` where the most common approximation is **plug-in approx**: $p(\boldsymbol{y} \mid \mathcal{D})\approx p(\boldsymbol{y}\mid\boldsymbol{\hat\theta}) $, basically plugging in a point estimate of params $\hat\theta=\delta(\mathcal{D})$ eg. MLE, MAP : 
        - $\delta(\theta-\hat\theta)\approx p(\boldsymbol{\theta} \mid \mathcal{D})$, which **shifts** $\rightarrow$ $p(\boldsymbol{y} \mid \mathcal{D})=\int p(\boldsymbol{y} \mid \boldsymbol{\theta})\delta(\theta-\hat\theta)d\boldsymbol{\theta} = p(\boldsymbol{y} \mid \boldsymbol{\hat\theta})$
    - A problem with plug-in approximation is overfitting and is weak against fat tails!
- `bayesian` Alternatively we can instead marginalize over all the values for each parameter $\boldsymbol{\theta}=(\ldots,\theta,\ldots)$, to compute the exact posterior predictive:
    - $p(y=1 \mid \mathcal{D})=\int_0^1p(y=1 \mid \theta)p(\theta \mid \mathcal{D})d\theta$
        - if beta posterior: $p(y=1 \mid \theta)=\int_0^1\theta\operatorname{Beta}(\theta \mid \hat\alpha, \hat\beta)d\theta = \hat\alpha / \hat{N}$
    - this Bayesian approach of marginalizing accounts for uncertainty
        - if prior is Beta then a.k.a. **Laplace's rule of succession**

#### 3.2.1.9 Marginal likelihood
- The **marginal likelihood** for a model $\mathcal{M}$ is: $p(\mathcal{D} \mid \mathcal{M})=\int p(\boldsymbol{\theta} \mid \mathcal{M})p(\mathcal{D} \mid \boldsymbol{\theta},\mathcal{M})d\boldsymbol{\theta}$
    - We can ignore this, when performing inference of params, because is constant wrt $\boldsymbol{\theta}$
    - However, is extremly important for 
        - empirical Bayes, estimating hyperparams from data (Sec.3.7)
        - choosing models (Sec.3.8.1)
- Normally is hard to compute, except in the case of dealing w/ Bernoulli-beta model $\mathcal{M}$, where the marginal likelihood is proportional to the ratio of normalization constants for the posterior and prior: $p(\mathcal{D})=\frac{\operatorname{B}(\hat\alpha,\hat\beta)}{\operatorname{B}(\breve\alpha,\breve\beta)}$

### 3.2.2 Modeling more complex data
- In ML we can predict more complex phenomenons than Bernoulli coin-tosses (Sec.3.2.1)
- We can predict outcomes $\boldsymbol{y}$ given input features $\boldsymbol{x}$ so now we have conditional porb dists of the form: $p(\boldsymbol{y}\mid \boldsymbol{x}, \boldsymbol{\theta})$ (basis of Generalized Linear models (Sec.15) and Neural Nets (Chapter.16) )
- A key quantity is the **posterior predictive dist**: $p(\boldsymbol{y}\mid \boldsymbol{x}, \mathcal{D}) = \int p(\boldsymbol{y}\mid \boldsymbol{x}, \boldsymbol{\theta}) p(\boldsymbol{\theta}\mid\mathcal{D}) d\boldsymbol{\theta}$
    - `Frequentist` approach $\rightarrow$ *plug-in approximation* + MLE / MAP, has the downside that is sensitive to overfitting and fat-tails. Because it estimates a constant uncertainty ($\hat\sigma$) for all predictions
        - There nature of uncertainty can be decomposed in **aleatoric/stochastic** (intrinsic, can't be reduced) & **epistemic** uncertainty (can be reduced)
    - Heart of `Bayesian` approach is $\rightarrow$ integrating/marginalizing OUT unkown parameters, effectively computing weighted averages of predictions and reducing uncertainty (variable uncertainty as opposed to the frequentist approach)
        - Bayes approach accounts for *epistemic uncertainty*, useful for Bayesian lin reg (Sec.15.2), optimization (Sec.6.6),risk-sensitive decision making (Sec.34.1.3) & active learning (Sec.34.7) 
        - moreover Bayesian methods are great for non-linear models such as NNs (Sec.17.1) & Generative models (Part-IV)

### 2.3.3 Selecting the prior
- A challenge w/ Bayes approach is that it requires us to choose the prior $\rightarrow$ can be difficult in large models, eg NNs
- In later secitions we'll discuss prior selection
    - *conjugate priors* (Sec.3.4)
    - *unninformative priors* (Sec.3.5)
    - *hierarchical priors* (Sec.3.6)
    - *empirical priors* (Sec.3.7)

### 3.2.4 Computational issues
- Computing posteriors / predictives is expensive in Bayesian approach. Full discussion about this in (Part-II) and a good historical manuscript is [[MFR20](https://arxiv.org/pdf/2004.06425)]

### 3.2.5 Exchangeability and de Finetti's Theorem
- De Finetti's Theorem is a result of the philosophical question: where do priors come from? Priors $p(\boldsymbol{\theta})$ are abstract, non-directly measurable quantities. 
    - De Finetti formalized the concept of **Infinetly exchangeable** sequence of rand vars (which is more general than iid): the joint prob of a sequence of rvs is invariant under permutation of indices
    - *Theorem 3.2.1 (de Finetti’s theorem).* A sequence of iid rvs $(\boldsymbol{x}_1, \ldots, \boldsymbol{x}_n)$ is infinitely exchangeable iff, for all $n$ $\rightarrow$ a param $\boldsymbol{\theta}$, a likelihood $p(\boldsymbol{x}_i\mid\boldsymbol{\theta})$ & a prior $p(\boldsymbol{\theta})$ EXISTS: 
        - $p(x_1, \ldots, x_n) = \int \prod_{i=1}^n p(\boldsymbol{x}_i\mid\boldsymbol{\theta})p(\boldsymbol{\theta})d\boldsymbol{\theta}$


## 3.3 Frequentist statistics
- 

### 3.3.1 Sampling distributions


### 3.3.2 Bootstrap approximation of the sampling distribution


### 3.3.3 Assymptotic normality of the sampling distribution of the MLE


### 3.3.4 Fisher Information Matrix (FIM)


#### 3.3.4.1 FIM Definition


#### 3.3.4.2 Equivalence between FIM and the Hessian of the NLL


#### 3.3.4.3 Example: FIM for the Binomial


#### 3.3.4.4 Example: FIM for the univariate Gaussian


#### 3.3.4.5 Example: FIM for Logistic Regression


#### 3.3.4.6 Example: FIM for the Exponential family


### 3.3.5 Counterintuitive properties of frequentist statistics


#### 3.3.5.1 Confidence intervals


#### 3.3.5.2 p-values


#### 3.3.5.3 Discussion


### 3.3.6 Why isn't everyone a Bayesian?




## 3.4 Conjugate priors

