# Probability & Statistics — Structured Foundations (Clean Terminology Map)

This is a coherent roadmap of the concepts you listed, organized so each layer builds the next.  
For each layer: (i) core objects, (ii) what is fixed vs variable, (iii) the governing equations.

---

## 1. Foundations of Probability (What randomness means)

### Core objects
- **Random variable** (discrete vs continuous): a measurable function
  $$
  X:\Omega \to \mathbb{R}
  $$
- **Sample space**: $\Omega$ (all possible outcomes)
- **Event**: $A \subseteq \Omega$
- **Probability measure**: $P:\mathcal{F}\to[0,1]$ on a $\sigma$-algebra $\mathcal{F}$

### Kolmogorov axioms
For $A,B\in\mathcal{F}$:
1. Non-negativity:  
   $$
   P(A)\ge 0
   $$
2. Normalization:  
   $$
   P(\Omega)=1
   $$
3. Countable additivity (disjoint $A_i$):  
   $$
   P\Big(\bigcup_{i=1}^\infty A_i\Big)=\sum_{i=1}^\infty P(A_i)
   $$

### Distribution functions
- **PMF** (discrete):
  $$
  p(x)=P(X=x), \quad \sum_x p(x)=1
  $$
- **PDF** (continuous):
  $$
  f(x)\ge 0,\quad \int_{-\infty}^{\infty} f(x)\,dx=1,\quad
  P(a\le X\le b)=\int_a^b f(x)\,dx
  $$
- **CDF** (both):
  $$
  F(x)=P(X\le x)
  $$
- **Support**:
  - discrete: $\{x: p(x)>0\}$
  - continuous: region where $f(x)>0$ (up to measure-zero)

**Goal**: probability assigns consistent numbers to events, and distributions are how that assignment appears on the real line via $X$.

---

## 2. Probability Distributions (Generative models)

A **distribution family** is a set of possible worlds indexed by parameters:
$$
\{P_\theta : \theta \in \Theta\}
\quad\text{or}\quad
\{p(x\mid \theta)\}
$$

### Core distributions (typical parameterizations)
- **Bernoulli**: $X\in\{0,1\}$  
  $$
  P(X=1)=p,\quad P(X=0)=1-p
  $$
- **Binomial**: $X\in\{0,\dots,n\}$  
  $$
  P(X=k)=\binom{n}{k}p^k(1-p)^{n-k}
  $$
- **Multinomial**: counts $x_1,\dots,x_K$ with $\sum x_k=n$  
  $$
  P(\mathbf{x})=\frac{n!}{\prod_{k=1}^K x_k!}\prod_{k=1}^K p_k^{x_k}
  $$
- **Poisson**: $X\in\{0,1,2,\dots\}$  
  $$
  P(X=k)=\frac{\lambda^k e^{-\lambda}}{k!}
  $$
- **Uniform** on $[a,b]$  
  $$
  f(x)=\frac{1}{b-a},\quad a\le x\le b
  $$
- **Normal (Gaussian)**  
  $$
  f(x)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\Big(-\frac{(x-\mu)^2}{2\sigma^2}\Big)
  $$
- **Exponential**  
  $$
  f(x)=\lambda e^{-\lambda x},\quad x\ge 0
  $$
- **Gamma** (shape $\alpha$, rate $\beta$)  
  $$
  f(x)=\frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x},\quad x\ge 0
  $$
- **Beta** (on $[0,1]$)  
  $$
  f(x)=\frac{1}{B(\alpha,\beta)}x^{\alpha-1}(1-x)^{\beta-1}
  $$

### Structural concepts
- **Parametric distribution**: $p(x\mid\theta)$ with finite-dimensional $\theta$
- **Parameters**:
  - location (e.g., $\mu$), scale (e.g., $\sigma$), shape (e.g., $\alpha$)
- **Moments**:
  $$
  \mathbb{E}[X],\quad \mathrm{Var}(X)=\mathbb{E}[X^2]-\mathbb{E}[X]^2
  $$
  Higher: skewness, kurtosis
- **MGF** (if exists):
  $$
  M_X(t)=\mathbb{E}[e^{tX}]
  $$
- **Characteristic function** (always exists):
  $$
  \varphi_X(t)=\mathbb{E}[e^{itX}]
  $$

**Goal**: distributions are families of data-generating worlds, not just formulas.

---

## 3. Conditioning & Dependence (Information flow)

### Core definitions
- **Joint distribution**: $p(x,y)$ or $f(x,y)$
- **Marginal distribution**:
  $$
  p(x)=\sum_y p(x,y)
  \quad\text{or}\quad
  f(x)=\int f(x,y)\,dy
  $$
- **Conditional probability**:
  $$
  P(A\mid B)=\frac{P(A\cap B)}{P(B)},\quad P(B)>0
  $$
- **Conditional density/mass**:
  $$
  p(x\mid y)=\frac{p(x,y)}{p(y)}
  $$

### Laws
- **Chain rule**:
  $$
  p(x_1,\dots,x_n)=\prod_{i=1}^n p(x_i\mid x_1,\dots,x_{i-1})
  $$
- **Independence**:
  $$
  X\perp Y \iff p(x,y)=p(x)p(y)
  $$
- **Conditional independence**:
  $$
  X\perp Y \mid Z \iff p(x,y\mid z)=p(x\mid z)p(y\mid z)
  $$
- **Bayes’ theorem**:
  $$
  p(\theta\mid x)=\frac{p(x\mid \theta)p(\theta)}{p(x)}
  $$
- **Law of total probability**:
  $$
  p(x)=\sum_\theta p(x\mid \theta)p(\theta)
  \quad\text{or}\quad
  p(x)=\int p(x\mid \theta)p(\theta)\,d\theta
  $$

**Goal**: conditioning is how information updates probabilities.

---

## 4. Likelihood Theory (Inference mindset)

Given data $x$ and model $p(x\mid\theta)$:

- **Likelihood function**:
  $$
  L(\theta\mid x)=p(x\mid\theta)
  $$
- **Log-likelihood**:
  $$
  \ell(\theta\mid x)=\log L(\theta\mid x)=\log p(x\mid\theta)
  $$
- **Likelihood surface**: $\theta \mapsto L(\theta\mid x)$
- **Relative likelihood**:
  $$
  \frac{L(\theta_1\mid x)}{L(\theta_2\mid x)}
  $$
- **Likelihood ratio**:
  $$
  \Lambda(x)=\frac{\sup_{\theta\in\Theta_0} L(\theta\mid x)}{\sup_{\theta\in\Theta} L(\theta\mid x)}
  $$
- **Identifiability**:
  $$
  p(x\mid\theta_1)=p(x\mid\theta_2)\ \forall x \ \Rightarrow\ \theta_1=\theta_2
  $$
- **Sufficient statistic** (factorization idea):
  $$
  p(x\mid\theta)=g(T(x),\theta)\,h(x)
  $$

**Critical insight**: the formula is the same as probability, but the *question* is reversed: data fixed, parameters variable.

---

## 5. Parameter Estimation (From data → model)

### Maximum Likelihood Estimation (MLE)
$$
\hat\theta_{\text{MLE}}=\arg\max_{\theta} L(\theta\mid x)
=\arg\max_{\theta}\ \ell(\theta\mid x)
$$

### Score function (gradient of log-likelihood)
$$
s(\theta)=\nabla_\theta \ell(\theta\mid x)
$$

### Fisher Information
Observed (data-dependent):
$$
\mathcal{I}_{\text{obs}}(\theta)=-\nabla_\theta^2 \ell(\theta\mid x)
$$
Expected:
$$
\mathcal{I}(\theta)=\mathbb{E}\big[s(\theta)s(\theta)^\top\big]
= -\mathbb{E}\big[\nabla_\theta^2 \ell(\theta\mid X)\big]
$$

### Cramér–Rao lower bound (unbiased estimators)
$$
\mathrm{Var}(\hat\theta)\ \ge\ \mathcal{I}(\theta)^{-1}
$$

### Bias, variance, consistency, asymptotics
- Bias:
  $$
  \mathrm{Bias}(\hat\theta)=\mathbb{E}[\hat\theta]-\theta
  $$
- Consistency:
  $$
  \hat\theta_n \xrightarrow[]{P} \theta
  $$
- Asymptotic normality (typical MLE result):
  $$
  \sqrt{n}\,(\hat\theta_n-\theta)\ \Rightarrow\ \mathcal{N}\big(0,\mathcal{I}(\theta)^{-1}\big)
  $$

**Goal**: understand why MLE works (geometry/curvature via Fisher information), not only how to compute it.

---

## 6. Bayesian Perspective (Probability over models)

Bayesian inference treats parameters as random variables.

- **Prior**: $p(\theta)$  
- **Likelihood**: $p(x\mid\theta)$  
- **Posterior**:
  $$
  p(\theta\mid x)=\frac{p(x\mid\theta)p(\theta)}{p(x)}
  $$
- **Evidence / marginal likelihood**:
  $$
  p(x)=\int p(x\mid\theta)p(\theta)\,d\theta
  $$
- **Posterior predictive**:
  $$
  p(x_{\text{new}}\mid x)=\int p(x_{\text{new}}\mid\theta)\,p(\theta\mid x)\,d\theta
  $$
- **Bayes factor** (model comparison):
  $$
  BF_{12}=\frac{p(x\mid M_1)}{p(x\mid M_2)}
  $$
- **Conjugate priors**: priors that keep posterior in same family
- **MAP estimation**:
  $$
  \hat\theta_{\text{MAP}}=\arg\max_{\theta}\ p(\theta\mid x)
  =\arg\max_{\theta}\ \big[\log p(x\mid\theta)+\log p(\theta)\big]
  $$

**Conceptual leap**: parameters become random variables; inference becomes probability calculus.

---

## 7. Information-Theoretic View (Geometry of probability)

- **Entropy**:
  $$
  H(P)=-\sum_x p(x)\log p(x)
  \quad\text{or}\quad
  h(P)=-\int f(x)\log f(x)\,dx
  $$
- **Cross-entropy**:
  $$
  H(P,Q)=-\sum_x p(x)\log q(x)
  $$
- **KL divergence**:
  $$
  D_{\mathrm{KL}}(P\|Q)=\sum_x p(x)\log\frac{p(x)}{q(x)}
  \quad\text{or}\quad
  \int f(x)\log\frac{f(x)}{g(x)}\,dx
  $$
- **Mutual information**:
  $$
  I(X;Y)=D_{\mathrm{KL}}(p(x,y)\ \|\ p(x)p(y))
  $$
- **Maximum entropy principle**: choose the distribution with maximal entropy under constraints
- **Log-loss**: negative log-likelihood per sample (typical learning objective)

**Unification**: likelihood, learning, compression, and generalization often reduce to minimizing cross-entropy / KL.

---

## 8. Monte Carlo & Sampling (Probability as a generator)

- **Monte Carlo estimation** (for $\mu=\mathbb{E}[f(X)]$):
  $$
  \hat\mu=\frac{1}{N}\sum_{i=1}^N f(X_i),\quad X_i\sim p
  $$
- **Law of Large Numbers**:
  $$
  \hat\mu \xrightarrow[]{a.s.} \mu
  $$
- **Central Limit Theorem**:
  $$
  \sqrt{N}(\hat\mu-\mu)\ \Rightarrow\ \mathcal{N}(0,\sigma^2)
  $$
- **Importance sampling**:
  $$
  \mathbb{E}_p[f(X)]=\mathbb{E}_q\!\left[f(X)\frac{p(X)}{q(X)}\right]
  \approx \frac{1}{N}\sum_{i=1}^N f(X_i)\frac{p(X_i)}{q(X_i)}
  $$
- **Rejection sampling**: accept/reject using envelope constant $M$
- **MCMC**: build a Markov chain with stationary distribution $p$
  - **Metropolis–Hastings**: accept with
    $$
    \alpha=\min\!\left(1,\frac{p(x')q(x\mid x')}{p(x)q(x'\mid x)}\right)
    $$
  - **Gibbs sampling**: sample from conditionals $p(x_i\mid x_{-i})$
- **Langevin dynamics** (score-driven sampling idea):
  $$
  x_{t+1}=x_t+\frac{\epsilon}{2}\nabla_x \log p(x_t)+\sqrt{\epsilon}\,z_t,\quad z_t\sim \mathcal{N}(0,I)
  $$

**Goal**: probability becomes computation: turning distributions into samples.

---

## 9. Statistical Modeling View (Putting it all together)

- **Generative vs discriminative**
  $$
  \text{generative: } p(x,y)\ \text{ or } p(x\mid\theta)
  \qquad
  \text{discriminative: } p(y\mid x)
  $$
- **Latent variable models**: introduce unobserved $z$:
  $$
  p(x)=\int p(x,z)\,dz
  $$
- **Mixture models**:
  $$
  p(x)=\sum_{k=1}^K \pi_k\,p(x\mid \theta_k)
  $$
- **Expectation–Maximization (EM)** (maximize likelihood with latent variables):
  - E-step: compute $p(z\mid x,\theta^{old})$
  - M-step: maximize expected complete log-likelihood
- **Exponential family**:
  $$
  p(x\mid\eta)=h(x)\exp\big(\eta^\top T(x)-A(\eta)\big)
  $$
- **Identifiability and overfitting**: distinct parameters vs too-flexible models

---

## 10. Conceptual Meta-Ideas (Expert-level understanding)

- **Forward vs inverse problems**:
  $$
  \theta \to x \quad\text{(forward)}
  \qquad\text{vs}\qquad
  x \to \theta \quad\text{(inverse)}
  $$
- **Data-generating process**: the (unknown) mechanism producing observations
- **Model misspecification**: true process not in $\{p(\cdot\mid\theta)\}$
- **Epistemic vs aleatoric uncertainty**:
  - epistemic: uncertainty about the model/parameters
  - aleatoric: inherent randomness in outcomes
- **Frequentist vs Bayesian**: parameters fixed vs random
- **Inference vs prediction**: explain $\theta$ vs forecast new $x$
- **Probability as belief vs frequency**: interpretation layer that changes what “probability” *means* philosophically

---

## Minimal “Must-Master” Core (Sharp Path)

If reduced to a non-negotiable core:

1. Random variables and distributions  
2. Conditional probability and Bayes’ theorem  
3. Likelihood and log-likelihood  
4. Maximum Likelihood Estimation  
5. Entropy and KL divergence  
6. Monte Carlo sampling  

Master these deeply, and most of the remaining structure becomes a consequence of them.
