### What are we missing?

<div style="font-size:14px">

| **Aspect** | **MLE (Maximum Likelihood Estimation) 最大似然估计** | **MAP (Maximum A Posteriori Estimation) 最大后验估计** |
|------------|------------------------------------------|--------------------------------------------|
| **Objective** | Estimate parameter $\theta$ that maximizes the likelihood of the observed data. | Estimate parameter $\theta$ that maximizes the posterior probability given the data. |
| **Optimization Goal** | $\displaystyle \hat{\theta}_{\text{MLE}} = \arg\max_{\theta} P(D \mid \theta)$ | $\displaystyle \hat{\theta}_{\text{MAP}} = \arg\max_{\theta} P(\theta \mid D)$ |
| **Formula Derivation** | Maximize the likelihood: <br> $\displaystyle \mathcal{L}(\theta) = \prod_{i=1}^{n} P(x_i \mid \theta)$ <br> Take the log: <br> $\displaystyle \log \mathcal{L}(\theta) = \sum_{i=1}^{n} \log P(x_i \mid \theta)$ <br> Then: <br> $\displaystyle \hat{\theta}_{\text{MLE}} = \arg\max_{\theta} \log P(D \mid \theta)$ | Use Bayes’ Theorem: <br> $\displaystyle P(\theta \mid D) = \frac{P(D \mid \theta) P(\theta)}{P(D)}$ <br> Ignore constant $P(D)$: <br> $\displaystyle \hat{\theta}_{\text{MAP}} = \arg\max_{\theta} P(D \mid \theta) P(\theta)$ <br> or log-form: <br> $\displaystyle \hat{\theta}_{\text{MAP}} = \arg\max_{\theta} \left[ \log P(D \mid \theta) + \log P(\theta) \right]$ |
| **Includes Prior?** | ❌ No | ✅ Yes |
| **Sensitivity to Prior** | Not sensitive (no prior used) | Sensitive to prior choice |
| **Overfitting Risk** | Higher, especially for small data | Lower, prior acts as regularizer |
| **Asymptotic Behavior** | As $n \to \infty$, MLE is consistent | As $n \to \infty$, MAP $\to$ MLE |
| **Computational Complexity** | Lower (no prior term) | Higher (includes prior) |
| **Interpretation** | Frequentist — parameters are fixed | Bayesian — parameters are random variables |
| **Uniform Prior Case** | MLE = MAP | Yes, if $P(\theta)$ is uniform |
| **Regularization View** | No regularization | Prior acts like regularization <br> Gaussian prior $\Rightarrow L_2$ <br> Laplace prior $\Rightarrow L_1$ |
| **Example: Gaussian Likelihood** | $x_i \sim \mathcal{N}(\mu, \sigma^2)$ <br> $\displaystyle \hat{\mu}_{\text{MLE}} = \frac{1}{n} \sum x_i$ | Prior: $\mu \sim \mathcal{N}(\mu_0, \tau^2)$ <br> $\displaystyle \hat{\mu}_{\text{MAP}} = \frac{n\sigma^{-2}}{n\sigma^{-2} + \tau^{-2}} \bar{x} + \frac{\tau^{-2}}{n\sigma^{-2} + \tau^{-2}} \mu_0$ |

<div>


<div style="font-size:14px">
<p>Modeling uncertainty is key to capture sparse signal in low SNR environments.<br>
Alternatives like voting/agreement rate in Random-Forest–like models are not good enough.<br>
Probabilistic Programming models provide more principled uncertainty estimation (❌ not prefect though).</p>

$
\underbrace{P(w \mid D)}_{\text{posterior}} 
= \frac{
    \overbrace{P(D \mid w)}^{\text{likelihood}} \cdot 
    \overbrace{P(w)}^{\text{prior}}
}{
    \underbrace{P(D)}_{\text{evidence}}
}
$

<p>General Form of Prior P(w):</p>

$
P(w) := \mathbb{E}_{x, t, \theta, \varepsilon} \left[ P(w \mid x, t, \theta, \varepsilon) \right] = \int P(w \mid x, t, \theta, \varepsilon) \, P(x) P(t) P(\theta) P(\varepsilon) \, dx \, dt \, d\theta \, d\varepsilon \ \text{(weighted average)}\\
P(w \mid x, t, \theta, \varepsilon) = \mathcal{F}(x, t, \theta, \varepsilon) \approx \underbrace{P(w \mid \theta)}_{
\begin{array}{c}
    \text{usually in (Deep)ProbProg models (e.g. BNN)}\\
    \text{assume static, noise-free and feature-independent}
\end{array}
}
$

<p>Conditioning terms:</p>

- $x$: input/context — adapts prior to input
- $t$: time — allows temporal dynamics
- $\theta$: hyperparameters — controls prior structure
- $\varepsilon$: noise — models stochasticity


## Bayesian Neural Network(BNN) of Deep Probabilistic Programming as an approximated implementation of MAP Estimation

<p>Assuming:</p>

- Likelihood $P(D \mid w) = P(y_{1:N} \mid x_{1:N}, w) \overbrace{=}^{
    \begin{array}{c}
        \text{autoregressive}\\
        \text{decomposition}
    \end{array}
}\\
\prod_{i=1}^N P(y_i \mid y_{<i}, x_{\le i}, w) \overbrace{\approx}^{\text{model}}\\
\prod_{i=1}^N \mathcal{N}(y_i; f_w(y_{<i}, x_{\le i}), \sigma^2)\\
\Rightarrow \log P(D \mid w) = \sum_{i=1}^N \log P(y_i \mid y_{<i}, x_{\le i}, w) = -\frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - f_w(y_{\le i}, x_{\le i}))^2 + \text{const}$

- Prior $P(w) = \mathbb{E}_{x,t,\varepsilon}[P(w \mid x, t, \theta, \varepsilon)] \overbrace{\approx}^{\text{model}} P(w \mid \theta) = \mathcal{N}(w; 0, \tau^2 I)\\
\Rightarrow \log P(w) = -\frac{1}{2\tau^2} \|w\|^2 + \text{const}$


<p>The MAP objective becomes:</p>

$
w^* = \arg\max_w P(w \mid D) = \arg\max_w P(D \mid w) \cdot P(w) = \arg\max_w log P(D \mid w) + log P(w)
= \arg\max_w \left( \sum_{i=1}^N \log P(y_i \mid y_{<i}, x_{\le i}, w) + \log P(w) \right)\\
= \arg\min_w \left( - \sum_{i=1}^N \log P(y_i \mid y_{<i}, x_{\le i}, w) - \log P(w) \right)
$

- Note that because **MLE** and **MAP** use argmax/argmin to formulate the problem, they are **point estimates**, but it can be used to retrieve approximation of the full distribution

<p>with Gaussian assumption:</p>

$
= \arg\min_w \left(\frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - f_w(y_{<i}, x_{\le i}))^2 + \frac{1}{2\tau^2} \|w\|^2 \right)
$

<p>with i.i.d. assumption:</p>

$
= \arg\min_w \left(\frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - f_w(x_i))^2 + \frac{1}{2\tau^2} \|w\|^2 \right)
$

<p>Which is equivalent to:</p>

$
\text{Loss}(w) = \text{MSE loss} + \text{L2 regularization} = \sum_{i=1}^N \left( y_i - f_w(x_i) \right)^2 + \lambda \cdot \|w\|^2, \quad \text{with } \lambda = \frac{\sigma^2}{\tau^2}
$

<div>

<div style="font-size:14px">

## All BNN Training Methods (forward → loss compute → backward → weight update):

| **Method**            | **Posterior Type**                           | **Inference Type**     | **Assumptions Made**                                                                                                                                          | **Uncertainty Quality** | **Scalability** | **Compute**    | **Packages**                           | **References**                                       | **Assumptions Handled By Model?**                                        | **Exact Posterior in Limit?**         | **Overfitting Risk**                                                         |
|----------------------|-----------------------------------------------|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|------------------|----------------|----------------------------------------|------------------------------------------------------|---------------------------------------------------------------------------|----------------------------------------|--------------------------------------------------------------------------------|
| **Bayes by Backprop** | Mean-field Gaussian                          | Variational              | Weights are independent; posterior is fully factorized Gaussian                                                                                               | Medium                   | High             | Low            | Pyro, Blitz-BNN, Bayesian-Torch        | Blundell et al., 2015                               | ❌ Posterior factorization not modeled explicitly                           | ❌ No                                 | **High** — limited posterior capacity encourages under-regularized solutions |
| **Flipout**           | Factorized Gaussian (decorrelated noise)     | Variational              | Weights are independent; Gaussian posterior; noise decorrelated across examples                                                                               | Medium+                  | High             | Low–Medium     | TensorFlow Probability                 | Wen et al., 2018                                  | ⚠️ Decorrelated noise reduces gradient variance, not structural assumptions  | ❌ No                                 | **Medium–High** — slightly improved over BBP                                 |
| **SGLD**              | Sampled Posterior                            | MCMC                     | Langevin dynamics without MH; uses minibatches; assumes step size ε→0; assumes stochastic gradient noise does not dominate Langevin noise    | High                     | High             | Medium         | PyTorch-Bayes, Emukit                  | Welling & Teh, 2011                                | ⚠️ Approximate posterior unless step size decays and noise is unbiased      | ⚠️ Only asymptotically (ε→0)           | **Medium** — implicit noise helps, but bias may hurt posterior accuracy     |
| **pSGLD**             | Preconditioned Posterior Samples             | MCMC                     | Same as SGLD; assumes curvature can be estimated online to scale gradients; assumes stability in adaptive noise statistics                                     | High+                    | High             | Medium+        | TFP (custom), Pyro (custom)            | Li et al., 2016                                    | ⚠️ Curvature modeled adaptively; still asymptotic correctness only           | ⚠️ Only asymptotically (ε→0)           | **Low–Medium** — better exploration helps avoid local minima                |
| **HMC**               | Exact Posterior                              | MCMC                     | No approximation; assumes smooth potential energy; full data gradients; no subsampling allowed                                                                | Very High                | Low              | Very High      | Stan, PyMC3, TF Probability            | Neal, 2011                                         | ✅ Fully nonparametric; assumptions hold for smooth models                  | ✅ Yes (only asymptotically on sample -> ∞)    | **Low** — proper posterior prevents overfitting                             |
| **Laplace Approx.**   | Gaussian around MAP                          | Deterministic            | Posterior is Gaussian near MAP; assumes curvature (Hessian) captures uncertainty                                                                              | Low–Medium               | High             | Low (Post-hoc) | LaplaceTorch, GPyTorch                 | MacKay, 1992                                       | ❌ Strong Gaussianity assumption near MAP                                   | ❌ No                                 | **High** — narrow posterior underestimates uncertainty                      |
| **Expectation Prop.** | Moment-matched Gaussian                      | Deterministic Approx.    | Approximates each likelihood term with Gaussian; moment-matching used to update posterior                                                                     | Medium–High              | Low              | High           | Edward1, GPy                           | Minka, 2001                                        | ❌ Still assumes Gaussian factors; better fit than mean-field                | ❌ No                                 | **Medium** — improved fit, but still approximate                            |
| **Functional BNN**    | Posterior over Functions (not weights)       | Hybrid (VI + GP-style)   | Prior and posterior over output functions; architecture defines function space distribution                                                                   | Very High                | Low              | Very High      | Neural Processes, GPJax, Functorch     | Garnelo et al., 2018; Rasmussen & Williams, 2006  | ✅ Function-space inference avoids parametric assumptions in weights         | ✅ Yes                                | **Low** — function-level regularization is strong                           |

</div>


In [None]:
# https://www.youtube.com/watch?v=LlzVlqVzeD8&list=PLHSMzCAQRltMGNQ9MxE7YBV87N0btrlUo&ab_channel=PyData
# https://www.youtube.com/watch?v=KhAUfqhLakw&list=PLHSMzCAQRltMGNQ9MxE7YBV87N0btrlUo&index=2&ab_channel=Enthought
# https://www.youtube.com/watch?v=i5PEMt21dO8&list=PLBjSxdPpAJGz-zSjO1Lpkc-0ibLTcz2o9&ab_channel=SMILES-SummerSchoolofMachineLearningatSK

# | Feature / Library                | **Pyro**                  | **Blitz-Bayesian-PyTorch**  | **Bayesian-Torch**            | **PyMC**                       | **NumPyro**               | **TensorFlow Probability (TFP)**        |
# | -------------------------------- | ------------------------- | --------------------------- | ----------------------------- | ------------------------------ | ------------------------- | --------------------------------------- |
# | **Backend**                      | PyTorch                   | PyTorch                     | PyTorch                       | Aesara / JAX                   | JAX                       | TensorFlow                              |
# | **Type**                         | Probabilistic Programming | Lightweight BNN             | Modular Bayesian Layers       | Probabilistic Programming      | Probabilistic Programming | Probabilistic Programming + Layers      |
# | **Inference Methods**            | SVI, HMC, NUTS            | Variational Inference       | VI, MC Dropout                | NUTS, HMC, ADVI                | NUTS, HMC, SVI            | HMC, VI, EM                             |
# | **BNN Support**                  | ✔️ Custom BNNs            | ✔️ Easy BNNs via decorators| ✔️ Deep BNNs & drop-in layers | ⚠️ Basic BNN support          | ⚠️ Some support (manual)  | ✔️ Keras BNN Layers                    |
# | **Ease of Use**                  | Medium                    | Easy                        | Medium                        | Easy                           | Medium                    | Medium                                  |
# | **Deep Learning Scale**          | ✔️ Yes                    | ✔️ Yes                     | ✔️ Yes                        | ❌ Limited                    | ⚠️ Limited GPU support    | ✔️ Yes (via TensorFlow)                |
# | **GPU Acceleration**             | ✔️ Yes                    | ✔️ Yes                     | ✔️ Yes                        | ⚠️ Limited (JAX backend only) | ✔️ JAX (fast!)            | ✔️ TensorFlow                          |
# | **Good for Probabilistic Logic** | ✔️ Yes                    | ❌                         | ❌                            | ✔️ Yes                        | ✔️ Yes                    | ✔️ Yes                                 |
# | **Learning Curve**               | Steep                     | Low                         | Medium                        | Medium                         | Medium                    | Medium                                  |
# | **Community & Maturity**         | Large (Uber, academic)    | Small                       | Medium                        | Large & mature                 | Growing fast (Google)     | Large (Google)                          |
# | **Best Use Case**                | Custom probabilistic BNNs | Quick, practical BNNs       | Plug-and-play BNNs            | Statistical models, small BNNs | Fast HMC/VI for research  | Keras-style probabilistic deep learning |

<div style="font-size:14px">

---
### MCMC-pSGLD: (Markov Chain Monte Carlo - preconditioned Stochastic Gradient Langevin Dynamics)
#### Forward Pass:
- **Monte Carlo**: most MAP models are generative (modeling the approximated real joint distribution), including this one<br>
  we cannot simply use mean from each parameter node to calculate the output<br>
  A single prediction requires multiple forward passes (samples) from our trained model $p(\theta \mid \mathcal{D}_{\text{train}})$,<br>
  using a random sampling method that matches the true posterior distribution (high-dimensional, intractable, unnormalized)<br>

$$
\begin{aligned}
\underbrace{p(y_{\text{test}} \mid X_{\text{test}}, \mathcal{D}_{\text{train}})}_{\textbf{Bayesian Prediction}}
&= \int 
\underbrace{p(y_{\text{test}} \mid X_{\text{test}}, \theta)}_{\textbf{Likelihood (model output sample)}}
\cdot 
\underbrace{p(\theta \mid \mathcal{D}_{\text{train}})}_{\textbf{True Posterior}}
\, d\theta 
\\[1.2em]
&= \underbrace{\mathbb{E}_{\theta \sim p(\theta \mid \mathcal{D}_{\text{train}})} \left[ p(y_{\text{test}} \mid X_{\text{test}}, \theta) \right]}_{\textbf{Expectation over Posterior}}
\\[1.2em]
&\approx \underbrace{\frac{1}{T} \sum_{i=1}^T p(y_{\text{test}} \mid X_{\text{test}}, \theta^{(i)})}_{\textbf{Monte Carlo Estimate}}
\quad \text{where } \theta^{(i)} \sim p(\theta \mid \mathcal{D}_{\text{train}})
\end{aligned}
$$

#### Loss Computation:
- Refer to previous section (without Gaussian/i.i.d. assumptions)

- The MAP point estimates contains loss definition: <br>
    $ w^* = \arg\min_w \left( - \sum_{i=1}^N \log P(y_i \mid y_{<i}, x_{\le i}, w) - \log P(w) \right) $

#### Backward Pass:
- we try to evaluate weight through the joint distribution of posterior, not just MAP point estimate
- instead of computing gradient of loss (negative log-likelihood), here we compute gradient of log-posterior
- the "Training"(posterior sampling) happens after all data is present, we use an algorithm to explore parameter space(state space of MC, support for posterior distribution) to find the most 'fitted'(approximate) posterior distribution(joint) over many iterations (iteration (time in Markov Chain in latent space) != sample (time in sequential samples))
- **Markov Chain**: the True Posterior $p(\theta \mid \mathcal{D}_{\text{train}})$ can be approximated as stationary distribution of a Markov Chain(discrete) as number of steps goes to infinity (the state space of this Markov process is also the support of the true posterior, which is $\theta \in \mathbb{R}^D$)<br>
    if we assume:
    - **Ergodicity**: The chain forgets its starting point.
        - $\forall \theta, \theta', \exists t \in \mathbb{N} \text{ such that } K^t(\theta' \mid \theta) > 0$
        - Ergodicity ⇐ Aperiodicity + Irreducibility
            - **Aperiodicity**: No cyclic pattern in transitions
            - **Irreducibility**: Every state is reachable from every other state in finite steps
    - **Time-homogeneity**: Transition probabilities $K$ are fixed over time
    - Target Invariance via **Detailed Balance**:
        - $p(\theta \mid \mathcal{D}) K(\theta' \mid \theta) = p(\theta' \mid \mathcal{D}) K(\theta \mid \theta') \quad \text{for all } \theta, \theta'$
        - this is actually Microscopic symmetry in parameter space: forward flow = backward flow
        - implies **Stationarity** if K is chosen correctly regarding D: distribution is fixed
            - $p(\theta' \mid \mathcal{D}) = \int_\Theta K(\theta' \mid \theta) p(\theta \mid \mathcal{D}) \, d\theta \quad \text{for all } \theta'$
    - ✅ all previous assumptions:
        - can always be constructed(exist) via K regardless of D (training data)
        - However, in some assumptions, K is also dependent on D, which means K needs to be carefully constructed

- Let $\{\theta_t\}_{t=0}^\infty$ be a Markov chain
- $A$ be a measurable region of parameter space (e.g. i-th component of θ>0.5, accuracy(θ)>90%, etc.)

$$
\begin{aligned}
\lim_{t \to \infty} \mathbb{P}(\theta_t \in A)
&= \lim_{t \to \infty}
\underbrace{
\int_{\Theta} \cdots \int_{\Theta}
}_{t \text{ nested integrals}} \;
\underbrace{K(\theta_t \mid \theta_{t-1})}_{\text{transition kernel}} \cdots K(\theta_1 \mid \theta_0)
\underbrace{\mu_0(\theta_0)}_{\text{initial distribution}} \,
\mathbf{1}_A(\theta_t)
\; d\theta_0 \cdots d\theta_t
\\[2ex]
&\quad \textcolor{gray}{\text{// Expand marginal probability of } \theta_t \in A \text{ via the full joint chain law: } \mu_0 \cdot K \cdots K}
\\[2ex]

&= \lim_{t \to \infty}
\int_A
\left(
\int_{\Theta} \cdots \int_{\Theta}
K(\theta_t \mid \theta_{t-1}) \cdots K(\theta_1 \mid \theta_0)
\mu_0(\theta_0)
\; d\theta_0 \cdots d\theta_{t-1}
\right)
d\theta_t
\\[2ex]
&\quad \textcolor{gray}{\text{// Pull indicator } \mathbf{1}_A(\theta_t) \text{ outside as domain of outermost integral becomes } A}
\\[2ex]

&=
\int_A
\left(
\lim_{t \to \infty}
(K^t \mu_0)(\theta_t)
\right)
d\theta_t
\\[2ex]
&\quad \textcolor{gray}{\text{// Recognize the nested integral as repeated application of the Markov operator: } K^t \mu_0}
\\[2ex]

&=
\int_A p(\theta \mid \mathcal{D}) \, d\theta
\\[2ex]
&\quad \textcolor{gray}{
\text{// By ergodic theorem: if } K \text{ is ergodic and satisfies detailed balance w.r.t. } p(\theta \mid \mathcal{D})
\Rightarrow \lim_{t \to \infty} K^t \mu_0 = p(\theta \mid \mathcal{D}) \text{ in distribution}
}
\\[2ex]
\end{aligned}
$$

$$
\begin{aligned}
p(\theta \mid \mathcal{D}) 
&= \frac{p(\mathcal{D}, \theta)}{p(\mathcal{D})}
= \frac{p(\mathcal{D} \mid \theta)\, p(\theta)}{\int_{\mathbb{R}^d} p(\mathcal{D} \mid \vartheta)\, p(\vartheta)\, d\vartheta}
= \frac{1}{\int_{\mathbb{R}^d} p(\mathcal{D} \mid \vartheta)\, p(\vartheta)\, d\vartheta} \cdot p(\mathcal{D} \mid \theta)\, p(\theta) \\[10pt]

&= \frac{1}{Z} \cdot p(\mathcal{D} \mid \theta)\, p(\theta)
= \frac{1}{Z} \cdot \exp\left( \log p(\mathcal{D} \mid \theta) + \log p(\theta) \right)
= \frac{1}{Z} \cdot \exp\left( -[-\log p(\mathcal{D} \mid \theta) - \log p(\theta)] \right) \\[10pt]

&= \underbrace{\frac{1}{Z} \cdot \exp\left( -U(\theta) \right)}_{\text{Gibbs (Boltzmann) form}}
\qquad \text{where:} \quad
\begin{cases}
\text{Potential Energy}: U(\theta) := -\log p(\mathcal{D} \mid \theta) - \log p(\theta) >= 0 \quad \text{(both non-negative)}\\[4pt]
\begin{array}{c}
    \text{Partition Function}\\
    \text{Normalization Constant}
\end{array}
: Z := \int_{\mathbb{R}^d} \exp(-U(\vartheta))\, d\vartheta
\end{cases}
\end{aligned}
$$

- There are many models in physics(statistical mechanics, thermo/quantum dynamics) that relates potential energy field to probability distribution
- the Posterior distribution can be written as a Potential-Energy-based Particle-Diffusion model in latent-space as well
    - this is only to help us intuitively understand the distribution, the physical analogy is not necessary
    - we need to find a Markov Chain that:
        - has transition kernel that this potential distribution is one of its stationary solutions
        - satisfy previous assumptions
    - some properties that we want/realized:
        - the lower the energy, the higher the probability (posterior in latent space)
        - the shape of posterior is complex (multiple modes/peaks) in latent space
        - the kernel needs to work as a compass to guide the transition towards nearest (depends on space type) local maxima of probability or minima of potential energy
            - only then, the Markov Chain can stay longer in the more probable region to form the correct distribution
        - also it needs to have some random/stochastic/drifting/diffusion properties to help explore the whole latent space (guarantee some of previous assumptions)

$$
\begin{aligned}
&\underbrace{d\theta_t = -\nabla U(\theta_t)\,dt + \sqrt{2}\,dW_t}_{\substack{\text{Over-Damped Langevin dynamics SDE:} \\ \text{gradient drift + Gaussian noise}}}
\\[1.2em]
&\Rightarrow 
\underbrace{
\frac{\partial \rho(\theta, t)}{\partial t} = \nabla \cdot \left( \nabla U(\theta)\, \rho(\theta, t) + \nabla \rho(\theta, t) \right)
}_{\substack{\text{Fokker–Planck equation:} \\ \text{evolution of density}}}
\\[1.5em]
&\Rightarrow 
\nabla \cdot \left( \nabla U(\theta)\, \rho(\theta) + \nabla \rho(\theta) \right)
= \nabla \cdot \left( \nabla U(\theta) \cdot \tfrac{1}{Z} e^{-U(\theta)} + \nabla \left( \tfrac{1}{Z} e^{-U(\theta)} \right) \right)
= \nabla \cdot \left( \tfrac{1}{Z} e^{-U(\theta)} \nabla U(\theta) - \tfrac{1}{Z} e^{-U(\theta)} \nabla U(\theta) \right)
= \nabla \cdot (0) = 0
\\[1.5em]
&\Leftrightarrow 
\rho(\theta) = \tfrac{1}{Z} \exp(-U(\theta)) \;\text{is stationary under Langevin dynamics}
\end{aligned}
$$

- Time-homogeneous: $U(\theta)$ is fixed for a given posterior
- Ergodicity: (the Fokker-Planck equation)
    - covers full support $\theta \in \mathbb{R}^D$
    - each state is reachable due to diffusion (Brownian term)
    - No periodicity due to stochasticity
- Detailed Balance (Microscopic Reversibility):
    - The Fokker–Planck operator is self-adjoint in the weighted space $L^2(p^*)$
    - The generator of Langevin dynamics is reversible with respect to $p^*(\theta)$

- Alternatively:
    - instead of Over-Damped Langevin: Potential + Diffusion(Noise) (friction high enough just to remove momentum)
    - we can have:
        - Under-Damped Langevin: Potential + Diffusion + Kinetic(Momentum)
        - Hamiltonian SDE: Potential + Kinetic (energy perfectly conserved)
        - Noisy Hamiltonian SDE: Potential + Kinetic + Diffusion (energy not perfectly conserved)
    - Kinetic => better preserve energy => inertial exploration => better long-range exploration

$$
H(\theta, p) = \underbrace{-\log p(\theta \mid \mathcal{D})}_{\text{Potential } U(\theta)} + \underbrace{\frac{1}{2} p^T M^{-1} p}_{\text{Kinetic } K(p)}
$$


#### Weight Update:

$$
\begin{aligned}
&\underbrace{d\theta_t = -\nabla_\theta U(\theta_t)\,dt + \sqrt{2}\,dW_t}_{\text{Langevin SDE (Itô)}} 
= 
\underbrace{\theta_{t+1} = \theta_t - \epsilon \nabla_\theta U(\theta_t) + \sqrt{2\epsilon} \, \xi_t}_{\text{Euler–Maruyama discretization} \quad \xi_t \sim \mathcal{N}(0, I)} 
\Rightarrow
\underbrace{q(\theta'|\theta_t) = \mathcal{N}\left(\theta' \mid \theta_t - \epsilon \nabla_\theta U(\theta_t), 2\epsilon I\right)}_{\text{Proposal distribution}} 
\\
&\Rightarrow 
\underbrace{
\alpha(\theta_t, \theta') = \min\left(1, 
\frac{
e^{-U(\theta')} \cdot 
\exp\left(-\frac{1}{4\epsilon} \|\theta_t - \theta' + \epsilon \nabla_\theta U(\theta')\|^2 \right)
}{
e^{-U(\theta_t)} \cdot 
\exp\left(-\frac{1}{4\epsilon} \|\theta' - \theta_t + \epsilon \nabla_\theta U(\theta_t)\|^2 \right)
}
\right)}_{\text{Metropolis–Hastings acceptance prob. (Detailed Balance)}} 
\\
&\Rightarrow
\theta_{t+1} =
\underbrace{
\begin{cases}
\theta', & \text{with probability } \alpha(\theta_t, \theta') \\
\theta_t, & \text{otherwise}
\end{cases}
}_{\text{MALA (Metropolis-Adjusted Langevin Algorithm)}} 
= 
\underbrace{
\begin{cases}
\theta_t - \epsilon \nabla_\theta U(\theta_t) + \sqrt{2\epsilon}\,\xi_t, & \text{if accepted} \\
\theta_t, & \text{otherwise}
\end{cases}
}_{\text{SGLD with MH correction: samples from } \pi(\theta) \propto e^{-U(\theta)}}
\end{aligned}
$$

- in pSGLD:
    - we use mini-batches and preconditioning to approximate and skip Metropolis-Hastings correction for bias introduced in Euler-Maruyama discretization
        - but as long as step size $\epsilon_t \to 0$ slowly and the preconditioner stabilizes, the sampling bias from ignoring MH can be minimized (Li et al., 2016)
    - pSGLD has better uncertainty estimation than SGLD because it incorporates local curvature information (via preconditioning)
        - the injected noise and gradient step are scaled according to local curvature, rather than isotropic noise in standard SGLD

$$
\begin{aligned}
d\theta_t &= \theta_t - \epsilon \left\{ \underbrace{ - \frac{1}{N} \sum_{i=1}^N \nabla_\theta \log p(y_i|x_i, \theta_t) - \nabla_\theta \log p(\theta_t) }_{ \nabla_\theta U(\theta_t) } \right\} + \sqrt{2\epsilon} \, \xi_t \\
&\xRightarrow{\text{minibatch approx.}} \theta_{t+1} = \theta_t - \epsilon \left\{ \underbrace{ - \frac{N}{|\mathcal{B}_t|} \sum_{i \in \mathcal{B}_t} \nabla_\theta \log p(y_i|x_i, \theta_t) - \nabla_\theta \log p(\theta_t) }_{ \hat{\nabla}_\theta U(\theta_t) \quad \text{stochastic gradient estimate from mini-batch}} \right\} + \sqrt{2\epsilon} \, \xi_t \\
&\xRightarrow{\text{preconditioned}} \theta_{t+1} = \theta_t - \epsilon \cdot \frac{1}{2} \underbrace{G(\theta_t)}_{\text{diagonal preconditioning matrix (RMSprop-style)}} \cdot \hat{\nabla}_\theta U(\theta_t) + \underbrace{\mathcal{N}(0, \epsilon G(\theta_t))}_{\text{noise}} \\
&\Rightarrow \theta_{t+1} = \theta_t - \frac{\epsilon}{2} G(\theta_t) \hat{\nabla}_\theta U(\theta_t) + \eta_t, \quad \eta_t \sim \mathcal{N}(0, \epsilon G(\theta_t))
\end{aligned}
$$


</div>


<div style="font-size:14px">

---
### mean-field VI:

Warning:

- in basic mean-field Variational Inference(VI): (e.g. Bayes by Backprop, Flipout)
    - no covariance between parameters:
        - thus joint posterior is a product of all weights distribution
        - thus it cannot capture posterior correlations induced jointly by prior × likelihood
    - which usually cause problems like:
        - No multi-modal behavior
        - No skewness
        - No heavy tails
        - No nonlinear dependencies between weights
        - ❌ which could violate the bare-minimum to model the true posterior (high-dimensional, highly-complex, unnormalized, intractable)

- more advanced VI variants does support multi-modal posterior(at least):
    - Mixture of Gaussians VI
    - Normalizing Flows
    - Adversarial VI (AVB)

- Assumption in mean-field VI has far-reaching consequences beyond posterior approximation, it leaks into the model’s functional behavior:
    - by imposing strong geometric constraints on the latent space, it may group together very different functions (far apart in function/feature space)
        -  ❌ this poses a very serious threat when randomly sampling using it as a generative model (unnatural in-between regions)
    - in some scenarios, it is possible to act as a regularizer that improves generalization — even if it’s fundamentally incorrect

- in MCMC, it preserves probability(true posterior), also not geometry(local latent space <-> local feature space):
    - interpolation: it also cannot make sure that nearby points in latent space correspond to similar outputs in data/feature space
    - geometry-aware models include:
        - Autoencoder-based models
        - Normalizing Flows
        - Diffusion Models
        - Energy-Based Models + Score-Based Learning
        - Neural ODEs & Continuous Normalizing Flows
        - Metric Learning / Contrastive Learning
    - However, even though there is no mathematical guarantee, in reality, basic NN arch still offer some protection that features are still mostly "continuous", although with high variance, in latent space:
        - Gradient descent (or pSGLD) keeps weight updates small, most of the time
        - you stay in a "smooth" part of the network function space, most of the time
        - smooth activations reduce the chance of sharp transitions
        - gradient-based optimization encourages local stability
        - you shouldn't take these for granted, though
    - MCMC can work for both parametric(e.g. BNN) and non-parametric(e.g. directly sampling X) models:
        - in parametric models, it is necessary to define the prior(e.g. forcing each of neural network’s parameter to Gaussian), using model or empirically<br>
$
\text{Parametric: } U(\theta) = -\log p(\mathcal{D} \mid \theta) - \log p(\theta)\\
\text{Non-Parametric: } U(x) = - \log p(x)\\
$
        - even if we are forcing prior to be uni-modal, considering complex likelihood function, the final posterior can still be complex and multi-modal
        - even with predefined prior, MCMC (e.g. HMC) still has advantage over other methods as it faithfully approximate the true posterior given all information
        - we only need to make sure the ML arch with current pre-defined prior can yield a relatively continuous, MCMC friendly latent space (e.g., connected, non-fragmented, and continuous posterior geometry)
            - In practice, BNN with Gaussian, Student-T, Cauchy, ... prior often suffice for this
        - there is ⚠️no real/faithful way to empirically determine the true prior in parametric models:
            - ⚠️Empirical Bayes(not Bayesian): estimates a "prior" by fitting it to the data, often by maximizing the marginal likelihood:
                - this yields a point estimate of hyperparameters, effectively collapsing the uncertainty and turning the process into MAP estimation.
                - it requires treating prior parameters as random variables, and approximating their posteriors as well (which MCMC can do) to be Bayesian.
            - how to determine distribution shape of P(θ)? still empirically:
                - ⚠️get weight distribution from a pre-trained non-Bayesian model:
                    - using part of training data (k-fold/bootstrapping with noise)
                    - or simply stat over same kind of weight over 1 layer (this is worse)
                - ⚠️run MCMC once with simple prior, then analyze posterior marginals over weights (MCMC trajectory)
            - after empirically defining the shape, then use hierarchical priors with hyperparameter inferred to make it bayesian
                - level 0(Empirical Bayes): Stationary data-derived P(θ)
                - level 1(Bayesian model): P(θ) with prior uncertainty
                - >= level 2: Hyperprior in meta-learning
$$
\phi \sim p(\phi), \quad \theta \sim p(\theta \mid \phi), \quad \mathcal{D} \sim p(\mathcal{D} \mid \theta)
$$
