# PHASE 9 — Mathematical Physics of Deep Learning (100 Problems)

---

## Module 1 — Energy Landscapes & Thermodynamics (1–15)

1. Show that gradient descent minimizes an energy functional $$E(\theta).$$
2. Derive the energy dissipation identity: $$\frac{dE}{dt} = -\|\nabla E\|^2.$$
3. Compute stationary points of an energy landscape $$E(x,y)=x^4-3x^2+y^2.$$
4. Define basins of attraction for a multi-well potential.
5. Show connection between potential wells and local minima.
6. Derive Gibbs distribution $$p(x) \propto e^{-E(x)/T}.$$
7. Evaluate partition function $$Z = \int e^{-E(x)}\, dx$$ for $$E(x)=x^2.$$
8. Compute free energy $$F = -T\log Z$$ for Gaussian potential.
9. Derive entropy $$S = -\int p(x)\log p(x)\,dx.$$
10. Compute free energy gradient for variational distribution $$q(x).$$
11. Illustrate high-dimensional “needle in a haystack” phenomenon.
12. Show that deep networks’ loss surface has many flat minima.
13. Prove that sharp minima correspond to large Hessian eigenvalues.
14. Compare flat vs sharp minima using Hessian spectrum.
15. Plot energy landscape evolution during training.

---

## Module 2 — Statistical Mechanics of Learning (16–30)

16. Define microscopic (parameters) vs macroscopic (loss/accuracy) variables.
17. Derive Boltzmann distribution for model weights under noise.
18. Show SGD $$\approx$$ Langevin dynamics: $$d\theta = -\nabla L\, dt + \sqrt{2T}\, dW_t.$$
19. Derive Fokker–Planck equation for weight distribution.
20. Solve Fokker–Planck for quadratic loss.
21. Compute stationary density for stochastic gradient Langevin dynamics.
22. Compare deterministic GD vs stochastic thermally driven dynamics.
23. Analyze escape time from local minimum using Kramers’ rate.
24. Derive large deviations rate for escape from energy well.
25. Analyze SGD noise scaling with batch size.
26. Derive effective temperature $$T \propto \eta \frac{\sigma^2}{B}.$$
27. Show temperature controls exploration vs convergence.
28. Analyze phase transition at critical batch size.
29. Compute equilibrium distribution for 1D learning rule.
30. Compare SGD trajectory to diffusion in potential wells.

---

## Module 3 — Mean-Field Theory & Neural Tangent Kernel (31–45)

31. Derive mean-field limit of neural network output as width $$\to \infty.$$
32. Compute covariance kernel of random initialization.
33. Define NTK $$K(x,x')$$ for 2-layer network.
34. Show NTK remains constant during training in infinite width.
35. Compute prediction dynamics $$f_t = f_0 - K(t)\nabla L.$$
36. Derive linearized training dynamics around initialization.
37. Compute NTK Gram matrix for sample dataset.
38. Show convergence under positive-definite NTK.
39. Compare NTK vs GP view of infinite-width networks.
40. Compute signal propagation through infinite-depth network.
41. Show weight distribution becomes Gaussian via CLT.
42. Derive recurrence for variance preservation across layers.
43. Analyze exploding/vanishing signal via Jacobian.
44. Show critical initialization condition for stable propagation.
45. Compute mean-field dynamics of gradient descent in 1-layer network.

---

## Module 4 — Information Theory & Generalization (46–60)

46. Compute mutual information $$I(X;Y)$$ for binary channel.
47. Derive entropy of Gaussian $$X \sim \mathcal{N}(0,\sigma^2).$$
48. Compute KL divergence $$D_{KL}(p\|q)$$ for two Gaussians.
49. Derive PAC-Bayes generalization bound.
50. Compute PAC-Bayes bound for small variance prior.
51. Analyze information bottleneck Lagrangian.
52. Derive IB objective: $$L = I(X;Z) - \beta\, I(Z;Y).$$
53. Compute gradient of the IB objective.
54. Estimate mutual information via Monte Carlo.
55. Show flat minima correspond to low-information weights.
56. Relate Hessian trace to effective information capacity.
57. Compute complexity penalty via log-det Hessian.
58. Derive “sharpness” metric for a model.
59. Compare generalization in flat vs sharp minima.
60. Show compression (implicit SGD regularization) improves generalization.

---

## Module 5 — Neural Dynamics as Differential Equations (61–75)

61. Show continuous-time gradient flow ODE: $$\dot{\theta}=-\nabla L.$$
62. Solve gradient flow for quadratic loss.
63. Derive stability condition for linear ODE $$\dot{x}=Ax.$$
64. Apply stability analysis to weight updates.
65. Model deep network as composition of flows.
66. Show residual network approximates ODE: $$\dot{x}=f(x,t).$$
67. Derive backpropagation as adjoint differential equation.
68. Compute adjoint dynamics for scalar ODE.
69. Derive adjoint equation for neural ODE block.
70. Compute numerical solution using Euler discretization.
71. Compare Euler vs RK4 for feature dynamics.
72. Compute Lipschitz constant of a residual block.
73. Analyze stability of deep networks under Lipschitz constraint.
74. Derive continuous depth parameterization in Neural ODEs.
75. Compute flow map $$x(T)=\Phi_T(x(0))$$ for simple vector field.

---

## Module 6 — Diffusion Models & Stochastic Processes (76–90)

76. Define forward diffusion SDE: $$dx = \sqrt{\beta_t}\, dW_t.$$
77. Derive reverse-time SDE from score function.
78. Compute score $$\nabla_x \log p_t(x)$$ for Gaussian.
79. Derive denoising score matching objective.
80. Compute KL divergence between diffusion transitions.
81. Derive loss for DDPM forward process.
82. Compute analytic reverse kernel for Gaussian forward noise.
83. Show connection between score matching and denoising.
84. Simulate 1D diffusion forward process.
85. Simulate reverse denoising step.
86. Derive continuous-time diffusion ODE.
87. Connection between diffusion ODE and probability flow ODE.
88. Compute trace Jacobian for CNF (Hutchinson estimator).
89. Derive ELBO for diffusion model.
90. Compare diffusion vs energy-based vs flow models.

---

## Module 7 — Geometry of Attention & Transformer Dynamics (91–100)

91. Compute Jacobian of attention output w.r.t. query vector.
92. Analyze stability of multi-head attention via spectral norm.
93. Derive Lipschitz condition for attention layer.
94. Compute Fisher information for attention weights.
95. Analyze curvature (Hessian) of attention logits.
96. Derive capacity scaling law for attention depth.
97. Show residual stream forms a dynamical system.
98. Compute energy function for attention update.
99. Show how layer normalization stabilizes gradient dynamics.
100. Connect Transformer depth $$\to$$ continuous dynamical flow model.
