
# Phase 10 — Mathematical Foundations of Generalization, Scaling Laws, and Large Language Models (100 Problems)

---

## Module 1 — Classical Generalization Theory (VC/Rademacher/Uniform Convergence) (1–10)

1. Prove a VC generalization bound for binary classifiers with VC-dimension $$d.$$
2. Compute VC-dimension of intervals on $$\mathbb{R}.$$
3. Compute VC-dimension of linear separators in $$\mathbb{R}^d.$$
4. Derive empirical Rademacher complexity for linear class with $$\|w\|\le B.$$
5. Bound generalization gap using Rademacher complexity for Lipschitz loss.
6. Show uniform convergence bound for finite hypothesis class with $$|H|=N.$$
7. Prove Massart’s finite class lemma and apply to linear separators.
8. Derive margin-based generalization bound for SVM with hinge loss.
9. Show relation between covering numbers and Rademacher complexity.
10. Compare bounds from VC vs Rademacher on a synthetic dataset.

---

## Module 2 — PAC-Bayes & Algorithmic Stability (11–20)

11. Derive a PAC-Bayes bound with KL divergence $$KL(Q\|P).$$
12. Apply PAC-Bayes bound to Gaussian prior/posterior over linear models.
13. Optimize PAC-Bayes objective over posterior variance.
14. Prove algorithmic stability bound for ERM with strongly convex loss.
15. Derive stability of ridge regression and its generalization bound.
16. Show stability of SGD under Lipschitz-smooth, strongly convex objective.
17. Compare PAC-Bayes vs stability bounds on the same model.
18. Relate flat minima (Hessian trace) to PAC-Bayes via $$\log\det$$ terms.
19. Compute a data-dependent PAC-Bayes bound with empirical Fisher.
20. Tighten bounds using localized Rademacher complexity.

---

## Module 3 — Double Descent, Implicit Bias & Overparameterization (21–30)

21. Simulate risk vs width to exhibit double descent.
22. Show interpolation threshold in least-squares with $$n<d.$$
23. Prove minimum-norm interpolating solution for underdetermined linear regression.
24. Derive gradient flow solution and its implicit regularization.
25. Show GD converges to max-margin classifier in separable logistic regression.
26. Analyze effect of label noise on double descent.
27. Compute bias–variance decomposition pre/post interpolation.
28. Compare early stopping vs explicit $$\ell_2$$ regularization.
29. Relate flatness (small Hessian eigenvalues) to generalization empirically.
30. Show role of weight decay as Tikhonov regularization.

---

## Module 4 — Scaling Laws & Data–Model–Compute Tradeoffs (31–40)

31. Fit power-law $$L(N)=aN^{-b}+c$$ to loss vs data size $$N.$$
32. Estimate scaling exponent for loss vs parameter count $$P.$$
33. Joint scaling: fit loss vs $$(P,D,C)$$ with log–log regression.
34. Optimize compute allocation between data and parameters under a budget.
35. Derive optimal training tokens for fixed model size.
36. Compare scaling exponents across architectures (MLP vs Transformer).
37. Extrapolate loss at $$10\times$$ compute using fitted scaling law.
38. Quantify irreducible loss floor $$c$$ and its effect on returns to scale.
39. Analyze sensitivity of exponents to tokenizer vocabulary size.
40. Compute Pareto frontier of (loss, compute) for several models.

---

## Module 5 — Information Theory of Sequence Modeling (41–50)

41. Show $$\text{Perplexity} = \exp(H)$$ where $$H$$ is cross-entropy per token.
42. Estimate entropy rate of a text source from samples.
43. Compute mutual information $$I(X_{1:t-1};X_t)$$ in a Markov model.
44. Derive cross-entropy gap between model and data distribution.
45. Connect KL divergence to excess risk in language modeling.
46. Compute bits-back argument for compression with generative models.
47. Analyze effect of context length on conditional entropy.
48. Show diminishing returns of longer context via mutual information decay.
49. Compare tokenization schemes by induced entropy per subword.
50. Derive relation between calibration error and log-likelihood.

---

## Module 6 — Transformers: Capacity, Expressivity, and Training Dynamics (51–65)

51. Prove a universal approximation result for attention on finite sequences.
52. Bound Lipschitz constant of a Transformer block (attn + MLP + residual).
53. Compute spectral norm bound for multi-head attention weight matrices.
54. Derive gradient of attention logits and analyze saturation.
55. Show softmax temperature’s effect on gradient scale.
56. Analyze depth vs width tradeoff for expressivity with residual connections.
57. Derive conditions preventing attention collapse (all mass to one token).
58. Compute Jacobian conditioning across stacked layers.
59. Analyze layer norm’s effect on gradient flow.
60. Derive learning rate scaling with model width.
61. Show linearized training dynamics (NTK) for a shallow Transformer.
62. Compare pre-LN vs post-LN blocks for stability (theoretical criterion).
63. Prove convergence of training under smoothness/Lipschitz assumptions.
64. Evaluate curvature (HVP) along training and relate to step size.
65. Analyze gradient noise scale vs batch size during pretraining.

---

## Module 7 — Pretraining, Fine-Tuning, In-Context Learning & Emergence (66–80)

66. Model pretraining as risk minimization on mixture distributions.
67. Quantify transfer improvement from pretraining via linear probe accuracy.
68. Derive effective learning rate change under LoRA adaptation.
69. Analyze rank-constrained updates and generalization.
70. Compute forgetting in continual fine-tuning (Fisher overlap).
71. Derive meta-gradient for in-context learning in linear models.
72. Show emergent linear regression in attention with key–value design.
73. Quantify emergence thresholds for new capabilities via scaling.
74. Analyze grokking as phase transition in train vs test loss dynamics.
75. Derive conditions for few-shot generalization with tokenizer bias.
76. Measure calibration shift after instruction tuning.
77. Compare RLHF objective with KL-regularized policy optimization.
78. Derive optimal KL penalty to preserve pretraining distribution.
79. Analyze safety–helpfulness tradeoff via multi-objective optimization.
80. Quantify in-context learning capacity vs context window length.

---

## Module 8 — Evaluation, Calibration, Uncertainty & Robustness (81–100)

81. Compute Brier score and relate to negative log-likelihood.
82. Estimate expected calibration error (ECE) with binning.
83. Optimize temperature scaling to minimize NLL on a validation set.
84. Derive ensemble log-likelihood improvement vs single model.
85. Compute epistemic vs aleatoric uncertainty decomposition.
86. Analyze OOD detection via likelihood ratio tests.
87. Derive conformal prediction intervals for sequence probabilities.
88. Show robustness curve under adversarial perturbations of tokens.
89. Evaluate coverage vs sharpness for uncertainty estimates.
90. Calibrate beam search scores to match true sequence probabilities.
91. Compare perplexity with downstream task risk via Bayes decision rule.
92. Derive regret bounds for next-token prediction under misspecification.
93. Quantify distribution shift via population stability index on tokens.
94. Evaluate effect of quantization on perplexity and calibration.
95. Compute generalization gap across domains (books → code → dialogue).
96. Derive optimal early-stopping rule from validation log-loss trend.
97. Bound catastrophic forgetting using Fisher-based quadratic penalty.
98. Analyze RLHF-induced distribution shift on token-level statistics.
99. Evaluate selective prediction (abstention) under risk constraints.
100. Build a composite score aggregating perplexity, ECE, robustness, and cost.

