### Convergence Theorems in Machine Learning
Convergence theorems in ML ensure that learning algorithms stabilize as they iterate, and that learned models approximate the true underlying function or distribution over time and with more data.

###   Key Convergence Theorems in ML Context
| **Theorem / Principle**                           | **Description**                                                                                                             | **Relevance in ML**                                                    |
| ------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| **Empirical Risk Minimization (ERM) Convergence** | Under i.i.d. data and a fixed hypothesis class, the empirical risk converges to the expected risk as sample size grows.     | Justifies training on finite datasets to generalize well.              |
| **Uniform Convergence Theorem**                   | With high probability, the worst-case deviation between empirical and expected risk over a class of functions is small.     | Foundation for VC dimension, generalization bounds.                    |
| **Gradient Descent Convergence Theorem**          | For convex functions with Lipschitz-continuous gradients, gradient descent converges to the global minimum at a known rate. | Guarantees convergence of training loss in deep learning.              |
| **Stochastic Gradient Descent (SGD) Convergence** | Under bounded variance and diminishing learning rate, SGD converges in expectation to a local/global minimum.               | Core to online learning and training large models.                     |
| **Law of Large Numbers (LLN)**                    | Sample averages converge to expected values as the number of samples increases.                                             | Justifies model training via empirical loss minimization.              |
| **Central Limit Theorem (CLT)**                   | Distribution of sample means approaches normality as sample size increases.                                                 | Underpins confidence intervals and hypothesis tests in ML evaluations. |


In [1]:
import numpy as np

# Objective: minimize f(x) = (x - 3)^2
def f(x): return (x - 3) ** 2
def grad_f(x): return 2 * (x - 3)

x = 0  # initial point
lr = 0.1  # learning rate
for i in range(100):
    grad = grad_f(x)
    x = x - lr * grad  # SGD step
    if i % 10 == 0:
        print(f"Iter {i}: x = {x:.4f}, f(x) = {f(x):.6f}")


Iter 0: x = 0.6000, f(x) = 5.760000
Iter 10: x = 2.7423, f(x) = 0.066408
Iter 20: x = 2.9723, f(x) = 0.000766
Iter 30: x = 2.9970, f(x) = 0.000009
Iter 40: x = 2.9997, f(x) = 0.000000
Iter 50: x = 3.0000, f(x) = 0.000000
Iter 60: x = 3.0000, f(x) = 0.000000
Iter 70: x = 3.0000, f(x) = 0.000000
Iter 80: x = 3.0000, f(x) = 0.000000
Iter 90: x = 3.0000, f(x) = 0.000000
