## Ordinary Least Squares Linear Regression

Regression is the task of estimating the relationship between a depedent variable $Y$, and one or more independent variables $X$.

The simplest linear regression model involves the linear combination of input variables:

$y(X,W) = w_{0} + w_{1}x_{1} + ... + w_{D}x_{D}$

where $X = (x_{1},...,x_{D})^{T}$ and $W=(w_{0},...,w_{D})$

Note that $w_{0}$ has no associated $x$ term, and thus is the bias term.

The goal is to select values for $W$ such that some loss function $E()$ is minimised. In OLS linear regression, we take the loss function to be:

$E_{D}(W) = \frac{1}{2}\Sigma^{N}_{n=1} (t_{n} - W^{T}X_{n})^{2}$

where $n$ is some observation in the train dataset, with input variables $X_{n}$ and target $t_{n}$. We assume that $t_{n}$ is given by some function $y(X, W) + \epsilon$ where $\epsilon$ is some additive Gaussian noise $\epsilon \sim \mathcal{N}(\mu, \sigma^{2}) $.




A key property of this model is that it is a linear function of the parameters $W$, as well as a linear function of the input variables $X$. Being linear in $W$ ensures that for there is only one set of parameters, $W$, that optimises $E_{D}$. However, being linear in the input variables $X$ in constricting and imposes limitations on the expressivity of the model. Therefore, we extend the class of models by considering linear combinations of fixed nonlinear functions of the input variables:



$E_{D}(W) = \frac{1}{2}\Sigma^{N}_{n=1} (t_{n} - W^{T}\phi (X_{n}))^{2}$

In other words, we may apply some nonlinear function to the input variables, such as *$sin(X)$*  or *$X^{2}$*, provided that these are known and fixed (i.e. we still are only estimating the parameters $W$, and these remain linear).



Now all that is left to do is to find the $W$ that minimises $E_{D}(W)$.

The minima of a function is a point of inflection, thus we can set the derivative of $E_{D}$ to 0 to find the optimal solution:

$\frac{\nabla E_{D}}{\nabla W} = \frac{1}{2}\Sigma^{N}_{n=1} (t_{n} - W^{T}\phi (X_{n}))^{2} = 0$

Before we begin, it is much easier to separate the bias term before hand and treat it as separate to the $W$ parameter:

$\frac{\nabla E_{D}}{\nabla W} = \frac{1}{2}\Sigma^{N}_{n=1} (t_{n} - w_{0} - \Sigma^{J}_{j=1}w_{j}^{T}\phi_{j} (X_{n}))^{2} = 0$

Next, note the chain rule here



$\frac{\nabla E_{D}}{\nabla W} = \frac{1}{2}\Sigma^{N}_{n=1} 2(t_{n} - w_{0} - \Sigma^{J}_{j=1}w_{j}^{T}\phi_{j} (X_{n})) \cdot \frac{\nabla \Sigma^{J}_{j=1}w_{j}^{T}\phi_{j} (X_{n})^{T}}{\nabla W} = 0$


$\frac{\nabla E_{D}}{\nabla W} = \frac{1}{2}\Sigma^{N}_{n=1} 2(t_{n} - w_{0} - \Sigma^{J}_{j=1}w_{j}^{T}\phi_{j} (X_{n}))) \cdot \Sigma^{J}_{j=1}\phi_{j} (X_{n})^{T} = 0$

$\therefore$
$0 = \Sigma^{N}_{n=1} (t_{n} - w_{0} - W^{T}\phi (X_{n}))\phi (X_{n})^{T}$

recall now that for a given datapoint $t = y(X,W)$ and that $y(X, W) = w_{0} + w_{1}x_{1} + ... + w_{j}x_{j}$

thus we can substitute:

$w_{0} = t_{n} - (w_{1}x_{1} + ... + w_{j}x_{j})$

$w_{0} = t_{n} - \Sigma^{J}_{j=1}(w_{1}x_{1} + ... + w_{j}x_{j})$

Normalize over the entire dataset...

$\Sigma^{N}_{n=1}w_{0} = \Sigma^{N}_{n=1}(t_{n} - \Sigma^{J}_{j=1}(w_{1}x_{1} + ... + w_{j}x_{j}))$

let $\Sigma^{J}_{j=1}(w_{1}x_{1} + ... + w_{j}x_{j}) = \rho$

$\therefore w_{0} = \bar{t} - \bar{\rho}$

$0 = \Sigma^{N}_{n=1} (t_{n} - (\bar{t} - \bar{\rho}) - W^{T}\phi (X_{n}))\phi (X_{n})^{T}$

$0 = \Sigma^{N}_{n=1} (t_{n} - \bar{t})\phi (X_{n})^{T} + \Sigma^{N}_{n=1}(\bar{\rho} - W^{T}\phi (X_{n}))\phi (X_{n})^{T})$

$\Sigma^{N}_{n=1}(W^{T}\phi (X_{n})) - \bar{\rho})\phi (X_{n})^{T} = \Sigma^{N}_{n=1} (t_{n} - \bar{t})\phi (X_{n})^{T}$





In [None]:
import jax.numpy as jnp
from jax import random, grad, jit, vmap, device_put
import matplotlib.pyplot as plt
from sklearn import datasets