### **The power of Linear Regression**

**Linear regression** has consistently been the main introductory concept in any *Machine Learning, Analytics or Statistics* course. It is therefore unsurprising that I have encountered it countless times already.

However, I decided to not follow the traditional path of reading about it in the book and then jumping straight into the *scikit-learn* library, but instead take my time in actually deriving the models myself.

My exploration includes the mathematical derivations of the fundamental linear regression model as well as some basic Python functions that can perform this functionality.

### **Synthetic data**

Throughout this repo, all data was synthetically generated.

### **Mathematical derivation: OLS linear regression**

We start by formulating the most basic scenario: a regression with likelihood function for noise.
$$p(y|x)=\mathcal{N}(y|f(x), \sigma^2) \iff y=f(x) + \epsilon ,\ \epsilon \sim \mathcal{N}(0,\sigma^2)$$
For now we can assume that the variance is known and focus on finding regression parameters (denoted by $\theta$). To do so, note that two different observations are conditionally independent, such that:
$$p(y_1,...,y_n|x_1,...,x_n,\theta)=\prod_{i=1}^{n} p(y_i|x_i,\theta)=\prod_{i=1}^{n} \mathcal{N}(y_i|x_i^T\theta),\ y_i\in \mathbb{R} \: \: \text{and} \: \: x_i,\theta \in \mathbb{R}^m$$
Since in this problem, a closed-form solution exists, gradient descent is unnecessary. We can instead find the minimum of a negative log-transformed function.
$$ -log(\prod_{i=1}^{n} \mathcal{N}(y_i|x_i^T\theta)) = -\sum_{i=1}^{n} log(\mathcal{N}(y_i|x_i^T\theta)) = -\frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - x_i^T\theta)^2 $$
For convenience, we can then ignore constant terms and define the negative log-likelihood function as
$$ \mathcal{L}(\theta) := \frac{1}{2\sigma^2} (y-X\theta)^T(y-X\theta), \ X := [x_1,...,x_n]^T, \ y := [y_1,...,y_n]^T $$
Deriving and setting equal to zero therefore yields
$$ \frac{d\mathcal{L}}{d\theta} = \frac{1}{2\sigma^2} \frac{d}{d\theta}(y^Ty - 2y^TX\theta + \theta^TX^TX\theta) = \frac{1}{\sigma^2}(-y^TX+\theta^TX^TX) = 0^T $$
$$ \theta^TX^TX = y^TX \iff \theta^T = y^TX(X^TX)^{-1} \iff \hat{\theta} = (X^TX)^{-1}X^Ty $$
From this result it follows that $ \ \bold{X\hat{\theta} = X(X^TX)^{-1}X^Ty} \ $ is the regression approximation of $y$ onto the column space of $X$.


We can extend this notion further to cases where we have a non-linear polynomial transformation matrix
 $ \phi (x) = \begin{bmatrix} 1 \\ x  \\ \vdots \\ x^p \end{bmatrix} \in \mathbb{R}^{p+1} \ $ 
and a corresponding feature matrix 
$ \ \Phi := \begin{bmatrix} \phi^T (x_1)  \\ \vdots \\ \phi^T (x_n) \end{bmatrix} = 
\begin{bmatrix} 1 & x_1 & \cdots x_1^p  \\ \vdots & \vdots & \ddots \\ 1 & x_n & \cdots x_n^p \end{bmatrix} 
\in \mathbb{R}^{n,\ p+1} \\ $

Then, the following dderivations stay the same as before, and we find the result $ \ \bold{\Phi\hat{\theta} = \Phi(\Phi^T\Phi)^{-1}\Phi^Ty} \ $



Note that a geometric/linear algebra approach is also possible. Consider that we are solving an equation $ \ X\theta=y \ $ where $y$ is not necessarily in the image of $X$. Then, to minimize distance from $y$ to $X\theta$, we use the fact that
$$ \| y - X\hat{\theta} \| \leq \| y - X\theta \|, \ X\hat{\theta}=proj_{\operatorname{im}(X)}(y) $$
which stems from the Pythagoras identity. We also know that $ (y - X\hat{\theta}) \perp \operatorname{im}(X) $ so it must be that  $ (y - X\hat{\theta}) \in \operatorname{ker}(X^T) $. Then, 
$$ X^T(y - X\hat{\theta})=0 \iff X^Ty=X^TX\theta \iff \hat{\theta}=(X^TX)^{-1}X^Ty $$
from which we once again find the same result for $\bold{X\hat{\theta}}$