# Normal Equation Proof

### Introduction: 

Many of us have encountered and employed a seemingly magical equation known as the <a href="https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)"><i>Normal Equation</i></a>.  

$$ \huge{ \theta = (A^T A)^{-1} A^T y } $$

This equation gives the optimal model parameters ($\theta$) to fit a linear model ($A$) to data ($y$). However, after long and repeated use, one may acquire the madness to wonder how it is derived. Thankfully, the derivation is fairly simple and only requires basic linear algebra, and calculus. The following is a proof that attempts to properly show the derivation of the normal equation. To start, suppose that one has linear model parameters $ \theta $ , such that when multiplied by a system matrix $ A $ , the result is the estimated output $ \hat y $ .

In equation form, this is simply: 

$$ \hat y = A \theta $$

Naturally, the goal is to minimize the model error. More specifically, the objective is to minimize the mean squared error. Consider a vector $r$ that is representative of the model error. In optimization, this error vector is often called the residual vector and is defined: 

$$ r = y - \hat y $$ 

The goal then is to pick $\theta$ such that the error is minimized:

$$ \large{min_\theta( ||r||^2 )}  $$

Form calculus, it is known that a minimum of a function can be found by setting its derivative equal to zero. 

$$ \large{ \frac{ \partial || r ||^2 }{ \partial \theta }}  \large{ = 0}$$

So then in summary, the goal is to find $ \theta $ such that $  \frac{ \partial || r ||^2 }{ \partial \theta } = 0 $

### Derivation: 

Recall that the matrix norm(squared) of the residual vector is as follows:

$$ || r || ^2 = r^T r $$

Next, back substitute the definition of $ r $ and the estimated linear model to arrive at:

$$ || r || ^2 = ( y - A \theta )^T ( y - A  \theta ) $$ 

The next step is to apply linear algebra properties. Apply the properties of a matrix transpose:

$$ \dots  = ( y^T - \theta^T A ^T ) ( y - A \theta ) $$ 

Apply the algebra foil: 

$$ \dots  = \theta ^T A^T A \theta  -  \theta^T A^T y  -  y^T A \theta + y^T y $$  

Apply transpose properties to group terms: 

$$ \large{ \dots  = \theta ^T A^T A \theta - 2 y^T A \theta + y^T  y } $$  

Now, to minimize this function, we know that we can take the derivative and set it equal to zero. In other words, we wish to: 

$$ \large{ \frac{ \partial || r ||^2 }{ \partial \theta }}  \large{ = 0}$$

Before one can do this, one must know the value of $ \frac{ \partial || r ||^2 }{ \partial \theta } $.  [Matrix calculus](https://en.wikipedia.org/wiki/Matrix_calculus) is required in order to take the derivative of a matrix based system. There is an identity that states:

$$  \large{ \frac{ \partial x^T A x }{ \partial x }} = 2Ax $$ 

Applying this identity as well as basic calculus, one ends up with: 

$$ \large{ \frac{ \partial || r ||^2 }{ \partial \theta } =  2 A^T A \theta - 2 A^T y } = 0 $$

Apply more basic algebra to rewrite as: 

$$ \large{ A^T A \theta = A^T y }  $$

Next, left multiply both sides by $ (A^T A)^{-1} $ to end up with:

$$ \large{ \theta = (A^T A)^{-1} A^T y } $$

and presto! We have derived the normal equation :) 








