# Bonus notebook - derivatives and closed form for multiple linear model

##  1) Error function:

Remember our error function definition, from the learning notebook, which is the same as in the simple model: 

$$J(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2 $$ 

Expanding with our linear model, we get:

$$J(y, \hat{y}) = \sum_{i=1}^N (y_i - \beta_0 -  \sum_{j=1}^K \beta_k x_{k_i})^2 $$


<br>
<br>

## 2) Derivatives of error function for simple linear regression

The derivatives are not required for the closed form solution, however they are quite usefull for other methods, such as the gradient descent, which you learn at the end of the learning notebook.


### 2.1 ) Intercept derivative:

The intercept derivative is the same. Let's develop it from our equation above:

$$\frac{d J}{d \beta_0} = \frac{d}{d \beta_0} (\frac{1}{N}\sum_{i=1}^N (y_i - \beta_0 - \sum_{j=1}^K \beta_k x_{k_i})^2) $$

We can expand the square, without unrolling the sum:

$$\frac{d J}{d \beta_0} = \frac{d}{d \beta_0} (\frac{1}{N}\sum_{i=1}^N (y_i^2 - 2 y_i \beta_0 - 2 y_i \sum_{j=1}^K \beta_k x_{k_i} + 2 \beta_0\sum_{j=1}^K \beta_k x_{k_i}+ \beta_0^2  + (\sum_{j=1}^K \beta_k x_{k_i})^2 )) $$

Which makes it easier to cut all the terms that do not depend on $\beta_0$, since the whole term $\sum_{j=1}^K \beta_k x_{k_i}$ is completely independent of $\beta_0$

$$\frac{d J}{d \beta_0} = \frac{1}{N}\sum_{i=1}^N (0 - 2 y_i - 0 + 2 \sum_{j=1}^K \beta_k x_{k_i} + 0 + 2\beta_0)  \\
\frac{d J}{d \beta_0} = \frac{1}{N}\sum_{i=1}^N (-2 y_i + 2 \sum_{j=1}^K \beta_k x_{k_i} + 2\beta_0) $$

Finally, we'll rearange the interior of the sum and get to:

$$\frac{d J}{d \beta_0} = -\frac{1}{N} \sum_{i=1}^N [2 (y_i - \sum_{j=1}^K \beta_k x_{k_i} - \beta_0)] $$

$$\frac{d J}{d \beta_0} = -\frac{1}{N} \sum_{i=1}^N [2 (y_i - \hat{y_i})] $$



### 2.2 ) Coefficient derivative:

We'll now derivate with respect to each $\beta_k$. This might seem trickier, since $\sum_{j=1}^K \beta_k x_{k_i})$ can not be considered independent of this term, however you'll see that this will get quite simplified due to only one term of the sum being import for each $\beta_k$. Assume from here on that $k \in [1, ..., K]$ where K is the number of features of the model. Let's start from the basic expression:

$$\frac{d J}{d \beta_k} = \frac{d}{d \beta_k} (\frac{1}{N}\sum_{i=1}^N (y_i - \beta_0 - \sum_{j=1}^K \beta_j x_{j_i})^2) $$

We'll compute the square still without unrolling the linear model sum:

$$\frac{d J}{d \beta_k} = \frac{d}{d \beta_k} (\frac{1}{N}\sum_{i=1}^N (y_i^2 - 2 y_i \beta_0 - 2 y_i \sum_{j=1}^K \beta_j x_{j_i} + 2 \beta_0\sum_{j=1}^K \beta_j x_{j_i}+ \beta_0^2  + (\sum_{j=1}^K \beta_j x_{j_i})^2 )) $$

Now let's cut out the terms that don't matter: 

$$\frac{d J}{d \beta_k} = \frac{d}{d \beta_k} (\frac{1}{N}\sum_{i=1}^N (0 - 0 - 2 y_i \sum_{j=1}^K \beta_j x_{j_i} + 2 \beta_0\sum_{j=1}^K \beta_j x_{j_i}+ 0  + (\sum_{j=1}^K \beta_j x_{j_i})^2 )) $$

And now we'll propagate the derivative inside the sum:

$$\frac{d J}{d \beta_k} = (\frac{1}{N}\sum_{i=1}^N (0 - 0 - 2 y_i \frac{d}{d \beta_k}\sum_{j=1}^K \beta_j x_{j_i} + 2 \beta_0\frac{d}{d \beta_k}\sum_{j=1}^K \beta_j x_{k_i}+ 0  + \frac{d}{d \beta_k}(\sum_{j=1}^K \beta_j x_{j_i})^2 )) $$
$$\frac{d J}{d \beta_k} = (\frac{1}{N}\sum_{i=1}^N (-2 y_i \frac{d}{d \beta_k}\sum_{j=1}^K \beta_j x_{j_i} + 2 \beta_0\frac{d}{d \beta_k}\sum_{j=1}^K \beta_j x_{j_i}+ \frac{d}{d \beta_k}(\sum_{j=1}^K \beta_j x_{j_i})^2 )) $$

#### Derivative of the sum

The first two terms can be computed quite easily, since we know that:

$$\sum_{j=1}^K \beta_j x_{j_i} =  \beta_1 x_{1_i} + \beta_2 x_{2_i} + ... + \beta_K x_{K_i} $$

So the derivative with respect to $\beta_k$ will concern only one of the terms in the sum, namely $\beta_k x_{k_i}$:

$$\frac{d J}{d \beta_k}\sum_{j=1}^K \beta_j x_{j_i} = x_{k_i}$$


#### Derivative of the square of sum

To solve the derivative of the square of the sum, since we know we are only interested in terms depending on $\beta_k$, we can rewrite it as follows:

$$(\sum_{j=1}^K \beta_j x_{j_i})^2 = (\beta_k x_{k_i} + \sum_{j=1, j!=k}^K \beta_j x_{j_i})^2$$

This can be developed to:

$$(\beta_k^2 x_{k_i}^2 + 2 \beta_k x_{k_i} \sum_{j=1, j!=k}^K \beta_j x_{j_i} + (\sum_{j=1, j!=k}^K \beta_j x_{j_i})^2$$

And so the derivative with respect to $\beta_k$ becomes:

$$ \frac{d}{d \beta_k} (\sum_{j=1}^K \beta_j x_{j_i})^2 = 2 \beta_k x_{k_i}^2 + 2 x_{k_i} \sum_{j=1, j!=k}^K \beta_j x_{j_i} $$

Which simplifies to:

$$ \frac{d}{d \beta_k} (\sum_{j=1}^K \beta_j x_{j_i})^2 = x_{k_i} (2 \beta_k x_{k_i} + 2 \sum_{j=1, j!=k}^K \beta_j x_{j_i}) = x_{k_i} (2 \sum_{j=1}^K \beta_j x_{j_i})$$


#### Putting everything together

And finally, our expression becomes: 

$$\frac{d}{d \beta_k} = (\frac{1}{N}\sum_{i=1}^N (-2 y_i x_{k_i} + 2 \beta_0 x_{k_i}+ 2 x_{k_i} \sum_{j=1}^K \beta_j x_{j_i}) $$

Which we simplify to:

$$ \frac{d J}{d \beta_1} = -\frac{1}{N}\sum_{i=1}^N [2( y_i - \beta_0 - \sum_{j=1}^K \beta_j x_{j_i})x_{k_i}] \\
 \frac{d J}{d \beta_1} = -\frac{1}{N}\sum_{i=1}^N [2( y_i - \hat{y_i})x_{k_i}] $$


<br>
<br>

## 3) Closed form solution for multiple linear regression

The multiple linear regression closed form solution actually does not require you to demonstrate those derivatives above, although it still makes use of them. This is mostly because it makes use of the matrix form, which provides some handy rules that simplify the process.

First we define our model in matrix notation, where we replace the vector notation from $\vec{\beta}$ by $\boldsymbol{\beta}$:

$$\hat{y} = X\boldsymbol{\beta} $$

where X is the matrix of inputs with an additional first column of ones:

$$ X' = [\vec{1} | X] $$

Using the same error function, we can write the gradients of the error function as follows:

$$ \Delta_{\boldsymbol{\beta}} J  = \Delta_{\boldsymbol{\beta}} (y - X\boldsymbol{\beta})^2  $$ 

First, we'll develop the square, with the following rule in mind: $ A^2 = A^TA$ and that $(AB)^T = B^TA^T$:

$$ \Delta_{\boldsymbol{\beta}} J  = \Delta_{\boldsymbol{\beta}} (y - X\boldsymbol{\beta})^T(y - X\boldsymbol{\beta})  $$ 

$$ \Delta_{\boldsymbol{\beta}} J  = \Delta_{\boldsymbol{\beta}} (y^T - \boldsymbol{\beta}^TX^T)(y - X\boldsymbol{\beta})  $$ 

$$ \Delta_{\boldsymbol{\beta}} J  = \Delta_{\boldsymbol{\beta}} (y^Ty - y^T X \boldsymbol{\beta} + \boldsymbol{\beta} ^T  X^T X \boldsymbol{\beta}   - \boldsymbol{\beta}^T X^T y ) $$ 

We can cut down the first term, since it does not depend on $\boldsymbol{\beta}$, and take the following rules into consideration:

* $ \Delta_{\boldsymbol{\theta}} (a^T \theta) = a$
* $ \Delta_{\boldsymbol{\theta}} (\theta^T a) = a$
* $ \Delta_{\boldsymbol{\theta}} (\theta^T A \theta) = 2A\theta$

And so we get:

$$ \Delta_{\boldsymbol{\beta}} J  = (-(y^T X)^T  + 2  X^T X \boldsymbol{\beta} - X^T y ) $$ 

$$ \Delta_{\boldsymbol{\beta}} J  = (-2 X^T y  + 2  X^T X \boldsymbol{\beta} ) $$ 


Finally, equaling the derivative to zero, we get a very clean solution:

$$ 0 = (-2 X^T y  + 2  X^T X \boldsymbol{\beta} ) $$ 

$$ X^T X \boldsymbol{\beta} = X^T y $$ 

$$ \boldsymbol{\beta} = (X^T X)^{-1} X^T y $$ 