## 1. Derivatives and closed form solution for the simple linear model

###  1.1 Cost function

Recall our cost function definition from the learning notebook

$$J(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2 $$ 

Expanding with the linear model $\hat{y}_i = \beta_0 + \beta_1 x_i$ we get

$$J(y, \hat{y}) = \sum_{i=1}^N (y_i - \beta_0 - \beta_1 x_i)^2 $$

### 1.2 Derivatives of the cost function for simple linear regression

#### 1.2.1 Derivative with respect to the intercept

We'll now derivate with respect to $\beta_0$, starting from the initial formulation

$$\frac{d J}{d \beta_0} = \frac{d}{d \beta_0} (\frac{1}{N}\sum_{i=1}^N (y_i - \beta_0 - \beta_1 x_i)^2) $$

As the only term depending on $\beta_0$ is inside the sum and the derivative of a sum is a sum of the derivatives, we can rewrite as

$$\frac{d J}{d \beta_0} = \frac{1}{N}\sum_{i=1}^N \frac{d}{d \beta_0} ((y_i - \beta_0 - \beta_1 x_i)^2) $$

The derivative of the squared term is 2 times that term multiplied by the derivative of that term

$$\frac{d J}{d \beta_0} = \frac{1}{N}\sum_{i=1}^N 2 (y_i - \beta_0 - \beta_1 x_i) \frac{d}{d \beta_0} (y_i - \beta_0 - \beta_1 x_i) $$

Proceeding with the derivative of $(y_i - \beta_0 - \beta_1 x_i)$, only the second term depends on $\beta_0$ and its derivative is -1

$$\frac{d J}{d \beta_0} = \frac{1}{N}\sum_{i=1}^N 2 (y_i - \beta_0 - \beta_1 x_i) (-1) $$

Simplifying the expression, we get

$$\frac{d J}{d \beta_0} = \frac{1}{N}\sum_{i=1}^N 2 (\beta_0 + \beta_1 x_i - y_i) $$

Finally, because $\beta_0 + \beta_1 x_i$ is just $\hat{y}_i$, we get

$$\frac{d J}{d \beta_0} = \frac{1}{N}\sum_{i=1}^N 2 (\hat{y}_i - y_i) $$

#### 1.2.2 Derivative with respect to the coefficient

Using the same principles as above, we'll now derivate with respect to $\beta_0$, starting from the initial formulation

$$\frac{d J}{d \beta_1} = \frac{d}{d \beta_1} (\frac{1}{N}\sum_{i=1}^N (y_i - \beta_0 - \beta_1 x_i)^2) $$

As the only term depending on $\beta_1$ is inside the sum and the derivative of a sum is a sum of the derivatives, we can rewrite as

$$\frac{d J}{d \beta_1} = \frac{1}{N}\sum_{i=1}^N \frac{d}{d \beta_1} ((y_i - \beta_0 - \beta_1 x_i)^2) $$

The derivative of the squared term is 2 times that term multiplied by the derivative of that term

$$\frac{d J}{d \beta_1} = \frac{1}{N}\sum_{i=1}^N 2 (y_i - \beta_0 - \beta_1 x_i) \frac{d}{d \beta_1} (y_i - \beta_0 - \beta_1 x_i) $$

Proceeding with the derivative of $(y_i - \beta_0 - \beta_1 x_i)$, only the third term depends on $\beta_1$ and its derivative is $-x_i$

$$\frac{d J}{d \beta_1} = \frac{1}{N}\sum_{i=1}^N 2 (y_i - \beta_0 - \beta_1 x_i) (-x_i) $$

Finally, because $\beta_0 + \beta_1 x_i$ is just $\hat{y}_i$, we get

$$\frac{d J}{d \beta_1} = \frac{1}{N}\sum_{i=1}^N 2 (\hat{y}_i - y_i) x_i $$

### 1.3 Closed form solution for simple linear regression

To get to the closed form solution, we need to find the minimum of the cost function. This is achieved by setting the derivatives to zero.

#### 1.3.1 Finding the intercept

To find the intercept, we'll equal its derivative to zero

$$\frac{d J}{d \beta_0} = 0$$

$$\frac{1}{N}\sum_{i=1}^N 2 (\hat{y}_i - y_i) = 0$$

We can start by cutting any terms multiplying or dividing the overall equation, since we are equaling to zero. We also substitute for $\hat{y}$

$$\sum_{i=1}^N (\beta_0 + \beta_1 x_i - y_i) = 0$$

We will then split the sum so we can isolate the $\beta_0$ term

$$\sum_{i=1}^N \beta_0  +  \sum_{i=1}^N \beta_1 x_i - \sum_{i=1}^N y_i = 0$$

Rearranging the terms, we get

$$\sum_{i=1}^N \beta_0 = \sum_{i=1}^N y_i - \sum_{i=1}^N \beta_1 x_i $$

We can move the betas outside of the sums because they are the same in every term

$$\beta_0 \sum_{i=1}^N = \sum_{i=1}^N y_i - \beta_1 \sum_{i=1}^N x_i $$

Now we can execute the sums. The sums over $x_i$ and $y_i$ are just their averages times N

$$N \beta_0 = N \bar{y} - N \beta_1 \bar{x} $$

Finally, we divide by N

$$\beta_0 = \bar{y} - \beta_1 \bar{x} $$

Notice that the resulting expression depends on the value of the coefficient, so let's proceed to compute the solution for that.

#### 1.3.2 Finding the coefficient

To find the coefficient, we start in the same way, by setting the derivative to zero and substituting for $\hat{y}$

$$\frac{d J}{d \beta_1} = 0$$

$$\frac{1}{N}\sum_{i=1}^N 2 (\hat{y}_i - y_i) x_i = 0$$

$$\frac{1}{N}\sum_{i=1}^N 2 (\beta_0 + \beta_1 x_i - y_i) x_i = 0$$

We can then replace $\beta_0$ by the result we got before, rearrange the terms and remove any multiplication factors

$$\frac{1}{N} \sum_{i=1}^N 2 (\bar{y} - \beta_1 \bar{x} + \beta_1 x_i - y_i) x_i = 0$$

$$\sum_{i=1}^N [\bar{y} - y_i - \beta_1 (\bar{x} - x_i )] x_i = 0$$

Finally, we isolate $\beta_1$

$$\sum_{i=1}^N [(\bar{y} - y_i) x_i - \beta_1 (\bar{x} - x_i ) x_i] = 0$$

$$\beta_1 = \frac {\sum_{i=1}^N (\bar{y} - y_i) x_i}{\sum_{i=1}^N (\bar{x} - x_i ) x_i}$$

Now we have two options how to arrange the expression into an easy-to-use form. The first option is to split the sums and remember that the sum of $x_i$ is just its average times N

$$\beta_1 = \frac {\bar{y} \sum_{i=1}^N x_i - \sum_{i=1}^N y_i x_i}{\bar{x} \sum_{i=1}^N x_i - \sum_{i=1}^N {x_i}^2}$$

$$\beta_1 = \frac {\bar{y} \bar{x} - \frac{1}{N} \sum_{i=1}^N y_i x_i}{{\bar{x}}^2 - \frac{1}{N} \sum_{i=1}^N {x_i}^2}$$

The second option is to rearrange into the formula we've seen in the learning notebook. We will use the fact the following expressions are equal to 0, so we can insert them to the sums without consequences

$\sum_{i=1}^N (\bar{x}^2 - x_i\bar{x})$

$\sum_{i=1}^N (\bar{x}\bar{y} - y_i\bar{x})$

Continuing from above and including the zero expressions to the fraction (we need to use the minus sign so it works out)

$$\beta_1 = \frac {\sum_{i=1}^N (\bar{y} - y_i) x_i}{\sum_{i=1}^N (\bar{x} - x_i ) x_i} = \frac {\sum_{i=1}^N (\bar{y} x_i - y_i x_i)}{\sum_{i=1}^N (\bar{x} x_i - {x_i}^2 )}$$

$$\beta_1 = \frac {\sum_{i=1}^N (\bar{y} x_i - y_i x_i) - \sum_{i=1}^N (\bar{x}\bar{y} - y_i\bar{x})}{\sum_{i=1}^N (\bar{x} x_i - {x_i}^2 ) - \sum_{i=1}^N (\bar{x}^2 - x_i\bar{x})}$$

$$\beta_1 = \frac {\sum_{i=1}^N (\bar{y} x_i - y_i x_i - \bar{x}\bar{y} + y_i\bar{x})}{\sum_{i=1}^N (\bar{x} x_i - {x_i}^2  \bar{x}^2 - x_i\bar{x})}$$

Finally, the numerator turns out to be the covariance and the numerator the variance and we get to the expression we already know from the learning notebook (the minus signs cancel out)

$$\beta_1 = \frac{-\sum_{i}^{N}{(x_i - \bar{x})(y_i - \bar{y})}}{-\sum_{i}^{N}{(x_i - \bar{x})^2}} = - \frac{cov(x, y)}{var(x)}$$

## 2. Derivatives and closed form for the multiple linear model

###  2.1 Cost function

Recall the cost function definition from the learning notebook, which is the same as in the simple model

$$J(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2 $$ 

Expanding with a multiple linear model with K features, we get

$$J(y, \hat{y}) = \sum_{i=1}^N (y_i - \beta_0 -  \sum_{j=1}^K \beta_k x_{ki})^2 $$

### 2.2 Derivatives of the cost function for multiple linear regression

The derivatives are not required for the closed form solution, however they are quite useful for other methods, such as the gradient descent, which you learn at the end of the learning notebook.

#### 2.2.1 Derivative with respect to the intercept

The intercept derivative is the same. Let's develop it from the equation above

$$\frac{d J}{d \beta_0} = \frac{d}{d \beta_0} (\frac{1}{N}\sum_{i=1}^N (y_i - \beta_0 - \sum_{j=1}^K \beta_k x_{ji})^2) $$

As the only term depending on $\beta_0$ is inside the sum and the derivative of a sum is a sum of the derivatives, we can rewrite as

$$\frac{d J}{d \beta_0} = \frac{1}{N}\sum_{i=1}^N \frac{d}{d \beta_0} ((y_i - \beta_0 - \sum_{j=1}^K \beta_k x_{ji})^2) $$

The derivative of the squared term is 2 times that term multiplied by the derivative of that term

$$\frac{d J}{d \beta_0} = \frac{1}{N}\sum_{i=1}^N 2 (y_i - \beta_0 - \sum_{j=1}^K \beta_k x_{ji}) \frac{d}{d \beta_0} (y_i - \beta_0 - \sum_{j=1}^K \beta_k x_{ji}) $$

Proceeding with the derivative, only the second term depends on $\beta_0$ and its derivative is -1

$$\frac{d J}{d \beta_0} = \frac{1}{N}\sum_{i=1}^N 2 (y_i - \beta_0 - \sum_{j=1}^K \beta_k x_{ji}) (-1) $$

Finally, simplifying we get

$$\frac{d J}{d \beta_0} = \frac{1}{N}\sum_{i=1}^N 2 (\hat{y}_i -y_i) $$

#### 2.2.2 Derivative with respect to the coefficient

We'll now derivate with respect to each $\beta_k$. This might seem tricky, but in the end just one term of the sum $\sum_{j=1}^K \beta_k x_{k_i})$ depends on $\beta_k$. Assume from here on that $k \in [1, ..., K]$ where K is the number of features of the model. Let's start from the basic expression and proceed exactly as for the simple linear model

$$\frac{d J}{d \beta_k} = \frac{d}{d \beta_k} (\frac{1}{N}\sum_{i=1}^N (y_i - \beta_0 - \sum_{j=1}^K \beta_j x_{ji})^2) $$

As the only term depending on $\beta_1$ is inside the sum and the derivative of a sum is a sum of the derivatives, we can rewrite as

$$\frac{d J}{d \beta_k} = \frac{1}{N}\sum_{i=1}^N \frac{d}{d \beta_k} ((y_i - \beta_0 - \sum_{j=1}^K \beta_j x_{ji})^2) $$

The derivative of the squared term is 2 times that term multiplied by the derivative of that term

$$\frac{d J}{d \beta_k} = (\frac{1}{N}\sum_{i=1}^N 2 (y_i - \beta_0 - \sum_{j=1}^K \beta_j x_{ji}) \frac{d}{d \beta_k} (y_i - \beta_0 - \sum_{j=1}^K \beta_j x_{ji}) $$

Proceeding with the derivative, only the third term, the sum, depends on $\beta_k$

$$\frac{d J}{d \beta_k} = (\frac{1}{N}\sum_{i=1}^N 2 (y_i - \beta_0 - \sum_{j=1}^K \beta_j x_{ji}) \frac{d}{d \beta_k} (-\sum_{j=1}^K \beta_j x_{ji}) $$

Inside the sum, only the kth term contains $\beta_k$ and its derivative is $-x_{ki}$

$$\frac{d J}{d \beta_k} = (\frac{1}{N}\sum_{i=1}^N 2 (y_i - \beta_0 - \sum_{j=1}^K \beta_j x_{ji}) (-x_{ki}) $$

Simplifying, we get

$$\frac{d J}{d \beta_k} = (\frac{1}{N}\sum_{i=1}^N 2 (y_i - \hat{y}_i) (-x_{ki}) $$

### 2.3 Closed form solution for multiple linear regression

The multiple linear regression closed form solution makes use of the matrix form of the expressions which provides some handy rules that simplify the process.

First we define our model in matrix notation

$$\hat{y} = X\vec{\beta} $$

where X is the feature matrix extended with abcolumn of ones

$$ X = [\vec{1} | X'] $$

The cost function in matrix notation is

$$ J = (\vec{y} - X\vec{\beta})^T (\vec{y} - X\vec{\beta}) $$

where $\vec{y}$ is the vector of true sample values.

The gradient of the cost function in matrix notation is

$$ \Delta_{\vec{\beta}} J  = \Delta_{\vec{\beta}} [(\vec{y} - X\vec{\beta})^T (\vec{y} - X\vec{\beta})]  $$ 

The gradients are just partial derivatives of J with respect to all $\beta_k$. The expression in the square brackets is a multiplication of two factors. Its gradient is calculated according to the rule $\Delta(AB) = \Delta(A) B + A \Delta(B)$. We also need to know that $(AB)^T = B^T A^T$

$$ \Delta_{\vec{\beta}} J  = \Delta_{\vec{\beta}} \vec{\beta}^T (-X)^T (\vec{y} - X\vec{\beta}) + (\vec{y} - X\vec{\beta})^T \Delta_{\vec{\beta}} \vec{\beta} (-X)$$ 

Now $\Delta_{\vec{\beta}} \vec{\beta}$ is just a unit matrix, so the expression becomes

$$ \Delta_{\vec{\beta}} J  = (-X)^T (\vec{y} - X\vec{\beta}) + (\vec{y} - X\vec{\beta})^T (-X)$$ 

Using the transpose rule again and simplifying

$$ \Delta_{\vec{\beta}} J  = -X^T (\vec{y} - X\vec{\beta}) + -X^T(\vec{y} - X\vec{\beta})$$ 

$$ \Delta_{\vec{\beta}} J  = -2 X^T (\vec{y} - X\vec{\beta})$$ 

Finally, setting the derivative to zero, we get a very clean closed form solution

$$  -2 X^T (\vec{y} - X\vec{\beta}) = 0 $$ 

$$  2 X^T X\vec{\beta} = 2 X^T \vec{y}$$ 

$$  \vec{\beta} = (X^T X)^{-1} X^T \vec{y} $$ 