## Ordinary Least Squares with Simple Linear Regression


The simple linear regression model is:

 $$\mathbf{y} = \beta_0 +\beta_1\mathbf{x}$$


where, we need to estimate the parameters, intercept($\beta_0$) and slope($\beta_1$). 


Let's recall an Advertising dataset and simple linear regression performed on scatter plot of _sales_ Vs. _TV_. With the help of `Scikit-Learn`, we were able to fit the best regression line among all the possibilities. Here is a snapshot:

<figure align="center">
       <img src="fig1.png" height="350" width="600">
       <figcaption>Figure 1: Simple Linear Regression </figcaption>
   </figure>



The blue line is a simple linear regression line with output $\mathbf{y}$ as `sales` and $\mathbf{x}$ as `TV`. The residual or error, $\epsilon$ is the difference between the observed value, $y_i$, and predicted value, $\hat{y_i}$. The observed value is the actual output data point, which is all blue dots in the figure, and the predicted value is the point given by the black regression line. Error for each output data point is shown by the vertical distance from the actual output data point to the predicted point on a regression line.

The predicted output value is:

$$\hat{y_i} = \beta_0 + \beta_1x_i$$

The observed (actual) output value is:

$$y_i = \beta_0 + \beta_1x_i + \epsilon_i$$

Where $\epsilon_i$ is a random error, not a parameter. The error $\epsilon_i$ as ($y_{i}-\hat{y_{i}}$) can either be positive or negative or even 0 sometimes. As we can see in the figure, vertical lines are on either side of the regression line. To avoid the cancellation of the error while summing errors, we square each error and sum them, called _Residual Sum of Squares (RSS)_ or _Sum of Squared Errors (SSE)_.

$$\text{Sum of Squared Errors (SSE)} = \sum_{i=1}^{n}(y_{i}-\hat{y_{i}})^2$$

The summation is indexed from $1$ to $n$, since we have $n$ samples. Sum of Squared Errors (SSE) is the function of $\beta_0$ and $\beta_1$. We can also take it as _Loss function_. The main principle of Least Squares is that we should end up choosing intercept ($\beta_0$) and slope ($\beta_1$) such that the overall sum is minimum.




Thus, to estimate the parameters, we minimize the sum of squared error. Sum of Squared Errors (SSE) can also be written as:

$$\text{SSE} = \sum_{i=1}^{n}(y_{i}-\hat{y_{i}})^2 =\sum_{i=1}^{n}(y_{i}-(\beta_0+\beta_1x_i))^2 $$



$\hat{y_i}$ is replaced with the simple linear regression model equation. Since we tend to minimize $\text{SSE}$, it is also called an objective function. Since the objective function, $\text{SSE}$ is a squared term, it is always positive. If we plot objective function, it would be a convex graph facing upwards. 


<figure align="center">
       <img src="./fig2.png" height="400" width="500">
       <figcaption>Figure 2: Convex cost function </figcaption>
   </figure>


The parameters at a minimum point are obtained from calculus by setting the first derivative of the objective function to $0$. Gradient or slope is always $0$ at the minimum point. We have two unknown parameters, intercept ($\beta_0$) and slope ($\beta_1$) so, we will take the partial derivative of _SSE_ with respect to $\beta_0$ and $\beta_1$ separately. We will set both partial derivatives to 0 and solve for $\beta_0$ and $\beta_1$ separately.


Taking partial derivatives with respect to $\beta_0$:

$$\frac{\partial\ \text{SSE}}{\partial \beta_0}  = \frac{\partial }{\partial \beta_0}\sum(y_i-(\beta_0+\beta_1x_i))^2$$

Note that the derivative of the sum is the sum of the derivatives. So, we can take the derivative inside the summation.

$$\frac{\partial }{\partial \beta_0}\sum(y_i-(\beta_0+\beta_1x_i))^2 = \sum\frac{\partial }{\partial \beta_0}(y_i-(\beta_0+\beta_1x_i))^2 $$

Now, applying power rule and chain rule, we get:


$$= \sum2(y_i-(\beta_0+\beta_1x_i))(-1) $$

$$=-2\sum(y_i-(\beta_0+\beta_1x_i)) ......(1)$$



Now, with respect to $\beta_1$:


$$\frac{\partial\ {\text{SSE}} }{\partial \beta_1} = \frac{\partial }{\partial \beta_1}\sum(y_i-(\beta_0+\beta_1x_i))^2$$

Again, the derivative of the sum is the sum of the derivatives, So, we take the derivative inside the summation.

$$\frac{\partial }{\partial \beta_1}\sum(y_i-(\beta_0+\beta_1x_i))^2 = \sum\frac{\partial }{\partial \beta_1}(y_i-(\beta_0+\beta_1x_i))^2 $$


Applying power rule, $2$ comes out front and exponent becomes $1$. We will also apply chain rule to encounter the coefficient of $\beta_1$. 
$$= \sum2(y_i-(\beta_0+\beta_1x_i))(-x_i) $$

Cleaning up a bit, 

$$= -2\sum x_i(y_i-(\beta_0+\beta_1x_i)) ......(2)$$

Now, we set up the partial derivatives equal to $0$ for equation $(1)$ and $(2)$.


$$-2\sum(y_i-(\beta_0+\beta_1x_i))  = 0$$

$$-2\sum x_i(y_i-(\beta_0+\beta_1x_i))  = 0$$

Here, we have two equations and two unknowns, and we are going to solve this to find our parameters. But how do we get that?

First, we will get an expression for $\beta_0$ from the first equation. That expression would involve $\beta_1$, and we will substitute that equation in the second equation and solve for $\beta_1$. Let's solve the first equation.


Solving for $\beta_0$ equating equation $\text{(1)}$ to $0$,


$$-2\sum(y_i-(\beta_0+\beta_1x_i))  = 0$$

We can divide both sides by $-2$ so that we get,

$$\sum(y_i-(\beta_0+\beta_1x_i))  = 0$$

If we carry the summation term through each terms inside the bracket, we get:

$$\sum y_i- \sum \beta_0 - \sum \beta_1x_i = 0$$

Note that with respect to summation, $\beta_0 $ and $\beta_1$ are constants. Statistically, they are random variables that take on any random value. But the values they take are constant over the samples. With respect to summation over the samples, they are constants so they can come outside the summation term as:

$$\sum y_i- n\beta_0 - \beta_1\sum x_i = 0$$

The sum of $\beta_0$ from $1$ to $n$ turns to $n\beta_0$ and $\beta_1$ comes out of the summation term.

Now, isolating the $n\beta_0$ term, we get:

$$n\beta_0 = \sum y_i- \beta_1\sum x_i$$

Dividing both sides by $n$, we get:

$$\beta_0 = \frac{\sum y_i}{n}- \frac{\beta_1\sum x_i}{n}$$

The sum of all $y's$ divided by $n$ gives the mean or average and so is for $x's$.So, we end up with:


$$\beta_0 = \overline{y}- \beta_1\overline{x}$$


But this doesn't work without knowing the value of $\beta_1$. So, we substitute this expression of $\beta_0$ to the equation where the partial derivative of $\beta_1$ is set to $0$.


Hence, solving for $\beta_1$,

$$-2\sum x_i(y_i-(\beta_0+\beta_1x_i))  = 0$$


We can divide both sides by $-2$ so that we get,

$$\sum x_i(y_i-(\beta_0+\beta_1x_i))  = 0$$

Substituting $\beta_0$ with $\overline{y}- \beta_1\overline{x}$, we get:

$$\sum x_i(y_i-(\overline{y}- \beta_1\overline{x}+\beta_1x_i))  = 0$$

Now, we are getting somewhere since the unknown in the above expression is only $\beta_1$. Now, we will find a way to isolate $\beta_1$. Let's first gather similar terms together, i.e., putting $y_i$ with $\overline{y}$ and $x_i$ with $\overline{x}$:
$$\sum x_i((y_i-\overline{y})- \beta_1(x_i- \overline{x}))  = 0$$


Carrying summation through each terms, we get:
$$\sum x_i(y_i-\overline{y})- \sum\beta_1x_i(x_i- \overline{x})
  = 0$$

$\beta_1$ is a constant, so we put it outside the summation term.
$$\sum x_i(y_i-\overline{y})- \beta_1\sum x_i(x_i- \overline{x})= 0$$

Moving the term including $\beta_1$ to the other side, we get;
$$\sum x_i(y_i-\overline{y})= \beta_1\sum x_i(x_i- \overline{x})$$

Now, we express $\beta_1$ as:

$$\beta_1 = \frac{\sum x_i(y_i-\overline{y})}{\sum x_i(x_i- \overline{x})} ......(3)$$


This is one way of expressing $\beta_1$, but we usually don't follow this fashion. We can also write the expression of $\beta_1$ as:

$$\beta_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} ...... (4)$$




From equation $(3)$ and $(4)$, we can see that:

 $$\sum x_i(y_i-\overline{y}) = \sum(x_i - \bar{x})(y_i - \bar{y})$$

 $$\sum x_i(x_i-\overline{x}) = \sum (x_i - \bar{x})^2$$

Now, we will see how these two expressions are equivalent.

__For the numerator part:__

$$\sum(x_i - \overline{x}) (y_i - \overline{y})=  \sum x_i(y_i-\overline{y}) - \sum \overline{x}(y_i-\overline{y})$$

$\overline{x}$ is a constant term so we take it out:

$$\sum(x_i - \overline{x}) (y_i - \overline{y})=  \sum x_i(y_i-\overline{y}) - \overline{x}\sum (y_i-\overline{y})$$

Now, let's see the second term of the equation $\sum (y_i-\overline{y})$.

$$\sum (y_i-\overline{y}) = \sum y_i - \sum\overline{y} = \sum y_i - n\overline{y} = 0$$

Since, $n\overline{y}$ is equal to  $\sum y_i$, the whole second term becomes $0$. Hence
$$\sum(x_i - \overline{x}) (y_i - \overline{y})=  \sum x_i(y_i-\overline{y})$$

__For the denominator part:__

$$\sum (x_i - \overline{x})^2 = \sum (x_i-\overline{x})(x_i-\overline{x})$$


$$\sum (x_i - \overline{x})^2 = \sum x_i(x_i-\overline{x})- \sum \overline{x}(x_i-\overline{x})$$

Again, $\overline{x}$ is a constant term, so we take it out:

$$\sum (x_i - \overline{x})^2 = \sum x_i(x_i-\overline{x})- \overline{x}\sum (x_i-\overline{x})$$

Again, as earlier, let's see the second term $\sum (x_i-\overline{x})$.

$$\sum (x_i-\overline{x}) = \sum x_i - \sum\overline{x} = \sum x_i - n\overline{x} = 0 $$



Since, $\overline{x}=\frac{\sum x_i}{n}\ $, $\sum x_i = n\overline{x}$ so the second term becomes $0$. Hence

$$\sum (x_i - \overline{x})^2 = \sum x_i(x_i-\overline{x})$$

So, now we proved the similarity of the denominator and numerator terms of both expressions of $\beta_1$. 

Since the parameters are estimates, we usually put _hats_ on them. The key equations of the estimated parameters for simple linear regression are:

$$\hat{\beta_1} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$$

$$\hat{\beta_0} = \bar{y} - \beta_1\bar{x}$$
From the samples provided, first we find $\beta_1$ from the first expression and substitute the value of $\beta_1$ in the second expression for $\beta_0$.
