## Ordinary Least Squares with Simple Linear Regression
The simple linear regression model is:
                                    
                                    
$y = \beta_0 + \beta_1 x$
                                    


where, we need to estimate the parameters, intercept($\beta_0$) and slope($\beta_1$).

Let's recall an Advertising dataset and simple linear regression performed on scatter plot of sales Vs. TV. With the help of **Scikit-Learn**, we were able to fit the best regression line among all the possibilities. Here is a snapshot:

![image.png](attachment:f71b771a-9521-400f-92f7-77b568610574.png)

### Figure 1: Simple Linear Regression

The blue line is a simple linear regression line with output **y** as `sales` and **x** as `TV`.  
The residual or error, $\varepsilon$, is the difference between the observed value, $y_i$, and predicted value, $\hat{y}_i$.  
The observed value is the actual output data point, which is all blue dots in the figure,  
and the predicted value is the point given by the black regression line.  
Error for each output data point is shown by the vertical distance from the actual output data point  
to the predicted point on a regression line.

---

### The predicted output value is:

$$
\hat{y}_i = \beta_0 + \beta_1 x_i
$$

---

### The observed (actual) output value is:

$$
y_i = \beta_0 + \beta_1 x_i + \varepsilon_i
$$

---

Where $\varepsilon_i$ is a random error, not a parameter.  
The error $\varepsilon_i$ as $(y_i - \hat{y}_i)$ can either be positive or negative or even 0 sometimes.  
As we can see in the figure, vertical lines are on either side of the regression line.  
To avoid the cancellation of the error while summing errors, we square each error and sum them,  
called **Residual Sum of Squares (RSS)** or **Sum of Squared Errors (SSE)**.

---

### Sum of Squared Errors (SSE)

$$
\text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

---

The summation is indexed from 1 to $n$, since we have $n$ samples.  
Sum of Squared Errors (SSE) is the function of $\beta_0$ and $\beta_1$.  
We can also take it as **Loss function**.  
The main principle of Least Squares is that we should end up choosing intercept $(\beta_0)$ and slope $(\beta_1)$  
such that the overall sum is minimum.

---

Thus, to estimate the parameters, we minimize the sum of squared error.  
Sum of Squared Errors (SSE) can also be written as:

$$
\text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
= \sum_{i=1}^{n} \left(y_i - (\beta_0 + \beta_1 x_i)\right)^2
$$


$\hat{y}_i$ is replaced with the simple linear regression model equation.  
Since we tend to minimize **SSE**, it is also called an **objective function**.  Since the objective function, **SSE** is a squared term, it is always positive.  If we plot objective function, it would be a **convex graph** facing upwards.


![image.png](attachment:a1c9bbe2-39d3-4643-ae23-b66f09b62f6a.png)

### Figure 2: Convex cost function

The parameters at a minimum point are obtained from calculus by setting the first derivative of the objective function to 0.  
Gradient or slope is always 0 at the minimum point.  
We have two unknown parameters, intercept $(\beta_0)$ and slope $(\beta_1)$ so, we will take the partial derivative of **SSE**  
with respect to $\beta_0$ and $\beta_1$ separately.  
We will set both partial derivatives to 0 and solve for $\beta_0$ and $\beta_1$ separately.

---

### Taking partial derivatives with respect to $\beta_0$:

$$
\frac{\partial \text{SSE}}{\partial \beta_0}
=
\frac{\partial}{\partial \beta_0}
\sum (y_i - (\beta_0 + \beta_1 x_i))^2
$$

Note that the derivative of the sum is the sum of the derivatives. So, we can take the derivative inside the summation.

$$
\frac{\partial}{\partial \beta_0}
\sum (y_i - (\beta_0 + \beta_1 x_i))^2
=
\sum
\frac{\partial}{\partial \beta_0}
(y_i - (\beta_0 + \beta_1 x_i))^2
$$

Now, applying power rule and chain rule, we get:

$$
= \sum 2 (y_i - (\beta_0 + \beta_1 x_i)) (-1)
$$

$$
= -2 \sum (y_i - (\beta_0 + \beta_1 x_i)) \quad ...... (1)
$$

---

### Now, with respect to $\beta_1$:

$$
\frac{\partial \text{SSE}}{\partial \beta_1}
=
\frac{\partial}{\partial \beta_1}
\sum (y_i - (\beta_0 + \beta_1 x_i))^2
$$

Again, the derivative of the sum is the sum of the derivatives. So, we take the derivative inside the summation.

$$
\frac{\partial}{\partial \beta_1}
\sum (y_i - (\beta_0 + \beta_1 x_i))^2
=
\sum
\frac{\partial}{\partial \beta_1}
(y_i - (\beta_0 + \beta_1 x_i))^2
$$

Applying power rule, 2 comes out front and exponent becomes 1. We will also apply chain rule to encounter the coefficient of $\beta_1$.

$$
= \sum 2 (y_i - (\beta_0 + \beta_1 x_i)) (-x_i)
$$

Cleaning up a bit,

$$
= -2 \sum x_i (y_i - (\beta_0 + \beta_1 x_i)) \quad ...... (2)
$$

Now, we set up the partial derivatives equal to 0 for equation (1) and (2).

$$
-2 \sum (y_i - (\beta_0 + \beta_1 x_i)) = 0
$$

$$
-2 \sum x_i (y_i - (\beta_0 + \beta_1 x_i)) = 0
$$

Here, we have two equations and two unknowns, and we are going to solve this to find our parameters. But how do we get that?

First, we will get an expression for $\beta_0$ from the first equation.  
That expression would involve $\beta_1$, and we will substitute that equation in the second equation and solve for $\beta_1$.  
Let's solve the first equation.

---

### Solving for $\beta_0$ equating equation (1) to 0,

$$
-2 \sum (y_i - (\beta_0 + \beta_1 x_i)) = 0
$$

We can divide both sides by $-2$ so that we get,

$$
\sum (y_i - (\beta_0 + \beta_1 x_i)) = 0
$$

If we carry the summation term through each terms inside the bracket, we get:

$$
\sum y_i - \sum \beta_0 - \sum \beta_1 x_i = 0
$$

Note that with respect to summation, $\beta_0$ and $\beta_1$ are constants.  Statistically, they are random variables that take on any random value. But the values they take are constant over the samples. With respect to summation over the samples, they are constants so they can come outside the summation term as:
$$
\sum y_i - n\beta_0 - \beta_1 \sum x_i = 0
$$

The sum of $\beta_0$ from 1 to $n$ turns to $n\beta_0$ and $\beta_1$ comes out of the summation term.

Now, isolating the $n\beta_0$ term, we get:

$$
n\beta_0 = \sum y_i - \beta_1 \sum x_i
$$

Dividing both sides by $n$, we get:

$$
\beta_0 = \frac{\sum y_i}{n} - \frac{\beta_1 \sum x_i}{n}
$$

The sum of all $y$'s divided by $n$ gives the mean or average and so is for $x$'s.  
So, we end up with:

$$
\beta_0 = \bar{y} - \beta_1 \bar{x}
$$

But this doesn't work without knowing the value of $\beta_1$.  
So, we substitute this expression of $\beta_0$ to the equation where the partial derivative of $\beta_1$ is set to 0.

Hence, solving for $\beta_1$,

$$
-2 \sum x_i (y_i - (\beta_0 + \beta_1 x_i)) = 0
$$

We can divide both sides by $-2$ so that we get,

$$
\sum x_i (y_i - (\beta_0 + \beta_1 x_i)) = 0
$$

Substituting $\beta_0$ with $\bar{y} - \beta_1 \bar{x}$, we get:

$$
\sum x_i \big(y_i - (\bar{y} - \beta_1 \bar{x} + \beta_1 x_i)\big) = 0
$$

Now, we are getting somewhere since the unknown in the above expression is only $\beta_1$.  
Now, we will find a way to isolate $\beta_1$.  
Let's first gather similar terms together, i.e., putting $y_i$ with $\bar{y}$ and $x_i$ with $\bar{x}$:

$$
\sum x_i \big((y_i - \bar{y}) - \beta_1 (x_i - \bar{x})\big) = 0
$$
Carrying summation through each terms, we get:

$$
\sum x_i (y_i - \bar{y}) - \sum \beta_1 x_i (x_i - \bar{x}) = 0
$$

$\beta_1$ is a constant, so we put it outside the summation term.

$$
\sum x_i (y_i - \bar{y}) - \beta_1 \sum x_i (x_i - \bar{x}) = 0
$$

Moving the term including $\beta_1$ to the other side, we get:

$$
\sum x_i (y_i - \bar{y}) = \beta_1 \sum x_i (x_i - \bar{x})
$$

Now, we express $\beta_1$ as:

$$
\beta_1 =
\frac{\sum x_i (y_i - \bar{y})}
{\sum x_i (x_i - \bar{x})}
\quad ...... (3)
$$

This is one way of expressing $\beta_1$, but we usually don't follow this fashion.  
We can also write the expression of $\beta_1$ as:

$$
\beta_1 =
\frac{\sum (x_i - \bar{x})(y_i - \bar{y})}
{\sum (x_i - \bar{x})^2}
\quad ...... (4)
$$

From equation (3) and (4), we can see that:

$$
\sum x_i (y_i - \bar{y}) = \sum (x_i - \bar{x})(y_i - \bar{y})
$$

$$
\sum x_i (x_i - \bar{x}) = \sum (x_i - \bar{x})^2
$$

Now, we will see how these two expressions are equivalent.

---

### For the numerator part:

$$
\sum (x_i - \bar{x})(y_i - \bar{y})
=
\sum x_i (y_i - \bar{y})
-
\sum \bar{x} (y_i - \bar{y})
$$
$\bar{x}$ is a constant term so we take it out:

$$
\sum (x_i - \bar{x})(y_i - \bar{y})
=
\sum x_i (y_i - \bar{y}) - \bar{x}\sum (y_i - \bar{y})
$$

Now, let's see the second term of the equation $\sum (y_i - \bar{y})$.

$$
\sum (y_i - \bar{y})
=
\sum y_i - \sum \bar{y}
=
\sum y_i - n\bar{y}
=
0
$$

Since, $n\bar{y}$ is equal to $\sum y_i$, the whole second term becomes 0. Hence

$$
\sum (x_i - \bar{x})(y_i - \bar{y})
=
\sum x_i (y_i - \bar{y})
$$

---

### For the denominator part:

$$
\sum (x_i - \bar{x})^2
=
\sum (x_i - \bar{x})(x_i - \bar{x})
$$

$$
\sum (x_i - \bar{x})^2
=
\sum x_i (x_i - \bar{x})
-
\sum \bar{x}(x_i - \bar{x})
$$

Again, $\bar{x}$ is a constant term, so we take it out:

$$
\sum (x_i - \bar{x})^2
=
\sum x_i (x_i - \bar{x})
-
\bar{x}\sum (x_i - \bar{x})
$$

Again, as earlier, let's see the second term $\sum (x_i - \bar{x})$.

$$
\sum (x_i - \bar{x})
=
\sum x_i - \sum \bar{x}
=
\sum x_i - n\bar{x}
=
0
$$

Since, $\bar{x}=\frac{\sum x_i}{n}$, $\sum x_i = n\bar{x}$ so the second term becomes 0. Hence

$$
\sum (x_i - \bar{x})^2
=
\sum x_i (x_i - \bar{x})
$$

---

So, now we proved the similarity of the denominator and numerator terms of both expressions of $\beta_1$.

Since the parameters are estimates, we usually put hats on them.  The key equations of the estimated parameters for simple linear regression are:

$$
\hat{\beta}_1
=
\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}
{\sum_{i=1}^{n}(x_i-\bar{x})^2}
$$

$$
\hat{\beta}_0
=
\bar{y}-\beta_1\bar{x}
$$

From the samples provided, first we find $\beta_1$ from the first expression and substitute the value of $\beta_1$ in the second expression for $\beta_0$.


