# Simple Linear Regression Model and OLS

One of the most simplest and yet very commonly used method for regression problems is linear regression models. When it comes to regression one method is by far the most common: ordinary least squares or "OLS". This technique is so common and canonical that people will often refer to it simply as "regression", even though there are plenty of other techniques and types of models that qualify as regressions. In this checkpoint we'll cover the basic formulation of the linear regression model and the OLS algorithm.

## Formulating a linear regression model

In the previous checkpoint, we saw that linear regression model fits a best line which represents a relationship between the features and the target. To be more precise, a linear regression model can be formulated like this:

$$ y = \beta_0 + \sum_{i=1}^{n}\beta_ix_i + \epsilon$$

Here, $y$ represents the target variable and $x_i$s represents the features. The unknowns of the equation above are the $\beta$ terms. $ \beta_0 $ refers to the bias term or constant. All the other $\beta_i$s are called the coefficients. The $\epsilon$ is the error term which represents the information in $y$ that is unexplained by the features. If we have two features the equation above can be written more explicitly like this:

$$ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon$$



## Insurance charges dataset

TODO: Description of the insurance cost dataset will come here.

In [2]:
import pandas as pd

In [11]:
## todo the data will be loaded from postgre
insurance_df = pd.read_csv("../datasets/insurance.csv")
insurance_df.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
7,37,female,27.74,3,no,northwest,7281.5056
8,37,male,29.83,2,no,northeast,6406.4107
9,60,female,25.84,0,no,northwest,28923.13692


In [12]:
insurance_df["is_male"] = pd.get_dummies(insurance_df.sex)["male"]
insurance_df["is_smoker"] = pd.get_dummies(insurance_df.smoker)["yes"]

In [13]:
insurance_df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,is_male,is_smoker
0,19,female,27.9,0,yes,southwest,16884.924,0,1
1,18,male,33.77,1,no,southeast,1725.5523,1,0
2,28,male,33.0,3,no,southeast,4449.462,1,0
3,33,male,22.705,0,no,northwest,21984.47061,1,0
4,32,male,28.88,0,no,northwest,3866.8552,1,0


## Modeling the insurance charges with linear regression

Let's illustrate how we can model the insurance cost using a simple linear regression model. Since, sex may play an important role in the determination of the insurance charges, let's put it into our model. Smoking is also a very critical factor in human health. So, including it in our model also makes sense. Hence, our model becomes: 

$$ charges = \beta_0 + \beta_1is\_smoker + \beta_2is\_male + \epsilon $$

$\beta_0$ is the constant and $\beta_1$ and $\beta_2$ are the coefficients of is_smoker and is_male dummies respectively. 

## How to find the optimal values for the coefficients? OLS.

Formulating a model is the first step in a regression problem. But, we need to find a way to discover the optimal values for the unknowns (the coefficients) in the equation above. Recall that, finding the optimal values of the unknowns are called **optimization** and hence we need an optimization algorithm to solve for the optimal coefficient values. 

The optimization algorithm used in the simple linear regression models is called **Ordinary Least Squares** or **OLS** in short.


## The machinery of OLS

OLS tries to minimize the squared sums of the error terms ($\epsilon$) in the model. We can write the error term as following:

$$(y - \beta_0 - \sum_{i=1}^{n}\beta_ix_i ) = \epsilon$$

If we take the squares of the each sides, it becomes:

$$(y - \beta_0 - \sum_{i=1}^{n}\beta_ix_i )^2 = \epsilon^2$$

Notice that, this error term is just for a single observation. If we have say $m$ observations in our dataset, then the sum of the squared errors can be represented like this:

$$\sum_{j=1}^{m}(y_j - \beta_0 - \sum_{i=1}^{n}\beta_ix_{ij} )^2 = \sum_{j=1}^{m}\epsilon_j^2$$

Recall that the index $i$ was representing the number of the features in the model. In the equation above, the index $j$ represents the number of observations. Hence, we go over each observation and add up their squared error terms.

We will not go deeper into the derivation of the coefficients, but all you need to do is to take the derivative of the both sides with respect to the each coefficients and equate them to zero! If you solve the resulting equations, then you end up with the optimal values of the coefficients. That's it! If you want to learn more about the derivation steps, you can read the [Wikipedia article](https://en.wikipedia.org/wiki/Ordinary_least_squares).