# Linear regression theory

#### 1. What is regression?
Regression is a type of machine learning algorithm that predicts numeric values using one or more variables.

#### 2. What is linear regression?
Linear regression is a statistical technique used to model the relationship between a scalar response and one or more explanatory variables.

The linear regression takes the following expressions: $$\textbf{y} \approx \bar{\textbf{y}} = \textbf{X}\textbf{w} + b$$

The term $\textbf{w}$ always has the degree 1, so this equation is similar to the line equation with respect to $\textbf{w}$, so we called the model Linear Regression.

We have some notations here:
- $\textbf{y}$ represents the **actual value** of the predicted variable.
- $\bar{\textbf{y}}$ represents the **predicted value** the predicted variable of the model.
- $\textbf{X}$ represents the explanatory variables.
- $\textbf{w}$ represents the weights or the coefficients of the linear regression.
- $b$ represents the bias of the linear regression.

To get more clearly defined, if $\textbf{X} = [X_1, X_2, \cdots, X_n]$ and $\textbf{w} = [w_1, w_2, \cdots, w_n]^{\mathsf{T}}$, then the above equation can be rewritten as $$\textbf{y} \approx \bar{\textbf{y}} = w_1X_1 + w_2X_2 + \cdots w_nX_n + b$$

In some documents, they define $\textbf{X} = [1, X_1, X_2, \cdots, X_n]$ and $\textbf{w} = [w_0, w_1, w_2, \cdots, w_n]^{\mathsf{T}}$. The equation changes: $$\textbf{y} \approx \bar{\textbf{y}} = w_1X_1 + w_2X_2 + \cdots w_nX_n + w_0 = \textbf{X}\textbf{w}$$

#### 3. How do we find the coefficents in the model?
Our goal is to find the coefficents $\textbf{w}$ so that $\bar{\textbf{y}}$ is "equal to" $\textbf{y}$. In another words, $\textbf{y} - \bar{\textbf{y}}$ is as close to $\textbf{0}$ as possible. 

This means $\textbf{y} - \textbf{X}\textbf{w} \approx \textbf{0}$, or $(\textbf{y} - \textbf{X}\textbf{w})^2 \approx \textbf{0}$. In real application, we just need to find the minimum of $(\textbf{y} - \textbf{X}\textbf{w})^2$.

In Linear Regression, we find the coefficents vector $\textbf{w}^*$ so that $\dfrac{1}{2}(\textbf{y} - \textbf{X}\textbf{w}^*)^2$ get minimum. The reason we add $\dfrac{1}{2}$ to the equation is to clear out the coefficient 2 in the equation. 

Differentiate 2 sides of the above equation and set the differentiate to $\textbf{0}$, we have ${\textbf{X}}^{\mathsf{T}}(\textbf{y} - \textbf{X}\textbf{w}^*) = \textbf{0}$. Therefore, $\textbf{w}^* = ({\textbf{X}}^{\mathsf{T}}\textbf{X})^{-1}{\textbf{X}}^{\mathsf{T}}\textbf{y}$.

This is the equation being used in `sklearn.linear_model.LinearRegression()` api. We will use this, combine with some apis in `sklearn` and some libraries like `numpy`, `pandas` and `matplotlib` to do some examples about Linear Regression.

#### 4. Sounds great, but how do we know that model is perfect?
Well, I have to say that there is no model true. Some models are useful in some ways, and we will cover this in the future.

But for now, if we use Linear Regression, we can get some **metrics** to calculate how good the model is:
- **Mean Squared Error (MSE)**: This is one of the popular metrics used in regression tasks. It calculates the Euclidean distance between the absolute value $\textbf{y}$ and the predicted value $\bar{\textbf{y}}$: $\text{MSE}(\textbf{y}, \bar{\textbf{y}}) = \displaystyle \sum_{i = 1}^{n}(\textbf{y}_i - \bar{\textbf{y}}_i)^2$.
- **Root Mean Squared Error (RMSE)**: Another common metric used in regression tasks is called Root Mean Squared Error (RMSE). We simply take the square root of MSE to obtain RMSE: $\text{RMSE}(\textbf{y}, \bar{\textbf{y}}) = \sqrt{\text{MSE}(\textbf{y}, \bar{\textbf{y}})}$.
- **Mean Absolute Error (MAE)**: Similar to MSE, but instead of using Euclidean distane, Mean Absolute Error (MAE) calculates the Manhattan distance between the absolute value $\textbf{y}$ and the predicted value $\bar{\textbf{y}}$: $\text{MAE}(\textbf{y}, \bar{\textbf{y}}) = \displaystyle \sum_{i = 1}^{n}|\textbf{y}_i - \bar{\textbf{y}}_i|$.

There're also metrics for specific tasks that I can not get more details here. If you interest, I will try to update this notebook to include these metrics.

### Now that we get some basics of Linear Regression, let's get some hands dirty, shall we?

# Let's get some hands dirty!

## First example

Before we get started to examples, we need to call some libraries and apis for later usages.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, root_mean_squared_error

We read the dataset using pandas. This dataset has 2 columns, SAT score (SAT) and GPA score (GPA). We will use the Linear Regression to see how SAT score affect the GPA score.

In [None]:
sat_gpa_df = pd.read_csv("../datasets/sat_gpa.csv")
sat_gpa_df.head()

We will visualize the data to see the correlation.

In [None]:
plt.scatter(sat_gpa_df['SAT'], sat_gpa_df['GPA'])
plt.xlabel('SAT score')
plt.ylabel('GPA score')
plt.title('SAT and GPA relationship')
plt.show()

Let's fit a Linear Regression model for the task. Now, since we do this in a Machine Learning fashion, we need to do some steps before fitting the model.

First, we need to define which is the regressor variable $\textbf{X}$ and which is the predicting variable $\textbf{y}$. In this case, $\textbf{X}$ is referred to the 'SAT' column, and $\textbf{y}$ is referred to the 'GPA' column.

In [None]:
X = sat_gpa_df.drop(columns='GPA') # You can do X = sat_gpa_df['SAT'] instead, but this is a good practice to do this.
y = sat_gpa_df['GPA']

Next, we need to split the dataset into 2 subsets, a training a subset and a testing subset. We will train the model on the training set, and use the testing set to see the result. We split the 20% of the dataset to the testing. The remaining is used for training.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)