# Linear Regression

## Introduction

Linear regression is a machine learning algorithm dealing with a continuous data. It is part of a supervised machine learning algorithm. In the simplest case, we have independent variable $x$ and a dependent variable $y$ in our data collection. The linear regression algorithm will find the straight line equation to model this data.

$$ y = m x + c$$

In this sense, the model consists of the two coefficients $m$ and $c$. Once we know these two coefficients, we will be able to predict the value of $y$ for any $x$.

## Hypothesis

We can make our straight line equation as our hypothesis. This simply means we make a hypothesis that the relationship between the independent variable and the dependent variable is a straight line. To generalize it, we will write down our hypothesis as follows.

$$h_\theta(x) = \theta_0 + \theta_1 x$$

where you can see that $\theta_0$ is the constant $c$ and $\theta_1$ is the gradient $m$. The purpose of our learning algorithm is to find $\theta_0$ and $\theta_1$ given the values of $x$ and $y$. 

## Cost Function

In order to find the values of $\theta_0$ and $\theta_1$, we will apply optimization algorithm that minimizes the error. The error caused by the difference between our hypothesis and the actual data $y$ is captured in a *cost function*. Let's find our cost function for linear regression.

We can get the error by taking the difference between the actual value and our hypothesis and square them. The square is to avoid cancellation due to positive and negative differences. This is to get our absolute errors. For one particular data point $i$, we can get the error square as follows.

$$e^i = \left(h_\theta(x^i) - y^i\right)^2$$

Assume we have $m$ data points, we can then sum over all data points to get the sum square of the errors.

$$\Sigma_{i=1}^m\left(h_\theta(x^i) - y^i\right)^2$$

We can then choose the following equation as our cost function.

$$J(\theta_0, \theta_1) = \frac{1}{2m}\Sigma_{i=1}^m\left(h_\theta(x^i) - y^i\right)^2$$

The division by $m$ is to get an average over all data points. The constant 2 in the denominator is make the derivative easier to calculate.

The learning algorithm will then try to obtain the constant $\theta_0$ and $\theta_1$ that minimizes the cost function.

$$\begin{matrix}\text{minimize} & J(\theta_0, \theta_1)\\
\theta_0, \theta_1\\ \end{matrix}$$

## Gradient Descent

One of the algorithm that we can use to find the constants by minimizing the cost function is called *gradient descent*. The algorithm starts by some initial guess of the constants and use the gradient of the cost function to make a prediction where to go next to reach the bottom or the minimum of the function. In this way, some initial value of $\theta_0$ and $\theta_1$, we can calculate the its next values using the following equation.

$$\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_o, \theta_1)$$

We can actually calculate the derivative of the cost function analytically.

$$J(\theta_0, \theta_1) = \frac{1}{2m}\Sigma_{i=1}^m\left(h_\theta(x^i) - y^i\right)^2$$

$$\frac{\partial}{\partial \theta_j} J(\theta_o, \theta_1) = \frac{\partial}{\partial \theta_j} \frac{1}{2m}\Sigma_{i=1}^m\left(h_\theta(x^i) - y^i\right)^2$$

We can substitute our straight line equation to give the following.

$$\frac{\partial}{\partial \theta_j} J(\theta_o, \theta_1) = \frac{\partial}{\partial \theta_j} \frac{1}{2m}\Sigma_{i=1}^m\left(\theta_0 + \theta_1 x^i - y^i\right)^2$$

Now we will differentiate with respect to $\theta_0$ and $\theta_1$. Let's first do it for $\theta_0$. 

$$\frac{\partial}{\partial \theta_0} J(\theta_o, \theta_1) =  \frac{1}{m}\Sigma_{i=1}^m\left(\theta_0 + \theta_1 x^i - y^i\right)$$

or

$$\frac{\partial}{\partial \theta_0} J(\theta_o, \theta_1) = \frac{1}{m}\Sigma_{i=1}^m\left(h_{\theta}(x^i) - y^i\right)$$

Now, we need to do the same by differentiating it with respect to $\theta_1$.

$$\frac{\partial}{\partial \theta_1} J(\theta_o, \theta_1) =  \frac{1}{m}\Sigma_{i=1}^m\left(\theta_0 + \theta_1 x^i - y^i\right) x^i$$

or

$$\frac{\partial}{\partial \theta_1} J(\theta_o, \theta_1) = \frac{1}{m}\Sigma_{i=1}^m\left(h_{\theta}(x^i) - y^i\right) x^i$$

Now we have the equation to calculate the next values of $\theta_0$ and $\theta_1$ using gradient descent.

$$\theta_0 = \theta_0 - \alpha \frac{1}{m}\Sigma_{i=1}^m\left(h_{\theta}(x^i) - y^i\right)$$

$$\theta_1 = \theta_1 - \alpha \frac{1}{m}\Sigma_{i=1}^m\left(h_{\theta}(x^i) - y^i\right)x^i$$

## Matrix Operations

We can calculate these operations using matrix calculations. 

### Hypothesis

Recall that our hypothesis for one data point was written as follows.

$$h_\theta(x^i) = \theta_0 + \theta_1 x^i$$

If we have $m$ data points, we will then have a set of equations.

$$h_\theta(x^1) = \theta_0 + \theta_1 x^1$$
$$h_\theta(x^2) = \theta_0 + \theta_1 x^2$$
$$\ldots$$
$$h_\theta(x^m) = \theta_0 + \theta_1 x^m$$

We can rewrite this in terms of matrix multiplication. First, we write our independent variable $x$ as a column vector.

$$\begin{bmatrix}
x^1\\
x^2\\
\ldots\\
x^m
\end{bmatrix}$$

To write the system equations, we need to add a column of constant 1s into our independent column vector.

$$\mathbf{X} = \begin{bmatrix}
1 & x^1\\
1 & x^2\\
\ldots & \ldots\\
1 &x^m
\end{bmatrix}$$

and our constants as a column vector too.

$$\mathbf{\Theta} = \begin{bmatrix}
\theta_0\\
\theta_1
\end{bmatrix}$$



Our system equations can then be written as

$$\mathbf{H} = \mathbf{X} \times \mathbf{\Theta}$$

The result of this matrix multiplication is a column vector of $m\times 1$.

### Gradient Descent

Recall that our gradient descent equations update functions were written as follows. 

$$\theta_0 = \theta_0 - \alpha \frac{1}{m}\Sigma_{i=1}^m\left(h_{\theta}(x^i) - y^i\right)$$

$$\theta_1 = \theta_1 - \alpha \frac{1}{m}\Sigma_{i=1}^m\left(h_{\theta}(x^i) - y^i\right)x^i$$

And recall that our independent variable is a column vector with constant 1 appended into the first column.

$$\mathbf{X} = \begin{bmatrix}
1 & x^1\\
1 & x^2\\
\ldots & \ldots\\
1 &x^m
\end{bmatrix}$$

Transposing this column vector results in

$$\mathbf{X}^T = \begin{bmatrix}
1 & 1 & \ldots & 1\\
x^1 & x^2 & \ldots & x^m\\
\end{bmatrix}$$


Note that we can write the update function summation also as a matrix operations.

$$\mathbf{\Theta} = \mathbf{\Theta} - \alpha \frac{1}{m}\mathbf{X}^T \times (\mathbf{H} - \mathbf{y})$$

Substituting the equation for $H$, we get the following equation.

$$\mathbf{\Theta} = \mathbf{\Theta} - \alpha \frac{1}{m}\mathbf{X}^T \times (\mathbf{X} \times \mathbf{\Theta} - \mathbf{y})$$

In this notation, the capital letter notation indicates matrices and small letter notation indicates vector. Those without bold notation are constants.

## Metrics

We use metrics to evaluate our model or hypothesis. Before we apply these metrics, we split the data into two:
- training data set
- test data set

The training data set is used to build the model or the hypothesis. The test data set and the metrics is used to evaluate the model.


### Mean Squared Error

One metric we can use here is called the mean squared error. The mean squred error is computed as follows.

$$MSE = \frac{1}{n}\Sigma_{i=1}^n(y^i - \hat{y}^i)^2$$

where $n$ is the number of predicted data points in the *test* data set, $y^i$ is the actual value in the *test* data set, and $\hat{y}^i$ is the predicted value obtained using the hypothesis and the independent variable $x^i$ in the *test* data set.

### R2 Coefficient of Determination

Another metric is called the $r^2$ coefficient or the coefficient of determination. This is computed as follows.

$$r^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$

where

$$SS_{res} = \Sigma_{i=1}^n (y_i - \hat{y}_i)^2$$ where $y_i$ is the actual target value and $\hat{y}_i$ is the predicted target value.

$$SS_{tot} = \Sigma_{i=1}^n (y_i - \overline{y})^2$$

where 
$$ \overline{y} = \frac{1}{n} \Sigma_{i=1}^n y_i$$
and $n$ is the number of target values.

This coefficient gives you how close the data is to a straight line. The closer it is to a straight line, the value will be close to 1.0. This means there is a correlation between the independent variable and the dependent variable. When there is no correlation, the value will be close to 0.