# Implement Linear Regression From Scratch (with Normal Equation)
__Author__ : Mohammad Rouintan , 400222042

__Course__ : Undergraduate Machine Learning Course

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Problem
Implement Linear Regression using the normal equation as the training algorithm from scratch.


### What is the Normal Equation ?

The normal equation is a closed-form solution used to find the value of θ that minimizes the cost function. Another way to describe the normal equation is as a one-step algorithm used to analytically find the coefficients that minimize the loss function. Both descriptions work, but what exactly do they mean? We will start with linear regression.

Linear regression makes a prediction, $\hat{y}$, by computing the weighted sum of input features plus a bias term. Mathematically it can be represented as follows:

$$
\begin{align}
\hat{y} &= \theta_{0}x_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \dots + \theta_{n}x_{n}
\end{align}
$$

Where $\theta$ represents the parameters and n is the number of features.

Essentially, all that occurs in the above equation is the dot product of $\theta$, and $x$ is being summed. Thus, a more concise way to represent this is to use its vectorized form:

$$
\begin{align}
\hat{y} &= h(\theta) = \theta^{T}x \tag{2}
\end{align}
$$

$h(\theta)$ is the hypothesis function.

Given this approximate target function, we can use our model to make predictions. To determine if our model has learned well, it’s important we measure the performance of our model on the training data. For this purpose, we compute a loss function. The goal of the training process is to find the values of theta ($\theta$) that minimize the loss function.

Here’s how we can represent our loss function mathematically:

$$
\begin{align}
J(\theta_{0}, \theta_{1}, \theta_{2}, \dots, \theta{m}) &= \frac{1}{2m} \sum\limits_{i = 1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2 \tag{3}
\end{align}
$$

In the above equation, theta ($\theta$) is a $m + 1$ dimensional vector, and our loss function is a function of the vector value. Consequently, the partial derivative of the loss function, $J$, has to be taken with respect to every parameter of $\theta_{j}$ in turn. All of them must equal zero. Following this process and solving for all of the values of $\theta$ from $\theta_{0}$ to $\theta_{m}$ will result in the values of θ that minimize the loss function.

Working through the solution to the parameters $\theta_{0}$ to $\theta_{m}$ using the process described above results in an extremely involved derivation procedure. There is indeed a faster solution.
Take a look at the formula for the normal equation:

$$
\begin{align}
\theta &= (X^{T}X)^{-1}X^{T}y
\end{align}
$$

Where:

$\theta$ → The parameters that minimize the loss function<br> 
$X$ → The input feature values for each instance<br> 
$y$ → The vector of output values for each instance

In [1]:
# Your code for first problem

After each cell, you should explain your entire code. Please consider clean code in cells too and use comments if you should

### Part b)
Description and code of second part

In [None]:
# Your code for first problem

After each cell, you should explain your entire code. Please consider clean code in cells too and use comments if you should

## Conclusion for this problem
Write a conclusion and references which you've used in your homework