# Multiple Linear Regression

## Introduction

In the previous notes, we only have one independent variable or one feature. In most cases of machine learning, we want to include more than one feature or we want to have a hypothesis that is not simply a straight line. This section discusses how we can include more than one feature and how to model our equation beyond a simple straight line using multiple linear regression.

## Hypothesis

Recall that in linear regression, our hypothesis is written as follows.

$$h_\theta(x) = \theta_0 + \theta_1 x$$

where $x$ is the only independent variable or feature. In multiple linear regression, we have more than one feature. We will write our hypothesis as follows.

$$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \ldots + \theta_n x_n$$

In the above hypothesis, we have $n$ features. Note also that we can assume to have $x_0 = 1$ with $\theta_0$ as its coefficient.


We can write this in terms of a row vector, where the features are written as

$$\mathbf{X} = \begin{vmatrix}
x_0 & x_1 & \ldots & x_n
\end{vmatrix} \in {\rm I\!R}^{n+1}$$

note that the dimension of the feature is $n+1$ because we have $x_0 = 1$ which is a constant of 1. 

The constants can be written as follows.

$$\mathbf{\Theta} = \begin{vmatrix}
\theta_0 \\
\theta_1 \\
\ldots \\
\theta_n
\end{vmatrix} \in {\rm I\!R}^{n+1}$$

Our system equations for all the data points can be written as follows.

$$h_\theta(x^1) = \theta_0 + \theta_1 x_1^1 + \theta_2 x_2^1 + \ldots + \theta_n x_n^1$$
$$h_\theta(x^2) = \theta_0 + \theta_1 x_1^2 + \theta_2 x_2^2 + \ldots + \theta_n x_n^2$$
$$\ldots$$
$$h_\theta(x^m) = \theta_0 + \theta_1 x_1^m + \theta_2 x_2^m + \ldots + \theta_n x_n^m$$

In the above equations, the superscript indicate the index for the data points from 1 to $m$, assuming there are $m$ data points.

To write the hypothesis as a matrix equation we first need to write the features as a matrix for all the data points.

$$\mathbf{X} = \begin{vmatrix}
1 & x_1^1 & \ldots & x_n^1 \\
1 & x_1^2 & \ldots & x_n^2 \\
\ldots & \ldots & \ldots & \ldots \\
1 & x_1^m & \ldots & x_n^m
\end{vmatrix} \in {\rm I\!R}^{m \times (n+1)}$$

with this, we can now write the hypothesis as a matrix multiplication.

$$\mathbf{H} = \mathbf{X} \times \mathbf{\Theta}$$

## Cost Function

Recall that the cost function is written as follows.

$$J(\theta_0, \theta_1) = \frac{1}{2m}\Sigma^m_{i=1}\left(h_{\theta}(x^i)-y^i\right)^2$$

We can rewrite the square as a multiplication instead and make use of matrix multplication to express it.

$$J(\theta_0, \theta_1) = \frac{1}{2m}\Sigma^m_{i=1}\left(h_{\theta}(x^i)-y^i\right)\times \left(h_{\theta}(x^i)-y^i\right)$$

Writing it as matrix multiplication gives us the following.

$$J(\theta_0, \theta_1) = \frac{1}{2m}(\mathbf{H}-\mathbf{y})^T\times (\mathbf{H}-\mathbf{y})$$

## Gradient Descent

Recall that the update function for gradient descern algorithm for a linear regression is given as follows.

$$\theta_0 = \theta_0 - \alpha \frac{1}{m}\Sigma_{i=1}^m\left(h_{\theta}(x^i) - y^i\right)$$

$$\theta_1 = \theta_1 - \alpha \frac{1}{m}\Sigma_{i=1}^m\left(h_{\theta}(x^i) - y^i\right)x^i$$

In the case of multiple linear regression, we have more than one feature and so we need to differentiate for each $\theta_j$. Doing this will result in a system of equation as follows.

$$\theta_0 = \theta_0 - \alpha \frac{1}{m}\Sigma_{i=1}^m\left(h_{\theta}(x^i) - y^i\right)x_0^i$$

$$\theta_1 = \theta_1 - \alpha \frac{1}{m}\Sigma_{i=1}^m\left(h_{\theta}(x^i) - y^i\right)x_1^i$$

$$\theta_2 = \theta_2 - \alpha \frac{1}{m}\Sigma_{i=1}^m\left(h_{\theta}(x^i) - y^i\right)x_2^i$$

$$\ldots$$

$$\theta_n = \theta_n - \alpha \frac{1}{m}\Sigma_{i=1}^m\left(h_{\theta}(x^i) - y^i\right)x_n^i$$

Note that $x_0 = 1$ for all $i$.

We can now write the gradient descent update function using matrix operations.

$$\mathbf{\Theta} = \mathbf{\Theta} - \alpha\frac{1}{m} \mathbf{X}^T \times (\mathbf{H} - y)$$

Substituting the equation for $\mathbf{H}$ gives us the following.

$$\mathbf{\Theta} = \mathbf{\Theta} - \alpha\frac{1}{m} \mathbf{X}^T \times (\mathbf{X}\times \mathbf{\Theta} - y)$$