# $$ Multivariate~~Linear~~Regression $$

## **Defintion and Usage**

This is the same as univariate linear regression with the only change is instead of one independent variable, we will use **multiple** independent variables (input).

# Notation

$x_j^{(i)}$ = value of feature $j^{th}$ in the $i^{th}$ training example.<br><br>
  $x^{(i)}$ = the input (features) of the $i^{th}$ training example.<br>
$~~ m =$ the number of training examples.<br>
$~~~ n =$ the number of features.<br>

For example:<br>
$$ x_1~~x_2~~x_3 $$
$$\begin{bmatrix} 1 & 2 & 1 \\ 3 & 0 & 1 \\ 0 & 2 & 4 \end{bmatrix}$$
<br><br>

With the matrix above, the rows indicate the number of training examples, the columns indicate the number of features.<br><br>
If we say $x^{(2)}$, that means we are taking a row vector $\begin{bmatrix} 3 & 0 & 1 \end{bmatrix}$ which is the input of the second training example.<br><br>
If we take $x_2^{(3)}$ that means we are referring to third row, second column which is 2.

# General form of Multivariate Linear Regression

$$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n $$

Using linear algebra notation:<br>
$$h_\theta(x) = \begin{bmatrix} \theta_0 & \theta_1 & \theta_2 & ... & \theta_n \end{bmatrix} \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ ... \\ x_n \end{bmatrix} = \theta^T x $$

$x^{(i)}_0 = 1$ for $ (i \in 1,...~,m) $

# Gradient Descent for Multiple Variables

This is the same as with one variable. But now we have a set of features. 
    $$\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_j)$$
    $$\theta_j = \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m}{[(h_\theta(x^{(i)}) - y^{(i)})  \phi_j(x^{(i)})]}$$
    (simutaneously update $\theta_j$ for $ j = 0, 1, 2, ..., m$)

# Feature Scaling

We can speed up gradient descent by having each of our input values in roughly the same range. This is because $\theta$ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.<br><br>
The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:

$$ −1 \leqslant x_{i} \leqslant 1 $$
or 
$$ -0.5 \leqslant x_{i} \leqslant 0.5 $$

The goal is to get all input variables into roughly one of these ranges. This is called **Feature scaling**. In data processing, it is also known as **Data normalization** and is generally performed during the **data processing** step.<br><br>
There are two techniques to achieve this:
1. **Max - Min normalization**.<br>
2. **Mean normalization**.

The general formula for Standardization is:
    $$ x_i = \frac{x - \mu_i}{\sigma} $$
* $\mu_i$ is the average of all the values for feature $i^{th}$
* $\sigma$ is the standard deviation (which can be found <a href="https://en.wikipedia.org/wiki/Standard_deviation" target="_blank">here</a>), or the range between Max - Min value.

## **Max - Min normalization**

Max - Min normalization involves dividing the input values by the ranges of the input variable, resulting in a new range around $1$ .<br>
**General Formula** for a Min - Max of $[0, 1]$ is given as:
$$x_i = \frac{x_i - x_{min}}{\sigma} = \frac{x_i - x_{min}}{x_{max} - x_{min}}$$

To rescale between an arbitrary set of values $[a, b]$ , the formula becomes:
$$x_i = a + \frac{(x_i - x_{min})(b - a)}{\sigma} = a + \frac{(x_i - x_{min})(b - a)}{x_{max} - x_{min}} $$


## **Mean normalization**

Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.<br>
**General Formula**:
$$x_i = \frac{x_i - \mu_i}{\sigma} = \frac{x_i - \mu_i}{x_{max} - x_{min}}$$