<a href="https://colab.research.google.com/github/PaulToronto/Stanford-Andrew-Ng-Machine-Learning-Specialization/blob/main/1_2_1_Multiple_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1.2.1 Multiple Linear Regression

- **NOTE**: this should not be confused with **multivariate regression**
    - **multiple** refers to more than one predictor variable but **multivariate** refers to more than one dependent variables
    - not covered in this course

In [None]:
import pandas as pd
import numpy as np

## 1.2.1.1 Multiple features

### Previously

- A single feature, $x$ was used to predict the price of the house, $y$
- $f_{w,b}\left(x\right) = wx + b$

In [None]:
path = 'https://raw.githubusercontent.com/PaulToronto'
path += '/Stanford-Andrew-Ng-Machine-Learning-Specialization/main'
path += '/data/Portland.csv'

portland = pd.read_csv(path, header=None, names=['Size', 'Bedrooms', 'Price'])
portland['Price'] = portland['Price'] / 1000.0
portland.shape

(47, 3)

In [None]:
portland = portland.drop('Bedrooms', axis=1)
portland.head()

Unnamed: 0,Size,Price
0,2104,399.9
1,1600,329.9
2,2400,369.0
3,1416,232.0
4,3000,539.9


### Terminology for multiple regression

- Now there are four features: $x_1$, $x_2$, $x_3$, and $x_4$
- $x_j$ represents the $j^{th}$ feature: $j = 1\dots4$
- $n$ denotes the total number of features: $n = 4$
- $\vec{x}^{(i)}$ denotes the $i^{th}$ training example
    - note that for multiple regression this is a vector
        - more specifically, this is a row vector
- $x_j^{(i)}$ is the value of feature $j$ in the $i^{th}$ training example

In [None]:
path = 'https://raw.githubusercontent.com/PaulToronto'
path += '/Stanford-Andrew-Ng-Machine-Learning-Specialization/main'
path += '/data/houses.csv'

houses = pd.read_csv(path,
                     header=None,
                     names=['Size', 'Bedrooms', 'Floors', 'YearsOld', 'Price'],
                     dtype={'Size':'int',
                            'Bedrooms': 'int',
                            'Floors': 'int',
                            'YearsOld':'int',
                            'Price': 'double'})
houses.shape

(100, 5)

In [None]:
houses.head()

Unnamed: 0,Size,Bedrooms,Floors,YearsOld,Price
0,952,2,1,65,271.5
1,1244,3,1,64,300.0
2,1947,3,2,17,509.8
3,1725,3,2,42,394.0
4,1959,3,2,15,540.0


### Model

- Previously: $f_{w,b}\left(x\right) = wx + b$
- With multiple regression: $f_{\vec{w},b}\left(\vec{x}\right) = w_1x_1 + w_2x_2 + w_3x_3 \dots + w_nx_n + b$

#### In our example:

- $x_1$ denotes the `Size`
- $x_2$ denotes the `Bedrooms`
- $x_3$ denotes the `Floors`
- $x_4$ denotes the `YearsOld`

Suppose we have trained the model and found the parameters:

$f_{w,b}\left(x\right) = 0.1x_1 + 4x_2 + 10x_3 - 2x_4 + 80$

#### How might we interpret these parameters?

- First note that the price is in 1000s of dollars
- $b = 80$ where the unit is 1000s of dollars
    - This can be thought of as the *base price* of a house with 0 square feet, 0 bedrooms, 0 floors and 0 years old
- $0.1x$
    - The price increases by $0.1 \times 1000 = 100$ dollars for each additional square foot
- $4x_2$
    - The price increases by $4 \times 1000 = 4000$ dollars for each additional bedroom
- $10x_3$
    - The prices increases by $10 \times 1000 = 10000$ dollars for each additional floor
- $-2x_4$
    - The price **decreases** by $2 \times 1000 = 2000$ dollars for each year added to the age of the house


### Notation

- **The model**: $f_{\vec{w},b}\left(\vec{x}\right) = w_1x_1 + w_2x_x2 + w_3x_3 \dots + w_nx_n + b$
- $\vec{w} = \left[w_1 \ w_2 \ w_3 \dots w_n\right]$
- $b$ is a scalar
    - $\vec{w}$ together with $b$ are the **parameters of the model**
- $\vec{x} = \left[x_1 \ x_2 \ x_3 \dots x_n\right]$
    - $\vec{x}$ contains the **features of the model**

### Model: more succinctly

- the dot ($\cdot$) in the following formula is for the **dot product**

$$
f_{\vec{w},b}\left(\vec{x}\right) = \vec{w} \cdot \vec{x} + b
$$

## 1.2.1.2 Vectorization

Benefiits of vectorization:

1. Makes your code more compact
2. Make your code run more efficiently
3. Allows you to make use of modern numerical linear algebra libraries
4. Might even allow you to make use of GPU hardware

### Example

Note that in linear algebra, the count starts from 1. In Python the count starts from 0.

- $\vec{w} = \left[w_1 \ w_2 \ w_3\right]$
- $b$ is a scalar
- $\vec{x} = \left[x_1 \ x_2 \ x_3\right]$
- So, $n = 3$

Be careful. In linear algebra, counting starts from 1, but in Python, it starts from 0.

Here it is in Python code:

In [None]:
w = np.array([1.0, 2.5, -3.3])
b = 4
x = np.array([10, 20, 30])
n = len(w)

print('w:', w)
print(' w1:', w[0], ' w2:', w[1], ' w3:', w[2], '\n')

print('b:', b, '\n')

print('x:', x)
print(' x1:', x[0], ' x2:', x[1], ' x3:', w[2], '\n')

print('n:', n, '\n')

w: [ 1.   2.5 -3.3]
 w1: 1.0  w2: 2.5  w3: -3.3 

b: 4 

x: [10 20 30]
 x1: 10  x2: 20  x3: -3.3 

n: 3 



#### Without vectorization

$f_{\vec{w},b} = w_1x_1 + w_2x_2 + w_3x_3 + b$

In [None]:
%%timeit -r7 -n1000000
f = w[0] * x[0] + \
    w[1] * x[1] + \
    w[2] * x[2] + b

1.54 µs ± 253 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


#### Without vectorizaaion using a `for` loop

$$
f_{\vec{w},b} = \left(\sum_{j=1}^{n}w_jx_j\right) + b
$$

In [None]:
%%timeit -r7 -n1000000
f = 0
for j in range(0, n):
    f += w[j] * x[j]
f += b

1.7 µs ± 35.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


#### With vectorization

- This code only seems to run better than the two previous code cells when using a GPU

$$
f_{\vec{w},b} = \vec{w} \cdot \vec{x} + b
$$

In [None]:
%%timeit -r7 -n1000000
f = w.dot(x) + b

1.22 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [None]:
%%timeit -r7 -n1000000
f = np.dot(w, x) + b

1.79 µs ± 18.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [None]:
%%timeit -r7 -n1000000
f = w @ x + b

1.91 µs ± 13.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


### Careful, don't use `*`

- `*` does element-wise multiplication

In [None]:
# correct
w.dot(x), w @ x

(-39.0, -39.0)

In [None]:
# incorrect
w * x

array([ 10.,  50., -99.])

### How does vectorization help with Gradient Descent?

Let $\vec{w}$ be the vector for the weights and let $\vec{d}$ be the vector for the derivatives. Suppose there are $n = 16$ features. Let $b$ be the intercept.

$$
\begin{align}
\vec{w} &= \begin{pmatrix} w_1 & w_2 & \dots & w_{16}\end{pmatrix} \\
\vec{d} &= \begin{pmatrix} d_1 & d_2 & \dots & d_{16}\end{pmatrix}
\end{align}
$$

Suppose the values for $vec{w}$ and $\vec{d}$ are in `numpy arrays`.

Compute:

$$
w_j = w_j - \alpha d_j \text{ for } j = 1 \dots 16
$$

Code without vectorization (ignoring $b$):

$$
\begin{align}
w_1 &= w_1 - \alpha d_1 \\
w_2 &= w_2 - \alpha d_2 \\
& \vdots \\
w_{16} &= w_{16} - \alpha d_{16}
\end{align}
$$

```python
for j in range(0, 16):
    w[j] = w[j] - alpha * d[j]
```

Code with vectorization (ignoring $b$):

$$
\vec{w} = \vec{w} - \alpha\vec{d}
$$

```python
w = w - 0.1 * d
```

## 1.2.1.3 Lab - Python, Numpy and Vectorization

https://colab.research.google.com/drive/1FuQouLAiKi4487ELObVYfBX5r-e2L96V#scrollTo=6qQadVxTW-hn



## 1.2.1.4 Gradient descent for multiple linear regression