# Linear Regression with Multiple Features

$x_j$ = $j^{th}$ feature (a column in the dataset)

$m$ = number of features

$\vec{x}^{(i)}$ = features of $i^{th}$ training example (a row in the dataset)

$x^{(i)}_j$ = value of feature $j$ in $i^{th}$ training example

$y^{(i)}$ = $i^{th}$ house's price in $1000's

$N$ = # of training examples

Previous Model: $f_{w, b}(x) = wx + b$

<u>Model for Multiple Features:</u> $f_{\vec{w}, b}(\vec{x}) = \vec{w} \cdot \vec{x} + b$

- Where $\vec{w} = \begin{bmatrix} w_1, w_2, \ldots, w_m \end{bmatrix}$ and $\vec{x} = \begin{bmatrix} x_1, x_2, \ldots, x_m \end{bmatrix}$
- The dot stands for Dot Product

<u>Parameters of the New Model:</u> $\begin{bmatrix} w_1, w_2, \ldots, w_m \end{bmatrix}$ and $b$

<u>New Cost Function:</u> $J(\vec{w}, b) = \frac{1}{2N}\sum_{i = 1}^{N}((\vec{w} \cdot \vec{x}^{(i)} + b) - y^{(i)})^2$

## Gradient Descent for Multiple Feature Linear Regression

- $w_1 = w_1 - \alpha\frac{\partial}{\partial w_1}J(\vec{w}, b) \ldots w_m = w_m - \alpha\frac{\partial}{\partial w_m}J(\vec{w}, b)$
    - We need to update all the different $w$ parameters we have from $w_1$ to $w_m$
- $b = b - \alpha\frac{\partial}{\partial b}J(\vec{w}, b)$



# Making Regression Work Well

#### <u>Feature Engineering</u> = using intuition to design new features, by transforming or combining original features

- Instead of just dealing with the features you have, you can create new features from the data you have that make more sense
- For example, let's say you have a feature for the width of a plot and a feature for the length of a plot to predict the plot's price
- Even though you can use those features, you can create a third feature which is the area of the plot and this may be more helpful in predicting the price of the plot

# Vectorization

- Vectorization of code reduces the amount of code you have to write as well as decreases the run time of the code
- Run time is decreased because unlike a for loop, all the operations in vectorized code are done at once
- In for loops, however, you perform operations one step at a time

In [1]:
import numpy as np

In [9]:
# Without vectorization
w = np.array([1, 2, 65, 3, 6, 42, 4, 6, 3])
x = np.array([6, 3, 2, 5, 63, 5, 6, 3, 55])
f = 0
for j in range(0, len(w)):
    f = f + w[j] * x[j]
    
print(f)

952


In [13]:
# With Vectorization
# In this case, both vectors are multiplied in parallel rather than step by step in the for loop
# Then, adding up the values is done by specialized hardware making it efficient rather than carrying out distinct additions
f = np.dot(w, x)
print(f)

952


# How to know Gradient Descent is Working

- Plot J (The Cost Function) and see if it is decreasing as the number of iterations of Gradient Descent increase and flattening at the end
- If J increases after any iteration, alpha is choosen poorly (alpha is too big) or there is a bug in the code

# Making Gradient Descent Run Faster

#### <u>Feature Scaling</u>

Having different features with considerable differences in their ranges can cause Gradient Descent to run slowly as the contour plot for the cost function is going to be skinny. However, if these features can be changed to have similar ranges of values, then the contour plot will be more circular and this will allow Gradient Descent to run faster.

How do you do feature scaling?
- Assume one of your features $x_1$ ranges from 300 - 2000
- Your second feature $x_2$ ranges from 0 to 5

- $x_{1, scaled} = \frac{x_1}{2000}$
- $x_{2, scaled} = \frac{x_2}{5}$

- By doing this, your ranges for both features are:
    - $0.15 \leq x_{1, scaled} \leq 1$
    - $0 \leq x_{2, scaled} \leq 1$

There are also other types of feature scaling such as mean normalization and z-score normalization but I have not included it in this notebook.

For feature scaling, acceptable ranges are $-3 \leq x_j \leq 3$ or $-0.3 \leq x_j \leq 0.3$ but you should try to aim for $-1 \leq x_j \leq 1$
- $0 \leq x_j \leq 3$ ---- okay, no rescaling
- $-2 \leq x_j \leq 0.5$ ---- okay, no rescaling
- $-100 \leq x_j \leq 100$ ---- too large, rescale
- $-0.001 \leq x_j \leq 0.001$ ---- too small, rescale

#### <u>Choosing the Learning Rate</u>

- Plot the Cost Function (J) by the number of iterations of Gradient Descent
- If J ever increases after an iteration of Gradient Descent, then alpha is probably too big and you should use a smaller alpha
    - Note that if J increases, then it may not be a problem with alpha, your code may be wrong
    - To check if your code is wrong, choose a very small alpha and see if J is decreasing on every iteration (if it isn't, your code is wrong)
- If Gradient Descent is taking a very large number of iterations to converge, then alpha may be too small


- Andrew Strategy: Try 0.001, then 0.003, then 0.01, then 0.03... keep multiplying by about 3

# Polynomial Regression

- Not all data can be represented by lines, you may also need to fit curves
- For example, a quadratic curve: $f_{w, b}(x) = w_1x + w_2x^2 + b$
- Or a cubic function: $f_{w, b}(x) = w_1x + w_2x^2 + w_3x^3 + b$