<a href="https://colab.research.google.com/github/PaulToronto/Stanford-Andrew-Ng-Machine-Learning-Specialization/blob/main/1_2_2_Gradient_descent_in_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1.2.2 Gradient descent in practice

## 1.2.2.1 Feature scaling

### Feature and parameter values

$$
\widehat{price} = w_1x_1 + w_2x_2 + b
$$

- price is in \$1000s
- $x_1$ represents the size of the house in $\text{feet}^{2}$
 - range: 300 to 2,000 square feet
 - relatively **large range** of values
- $x_2$ represents the nunber of bedrooms
 - range: 0 to 5 bedrooms
 - relatively **small range** of values

#### Example House

- House: $x_1 = 2000$, $x_2 = 5$, $price = 500$
- $b = 50$
- Possible parameters: $w_1 = 50, w_2 = 0.1$
 - $\widehat{price} = 100,050,500$
 - clearly, not a good choice for $w_1$ and $w_2$
- Possibe parameters: $w_1 = 0.1, w_2 = 50$
 - $\widehat{price} = 500,000$
 - a perfect choice for $w_1$ and $w_2$

#### What does this show?

- When a feature has a relatively large range, there is a good chance that the optimal weight for that feature will be a small value.
- Likewise, when the possible values of feature are small, there is a good change that the optimal weight for that feature will be a large value.





### How does this relate to gradient descent?

<img src='https://drive.google.com/uc?export=view&id=198UySZdh8xm9dyfEWXQOzogdz1Gxdbh7'>

- Notice that the contours form ellipses, not circles
- Because the contours are so tall and skinny, gradient descent may end up bouncing back and forth for a long time before it can finally find its way to the global minimum
- **Feature scaling** enables gradient descent to find a much more direct path to the global minimum.

<img src='https://drive.google.com/uc?export=view&id=1zKhYXV1GtOJxBxjUrRgmp1nHyv1kkWn0'>

### How to implement scaling

#### Method 1: Divide each value by the maximum

$$
\begin{align}
300 \le x_1 \le 2000 &\rightarrow 0.15 \le x_1 \le 1 \\
0 \le x_2 \le 5 &\rightarrow 0 \le x_2 \le 1
\end{align}
$$

#### Method 2: Mean normalization (centres it at 0)

$$
x_j = \frac{x_j - \mu_j}{max(x_j) - min(x_j)}
$$

#### Mathod 3: Z-score normalization

$$
x_j = \frac{x_j - \mu_j}{\sigma_1}
$$

### Feature Scaling Examples

- aim for about $-1 \le x_j \le 1$ for each feature $j$
- the following are all ok:
 - $-3 \le x_j \le 3$
 - $-0.3 \le x_j \le 0.3$
 - $0 \le x_j \le 3$
 - $-2 \le x_j \le 0.5$
- not ok, should rescale
 - $-100 \le x_3 \le 100$, too large
 - $-0.0001 \le x_4 \le 0.0001$, too small
 - $98.6 \le x_5 \le 105$, too large

## 1.2.2.2 Checking gradient descent for convergence

### Gradient descent

$$
\begin{align}
w_j &= w_j - \alpha \frac{\partial}{\partial w_j}J\left(\vec{w}, b\right) \\
b &= b - \alpha \frac{\partial}{\partial b}J\left(\vec{w}, b\right)
\end{align}
$$

- One of the key choices is the choice of the learning rate, $\alpha$.

### Make sure that gradient descent is working correctly

objective: $\min_{\vec{w},b}J(\vec{w},b)$

- one strategy is to plot **$J(\vec{w},b)$ vs. # interations**
- this curve is called a **learning curve**
- there are a few different types of learning curves used in machine learning
- if gradient descent is working properly, $J(\vec{w},b)$ should decrease after each iteration
- if J ever increases after one iteration, that means $\alpha$ is chosen poorly
 - usually means $\alpha$ is too large
 - or there could be a bug in the code
- As the number of iterations increases, J should start to level off
 - that means gradient descent has **coverged**



### Automatic convergence test

Let $\epsilon$ be $10^{-3}$ or some other small number.

If $J(\vec{w},b)$ decreases by $\le \epsilon$ in one iteration, declare **convergence**.

## 1.2.2.3 Choosing the learning rate

- If you plot the learning curve and notice the cost goes up and down, that is a good indication that there is a bug in your code, or that the learning rate is too large
    - Try setting a smaller learning rate
- Sometimes you will see a learning curve where the cost is consistently increasing, could also be a learning rate that it too large, or could be a bug in your code
    - debugging tip: with a small enough learning rate, $J(\vec{w},b)$ should descrease on every iteration, so set $\alpha$ to a very small number for testing
- values of $\alpha$ to try:
 - $\cdots 0.001, 0.01, 0.1, 1, \cdots$
 - or roughly 3 times larger each time: $\cdots 0.001, 0.003, .01, 0.03, 0.1, 0.3, 1, \cdots$
 - start with the smaller one and run gradient descent for a handful of iterations, then try a larger one, with the goal of finding $\alpha$ such that the cost descreases rapidly, but also consistently

## 1.2.2.4 Lab - Feature Scaling and Learning Rate

https://colab.research.google.com/drive/1OH2UCFia4HRa9sXWvin6m84v2wQKCXDb

## 1.2.2.5 Feature Engineering

Say you have two features to predict the price of a house:

1. $x_1$ is the *frontage* or width of the lot
2. $x_2$ is the *depth* of the rectangular lot

$$
f_{\vec{w},b} = w_1x_1 + w_2x_2 + b
$$

This model might work well, but there is another option.

The area of the house can be caculated from the frontage and the depth.

$$
area = frontage \times depth
$$

You may have an intution that the area of the land is more predictive of the price than the frontage and depth as separate features, so you define a new feature:

$$
x_3 = x_1 \times x_2
$$

This gives us a new possible model:

$$
f_{\vec{w},b} = w_1x_1 + w_2x_2 + w_3x_3 + b
$$

### Definition: Feature engineering

Using **intuition** to design **new features**, by transforming or combining original features.

## 1.2.2.6 Polynomial regression

<img src='https://drive.google.com/uc?export=view&id=1lt8-gP5BkAgBhAy-qfHVrGHxHRnqCKPo'>

- This might work, but eventually a quadratic model will come down, which doesn't make sense for housing prices as the size increases.
 - A cubic function might work better:

 $$
 f_{\vec{w},b} = w_1x + w_2x^2 + w_3x^3 + b
 $$

- NOTE: as you create new features by squaring or cubing existing features, **feature scaling becomes increasingly important**
- Here is another reasonable alternative:

$$
f_{\vec{w},b} = w_1x + w_2\sqrt{x} + b
$$

  

## 1.2.2.7 Lab - Feature engineering and Polynomial regression

https://colab.research.google.com/drive/1lnKD6XZNOn1LOii-Iv8xYZcQ1F2oKTYe




## 1.2.2.8 Lab - Linear regression with scikit-learn

https://colab.research.google.com/drive/1a2thuEs25-bG9bD-bc_o-KPoyDm5r8gx