# Gradient Descent in Practice

## Feature Scaling

The convergence rate of gradient descent also depends on the value of the ranges of the features. For the contours of the cost function these ranges if they are large or small can lead to skinny/wide or tall/short contours which can lead to a slower convergence rate. We can rescale the range of the features to have a comparable ranges of values which improves the rate of convergence. There are some ways to do rescaling

### Divide by Maximum

Given a feature range of $a\leq x_1 \leq b$ we can simply rescale this to $\frac{a}{b} \leq \frac{x_1}{b} \leq 1$

### Mean Normalization

This centers the range around 0. Given a feature range of $a\leq x_1 \leq b$ let $\bar{x}$ be the mean of this range. Then we have 

$$\frac{a-\bar{x}}{b-a} \leq \frac{x_1-\bar{x}}{b-a} \leq \frac{b-\bar{x}}{b-a} $$

### Z-score Normalization

Given a feature range of $a\leq x_1 \leq b$ let $\bar{x}$ and $\sigma$ be the mean and standard deviation respectively of this range. Then we have,

$$\frac{a-\bar{x}}{\sigma} \leq \frac{x_1-\bar{x}}{\sigma} \leq \frac{b-\bar{x}}{\sigma} $$

A general rule of thumb for feature rescaling is get it between -1 and 1.

## Convergence of Gradient Descent

To analyze the convergence of the gradient descent one may plot a graph of the cost function vs. iteration step. The resulting curve created is called the learning curve. The cost function should decrease after every iteration or in other words be monotonically decreasing. If it is not then the value of $\alpha$ may be poorly chosen or a bug is present in the code. One can see this by looking at the cost function for a certain parameter and seeing if it zigzags around the minimum.

Another way to analyze convergence is to use an automatic convergence test. Let $\varepsilon>0$ be small. If the cost function decreases by $\leq \varepsilon$ in one iteration we can assume it converged.

## Feature Engineering

New features can be created by combining or transforming original features. For example $x_3 = x_1 x_2$. This in turn allows for fitting of non-linear functions 

## Polynomial Regression

This involes regression of the form

$$f_{\vec{w},b} = \sum_{i=1}^{n} w_i x_{1}^{i} + b$$

Regression can still be done by writing each column of the matrix $X$ as the corresponding power e.g. $x_{1}^{2}$ in the second column of $X$