-----------
# Outline of Notebook
- ### ML Definitions/Intro
- ### Linear Regression with One Feature
- ### Gradient Descent
- ### Properties of Gradient Descent / Why is works?
-----------

# Definitions

ML = gives the computers the ability to learn without being explicitly programmed

Supervised Learning = algorithms that learn x to y or input to output mappings
- Give "right answers"
- Regression - predicts a number from infinite possibilities
- Classification - predicts a label/category

Unsupervised Learning
- Don't give "right answers", only input
- Clustering - assigns input to groups
- Anomaly Detection - finds unusual data points
Dimensionality Reduction - compresses data

# Linear Regression: Fit Line to Data

![](2022-07-19-14-08-59.png)


Assume that we are trying to fit a line through a dataset which has one feature - size of house in square feet - and the target variable is the price of the house in $1000's. Below is an example of that dataset:

![](2022-07-19-14-37-49.png)

$x^{(i)}$ = i'th house's size if square feet in training data

$y^{(i)}$ = i'th house's price in $1000's in training data

$(x^{(i)}, y^{(i)})$ = a data point in the training dataset / on the graph above

$N$ = # of training examples


<u>Model:</u> $f_{w, b}(x) = wx + b$
- <u>Parameters:</u> $w, b$
- Parameters of a model = variables adjusted during training to improve model

Cost Function = how good the model is doing with its current parameters
- In this case, our cost function is the Least Squares Cost Function: $$\frac{1}{2N}\sum_{i = 1}^{N}(\hat{y}^{(i)} - y^{(i)})^2$$
- <u>Final Cost Function:</u> $$J(w, b) = \frac{1}{2N}\sum_{i = 1}^{N}((wx^{(i)} + b) - y^{(i)})^2$$

<u>Our goal is to choose $w, b$ that minimizes $J(w, b)$</u>

## Gradient Descent

- Start with some $w, b$
- w is updated: $w = w - \alpha\frac{\partial}{\partial w}J(w, b)$
- b is updated: $b = b - \alpha\frac{\partial}{\partial b}J(w, b)$
- $\alpha$ = learning rate (always between 0 and 1)
- $w, b$ must be updated simultaneously; otherwise if w is updated first, J(w, b) changes so b is updated incorrectly

Another way to think of Gradient Descent is: $\begin{bmatrix} w \\ b \end{bmatrix}$ - $\alpha$ * Jacobian of $J(w, b)$

We can use Gradient Descent to calculate the optimum values for the parameters $w, b$ giving us our function f -- $f(x) = wx + b$ -- with which we can predict house prices based on the size of the house.

# Properties of Gradient Descent / Why it works?

NOTE: We have used a simplified view of J to understand the concept. We have eliminated the parameter b in this case.

![](2022-07-19-15-26-55.png)

If we start at point a, the derivative of $J$ is positive meaning $w$ will decrease towards min (which is what we want)

If we start at point b, the derivative of $J$ is negative meaning $w$ will increase towards min (which is what we want)

![](2022-07-19-15-07-09.png)

If $\alpha$ is too small, Gradient Descent will take very small steps so it will take a very long time to converge.

If $\alpha$ is too large, Gradient Descent will take too big steps and it may keep going away from the min so Gradient Descent may not even work.

Gradient Descent works even when $\alpha$ is constant because the derivative term decreases as you get closer to the min. Therefore, gradient descent automatically takes smaller steps as it gets closer and closer to a minimum and eventually stops because at the minimum, the derivative term is going to evaluate to 0.

<u>PROBLEM:</u> Gradient Descent can find local min rather than global min depending on starting parameters. Doesn't happen with Linear Regression because there's always going to be 1 minimum, the global minimum as the cost function for Linear Regression is a convex function (a function with only one minimum)