# Gradient Descent
* Lets take a close look at gradient descent
* It is used extensively in deep learning, and is general enough to be used in a wide variety of situations
## Concept..
* The idea is as follows: you have a function you want to minimize, and you want to find the optimal inputs to minimize this function 
* Usually this thing we want to minimize is the cost or error function 
* We can maximize things as well, for instance say we want to maximize the likelihood of some probability distribution 
* In order to do that, all we need to do is reverse the signs

![gradient%20descent.png](attachment:gradient%20descent.png)

# Example
* To demonstrate this, we are going to do a very simple 1-d example
* Although in machine learning we are usually working in more than 1 dimension, this will help us visualize what is going on

So, lets start with a simple function:
### $$J = w^2$$
We know that the minimum is at w = 0, but lets pretend we don't. Our weights are randomly initialized, so lets suppose we start at:
### $$w=20$$
We know the gradient is:
### $$\frac{dJ}{dw} = 2w$$
and we can set our learning rate to: 
### $$\alpha = 0.1$$
So on our first iteration we get:
### Iteration 1
### $$J(w) = w-\alpha*\frac{dJ}{dw}$$
### $$J(w) = 20 - 0.1*2*20 = 16$$
So now we set w = 16 and repeat.
### Iteration 2
### $$J(w) = 16 - 0.1*2*16 = 12.8$$
We now set w = 12.8 and repeat.
### Iteration 3
### $$J(w) = 12.8 - 0.1 * 2*12.8 = 10.24 $$

So, we can see that on each iteration, we get closer and closer to the actual minimum value of 0, but we take smaller and smaller steps each time. We take smaller and smaller steps because as we get closer to zero, the slope gets smaller. 

Lets try this in code! 

## Code

In [2]:
import numpy as np

w = 20
alpha = 0.1

for i in range(30):
    gradient = 2*w 
    w = w - alpha*gradient
    print(w)

16.0
12.8
10.24
8.192
6.5536
5.24288
4.194304000000001
3.3554432000000007
2.6843545600000005
2.1474836480000006
1.7179869184000005
1.3743895347200004
1.0995116277760002
0.8796093022208001
0.7036874417766401
0.562949953421312
0.45035996273704965
0.3602879701896397
0.28823037615171176
0.23058430092136942
0.18446744073709553
0.14757395258967643
0.11805916207174114
0.09444732965739291
0.07555786372591433
0.06044629098073147
0.048357032784585176
0.03868562622766814
0.030948500982134513
0.02475880078570761


We can see that after 30 steps, w reaches 0.02475. So it looks as though 30 steps is not enough! Lets try 100. 

In [3]:
w = 20
alpha = 0.1

for i in range(100):
    gradient = 2*w 
    w = w - alpha*gradient
    print(w)

16.0
12.8
10.24
8.192
6.5536
5.24288
4.194304000000001
3.3554432000000007
2.6843545600000005
2.1474836480000006
1.7179869184000005
1.3743895347200004
1.0995116277760002
0.8796093022208001
0.7036874417766401
0.562949953421312
0.45035996273704965
0.3602879701896397
0.28823037615171176
0.23058430092136942
0.18446744073709553
0.14757395258967643
0.11805916207174114
0.09444732965739291
0.07555786372591433
0.06044629098073147
0.048357032784585176
0.03868562622766814
0.030948500982134513
0.02475880078570761
0.01980704062856609
0.015845632502852872
0.012676506002282298
0.01014120480182584
0.008112963841460671
0.006490371073168537
0.00519229685853483
0.004153837486827864
0.0033230699894622913
0.002658455991569833
0.002126764793255866
0.001701411834604693
0.0013611294676837543
0.0010889035741470034
0.0008711228593176028
0.0006968982874540822
0.0005575186299632657
0.00044601490397061256
0.00035681192317649007
0.00028544953854119207
0.00022835963083295366
0.00018268770466636292
0.00014615016373309

### This brings us very close to 0! 
* By moving slowly in the direction of the gradient of a function, we get closer to the minimum of that function

# Why is this so important? 
* Well, as we progress into deep learning and machine learning, the functions will get more complicated 
* For regular neural networks with softmax, it might take you a few hours or days to get the derivatives right on the first time
* And what about when we go to convolutional neural networks, or recurrent neural networks? Its definitely possible to derive the gradients on paper, but you don't want to!  
* However, our time is much better spent testing different architectures and hyperparameters, without having to worry about the gradients! 