## Gradient Checks
$ \dfrac{df(x)}{dx} = \dfrac{f(x+h)-f(x)}{h} $ not good

$ \dfrac{df(x)}{dx} = \dfrac{f(x+h)-f(x-h)}{2h} $ instead

h = 1e-5

### relative error :
$ \dfrac{|f^{'}_{a}-f^{'}_{n}|}{max(|f^{'}_{a}|,|f^{'}_{n}|)} $

numerical gradient = $f^{'}_{n}$

analytic gradient = $f^{'}_{a}$

relative error > 1e-2 usually means the gradient is probably wrong

1e-2 > relative error > 1e-4 should make you feel uncomfortable

1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g. use of tanh 

nonlinearities and softmax), then 1e-4 is too high.

1e-7 and less you should be happy.

## Ratio of weights:updates

In [2]:
# assume parameter vector W and its gradient vector dW
param_scale = np.linalg.norm(W.ravel())
update = -learning_rate*dW # simple SGD update
update_scale = np.linalg.norm(update.ravel())
W += update # the actual update
print(update_scale / param_scale) # want ~1e-3

SyntaxError: Missing parentheses in call to 'print' (<ipython-input-2-ae7c055a2f3f>, line 8)

## Parameter updates

In [3]:
#Vanilla update
x += - learning_rate * dx

NameError: name 'x' is not defined

In [4]:
#Momentum update ,better converge rates on deep networks
v = mu * v - learning_rate * dx # integrate velocity
x += v # integrate position

NameError: name 'mu' is not defined

In [5]:
#Nesterov Momentum 
x_ahead = x + mu * v
# evaluate dx_ahead (the gradient at x_ahead instead of at x)
v = mu * v - learning_rate * dx_ahead
x += v

v_prev = v # back this up
v = mu * v - learning_rate * dx # velocity update stays the same
x += -mu * v_prev + (1 + mu) * v # position update changes form

NameError: name 'x' is not defined

### Learning rate decay
#### Step decay
#### Exponential decay
$ \alpha = \alpha^{}_{0}e^{-kt} $where $\alpha^{}_{0}$,k hyperparameters and t is the iteration number
#### 1/t decay
$ \alpha=\alpha^{}_{0}/(1+kt) $where $\alpha^{}_{0}$,k hyperparameters and t is the iteration number

### Second order methods
$ x \gets x - [Hf(x)]^{-1}\nabla f(x) $

$Hf(x)$ is the Hessian matrix which is a square matrix of second-order partial derivatives of the function

### Per-parameter adaptive learning rate methods

In [6]:
#Adagrad
# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps #eps = 1e-4 to 1e-8

NameError: name 'cache' is not defined

In [7]:
#RMSprop
cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)

NameError: name 'decay_rate' is not defined

In [None]:
#Adam
eps = 1e-8
beta1 = 0.9
beta2 = 0.999
m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += - learning_rate * m / (np.sqrt(v) + eps)