## Updating Weights

In [1]:
"""
Purpose of training a neural network is to minize the output of the loss function
by identifying the optimal parameters.
The process of identifying the optimal parameters is called the 'optization'
"""

"\nPurpose of training a neural network is to minize the output of the loss function\nby identifying the optimal parameters.\nThe process of identifying the optimal parameters is called the 'optization'\n"

In [4]:
"""
SGD (stochamic gradient descent) is a 
optimization technique where optimal paramers is calculated by
iteratively taking the derivative of the gradient

W ← W - η * ∂L/∂W

W : weight to be updated
∂L/∂W : gradient of the loss fuinction of the W
η : learning rate(constant. 0.01, 0.001 등)

About of SGD 

SGD can easily be implmented, however it is ineffective depending on the situation
for instance, for a function : 

f(x, y) = 1/20 * x² + y²

gradient at each point is (x/10, 2y), and thereby steep on the y-axis, whereas it's not on the x-axis
Also, minimum point(optimal) is (0, 0), when most of the points do not direct to the optimal point
Hence, when SGD is applied to this formulae, it will converge in a zig-zag manner, in-arguably inefficient.

In summary, SGD is ineffective for all anisotropy functions, which changes the gradient depending on the direction.

Other optimization such as momentum, Adagrad or Adam were introduced to improve the problem that SGD has

"""

class SGD:
    def __init__(self, lr= 0.01):
        self.lr - lr
    
    def update(self, params, grads):
        for key in params.keys():
            params[key] -= self.lr * grads[key]

###  6.1.4 Momentum

In [2]:
"""
Momentum
v ← αv - η * ∂L/∂W
W ← W + v

W : weights to be updated
∂L/∂W : gradient of the loss function of W
η : learning rate
v : velocity
α : friction (0.9)

"""
import numpy as np

class Momentum :
    def __init__(self, lr = 0.01, momentum = 0.9):
        self.lr = lr
        self.momentum = momentum
        self. v = None
    
    def update(self, params, grads):
        if self.v is None:
            self.v = {}
            
            for key, val in params.items():
                self.v[key] = np.zeros_like(val)
        
        for key in params.keys():
            self.v[key] = self.momentum * self.v[key] - self.lr * grads[key]
            params[key] += self.v[key]

In [3]:
class Nesterov:

    """Nesterov's Accelerated Gradient (http://arxiv.org/abs/1212.0901)"""
    # NAG는 모멘텀에서 한 단계 발전한 방법이다. (http://newsight.tistory.com/224)
    
    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.v = None
        
    def update(self, params, grads):
        if self.v is None:
            self.v = {}
            for key, val in params.items():
                self.v[key] = np.zeros_like(val)
            
        for key in params.keys():
            self.v[key] *= self.momentum
            self.v[key] -= self.lr * grads[key]
            params[key] += self.momentum * self.momentum * self.v[key]
            params[key] -= (1 + self.momentum) * self.lr * grads[key]

### 6.1.5 AdaGrad : Learning rate decay

In [1]:
"""
When the learning rate η is set to be too small, it will delay the entire training phase,
whereas if it's too big, the training will not yield appropriate result

Learning rate decay is one of the approach to effectively decide learning rate. 
As the training progresses, it slowly reduces the learning rate. 

AdaGrad is an improved way of choosing the learning rate (from learning rate decay). 
It adaptively chooses the learning rate prior to implementation

h ← h + ∂L/∂W ⊙ ∂L/∂W
W ← W - η *1/√h * ∂L/∂W


NOTE : as AdaGrad adds 2 square of the past gradient, 
the more the training progresses, it weakens. 

RMSProp was introduced to improve such problem. 
In RMSProp, gradient in the far past is slowly forgotten, and the new gradient is applied with high magnitude.
It is call the "exponential moving average", where gradients are applied with weight in respect their 
timeline
"""
class AdaGrad: 
    def __init__(self, lr = 0.01):
        self.lr = lr
        self.h = None
    
    def update(self, params, grads):
        if self.h is None:
            self.h = {}
            
            for key, val in params.items():
                self.h[key] = np.zeros_like(val)
                
        for key in params.keys():
            self.h[key] += grads[key] * grads[key]
            params[key] -= self.lr * grads[key] / np.sqrt(self.h[key] + 1e-7)

In [13]:
class RMSprop:

    """RMSprop"""

    def __init__(self, lr=0.01, decay_rate = 0.99):
        self.lr = lr
        self.decay_rate = decay_rate
        self.h = None
        
    def update(self, params, grads):
        if self.h is None:
            self.h = {}
            for key, val in params.items():
                self.h[key] = np.zeros_like(val)
            
        for key in params.keys():
            self.h[key] *= self.decay_rate
            self.h[key] += (1 - self.decay_rate) * grads[key] * grads[key]
            params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)

In [14]:
"""
Adam is an optimization method which comings Momentum and AdaGrad
It was introduced in 2015.
"""
class Adam :
    def __init__(self, lr= 0.01, beta1 = 0.9, beta2 = 0.999) : 
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.iter = 0
        self.m = None
        self.v = None
                
    def update(self, params, grads):
        if self.m is None:
            self.m = {}
            
            for key, val in params.items():
                self.m[key] = np.zeros_like(val)
                self.v[key] = np.zeros_like(val)
                
        self.iter += 1 
        lr_t = self.lr * np.sqrt(1.0 - self.beta2 ** self.iter) / (1.0 - self.beta1 ** self.iter)
               
        for key in params.keys():
            self.m[key] += (1 - self.beta1) * (grads[key] - self.m[key])
            self.v[key] += (1 - self.beta2) * (grads[key]**2 - self.v[key])
            
            params[key] -= lr_t * self.m[key] / (np.sqrt(self.v[key] + 1e-7))