[[Neural Networks from Scratch]]

##### Why Move Beyond SGD?
SGD performs blind updates meaning that it uses the same learning rate for all parameters and has no memory of past gradients. This is insufficient for high-dimensional, sparse, noisy, or dynamically shifting problems.

### **RMSProp** - Root Mean Square Propagation
##### The Core Idea Behind RMSProp
`RMSProp` tracks an exponentially decaying of past squared gradients for each parameter, slowing updates for frequently changing parameters and speeding up for stable ones.
$$
g_t = \beta g_{t-1} + (1-\beta) \nabla L^2
$$
$$
\theta = \theta - \frac{\eta}{\sqrt{g_t+\epsilon}}*\nabla L
$$
where:
- $\beta$ = 0.9: decay rate (this is called `rho` in the code)
- $\epsilon$ = 1e-7: prevents divide-by-zero
- Adaptivity is **per-parameter**

##### Implementation of `Optimiser_RMSprop`

In [None]:
class Optimiser_RMSprop:
	def __init__(self, learning_rate=0.001, decay=0.0, epsilon=1e-7, rho=0.9):
		self.learning_rate = learning_rate
		self.current_learning_rate = learning_rate
		self.decay = decay
		self.iteration = 0
		self.epsilon = epsilon
		self.rho = rho

	def pre_update_params(self):
		if self.decay:
			self.current_learning_rate = self.learning_rate * \ (1. / (1. + self.decay * self.iterations))

	def update_params(self, layer):
		# if the layer object does not yet have an attribute called `weight_cache`
		if not hasattr(layer, 'weight_cache'):
			layer.weight_cache = np.zeros_like(layer.weights)
			layer.bias_cache = np.zeros_like(layer.biases)

		layer.weight_momentums = self.rho * layer.weight_cache + \ (1 - self.rho) * layer.dweights**2
		layer.bias_cache = self.rho * layer.bias_cache + \ (1 - self.rho) * layer.dbiases**2

		layer.weights += -self.current_learning_rate * \ layer.dweights / (np.sqrt(layer.weight_cache) + self.epsilon)
		layer.biases += -self.current_learning_rate * \ layer.dbiases / (np.sqrt(layer.bias_cache) + self.epsilon)

	def post_update_params(self):
		self.iterations += 1


### **Adam** - Adaptive Moment Estimation
##### The Core Idea Behind Adam
Adam combines the **Momentum** - which is the average of past gradients - with **RMSProp** - which is the average of past squared gradients.

With bias correction to stabilise early updates:
$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1)\nabla L
$$
$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla L)^2
$$
$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$
$$
\theta = \theta - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
$$

##### Implementation of `Optimiser Adam`

In [None]:
class Optimiser_Adam:
	def __init__(self, learning_rate=0.001, decay=0.0, epsilon=1e-7, beta_1=0.9, beta_2=0.999):
		self.learning_rate = learning_rate
		self.current_learning_rate = learning_rate
		self.decay = decay
		self.iterations = 0
		self.epsilon = epsilon # for numerical stability
		self.beta_1 = beta_1 # decay rate for first moment
		self.beta_2 = beta_2 # decay rate for second moment

	def pre_update_params(self):
		# Decay learning rate if applicable
		if self.decay:
			self.current_learning_rate = self.learning_rate * \
				(1. / (1. + self.decay * self.iterations))

	def update_params(self, layer):
		# if the layer object does not yet have an attribute called `weight_cache`
		if not hasattr(layer, 'weight_cache'):
			layer.weight_momentums = np.zeros_like(layer.weights)
			layer.weight_cache = np.zeros_like(layer.weights)
			layer.bias_momentums = np.zeros_like(layer.biases)
			layer.bias_cache = np.zeros_like(layer.biases)

		# update biased first moment estimate (momentum)
		layer.weight_momentums = self.beta_1 * layer.weight_momentums + \
								 (1 - self.beta_1) * layer.dweights
		layer.bias_momentums = self.beta_1 * layer.bias_momentums + \
								(1 - self.beta_1) * layer.dbiases

		# correct bias in first moment
		corrected_weight_momentums = layer.weight_momentums / \
									 (1 - self.beta_1 ** (self.iterations + 1))
		corrected_bias_momentums = layer.bias_momentums / \
								   (1 - self.beta_1 ** (self.iterations + 1))

		# update biased second moment estimate (squared gradient cache)
		layer.weight_cache = self.beta_2 * layer.weight_cache + \
							 (1 - self.beta_2) * layer.dweights**2
		layer.bias_cache = self.beta_2 * layer.bias_cache + \
						   (1 - self.beta_2) * layer.dbiases**2

		# correct bias in second moment
		corrected_weight_cache = layer.weight_cache / \
								 (1 - self.beta_2 ** (self.iterations + 1))
		corrected_bias_cache = layer.bias_cache / \
							   (1 - self.beta_2 ** (self.iterations + 1))

		# update weights and biases using Adam Formula
		layer.weights += -self.current_learning_rate * corrected_weight_momentums / \
						 (np.sqrt(corrected_weight_cache) + self.epsilon)
		layer.biases += -self.current_learning_rate * corrected_bias_momentums / \
						(np.sqrt(corrected_bias_cache) + self.epsilon)

	def post_update_params(self):
		# increment iteration count
		self.iterations += 1


##### Usage Example
Swap this line in the training loop:

In [None]:
optimiser = Optimiser_Adam(learning_rate=0.02, decay=5e-5)


##### Next Step:
[[Batch Training]]