[[Neural Networks from Scratch]]

##### What is Optimisation?
The process of adjusting weights and biases to minimise a loss function. This is how we train a model.

##### What Does It Use?
It uses gradients from backpropagation to determine how much the loss increases if a weight increases:
- `dweights`
- `dbiases`

So we subtract a function of this from each parameter.
$$
\theta = \theta - \eta \times \nabla L
$$
Where:
- $\theta$ = parameter (weight or bias)
- $\eta$ = learning rate
- $\nabla L$ = gradient of loss with respect to parameter

##### Why Learning Rate is Crucial
- Too large -> overshoots the minima and diverges
- Too small -> extremely slow learning
- Must be *tunable* (can **decay** over time)

##### What is Decay?
As the model learns, the learning rate decreases to make smaller, more precise steps. This is called **learning rate decay**

Example:
$$
lr_{current} = \frac{lr_{initial}}{1+decay \times epoch}
$$
In Python:

In [None]:
if self.decay:
    self.current_learning_rate = (
        self.learning_rate * (1. / (1. + self.decay * self.iterations))
    )

##### What is an Epoch?
A full pass through the dataset.

##### What is Momentum?
Imagine pushing a ball down a hill. It speeds up even without a push. Momentum allows updates to build velocity and not get stuck in small bumps
$$
u_t = \gamma \times u_{t-1} - \eta \times \nabla L
$$
$$
\theta = \theta + u_t
$$
Where:
- $\gamma$ = momentum coefficient (0.9 typical)
- $u_t$ = velocity (accumulated gradient)

##### What Is a Training Loop?
It repeats:
1. Forward pass -> compute predictions
2. Loss calculation
3. Accuracy calculation
4. Backward pass -> compute gradients
5. Optimiser step -> update parameters

Repeat for many **epochs** (full passes over data).

##### Stochastic Gradient Descent 
The core idea of **SGD** is:

To iteratively update model parameters by taking small steps in the direction that reduces errors, using just a few data points at a time.

**Implementation:**

In [None]:
class Optimiser_SGD:
	def __init__(self, learning_rate=1.0, decay=0.0, momentum=0.0):
		self.learning_rate = learning_rate
		self.current_learning_rate = learning_rate
		self.decay = decay
		self.iterations = 0
		self.momentum = momentum

	def pre_update_params(self):
		if self.decay:
			self.current_learning_rate = self.learning_rate * \
				(1. / (1. + self.decay * self.iterations))

	def update_params(self, layer):
		if self.momentum:
			if not hasattr(layer, 'weight_momentums'):
				layer.weight_momentums = np.zeros_like(layer.weights)
				layer.bias_momentums = np.zeros_like(layer.biases)

			weight_updates = self.momentum * layer.weight_momentums - \
							 self.current_learning_rate * layer.dweights
			layer.weight_momentums = weight_updates

			bias_updates = self.momentum * layer.bias_momentums - \
						   self.current_learning_rate * layer.dbiases
			layer.bias_momentums = bias_updates
		else:
			weight_updates = -self.current_learning_rate * layer.dweights
			bias_updates = -self.current_learning_rate * layer.dbiases

		layer.weights += weight_updates
		layer.biases += bias_updates

	def post_update_params(self):
		self.iterations += 1


##### Module Initialisation

In [None]:
import micropip

await micropip.install("numpy")
await micropip.install("nnfs")
await micropip.install("matplotlib")

import matplotlib.pyplot as plt
import numpy as np
import nnfs
from nnfs.datasets import spiral_data
import math

nnfs.init()


##### Training Loops with SGD Optimiser class (If testing this code block specifically **ensure to first run** `Layer_Dense`, `Activation_ReLU`, `Activation_Softmax_Loss_CategoricalCrossentropy` classes)

In [None]:
# Create dataset
X, y = spiral_data(samples=100, classes=3)

# Network architecture
dense1 = Layer_Dense(2, 64)
activation1 = Activation_ReLU()
dense2 = Layer_Dense(64, 3)
loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()

# Optimiser
optimiser = Optimiser_SGD(learning_rate=1.0, decay=1e-3, momentum=0.9)

# Training
for epoch in range(10001):
	dense1.forward(X)
	activation1.forward(dense1.output)
	dense2.forward(activation1.output)
	loss = loss_activation.forward(dense2.output, y)

	# Accuracy
	predictions = np.argmax(loss_activation.output, axis=1)
	if len(y.shape) == 2:
		y = np.argmax(y, axis=1)
	accuracy = np.mean(predictions == y)

	if not epoch % 100:
		print(f'epoch: {epoch}, acc: {accuracy:.3f}, loss: {loss:.3f}, lr: {optimiser.current_learning_rate}')

	# Backward
	loss_activation.backward(loss_activation.output, y)
	dense2.backward(loss_activation.dinputs)
	activation1.backward(dense2.dinputs)
	dense1.backward(activation1.dinputs)

	# Update
	optimiser.pre_update_params()
	optimiser.update_params(dense1)
	optimiser.update_params(dense2)
	optimiser.post_update_params()


##### What do we see here?
Key:
- epoch = full passes through the dataset
- accuracy = self explanatory
- loss = measure of how wrong the predictions are
- learning rate (lr) = step size for weight updates

In the training process using **Stochastic Gradient Descent**, there are large updates initially in order to converge faster. Then, updates become smaller through **decay** so that the **SGD** fine-tunes around the minima and avoids overshooting.

##### Next Step
[[Better Optimisers (RMSProp, Adam)]]