[[Neural Networks from Scratch]]

### L1 and L2 Regularisation

##### Why do we need Regularisation?
Regularisation penalises large weights to prevent overfitting.
##### L1 (Lasso)
Adds a penalty proportional to the **absolute value of weights**, driving some weights to exactly zero.
$$
Loss_{total} = Loss_{data} + \lambda \sum |\theta|
$$
**Use Case:** Useful when you have a large number of features and you suspect that only a subset of them are important
##### L2 (Ridge)
Adds a penalty proportional to the **square of the weights**, shrinking weights but keeping all non-zero and distributing weight energy uniformly.
$$
Loss_{total} = Loss_{data} + \lambda \sum \theta^2
$$
**Use Case:** Useful when you have a large number of features and most of them contribute to the outcome
##### Dropout
Dropout randomly turns off neurons during training to prevent over-reliance on any one path.

#### Code Implementation
##### Layer Initialisation

In [None]:
class Layer_Dense:
	def __init__(self, n_inputs, n_neurons,
				 weight_regulariser_l1=0.0, weight_regulariser_l2=0.0,
				 bias_regulariser_l1=0.0, bias_regulariser_l2=0.0):
		self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
		self.biases = np.zeros((1, n_neurons))
		self.weight_regulariser_l1 = weight_regulariser_l1
		self.weight_regulariser_l2 = weight_regulariser_l2
		self.bias_regulariser_l1 = bias_regulariser_l1
		self.bias_regulariser_l2 = bias_regulariser_l2

##### Loss Class Method

In [None]:
def regularization_loss(self, layer):
	loss = 0
	if layer.weight_regulariser_l1 > 0:
		loss += layer.weight_regulariser_l1 * np.sum(np.abs(layer.weights))
	if layer.weight_regulariser_l2 > 0:
		loss += layer.weight_regulariser_l2 * np.sum(layer.weights ** 2)
	if layer.bias_regulariser_l1 > 0:
		loss += layer.bias_regulariser_l1 * np.sum(np.abs(layer.biases))
	if layer.bias_regulariser_l2 > 0:
		loss += layer.bias_regulariser_l2 * np.sum(layer.biases ** 2)
	return loss

##### Backward Pass In Layer

In [None]:
def backward(self, dvalues):
	self.dweights = np.dot(self.inputs.T, dvalues)
	self.dbiases = np.sum(dvalues, axis=0, keepdims=True)

	# L1 weights
	if self.weight_regulariser_l1 > 0:
		dL1 = np.ones_like(self.weights)
		dL1[self.weights < 0] = -1
		self.dweights += self.weight_regulariser_l1 * dL1

	# L2 weights
	if self.weight_regulariser_l2 > 0:
		self.dweights += 2 * self.weight_regulariser_l2 * self.weights

	# L1 biases
	if self.bias_regulariser_l1 > 0:
		dL1 = np.ones_like(self.biases)
		dL1[self.biases < 0] = -1
		self.dbiases += self.bias_regulariser_l1 * dL1

	# L2 biases
	if self.bias_regulariser_l2 > 0:
		self.dbiases += 2 * self.bias_regulariser_l2 * self.biases

	self.dinputs = np.dot(dvalues, self.weights.T)

##### Training Loop Integration

In [None]:
data_loss = loss_function.calculate(predictions, y)
reg_loss = loss_function.regularisation_loss(dense1) + \
		   loss_function.regularisation_loss(dense2)
loss = data_loss + reg_loss

##### Example Layer Initialisation

In [None]:
dense1 = Layer_Dense(2, 64, weight_regulariser_l2=5e-4, bias_regulariser_l2=5e-4)


##### Key Observations:
- L1 is rarely used alone and often combined with L2
- Regularisation terms are mostly applied to hidden layers
- Regularisation improves generalisation and tames unstable gradients

### Dropout
##### Why do we need to implement Dropout?
Dropout is a regularisation technique that endeavours to prevent over reliance on any given neuron and ultimately reduce overfitting.

##### How does Dropout work?
During training, Dropout randomly disables a fraction of neurons by zeroing their outputs.
![[Pasted image 20250605183714.png]]

This is done by generating a Bernoulli mask that acts as a random filter to temporarily deactivate certain neurons:
$$
mask \sim Binomial(1,1-dropout_{rate})
$$
Then:
$$
output = inputs \times mask/(1 - dropout_{rate})
$$
During inference, Dropout is **omitted**.

#### Code Implementation

In [None]:
class Layer_Dropout:
	def __init__(self, rate):
		# Keep rate, not drop rate
		self.rate = 1 - rate

	def forward(self, inputs, training):
		self.inputs = inputs
		if training:
			# Sample dropout mask and scale
			self.binary_mask = np.random.binomial(1, self.rate, size=inputs.shape) / self.rate
			self.output = inputs * self.binary_mask
		else:
			self.output = inputs

	def backward(self, dvalues):
		self.dinputs = dvalues * self.binary_mask

##### Integration into Model

In [None]:
dense1 = Layer_Dense(2, 64, weight_regulariser_l2=5e-4, bias_regulariser_l2=5e-4)
activation1 = Activation_ReLU()
dropout1 = Layer_Dropout(0.1)
dense2 = Layer_Dense(64, 3)

##### Forward Pass

In [None]:
dense1.forward(X)
activation1.forward(dense1.output)
dropout1.forward(activation1.output, training=True)
dense2.forward(dropout1.output)

##### Backward Pass

In [None]:
loss_activation.backward(dense2.output, y)
dense2.backward(loss_activation.dinputs)
dropout1.backward(dense2.dinputs)
activation1.backward(dropout1.dinputs)
dense1.backward(activation1.dinputs)

##### Optimiser Settings

In [None]:
optimiser = Optimiser_Adam(learning_rate=0.05, decay=5e-5)

##### Verification

In [None]:
# Before dropout
np.sum(output)

# After dropout
np.sum(output * dropout_mask / (1 - dropout_rate)) ≈ original sum


### Full Code Up To This Point:

In [None]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data

nnfs.init()

class Layer_Dense:
    # Layer initialisation
    def __init__(self, n_inputs, n_neurons,
                 weight_regulariser_l1=0, weight_regulariser_l2=0,
                 bias_regulariser_l1=0, bias_regulariser_l2=0):
        # Initialise weights and biases
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
        # Set regularisation strength
        self.weight_regulariser_l1 = weight_regulariser_l1
        self.weight_regulariser_l2 = weight_regulariser_l2
        self.bias_regulariser_l1 = bias_regulariser_l1
        self.bias_regulariser_l2 = bias_regulariser_l2

    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        # Calculate output values from inputs, weights, and biases
        self.output = np.dot(inputs, self.weights) + self.biases

    # Backward pass
    def backward(self, dvalues):
        # Gradients on parameters
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)

        # Gradients on regularisation
        # L1 on weights
        if self.weight_regulariser_l1 > 0:
            dL1 = np.ones_like(self.weights)
            dL1[self.weights < 0] = -1
            self.dweights += self.weight_regulariser_l1 * dL1
        # L2 on weights
        if self.weight_regulariser_l2 > 0:
            self.dweights += 2 * self.weight_regulariser_l2 * self.weights
        # L1 on biases
        if self.bias_regulariser_l1 > 0:
            dL1 = np.ones_like(self.biases)
            dL1[self.biases < 0] = -1
            self.dbiases += self.bias_regulariser_l1 * dL1
        # L2 on biases
        if self.bias_regulariser_l2 > 0:
            self.dbiases += 2 * self.bias_regulariser_l2 * self.biases

        # Gradient on values
        self.dinputs = np.dot(dvalues, self.weights.T)

class Layer_Dropout:
    # Initialisation
    def __init__(self, rate):
        # Store rate, we invert it as, for example, for dropout of 0.1, we need a success rate of 0.9
        self.rate = 1 - rate

    # Forward pass
    def forward(self, inputs):
        # Save input values
        self.inputs = inputs
        # Generate and save scaled mask
        self.binary_mask = np.random.binomial(1, self.rate, size=inputs.shape) / self.rate
        # Apply mask to output values
        self.output = inputs * self.binary_mask

    # Backward pass
    def backward(self, dvalues):
        # Gradient on values
        self.dinputs = dvalues * self.binary_mask

class Activation_ReLU:
    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        # Calculate output values from inputs
        self.output = np.maximum(0, inputs)

    # Backward pass
    def backward(self, dvalues):
        # Since we need to modify the original variable, let's make a copy of the values first
        self.dinputs = dvalues.copy()
        # Zero gradient where input values were negative
        self.dinputs[self.inputs <= 0] = 0

class Activation_Softmax:
    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        # Get unnormalised probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        # Normalise them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.output = probabilities

    # Backward pass
    def backward(self, dvalues):
        # Create uninitialised array
        self.dinputs = np.empty_like(dvalues)
        # Enumerate outputs and gradients
        for index, (single_output, single_dvalues) in enumerate(zip(self.output, dvalues)):
            # Flatten output array
            single_output = single_output.reshape(-1, 1)
            # Calculate Jacobian matrix of the output
            jacobian_matrix = np.diagflat(single_output) - np.dot(single_output, single_output.T)
            # Calculate sample-wise gradient and add it to the array of sample gradients
            self.dinputs[index] = np.dot(jacobian_matrix, single_dvalues)

class Optimiser_SGD:
    # Initialise optimiser - set settings
    def __init__(self, learning_rate=1., decay=0., momentum=0.):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.momentum = momentum

    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * (1. / (1. + self.decay * self.iterations))

    # Update parameters
    def update_params(self, layer):
        # If we use momentum
        if self.momentum:
            # If layer does not contain momentum arrays, create them filled with zeros
            if not hasattr(layer, 'weight_momentums'):
                layer.weight_momentums = np.zeros_like(layer.weights)
                layer.bias_momentums = np.zeros_like(layer.biases)

            # Build weight updates with momentum
            weight_updates = self.momentum * layer.weight_momentums - self.current_learning_rate * layer.dweights
            layer.weight_momentums = weight_updates
            # Build bias updates
            bias_updates = self.momentum * layer.bias_momentums - self.current_learning_rate * layer.dbiases
            layer.bias_momentums = bias_updates
        else:
            weight_updates = -self.current_learning_rate * layer.dweights
            bias_updates = -self.current_learning_rate * layer.dbiases

        # Update weights and biases using either vanilla or momentum updates
        layer.weights += weight_updates
        layer.biases += bias_updates

    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1

class Optimiser_Adagrad:
    # Initialise optimiser - set settings
    def __init__(self, learning_rate=1., decay=0., epsilon=1e-7):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon

    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * (1. / (1. + self.decay * self.iterations))

    # Update parameters
    def update_params(self, layer):
        # If layer does not contain cache arrays, create them filled with zeros
        if not hasattr(layer, 'weight_cache'):
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_cache = np.zeros_like(layer.biases)

        # Update cache with squared current gradients
        layer.weight_cache += layer.dweights**2
        layer.bias_cache += layer.dbiases**2

        # Vanilla SGD parameter update + normalisation with square rooted cache
        layer.weights += -self.current_learning_rate * layer.dweights / (np.sqrt(layer.weight_cache) + self.epsilon)
        layer.biases += -self.current_learning_rate * layer.dbiases / (np.sqrt(layer.bias_cache) + self.epsilon)

    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1

class Optimiser_RMSprop:
    # Initialise optimiser - set settings
    def __init__(self, learning_rate=0.001, decay=0., epsilon=1e-7, rho=0.9):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon
        self.rho = rho

    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * (1. / (1. + self.decay * self.iterations))

    # Update parameters
    def update_params(self, layer):
        # If layer does not contain cache arrays, create them filled with zeros
        if not hasattr(layer, 'weight_cache'):
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_cache = np.zeros_like(layer.biases)

        # Update cache with squared current gradients
        layer.weight_cache = self.rho * layer.weight_cache + (1 - self.rho) * layer.dweights**2
        layer.bias_cache = self.rho * layer.bias_cache + (1 - self.rho) * layer.dbiases**2

        # Vanilla SGD parameter update + normalisation with square rooted cache
        layer.weights += -self.current_learning_rate * layer.dweights / (np.sqrt(layer.weight_cache) + self.epsilon)
        layer.biases += -self.current_learning_rate * layer.dbiases / (np.sqrt(layer.bias_cache) + self.epsilon)

    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1

class Optimiser_Adam:
    # Initialise optimiser - set settings
    def __init__(self, learning_rate=0.001, decay=0., epsilon=1e-7, beta_1=0.9, beta_2=0.999):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon
        self.beta_1 = beta_1
        self.beta_2 = beta_2

    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * (1. / (1. + self.decay * self.iterations))

    # Update parameters
    def update_params(self, layer):
        # If layer does not contain cache arrays, create them filled with zeros
        if not hasattr(layer, 'weight_cache'):
            layer.weight_momentums = np.zeros_like(layer.weights)
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_momentums = np.zeros_like(layer.biases)
            layer.bias_cache = np.zeros_like(layer.biases)

        # Update momentum with current gradients
        layer.weight_momentums = self.beta_1 * layer.weight_momentums + (1 - self.beta_1) * layer.dweights
        layer.bias_momentums = self.beta_1 * layer.bias_momentums + (1 - self.beta_1) * layer.dbiases

        # Get corrected momentum
        weight_momentums_corrected = layer.weight_momentums / (1 - self.beta_1 ** (self.iterations + 1))
        bias_momentums_corrected = layer.bias_momentums / (1 - self.beta_1 ** (self.iterations + 1))

        # Update cache with squared current gradients
        layer.weight_cache = self.beta_2 * layer.weight_cache + (1 - self.beta_2) * layer.dweights**2
        layer.bias_cache = self.beta_2 * layer.bias_cache + (1 - self.beta_2) * layer.dbiases**2

        # Get corrected cache
        weight_cache_corrected = layer.weight_cache / (1 - self.beta_2 ** (self.iterations + 1))
        bias_cache_corrected = layer.bias_cache / (1 - self.beta_2 ** (self.iterations + 1))

        # Vanilla SGD parameter update + normalisation with square rooted cache
        layer.weights += -self.current_learning_rate * weight_momentums_corrected / (np.sqrt(weight_cache_corrected) + self.epsilon)
        layer.biases += -self.current_learning_rate * bias_momentums_corrected / (np.sqrt(bias_cache_corrected) + self.epsilon)

    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1

class Loss:
    # Regularisation loss calculation
    def regularization_loss(self, layer):
        regularization_loss = 0
        # L1 regularisation - weights
        if layer.weight_regulariser_l1 > 0:
            regularisation_loss += layer.weight_regulariser_l1 * np.sum(np.abs(layer.weights))
        # L2 regularisation - weights
        if layer.weight_regulariser_l2 > 0:
            regularisation_loss += layer.weight_regulariser_l2 * np.sum(layer.weights * layer.weights)
        # L1 regularisation - biases
        if layer.bias_regulariser_l1 > 0:
            regularisation_loss += layer.bias_regulariser_l1 * np.sum(np.abs(layer.biases))
        # L2 regularisation - biases
        if layer.bias_regulariser_l2 > 0:
            regularisation_loss += layer.bias_regulariser_l2 * np.sum(layer.biases * layer.biases)
        return regularisation_loss

    # Calculates the data and regularisation losses given model output and ground truth values
    def calculate(self, output, y):
        sample_losses = self.forward(output, y)
        data_loss = np.mean(sample_losses)
        return data_loss

class Loss_CategoricalCrossentropy(Loss):
    # Forward pass
    def forward(self, y_pred, y_true):
        samples = len(y_pred)
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)

        if len(y_true.shape) == 1:
            correct_confidences = y_pred_clipped[range(samples), y_true]
        elif len(y_true.shape) == 2:
            correct_confidences = np.sum(y_pred_clipped * y_true, axis=1)

        negative_log_likelihoods = -np.log(correct_confidences)
        return negative_log_likelihoods

    # Backward pass
    def backward(self, dvalues, y_true):
        samples = len(dvalues)
        labels = len(dvalues[0])

        if len(y_true.shape) == 1:
            y_true = np.eye(labels)[y_true]

        self.dinputs = -y_true / dvalues
        self.dinputs = self.dinputs / samples

class Activation_Softmax_Loss_CategoricalCrossentropy:
    # Creates activation and loss function objects
    def __init__(self):
        self.activation = Activation_Softmax()
        self.loss = Loss_CategoricalCrossentropy()

    # Forward pass
    def forward(self, inputs, y_true):
        self.activation.forward(inputs)
        self.output = self.activation.output
        return self.loss.calculate(self.output, y_true)

    # Backward pass
    def backward(self, dvalues, y_true):
        samples = len(dvalues)

        if len(y_true.shape) == 2:
            y_true = np.argmax(y_true, axis=1)

        self.dinputs = dvalues.copy()
        self.dinputs[range(samples), y_true] -= 1
        self.dinputs = self.dinputs / samples

# Create dataset
X, y = spiral_data(samples=1000, classes=3)

# Create Dense layer with 2 input features and 64 output values
dense1 = Layer_Dense(2, 64, weight_regulariser_l2=5e-4, bias_regulariser_l2=5e-4)

# Create ReLU activation
activation1 = Activation_ReLU()

# Create dropout layer
dropout1 = Layer_Dropout(0.1)

# Create second Dense layer with 64 input features and 3 output values
dense2 = Layer_Dense(64, 3)

# Create Softmax classifier's combined loss and activation
loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()

# Create optimiser
optimiser = Optimiser_Adam(learning_rate=0.05, decay=5e-5)

# Train in loop
for epoch in range(10001):
    # Perform a forward pass of our training data through this layer
    dense1.forward(X)
    activation1.forward(dense1.output)
    dropout1.forward(activation1.output)
    dense2.forward(dropout1.output)
    data_loss = loss_activation.forward(dense2.output, y)

    # Calculate regularisation penalty
    regularization_loss = loss_activation.loss.regularization_loss(dense1) + loss_activation.loss.regularization_loss(dense2)

    # Calculate overall loss
    loss = data_loss + regularization_loss

    # Calculate accuracy from output of activation2 and targets
    predictions = np.argmax(loss_activation.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
    accuracy = np.mean(predictions == y)

    if not epoch % 100:
        print(f'epoch: {epoch}, acc: {accuracy:.3f}, loss: {loss:.3f} (data_loss: {data_loss:.3f}, reg_loss: {regularization_loss:.3f}), lr: {optimiser.current_learning_rate}')

    # Backward pass
    loss_activation.backward(loss_activation.output, y)
    dense2.backward(loss_activation.dinputs)
    dropout1.backward(dense2.dinputs)
    activation1.backward(dropout1.dinputs)
    dense1.backward(activation1.dinputs)

    # Update weights and biases
    optimiser.pre_update_params()
    optimiser.update_params(dense1)
    optimiser.update_params(dense2)
    optimiser.post_update_params()

# Validate the model
X_test, y_test = spiral_data(samples=100, classes=3)
dense1.forward(X_test)
activation1.forward(dense1.output)
dense2.forward(activation1.output)
loss = loss_activation.forward(dense2.output, y_test)
predictions = np.argmax(loss_activation.output, axis=1)
if len(y_test.shape) == 2:
    y_test = np.argmax(y_test, axis=1)
accuracy = np.mean(predictions == y_test)
print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')


##### Results:
![[Screenshot_2025-06-05_18-50-46.png]]

New components in the results:

- Data Loss (`data_loss`): This measures how well the model's predictions match the actual target values. It is the primary component of the loss function
- Regularisation Loss (`reg_loss`): This is an additional term added to the loss function to penalise large weights in the model, helping to prevent overfitting

Summary:
- The model is trained over 10,000 epochs
- The model achieves a validation accuracy of 0.753 and a validation loss of 0.692

##### Next Step
[[Binary Logistic Regression]]