# Chapter 14: L1 and L2 Regularization

- techniques that reduce generalization error (overfitting, among others) fall under the umbrella of **regularization**
- L1 and L2 regularization both calculate a number (called a **penalty**) that is added to the loss value in an effort to **penalize the model for large weights**
- note that it is better to have many neurons contributing the output rather than a select few
---
- the **L1 regularization** penality is the sum of all the absolute values for the weights and biases
- this is a linear penality as the returned regularization loss is directly proportional to the weight values
- L1 regularization penalizes small weights more, causing the model to grow invariant to small inputs and variant only to the bigger ones
- the **L2 regularization** penalty is the sum of the squared weights and biases 
- this is a non-linear approach because it penalizes larger weights and biases more than smaller ones
- with that being said, **L2 regularization is commonly used** as it does not affect small parameter values and does not allow the model's weights to grow too large
- L1 regularization is only used with L2 regularization, but L2 regularization, as just mentioned, is commonly used individually
- regularization functions of this type drive the sum of weights towards 0, which is helpful towards alleviating exploding gradients
---
- we also want to dictate how much of an impact of the regularization penalty
- in the mathematical equation, this value is referred to as **lambda**, where a higher value yields a larger penalty
---
- using code notation:
- `l1 = lambda_l1 * sum(abs(weights))`
- `l2 = lambda_l2 * sum(weights**2)`
- `loss = data_loss + l1 + l2`
---
- to implement regularization in our neural network code, we’ll start with the `__init__()` method of the `Layer_Dense()` class, which will house the lambda values for regularization as these can be set separately for each individual layer:

In [27]:
   # Layer initialization
    def __init__(self, inputs, neurons, weight_regularizer_l1=0, weight_regularizer_l2=0, bias_regularizer_l1=0, bias_regularizer_l2=0):
        # Initialize weights and biases
        self.weights = 0.01 * np.random.randn(inputs, neurons)
        self.biases = np.zeros((1, neurons))
        # Set regularization strength
        self.weight_regularizer_l1 = weight_regularizer_l1
        self.weight_regularizer_l2 = weight_regularizer_l2
        self.bias_regularizer_l1 = bias_regularizer_l1
        self.bias_regularizer_l2 = bias_regularizer_l2

- again, this method sets the lambda values
- now we’re going to create a new general `Loss()` class, which can be inherited by any of our specific loss functions (such as our existing `Loss_CategoricalCrossentropy()` class) 

In [28]:
# Common loss class
class Loss:

    # Regularization loss calculation
    def regularization_loss(self, layer):

        # 0 by default
        regularization_loss = 0

        # L1 regularization - weights
        if layer.weight_regularizer_l1 > 0:  # only calculate when factor greater than 0
            regularization_loss += layer.weight_regularizer_l1 * np.sum(np.abs(layer.weights))

        # L2 regularization - weights
        if layer.weight_regularizer_l2 > 0:
            regularization_loss += layer.weight_regularizer_l2 * np.sum(layer.weights * layer.weights)

        # L1 regularization - biases
        if layer.bias_regularizer_l1 > 0:  # only calculate when factor greater than 0
            regularization_loss += layer.bias_regularizer_l1 * np.sum(np.abs(layer.biases))

        # L2 regularization - biases
        if layer.bias_regularizer_l2 > 0:
            regularization_loss += layer.bias_regularizer_l2 * np.sum(layer.biases * layer.biases)

        return regularization_loss

- next, as described above, we will update the `Loss_CategoricalCrossentropy()` class to inherit the result from the general `Loss()` class:

In [29]:
class Loss_CategoricalCrossentropy(Loss): # pass it as a parameter through the class
    pass # temporary

- then we’ll calculate the regularization loss and add it to our calculated class:

In [30]:
# Cross-entropy loss
class Loss_CategoricalCrossentropy(Loss):
   # Calculate loss from output of activation2 so softmax activation
    data_loss = loss_function.forward(activation2.output, y)

    # Calculate regularization penalty
    regularization_loss = loss_function.regularization_loss(dense1) + loss_function.regularization_loss(dense2)

    # Calculate overall loss
    loss = data_loss + regularization_loss

    loss

- this completes the forward pass for regularization, but this also means our overall loss has changed as part of the calculation can possibly include regularization, which must be accounted for in the backpropagation of the gradients
- thus, we will now cover the partial derivatives for both L1 and L2 regularization
- we are calculating the derivative with respect to the weights, and the resulting gradient is what we’ll use to update the weights:

In [31]:
weights = [0.2, 0.8, -0.5]  # weights of one neuron
dL1 = []  # array of partial derivatives of L1 regularization
for weight in weights:
    if weight >= 0:
        dL1.append(1)
    else:
        dL1.append(-1)
dL1

[1, 1, -1]

- now let's try to modify our `Loss()` class to work with multiple neurons in a layer:

In [32]:
weights = [[0.2, 0.8, -0.5, 1], # now we have 3 sets of weights
           [0.5, -0.91, 0.26, -0.5],
           [-0.26, -0.27, 0.17, 0.87]]
dL1 = []  # array of partial derivatives of L1 regularization (eventual list of lists)
for neuron in weights:
    neuron_dL1 = []  # derivatives related to one neuron
    for weight in neuron:
        if weight >= 0:
            neuron_dL1.append(1)
        else:
            neuron_dL1.append(-1)
    dL1.append(neuron_dL1)
dL1

[[1, 1, -1, 1], [1, -1, 1, -1], [-1, -1, 1, 1]]

- as with most of our functions, the above code can be simplified using NumPy
- with NumPy, we’re going to use conditions and binary masks

In [33]:
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]])

# two conditions
dL1 = weights.copy()
dL1[dL1 >= 0] = 1
dL1[dL1 < 0] = -1

print(dL1)

[[ 1.  1. -1.  1.]
 [ 1. -1.  1. -1.]
 [-1. -1.  1.  1.]]


- this returned an array of the same shape containing values of 1 and -1 (the gradient of the absolute function)
- we can now take these and update the `backward()` pass method for the `Layer_Dense()` class
- for L1 regularization, we’ll multiply the code above by lambda for the weights and biases (separately)
- for L2 regularization, as previously discussed, we simply take the weights/biases, multiply them by (2 * lambda), and add that product to the gradients: 

In [34]:
class Layer_Dense:
    ...
    def backward(self, dvalues):
        # Gradients on parameters
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)

        # Gradients on regularization
        # L1 on weights
        if self.weight_regularizer_l1 > 0:
            dL1 = self.weights.copy()
            dL1[dL1 >= 0] = 1
            dL1[dL1 < 0] = -1
            self.dweights += self.weight_regularizer_l1 * dL1
        # L2 on weights
        if self.weight_regularizer_l2 > 0:
            self.dweights += 2 * self.weight_regularizer_l2 * self.weights
        # L1 on biases
        if self.bias_regularizer_l1 > 0:
            dL1 = self.biases.copy()
            dL1[dL1 >= 0] = 1
            dL1[dL1 < 0] = -1
            self.dbiases += self.bias_regularizer_l1 * dL1
        # L2 on biases
        if self.bias_regularizer_l2 > 0:
            self.dbiases += 2 * self.bias_regularizer_l2 * self.biases

        # Gradient on values
        self.dvalues = np.dot(dvalues, self.weights.T)

- let's also update our `print()` statement to output the regularization loss and the overall loss

### Full Code

In [36]:
import numpy as np
import random


np.random.seed(0)


# Our sample dataset
def create_data(n, k):
    X = np.zeros((n*k, 2))  # data matrix (each row = single example)
    y = np.zeros(n*k, dtype='uint8')  # class labels
    for j in range(k):
        ix = range(n*j, n*(j+1))
        r = np.linspace(0.0, 1, n)  # radius
        t = np.linspace(j*4, (j+1)*4, n) + np.random.randn(n)*0.2  # theta
        X[ix] = np.c_[r*np.sin(t*2.5), r*np.cos(t*2.5)]
        y[ix] = j
    return X, y


# Dense layer
class Layer_Dense:

    # Layer initialization
    def __init__(self, inputs, neurons, weight_regularizer_l1=0, weight_regularizer_l2=0, bias_regularizer_l1=0, bias_regularizer_l2=0):
        # Initialize weights and biases
        self.weights = 0.01 * np.random.randn(inputs, neurons)
        self.biases = np.zeros((1, neurons))
        # Set regularization strength
        self.weight_regularizer_l1 = weight_regularizer_l1
        self.weight_regularizer_l2 = weight_regularizer_l2
        self.bias_regularizer_l1 = bias_regularizer_l1
        self.bias_regularizer_l2 = bias_regularizer_l2

    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        # Calculate output values from input ones, weights and biases
        self.output = np.dot(inputs, self.weights) + self.biases

    # Backward pass
    def backward(self, dvalues):
        # Gradients on parameters
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)

        # Gradients on regularization
        # L1 on weights
        if self.weight_regularizer_l1 > 0:
            dL1 = self.weights.copy()
            dL1[dL1 >= 0] = 1
            dL1[dL1 < 0] = -1
            self.dweights += self.weight_regularizer_l1 * dL1
        # L2 on weights
        if self.weight_regularizer_l2 > 0:
            self.dweights += 2 * self.weight_regularizer_l2 * self.weights
        # L1 on biases
        if self.bias_regularizer_l1 > 0:
            dL1 = self.biases.copy()
            dL1[dL1 >= 0] = 1
            dL1[dL1 < 0] = -1
            self.dbiases += self.bias_regularizer_l1 * dL1
        # L2 on biases
        if self.bias_regularizer_l2 > 0:
            self.dbiases += 2 * self.bias_regularizer_l2 * self.biases

        # Gradient on values
        self.dvalues = np.dot(dvalues, self.weights.T)


# ReLU activation
class Activation_ReLU:

    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        # Calculate output values from input ones
        self.output = np.maximum(0, inputs)

    # Backward pass
    def backward(self, dvalues):
        dvalues = dvalues.copy()  # Since we need to modify original variable, let;s make a copy of values first
        dvalues[self.inputs <= 0] = 0  # Zero gradient where input values were negative
        self.dvalues = dvalues


# Softmax activation
class Activation_Softmax:

    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs

        # get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        # normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)

        self.output = probabilities

    # Backward pass
    def backward(self, dvalues):
        self.dvalues = dvalues


# Common loss class
class Loss:

    # Regularization loss calculation
    def regularization_loss(self, layer):

        # 0 by default
        regularization_loss = 0

        # L1 regularization - weights
        if layer.weight_regularizer_l1 > 0:  # only calculate when factor greater than 0
            regularization_loss += layer.weight_regularizer_l1 * np.sum(np.abs(layer.weights))

        # L2 regularization - weights
        if layer.weight_regularizer_l2 > 0:
            regularization_loss += layer.weight_regularizer_l2 * np.sum(layer.weights * layer.weights)

        # L1 regularization - biases
        if layer.bias_regularizer_l1 > 0:  # only calculate when factor greater than 0
            regularization_loss += layer.bias_regularizer_l1 * np.sum(np.abs(layer.biases))

        # L2 regularization - biases
        if layer.bias_regularizer_l2 > 0:
            regularization_loss += layer.bias_regularizer_l2 * np.sum(layer.biases * layer.biases)

        return regularization_loss


# Cross-entropy loss
class Loss_CategoricalCrossentropy(Loss):

    # Forward pass
    def forward(self, y_pred, y_true):

        # Number of samples in a batch
        samples = y_pred.shape[0]

        # Probabilities for target values - only if categorical labels
        if len(y_true.shape) == 1:
            y_pred = y_pred[range(samples), y_true]

        # Losses
        negative_log_likelihoods = -np.log(y_pred)

        # Mask values - only for one-hot encoded labels
        if len(y_true.shape) == 2:
            negative_log_likelihoods *= y_true

        # Overall loss
        data_loss = np.sum(negative_log_likelihoods) / samples
        return data_loss

    # Backward pass
    def backward(self, dvalues, y_true):

        samples = dvalues.shape[0]

        dvalues = dvalues.copy()  # We need to modify variable directly, make a copy first then
        dvalues[range(samples), y_true] -= 1
        dvalues = dvalues / samples

        self.dvalues = dvalues


# SGD Optimizer
class Optimizer_SGD:

    # Initialize optimizer - set settings
    def __init__(self, learning_rate=1., decay=0., momentum=0., nesterov=False):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.momentum = momentum
        self.nesterov = nesterov

    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.current_learning_rate * (1. / (1. + self.decay * self.iterations))

    # Update parameters
    def update_params(self, layer):

        # If layer does not contain momentum arrays, create them filled with zeros
        if not hasattr(layer, 'weight_momentums'):
            layer.weight_momentums = np.zeros_like(layer.weights)
            layer.bias_momentums = np.zeros_like(layer.biases)

        # If we use momentum
        if self.momentum:

            # Build weight updates with momentum - take previous updates multiplied by retain factor and update with current gradients
            weight_updates = self.momentum * layer.weight_momentums - self.current_learning_rate * layer.dweights
            layer.weight_momentums = weight_updates

            # Build bias updates
            bias_updates = self.momentum * layer.bias_momentums - self.current_learning_rate * layer.dbiases
            layer.bias_momentums = bias_updates

            # Apply Nesterov as well?
            if self.nesterov:
                weight_updates = self.momentum * weight_updates - self.current_learning_rate * layer.dweights
                bias_updates = self.momentum * bias_updates - self.current_learning_rate * layer.dbiases

        # Vanilla SGD updates (as before momentum update)
        else:
            weight_updates = -self.current_learning_rate * layer.dweights
            bias_updates = -self.current_learning_rate * layer.dbiases

        # Update weights with updates which are either vanilla, momentum or momentum+nesterov updates
        layer.weights += weight_updates
        layer.biases += bias_updates

    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1


# Adagrad Optimizer
class Optimizer_Adagrad:

    # Initialize optimizer - set settings
    def __init__(self, learning_rate=1., decay=0., epsilon=1e-7):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon

    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.current_learning_rate * (1. / (1. + self.decay * self.iterations))

    # Update parameters
    def update_params(self, layer):

        # If layer does not contain cache arrays, create them filled with zeros
        if not hasattr(layer, 'weight_cache'):
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_cache = np.zeros_like(layer.biases)

        # Update cache with squared current gradients
        layer.weight_cache += layer.dweights**2
        layer.bias_cache += layer.dbiases**2

        # Vanilla SGD parameter update + normalization with square rooted cache
        layer.weights += -self.current_learning_rate * layer.dweights / (np.sqrt(layer.weight_cache) + self.epsilon)
        layer.biases += -self.current_learning_rate * layer.dbiases / (np.sqrt(layer.bias_cache) + self.epsilon)

    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1


# RMSprop Optimizer
class Optimizer_RMSprop:

    # Initialize optimizer - set settings
    def __init__(self, learning_rate=0.001, decay=0., epsilon=1e-7, rho=0.9):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon
        self.rho = rho

    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.current_learning_rate * (1. / (1. + self.decay * self.iterations))

    # Update parameters
    def update_params(self, layer):

        # If layer does not contain cache arrays, create them filled with zeros
        if not hasattr(layer, 'weight_cache'):
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_cache = np.zeros_like(layer.biases)

        # Update cache with squared current gradients
        layer.weight_cache = self.rho * layer.weight_cache + (1 - self.rho) * layer.dweights**2
        layer.bias_cache = self.rho * layer.bias_cache + (1 - self.rho) * layer.dbiases**2

        # Vanilla SGD parameter update + normalization with square rooted cache
        layer.weights += -self.current_learning_rate * layer.dweights / (np.sqrt(layer.weight_cache) + self.epsilon)
        layer.biases += -self.current_learning_rate * layer.dbiases / (np.sqrt(layer.bias_cache) + self.epsilon)

    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1


# Adam Optimizer
class Optimizer_Adam:

    # Initialize optimizer - set settings
    def __init__(self, learning_rate=0.001, decay=0., epsilon=1e-7, beta_1=0.9, beta_2=0.999):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon
        self.beta_1 = beta_1
        self.beta_2 = beta_2

    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.current_learning_rate * (1. / (1. + self.decay * self.iterations))

    # Update parameters
    def update_params(self, layer):

        # If layer does not contain cache arrays, create them filled with zeros
        if not hasattr(layer, 'weight_cache'):
            layer.weight_momentums = np.zeros_like(layer.weights)
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_momentums = np.zeros_like(layer.biases)
            layer.bias_cache = np.zeros_like(layer.biases)

        # Update momentum  with current gradients
        layer.weight_momentums = self.beta_1 * layer.weight_momentums + (1 - self.beta_1) * layer.dweights
        layer.bias_momentums = self.beta_1 * layer.bias_momentums + (1 - self.beta_1) * layer.dbiases
        # Get corrected momentum
        weight_momentums_corrected = layer.weight_momentums / (1 - self.beta_1 ** (self.iterations + 1))  # self.iteration is 0 at first pass ans we need to start with 1 here
        bias_momentums_corrected = layer.bias_momentums / (1 - self.beta_1 ** (self.iterations + 1))
        # Update cache with squared current gradients
        layer.weight_cache = self.beta_2 * layer.weight_cache + (1 - self.beta_2) * layer.dweights**2
        layer.bias_cache = self.beta_2 * layer.bias_cache + (1 - self.beta_2) * layer.dbiases**2
        # Get corrected bias
        weight_cache_corrected = layer.weight_cache / (1 - self.beta_2 ** (self.iterations + 1))
        bias_cache_corrected = layer.bias_cache / (1 - self.beta_2 ** (self.iterations + 1))

        # Vanilla SGD parameter update + normalization with square rooted cache
        layer.weights += -self.current_learning_rate * weight_momentums_corrected / (np.sqrt(weight_cache_corrected) + self.epsilon)
        layer.biases += -self.current_learning_rate * bias_momentums_corrected / (np.sqrt(bias_cache_corrected) + self.epsilon)

    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1


# Create dataset
X, y = create_data(100, 3)

# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 64, weight_regularizer_l2=1e-5, bias_regularizer_l2=1e-5)  # first dense layer, 2 inputs (each sample has 2 features), 3 outputs

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create second Dense layer with 3 input features (as we take output of previous layer here) and 3 output values (output values)
dense2 = Layer_Dense(64, 3)  # second dense layer, 3 inputs, 3 outputs

# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()

# Create loss function
loss_function = Loss_CategoricalCrossentropy()

# Create optimizer
#optimizer = Optimizer_SGD(decay=1e-8, momentum=0.9)
#optimizer = Optimizer_Adagrad(decay=1e-8)
#optimizer = Optimizer_RMSprop(learning_rate=0.05, decay=4e-8, rho=0.999)
optimizer = Optimizer_Adam(learning_rate=0.05, decay=1e-8)

# Train in loop
for epoch in range(10001):

    # Make a forward pass of our training data thru this layer
    dense1.forward(X)

    # Make a forward pass thru activation function - we take output of previous layer here
    activation1.forward(dense1.output)

    # Make a forward pass thru second Dense layer - it takes outputs of activation function of first layer as inputs
    dense2.forward(activation1.output)

    # Make a forward pass thru activation function - we take output of previous layer here
    activation2.forward(dense2.output)

    # Calculate loss from output of activation2 so softmax activation
    data_loss = loss_function.forward(activation2.output, y)

    # Calculate regularization penalty
    regularization_loss = loss_function.regularization_loss(dense1) + loss_function.regularization_loss(dense2)

    # Calculate overall loss
    loss = data_loss + regularization_loss

    # Calculate accuracy from output of activation2 and targets
    predictions = np.argmax(activation2.output, axis=1)  # calculate values along first axis
    accuracy = np.mean(predictions==y)

    if not epoch % 100:
        print('epoch:', epoch, 'acc:', f'{accuracy:.3f}', 'loss:', f'{loss:.3f}', '(data_loss:', f'{data_loss:.3f}', 'reg_loss:', f'{regularization_loss:.3f})', ')', 'lr:', optimizer.current_learning_rate)

    # Backward pass
    loss_function.backward(activation2.output, y)
    activation2.backward(loss_function.dvalues)
    dense2.backward(activation2.dvalues)
    activation1.backward(dense2.dvalues)
    dense1.backward(activation1.dvalues)

    # Update weights
    optimizer.pre_update_params()
    optimizer.update_params(dense1)
    optimizer.update_params(dense2)
    optimizer.post_update_params()

# Validate model (we just do a forward pass)

# Create test dataset
X_test, y_test = create_data(100, 3)

# Make a forward pass of our training data thru this layer
dense1.forward(X_test)

# Make a forward pass thru activation function - we take output of previous layer here
activation1.forward(dense1.output)

# Make a forward pass thru second Dense layer - it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)

# Make a forward pass thru activation function - we take output of previous layer here
activation2.forward(dense2.output)

# Calculate loss from output of activation2 so softmax activation
loss = loss_function.forward(activation2.output, y_test)

# Calculate accuracy from output of activation2 and targets
predictions = np.argmax(activation2.output, axis=1)  # calculate values along first axis
accuracy = np.mean(predictions==y_test)

print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')

epoch: 0 acc: 0.360 loss: 1.099 (data_loss: 1.099 reg_loss: 0.000) ) lr: 0.05
epoch: 100 acc: 0.707 loss: 0.698 (data_loss: 0.696 reg_loss: 0.002) ) lr: 0.04999752506207612
epoch: 200 acc: 0.797 loss: 0.527 (data_loss: 0.523 reg_loss: 0.004) ) lr: 0.049990050996574775
epoch: 300 acc: 0.843 loss: 0.419 (data_loss: 0.413 reg_loss: 0.006) ) lr: 0.04997758005043209
epoch: 400 acc: 0.850 loss: 0.364 (data_loss: 0.357 reg_loss: 0.007) ) lr: 0.04996011596895705
epoch: 500 acc: 0.883 loss: 0.322 (data_loss: 0.314 reg_loss: 0.008) ) lr: 0.04993766399395728
epoch: 600 acc: 0.897 loss: 0.291 (data_loss: 0.282 reg_loss: 0.009) ) lr: 0.04991023086111661
epoch: 700 acc: 0.907 loss: 0.269 (data_loss: 0.259 reg_loss: 0.010) ) lr: 0.049877824796627425
epoch: 800 acc: 0.897 loss: 0.257 (data_loss: 0.246 reg_loss: 0.011) ) lr: 0.0498404555130797
epoch: 900 acc: 0.917 loss: 0.230 (data_loss: 0.218 reg_loss: 0.012) ) lr: 0.04979813420460921
epoch: 1000 acc: 0.917 loss: 0.216 (data_loss: 0.203 reg_loss: 0.0

- after adding the L2 regularization term into the hidden layer, we've achieved a lower loss and a higher accuracy
- let's also take a moment to exemplify how a simple increase in training data can make a large difference (100 --> 1000 samples):

In [38]:
np.random.seed(0)

# Create dataset
X, y = create_data(1000, 3)

# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 64, weight_regularizer_l2=1e-5, bias_regularizer_l2=1e-5)  # first dense layer, 2 inputs (each sample has 2 features), 3 outputs

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create second Dense layer with 3 input features (as we take output of previous layer here) and 3 output values (output values)
dense2 = Layer_Dense(64, 3)  # second dense layer, 3 inputs, 3 outputs

# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()

# Create loss function
loss_function = Loss_CategoricalCrossentropy()

# Create optimizer
#optimizer = Optimizer_SGD(decay=1e-8, momentum=0.9)
#optimizer = Optimizer_Adagrad(decay=1e-8)
#optimizer = Optimizer_RMSprop(learning_rate=0.05, decay=4e-8, rho=0.999)
optimizer = Optimizer_Adam(learning_rate=0.05, decay=1e-8)

# Train in loop
for epoch in range(10001):

    # Make a forward pass of our training data thru this layer
    dense1.forward(X)

    # Make a forward pass thru activation function - we take output of previous layer here
    activation1.forward(dense1.output)

    # Make a forward pass thru second Dense layer - it takes outputs of activation function of first layer as inputs
    dense2.forward(activation1.output)

    # Make a forward pass thru activation function - we take output of previous layer here
    activation2.forward(dense2.output)

    # Calculate loss from output of activation2 so softmax activation
    data_loss = loss_function.forward(activation2.output, y)

    # Calculate regularization penalty
    regularization_loss = loss_function.regularization_loss(dense1) + loss_function.regularization_loss(dense2)

    # Calculate overall loss
    loss = data_loss + regularization_loss

    # Calculate accuracy from output of activation2 and targets
    predictions = np.argmax(activation2.output, axis=1)  # calculate values along first axis
    accuracy = np.mean(predictions==y)

    if not epoch % 100:
        print('epoch:', epoch, 'acc:', f'{accuracy:.3f}', 'loss:', f'{loss:.3f}', '(data_loss:', f'{data_loss:.3f}', 'reg_loss:', f'{regularization_loss:.3f})', ')', 'lr:', optimizer.current_learning_rate)

    # Backward pass
    loss_function.backward(activation2.output, y)
    activation2.backward(loss_function.dvalues)
    dense2.backward(activation2.dvalues)
    activation1.backward(dense2.dvalues)
    dense1.backward(activation1.dvalues)

    # Update weights
    optimizer.pre_update_params()
    optimizer.update_params(dense1)
    optimizer.update_params(dense2)
    optimizer.post_update_params()

# Validate model (we just do a forward pass)

# Create test dataset
X_test, y_test = create_data(100, 3)

# Make a forward pass of our training data thru this layer
dense1.forward(X_test)

# Make a forward pass thru activation function - we take output of previous layer here
activation1.forward(dense1.output)

# Make a forward pass thru second Dense layer - it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)

# Make a forward pass thru activation function - we take output of previous layer here
activation2.forward(dense2.output)

# Calculate loss from output of activation2 so softmax activation
loss = loss_function.forward(activation2.output, y_test)

# Calculate accuracy from output of activation2 and targets
predictions = np.argmax(activation2.output, axis=1)  # calculate values along first axis
accuracy = np.mean(predictions==y_test)

print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')

epoch: 0 acc: 0.323 loss: 1.099 (data_loss: 1.099 reg_loss: 0.000) ) lr: 0.05
epoch: 100 acc: 0.633 loss: 0.806 (data_loss: 0.804 reg_loss: 0.001) ) lr: 0.04999752506207612
epoch: 200 acc: 0.741 loss: 0.645 (data_loss: 0.642 reg_loss: 0.003) ) lr: 0.049990050996574775
epoch: 300 acc: 0.755 loss: 0.583 (data_loss: 0.579 reg_loss: 0.004) ) lr: 0.04997758005043209
epoch: 400 acc: 0.766 loss: 0.547 (data_loss: 0.542 reg_loss: 0.005) ) lr: 0.04996011596895705
epoch: 500 acc: 0.796 loss: 0.504 (data_loss: 0.498 reg_loss: 0.006) ) lr: 0.04993766399395728
epoch: 600 acc: 0.818 loss: 0.480 (data_loss: 0.473 reg_loss: 0.007) ) lr: 0.04991023086111661
epoch: 700 acc: 0.848 loss: 0.409 (data_loss: 0.401 reg_loss: 0.008) ) lr: 0.049877824796627425
epoch: 800 acc: 0.855 loss: 0.366 (data_loss: 0.357 reg_loss: 0.009) ) lr: 0.0498404555130797
epoch: 900 acc: 0.865 loss: 0.364 (data_loss: 0.354 reg_loss: 0.010) ) lr: 0.04979813420460921
epoch: 1000 acc: 0.875 loss: 0.334 (data_loss: 0.323 reg_loss: 0.0

- in theory, this regularization should also allow us to create much larger models without as much fear of overfitting
- we can test this by increasing the number of neurons per layer to 512 (instead of our usual 64/layer):

In [39]:
np.random.seed(0)

# Create dataset
X, y = create_data(1000, 3)

# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 512, weight_regularizer_l2=1e-5, bias_regularizer_l2=1e-5)  # first dense layer, 2 inputs (each sample has 2 features), 3 outputs

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create second Dense layer with 3 input features (as we take output of previous layer here) and 3 output values (output values)
dense2 = Layer_Dense(512, 3)  # second dense layer, 3 inputs, 3 outputs

# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()

# Create loss function
loss_function = Loss_CategoricalCrossentropy()

# Create optimizer
#optimizer = Optimizer_SGD(decay=1e-8, momentum=0.9)
#optimizer = Optimizer_Adagrad(decay=1e-8)
#optimizer = Optimizer_RMSprop(learning_rate=0.05, decay=4e-8, rho=0.999)
optimizer = Optimizer_Adam(learning_rate=0.05, decay=1e-8)

# Train in loop
for epoch in range(10001):

    # Make a forward pass of our training data thru this layer
    dense1.forward(X)

    # Make a forward pass thru activation function - we take output of previous layer here
    activation1.forward(dense1.output)

    # Make a forward pass thru second Dense layer - it takes outputs of activation function of first layer as inputs
    dense2.forward(activation1.output)

    # Make a forward pass thru activation function - we take output of previous layer here
    activation2.forward(dense2.output)

    # Calculate loss from output of activation2 so softmax activation
    data_loss = loss_function.forward(activation2.output, y)

    # Calculate regularization penalty
    regularization_loss = loss_function.regularization_loss(dense1) + loss_function.regularization_loss(dense2)

    # Calculate overall loss
    loss = data_loss + regularization_loss

    # Calculate accuracy from output of activation2 and targets
    predictions = np.argmax(activation2.output, axis=1)  # calculate values along first axis
    accuracy = np.mean(predictions==y)

    if not epoch % 100:
        print('epoch:', epoch, 'acc:', f'{accuracy:.3f}', 'loss:', f'{loss:.3f}', '(data_loss:', f'{data_loss:.3f}', 'reg_loss:', f'{regularization_loss:.3f})', ')', 'lr:', optimizer.current_learning_rate)

    # Backward pass
    loss_function.backward(activation2.output, y)
    activation2.backward(loss_function.dvalues)
    dense2.backward(activation2.dvalues)
    activation1.backward(dense2.dvalues)
    dense1.backward(activation1.dvalues)

    # Update weights
    optimizer.pre_update_params()
    optimizer.update_params(dense1)
    optimizer.update_params(dense2)
    optimizer.post_update_params()

# Validate model (we just do a forward pass)

# Create test dataset
X_test, y_test = create_data(100, 3)

# Make a forward pass of our training data thru this layer
dense1.forward(X_test)

# Make a forward pass thru activation function - we take output of previous layer here
activation1.forward(dense1.output)

# Make a forward pass thru second Dense layer - it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)

# Make a forward pass thru activation function - we take output of previous layer here
activation2.forward(dense2.output)

# Calculate loss from output of activation2 so softmax activation
loss = loss_function.forward(activation2.output, y_test)

# Calculate accuracy from output of activation2 and targets
predictions = np.argmax(activation2.output, axis=1)  # calculate values along first axis
accuracy = np.mean(predictions==y_test)

print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')

epoch: 0 acc: 0.380 loss: 1.099 (data_loss: 1.099 reg_loss: 0.000) ) lr: 0.05
epoch: 100 acc: 0.802 loss: 0.507 (data_loss: 0.502 reg_loss: 0.005) ) lr: 0.04999752506207612
epoch: 200 acc: 0.886 loss: 0.308 (data_loss: 0.299 reg_loss: 0.009) ) lr: 0.049990050996574775
epoch: 300 acc: 0.907 loss: 0.258 (data_loss: 0.246 reg_loss: 0.011) ) lr: 0.04997758005043209
epoch: 400 acc: 0.909 loss: 0.240 (data_loss: 0.228 reg_loss: 0.013) ) lr: 0.04996011596895705
epoch: 500 acc: 0.905 loss: 0.249 (data_loss: 0.235 reg_loss: 0.013) ) lr: 0.04993766399395728
epoch: 600 acc: 0.916 loss: 0.223 (data_loss: 0.209 reg_loss: 0.014) ) lr: 0.04991023086111661
epoch: 700 acc: 0.919 loss: 0.218 (data_loss: 0.204 reg_loss: 0.014) ) lr: 0.049877824796627425
epoch: 800 acc: 0.918 loss: 0.215 (data_loss: 0.200 reg_loss: 0.015) ) lr: 0.0498404555130797
epoch: 900 acc: 0.920 loss: 0.226 (data_loss: 0.210 reg_loss: 0.016) ) lr: 0.04979813420460921
epoch: 1000 acc: 0.915 loss: 0.219 (data_loss: 0.203 reg_loss: 0.0

- in this case, we see that the accuracies and losses for both in-sample and out-of-sample data are almost identical
- fom here, we could try to add even more layers and/or neurons (feel free to tinker around)
- next, we’re going to cover another regularization method: **dropout** regularization