L1 and L2 Regularization
- the goal of regularization is to prevent weights and biases from becoming too high by penalizing large weights and biases. A very high weight indicates a neuron may be trying to memorize a feature, based on the idea that its better to have many neurons contribute to the network output than a few.
- there are two types of regularization L1 and L2, and they follow a general form of adding a penatly gradient to the gradient of loss wrt layer's weights and biases.

Fowrwad pass:
- L1 - absolute value of the weights or bias times a penalty hyperparamter lamda added to the loss function. In our implementation, we sum all the weights/bias for a layer and multiply the lambda hyper parameter to that sum. Just increases loss by the abs value of the weight/bias. Please note that this is to calculate loss of total function, in backward pass penalty gradient is calculated on parameter level. Formula = lambda * sum(abs(weights)); weights = weights in layer, replace weights with bias for bias form
- L2 - squared value of the weight/bias. This is summed up for all the weights/bias in layer. Adds penatly that scales with the weight, penalizing larger weights, and not affecting smaller weights as much. weight^2 (or bias) impact on loss. Formula = lambda * sum(weights^2); weights = weights in layer, replace weights with bias for bias form. Again, please note that this is for the loss value for the whole network, not derivative of layer.
- the non-linear impact of L2 regularization impacts smaller weights less and larger weights more. the linear nature of L1 regularization impacts small weights more and can cause a model to become invariant to small values, and variant only to larger values. (i.e., because you are adding a constant value in L1, it is relatively larger to smaller weight values, having a much greater impact on pushing them to 0). As such L2 regularization is used more often than L1, and L1 is rarely used on its own.
- lambda is the penalty scalar that dicates how strong the penalty is
- in our implementation we set lambda independently for each layer. This is a more efficient solution, per chat GPT, as it would be a lot to set individual lambdas, but still maintains flexibility. It is simpler and more interpretable.
- regularization drives model parameters closer to 0, thus forcing the network to not memorize the data, because as mentioned previosuly, very high weight/bias can mean the network is memorizing. It does this by increasing loss in the first degree and in the second degree, causing bigger steps towards 0 during backprop, expanded upon in following notes.

Updated Loss Class
- added in summing up the loss for the whole network
- pass a layer object to the regularization_loss function and it will accumulate the total regularization loss for the layer
- outside this object in the broader training loop, the total regularization loss for each layer is calculated in the: regularization_loss = loss_activation.loss.regularization_loss(dense1) + loss_activation.loss.regularization_loss(dense2)
- then they are summed together into the total loss: loss = data_loss + regularization_loss; which we capture in the graph and training print outs
- please note that we have not included the layer regularization loss code, as the regularization loss for the layer is calculated in this funtion, not in the layer object. The layer object is used for the backward pass of the loss and to initialize loss parameters for the layer

In [None]:
import numpy as np
class Loss:
# Regularization loss calculation
    def regularization_loss(self, layer):
    # 0 by default
        regularization_loss = 0
        # L1 regularization - weights
        # calculate only when factor greater than 0
        if layer.weight_regularizer_l1 > 0:
            regularization_loss += layer.weight_regularizer_l1 * np.sum(np.abs(layer.weights))
        # L2 regularization - weights
        if layer.weight_regularizer_l2 > 0:
            regularization_loss += layer.weight_regularizer_l2 * np.sum(layer.weights * layer.weights)
        # L1 regularization - biases
        # calculate only when factor greater than 0
        if layer.bias_regularizer_l1 > 0:
            regularization_loss += layer.bias_regularizer_l1 * np.sum(np.abs(layer.biases))
        # L2 regularization - biases
        if layer.bias_regularizer_l2 > 0:
            regularization_loss += layer.bias_regularizer_l2 * np.sum(layer.biases * layer.biases)
        return regularization_loss
# Calculates the data and regularization losses
# given model output and ground truth values
    def calculate(self, output, y):
        # Calculate sample losses
        sample_losses = self.forward(output, y)
        # Calculate mean loss
        data_loss = np.mean(sample_losses)
        # Return loss
        return data_loss

- Regularization losses for both L1 and L2 are accumulated 