Optimizers

Gradient Descent
- Stochastic Gradient Descent - fitting single sample at a time
- vanilla gradient descent, gradient descent, batch gradient descent - fitting whole dataset at once
- mini batch dataset - fit smaller (mini) batches of data instead of all data at once
- These terms can get confusing. For the purpose of the book we will call mini batches batches 
- some call it stochastic gradient descent regardless of batch size/single sample

- to implement gradient descent, need a learing rate and the calculated gradients of loss function with respect to parameters. To get the parameter update amounts just multiply -learning rate * gradients, then add to the parameters. Learning rate is negated because we are trying to find minimum, so stepping towards lowest loss
- see optimizer object below - in previous work we stored the layer weights and biases gradients in the layer objects as attributes so we can now make use of & modify them via the optimzer

In [1]:
class Optimizer_SGD:
# Initialize optimizer - set settings,
# learning rate of 1. is default for this optimizer
    def __init__(self, learning_rate=1.0):
        self.learning_rate = learning_rate
    # Update parameters
    def update_params(self, layer):
        layer.weights += -self.learning_rate * layer.dweights
        layer.biases += -self.learning_rate * layer.dbiases

Basic network training using SGD optimizer
- epoch - a full pass through the training data, including forwards and backwards
- typically models are trained over multiple epochs, but obviously the less the better
- here it is implemented via for loop where each epoch is 1 interation of the loop, including a forward and backward pass, so do that 10000x
- for this intial run, we chose a learning rate of 1, (it is the default)
- run the final cell in this workbook with full network code to use this example

In [10]:
import nnfs
from nnfs.datasets import spiral_data
import numpy as np
nnfs.init()

###NOTE: run full network code in final cell of this workbook so that this example works

# Create dataset
X, y = spiral_data(samples=100, classes=3)

# Create Dense layer with 2 input features and 64 output values
dense1 = Layer_Dense(2, 64)

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create second Dense layer with 64 input features (as we take output
# of previous layer here) and 3 output values (output values)
dense2 = Layer_Dense(64, 3)

# Create Softmax classifier's combined loss and activation
loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()

# Create optimizer
optimizer = Optimizer_SGD()
# Train in loop
for epoch in range(10001):
    # Perform a forward pass of our training data through this layer
    dense1.forward(X)
    
    # Perform a forward pass through activation function
    # takes the output of first dense layer here
    activation1.forward(dense1.output)
    
    # Perform a forward pass through second Dense layer
    # takes outputs of activation function of first layer as inputs
    dense2.forward(activation1.output)
    
    # Perform a forward pass through the activation/loss function
    # takes the output of second dense layer here and returns loss
    loss = loss_activation.forward(dense2.output, y)

    # Calculate accuracy from output of activation2 and targets
    # calculate values along first axis
    predictions = np.argmax(loss_activation.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
    accuracy = np.mean(predictions==y)
    
    if not epoch % 100: #only every hundredth epoch
        print(f'epoch: {epoch}, ' +
        f'acc: {accuracy:.3f}, ' +
        f'loss: {loss:.3f}')
    
    # Backward pass
    loss_activation.backward(loss_activation.output, y)
    dense2.backward(loss_activation.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)
    
    # Update weights and biases
    optimizer.update_params(dense1)
    optimizer.update_params(dense2)

epoch: 0, acc: 0.360, loss: 1.099
epoch: 100, acc: 0.400, loss: 1.087
epoch: 200, acc: 0.417, loss: 1.077
epoch: 300, acc: 0.413, loss: 1.076
epoch: 400, acc: 0.400, loss: 1.074
epoch: 500, acc: 0.403, loss: 1.071
epoch: 600, acc: 0.417, loss: 1.067
epoch: 700, acc: 0.440, loss: 1.062
epoch: 800, acc: 0.457, loss: 1.055
epoch: 900, acc: 0.410, loss: 1.062
epoch: 1000, acc: 0.407, loss: 1.058
epoch: 1100, acc: 0.407, loss: 1.057
epoch: 1200, acc: 0.403, loss: 1.064
epoch: 1300, acc: 0.427, loss: 1.051
epoch: 1400, acc: 0.443, loss: 1.067
epoch: 1500, acc: 0.400, loss: 1.058
epoch: 1600, acc: 0.420, loss: 1.070
epoch: 1700, acc: 0.410, loss: 1.049
epoch: 1800, acc: 0.460, loss: 1.040
epoch: 1900, acc: 0.483, loss: 1.033
epoch: 2000, acc: 0.403, loss: 1.038
epoch: 2100, acc: 0.447, loss: 1.022
epoch: 2200, acc: 0.467, loss: 1.023
epoch: 2300, acc: 0.437, loss: 1.005
epoch: 2400, acc: 0.497, loss: 0.993
epoch: 2500, acc: 0.513, loss: 0.981
epoch: 2600, acc: 0.453, loss: 0.991
epoch: 2700, 

- the output of this intial training method showed some learing (accuracy ~63%), but loss did not drop 
- the visualiztion of this training: https://nnfs.io/pup
- in the visualization, there is a "flashy wiggle" effect, indicating the learning rate may be too high. We can also tell via the loss, which does not decrease smoothly, but bounces around, (i.e. the loss decreases between epochs, then increases again between other epochs)

The Learning Rate
- in most cases applying the full negative gradient to the parameters is too big of a step, as the gradient is continously changing (and we are applying a point estimate of a tangent function). So a big jump may result in jumping over the steepest areas of the function instead of more closely hugging to the function's curvature.
- small steps make sure we follow the direction of steepest descent, hugging more closely to the function curve by not jumping too much. But too small is bad too - takes longer to arive at optimal parameters and more prone to getting stuck in local minimums, the lowest point of a function in a given x range (vs global minimum - the lowest possible y value a function can output)
- ideal global minimum for ANNs is a loss of 0 - however this is typically not achieved. You know you are stuck in some local minumum if the loss is not low/close to 0
- Gradient descent algo follows the direction of steepest descent, no matter how large or small it is. So if you are near a local minimum, this is what causes it to get stcuk because graidents near local minimum are lower, causing smaller parameter adjustment
- too low learning rate can cause learning stagnation - stuck in local minimum
- too high learning rate can cause gradient explosion - where gradient updates cause model loss to rise instead of fall. Eventually loss/gradients become so big that they cannot be stored in floating point, causing error. Can be costly if model has taken a while to train, then waste of time and computing resources
- with just learning rate, need to select an learning rate that is not too high or too low for the reasons discussed above. Too high and will cause loss to bounce around/gradient explosion, too low and the model will learn too slow.
- alternatives to just setting learning rate are learning rate decay and momentum
- setting these factors are called hyper parameters, and setting them appropriately requires experience, it is difficult to prescribe anything. Have to see how model performs with different hyperparameters and adjust from there.
- see code example below for lower learning rate - BE SURE TO RUN FINAL CELL SO THE OBJECTS WORK
- in the code example, setting learning rate to .85 causes slightly higher accuracy, and slightly lower loss

In [12]:
import nnfs
from nnfs.datasets import spiral_data
import numpy as np
nnfs.init()

###NOTE: run full network code in final cell of this workbook so that this example works

# Create dataset
X, y = spiral_data(samples=100, classes=3)

# Create Dense layer with 2 input features and 64 output values
dense1 = Layer_Dense(2, 64)

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create second Dense layer with 64 input features (as we take output
# of previous layer here) and 3 output values (output values)
dense2 = Layer_Dense(64, 3)

# Create Softmax classifier's combined loss and activation
loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()

# Create optimizer
optimizer = Optimizer_SGD(learning_rate=.85)
# Train in loop
for epoch in range(10001):
    # Perform a forward pass of our training data through this layer
    dense1.forward(X)
    
    # Perform a forward pass through activation function
    # takes the output of first dense layer here
    activation1.forward(dense1.output)
    
    # Perform a forward pass through second Dense layer
    # takes outputs of activation function of first layer as inputs
    dense2.forward(activation1.output)
    
    # Perform a forward pass through the activation/loss function
    # takes the output of second dense layer here and returns loss
    loss = loss_activation.forward(dense2.output, y)

    # Calculate accuracy from output of activation2 and targets
    # calculate values along first axis
    predictions = np.argmax(loss_activation.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
    accuracy = np.mean(predictions==y)
    
    if not epoch % 100: #only every hundredth epoch
        print(f'epoch: {epoch}, ' +
        f'acc: {accuracy:.3f}, ' +
        f'loss: {loss:.3f}')
    
    # Backward pass
    loss_activation.backward(loss_activation.output, y)
    dense2.backward(loss_activation.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)
    
    # Update weights and biases
    optimizer.update_params(dense1)
    optimizer.update_params(dense2)

epoch: 0, acc: 0.360, loss: 1.099
epoch: 100, acc: 0.403, loss: 1.091
epoch: 200, acc: 0.410, loss: 1.078
epoch: 300, acc: 0.423, loss: 1.077
epoch: 400, acc: 0.413, loss: 1.075
epoch: 500, acc: 0.400, loss: 1.074
epoch: 600, acc: 0.410, loss: 1.071
epoch: 700, acc: 0.417, loss: 1.067
epoch: 800, acc: 0.440, loss: 1.064
epoch: 900, acc: 0.443, loss: 1.057
epoch: 1000, acc: 0.420, loss: 1.050
epoch: 1100, acc: 0.397, loss: 1.061
epoch: 1200, acc: 0.387, loss: 1.060
epoch: 1300, acc: 0.420, loss: 1.061
epoch: 1400, acc: 0.460, loss: 1.055
epoch: 1500, acc: 0.390, loss: 1.057
epoch: 1600, acc: 0.450, loss: 1.072
epoch: 1700, acc: 0.400, loss: 1.049
epoch: 1800, acc: 0.423, loss: 1.039
epoch: 1900, acc: 0.387, loss: 1.059
epoch: 2000, acc: 0.437, loss: 1.053
epoch: 2100, acc: 0.443, loss: 1.026
epoch: 2200, acc: 0.377, loss: 1.050
epoch: 2300, acc: 0.433, loss: 1.016
epoch: 2400, acc: 0.460, loss: 1.000
epoch: 2500, acc: 0.493, loss: 1.010
epoch: 2600, acc: 0.527, loss: 0.998
epoch: 2700, 

Learning Rate Decay
 - general idea is the start with some learning rate at beginning of training and decrease it during training
 - One way - monitor loss across each epoch and adjust learning rate if there is learning stagnation (ie. loss curve is flattening, rate too low) or jumping in loss (rate too high). Can be programmed or just done manually
 - Can also use a decay rate that decreases the learning rate per batch or per epoch, such as a 1/t decay or exponential decay.
 - we will use learning rate decay, specifically: starting_learning rate * (1 / (1+ learning_rate_decay * step_number)). So the larger the step number (scaled by the decay factor) the lower the learning rate. This form also ensure that we do not accidently increase learning rate by divinding by a value less than 1, hence the adding 1.
 - See below for new optimizer class with learning rate decay
 - decay is is defualt 0, and if decay is not 0, then we update it with the steps
 - we keep track of iterations, though not sure if it is necessary if epoch/batches are via range function. I guess it makes it more self-contained, and if you are not looping with numbers then it would be good to internally count iterations.


In [17]:
class Optimizer_SGD:
# Initialize optimizer - set settings,
# learning rate of 1. is default for this optimizer
    def __init__(self, learning_rate=1., decay=0.):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * (1. / (1. + self.decay*self.iterations))
    # Update parameters
    def update_params(self, layer):
        layer.weights += -self.current_learning_rate * layer.dweights
        layer.biases += -self.current_learning_rate * layer.dbiases
    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1

Traning with Learning Rate
- trial and error - learning rate decay of 1e-2 was too high, that is the rate dropped too fast and the model got stuck in local minimum. (higher decay causes larger denominator because we multiply decay by step size), 1e-3 is better, getting to our best accuracy/loss so far. This assumes learning rate of 1.
- be sure to run optimzier object cell above and last cell in notebook with all prerequisite code

In [18]:
import nnfs
from nnfs.datasets import spiral_data
import numpy as np
nnfs.init()

###NOTE: run full network code in final cell of this workbook so that this example works

# Create dataset
X, y = spiral_data(samples=100, classes=3)

# Create Dense layer with 2 input features and 64 output values
dense1 = Layer_Dense(2, 64)

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create second Dense layer with 64 input features (as we take output
# of previous layer here) and 3 output values (output values)
dense2 = Layer_Dense(64, 3)

# Create Softmax classifier's combined loss and activation
loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()

# Create optimizer
optimizer = Optimizer_SGD(decay=1e-3)
# Train in loop
for epoch in range(10001):
    # Perform a forward pass of our training data through this layer
    dense1.forward(X)
    
    # Perform a forward pass through activation function
    # takes the output of first dense layer here
    activation1.forward(dense1.output)
    
    # Perform a forward pass through second Dense layer
    # takes outputs of activation function of first layer as inputs
    dense2.forward(activation1.output)
    
    # Perform a forward pass through the activation/loss function
    # takes the output of second dense layer here and returns loss
    loss = loss_activation.forward(dense2.output, y)

    # Calculate accuracy from output of activation2 and targets
    # calculate values along first axis
    predictions = np.argmax(loss_activation.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
    accuracy = np.mean(predictions==y)
    
    if not epoch % 100: #only every hundredth epoch
        print(f'epoch: {epoch}, ' +
        f'acc: {accuracy:.3f}, ' +
        f'loss: {loss:.3f}')
    
    # Backward pass
    loss_activation.backward(loss_activation.output, y)
    dense2.backward(loss_activation.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)
    
    # Update weights and biases
    optimizer.pre_update_params()
    optimizer.update_params(dense1)
    optimizer.update_params(dense2)
    optimizer.post_update_params()

epoch: 0, acc: 0.360, loss: 1.099
epoch: 100, acc: 0.400, loss: 1.088
epoch: 200, acc: 0.423, loss: 1.078
epoch: 300, acc: 0.423, loss: 1.076
epoch: 400, acc: 0.420, loss: 1.076
epoch: 500, acc: 0.403, loss: 1.074
epoch: 600, acc: 0.403, loss: 1.072
epoch: 700, acc: 0.410, loss: 1.070
epoch: 800, acc: 0.410, loss: 1.068
epoch: 900, acc: 0.427, loss: 1.066
epoch: 1000, acc: 0.440, loss: 1.063
epoch: 1100, acc: 0.440, loss: 1.059
epoch: 1200, acc: 0.447, loss: 1.056
epoch: 1300, acc: 0.440, loss: 1.052
epoch: 1400, acc: 0.427, loss: 1.048
epoch: 1500, acc: 0.417, loss: 1.040
epoch: 1600, acc: 0.423, loss: 1.033
epoch: 1700, acc: 0.450, loss: 1.025
epoch: 1800, acc: 0.470, loss: 1.017
epoch: 1900, acc: 0.460, loss: 1.008
epoch: 2000, acc: 0.463, loss: 1.000
epoch: 2100, acc: 0.490, loss: 1.005
epoch: 2200, acc: 0.467, loss: 1.014
epoch: 2300, acc: 0.483, loss: 1.014
epoch: 2400, acc: 0.490, loss: 1.012
epoch: 2500, acc: 0.493, loss: 1.009
epoch: 2600, acc: 0.497, loss: 1.005
epoch: 2700, 

Full Network Code from Ch. 9 So We can make use of it for examples above

In [2]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data
nnfs.init()
import matplotlib.pyplot as plt

# Dense layer
class Layer_Dense:
    # Layer initialization
    def __init__(self, n_inputs, n_neurons):
        # Initialize weights and biases
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        # Calculate output values from input ones, weights and biases
        self.output = np.dot(inputs, self.weights) + self.biases
        # Backward pass
    def backward(self, dvalues):
        # Gradients on parameters
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)
        # Gradient on values
        self.dinputs = np.dot(dvalues, self.weights.T)

# ReLU activation
class Activation_ReLU:
    # Forward pass
    def forward(self, inputs):
        self.inputs = inputs
        # Calculate output values from inputs
        self.output = np.maximum(0, inputs)

    def backward(self, dvalues):
        # Since we need to modify original variable,
        # let’s make a copy of values first
        self.dinputs = dvalues.copy()
        # Zero gradient where input values were negative
        self.dinputs[self.inputs <= 0] = 0

# Softmax activation
class Activation_Softmax:
# Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        
        # Get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        
        # Normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        
        self.output = probabilities
    # Backward pass
    def backward(self, dvalues):
        # Create uninitialized array
        self.dinputs = np.empty_like(dvalues)
        # Enumerate outputs and gradients
        for index, (single_output, single_dvalues) in enumerate(zip(self.output, dvalues)):
            # Flatten output array
            single_output = single_output.reshape(-1, 1)
            # Calculate Jacobian matrix of the output
            jacobian_matrix = np.diagflat(single_output) - np.dot(single_output, single_output.T)
            # Calculate sample-wise gradient
            # and add it to the array of sample gradients
            self.dinputs[index] = np.dot(jacobian_matrix, single_dvalues)

# Common loss class
class Loss:
    # Calculates the data and regularization losses
    # given model output and ground truth values
    def calculate(self, output, y):
        
        # Calculate sample losses
        sample_losses = self.forward(output, y)
        
        # Calculate mean loss
        data_loss = np.mean(sample_losses)
        
        # Return loss
        return data_loss
    
class Loss_CategoricalCrossentropy(Loss):
# Forward pass
    def forward(self, y_pred, y_true):
    # Number of samples in a batch
        samples = len(y_pred)
        
        # Clip data to prevent division by 0
        # Clip both sides to not drag mean towards any value
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
        # Probabilities for target values -
        # only if categorical labels
        if len(y_true.shape) == 1:
            correct_confidences = y_pred_clipped[range(samples),y_true]
        # Mask values - only for one-hot encoded labels
        elif len(y_true.shape) == 2:
            correct_confidences = np.sum(y_pred_clipped * y_true, axis=1)
        # Losses
        negative_log_likelihoods = -np.log(correct_confidences)
        return negative_log_likelihoods
    
    def backward(self, dvalues, y_true):
        # Number of samples
        samples = len(dvalues)
        # Number of labels in every sample
        # We'll use the first sample to count them
        labels = len(dvalues[0])
        # If labels are sparse, turn them into one-hot vector
        if len(y_true.shape) == 1:
            y_true = np.eye(labels)[y_true]
        
        # Calculate gradient
        self.dinputs = -y_true / dvalues
        # Normalize gradient
        self.dinputs = self.dinputs / samples


class Activation_Softmax_Loss_CategoricalCrossentropy():
# Creates activation and loss function objects
    def __init__(self):
        self.activation = Activation_Softmax()
        self.loss = Loss_CategoricalCrossentropy()
        # Forward pass
    def forward(self, inputs, y_true):
        # Output layer's activation function
        self.activation.forward(inputs)
        # Set the output
        self.output = self.activation.output
        # Calculate and return loss value
        return self.loss.calculate(self.output, y_true)
    # Backward pass
    def backward(self, dvalues, y_true):
        # Number of samples
        samples = len(dvalues)

        # If labels are one-hot encoded,
        # turn them into discrete values
        if len(y_true.shape) == 2:
            y_true = np.argmax(y_true, axis=1)
        
        # Copy so we can safely modify
        self.dinputs = dvalues.copy()
    
        # For each row in dinputs, get what the network has for the correct class and subtract 1
        self.dinputs[range(samples), y_true] -= 1
        
        # Normalize gradient
        self.dinputs = self.dinputs / samples