# Chapter 15: Dropout

- another regularization technique for neural networks is **dropout** regularization
- a dropout layer disables some neurons to prevent a neural network from becoming too dependent on any specific neuron
- dropout also helps alleviate **co-adoption**, which occurs when neurons become too dependent on the output values of other neurons and fail to learn the underlying function on their own 
- dropout can also help with **noise** and other perturbations in the training data
---
- with a dropout layer, neurons are disabled randomly at a given rate during every forward pass, forcing the network to learn how to make accurate predictions with only the random selection of remaining neurons
- to make an important clarification, the dropout layer does not truly disable neurons, but instead zeroes their outputs
- in other words, dropout does not actually decrease the number of neurons used, nor does it speed up the training time
---
- in code, we'll “turn off” neurons with a filter that is an array with the same shape as the weight array, but filled with numbers from a **Bernoulli distribution** 
- a Bernoulli distribution is a binary (also called discrete) probability distribution where we can get a value of 1 with probability of $p$ and value of 0 with probability of $q$
---
- we'll have one hyperparameter for a dropout layer that dictates the percentage of neurons to disable in that layer (0.10 = 10% of the neurons will randomly be disabled during each forward pass) 
- before we use NumPy, we’ll show a raw Python example: 

In [7]:
import random

dropout_rate = 0.5
example_output = [0.27, -1.03, 0.67, 0.99, 0.05, -0.37, -2.01, 1.13, -0.07, 0.73]  # example output containing 10 values

# Repeat as long as necessary
while True:
    
    # Randomly choose index and set value to 0
    index = random.randint(0, len(example_output) - 1)
    example_output[index] = 0
    
    # We might set an index that already is zeroed
    # There are different ways of overcoming this problem, for simplicity we count values that are exactly 0
    # as it's extremely rare in real model that weights are exactly 0, it's not the best method for sure
    dropped_out = 0
    for value in example_output:
        if value == 0:
            dropped_out += 1
    
    # If required number of outputs is zeroed - leave a loop
    if dropped_out / len(example_output) >= dropout_rate:
        break

example_output # outputs vary due to randomness

[0, -1.03, 0.67, 0, 0, 0, -2.01, 0, -0.07, 0.73]

- the idea is to just keep disabling neuron outputs (setting them to 0) at random until we’ve disabled our target % of neurons 
- a Bernoulli distribution is a special variation of a **binomial distribution** with $n=1$ 
- therefore, we can use `np.random.binomial()` 
- a binomial distribution differs from a Bernoulli distribution as it adds a parameter, $n$, which dictates the number of concurrent experiments (instead of just one) and returns the number of successes from these $n$ experiments
- `np.random.binomial()` has three parameters: $n$ (number of experiments), $p$ (probability of true value of experiment), and parameter size (`np.random.binomial(n, p, size)`)
---
- think of this function like a coin toss, where the result will either be 0 or 1
- the $n$ is how many tosses of the coin do you want to perform
- the $p$ is the probability for the toss to result in a 1 and the overall result is the sum of all the results
- the $size$ is our desired amount of “tests” we want to perform

In [56]:
np.random.binomial(2, 0.5, size=10)

array([0, 1, 0, 1, 0, 1, 1, 2, 2, 1])

- the code above will produce an array that is of size 10, where each element will be the sum of 2 coin tosses, where the probability of 1 will be 0.5, or 50%
---
- we can use this code to create our dropout layer
- our goal is to create a filter where the intended dropout % is represented as 0 and otherwise represented as 1
- suppose we have a dropout layer of 5 neurons that we’ll add after a layer and we wish to have a 20% dropout
- an example of a dropout layer might look like: `[1, 0, 1, 1, 1]` (as you can see, 20% of that list is 0)

In [77]:
np.random.binomial(1, 0.8, size=5) # my own attempt to yield the array above

array([1, 1, 1, 1, 0])

In [114]:
dropout_rate = 0.20 # actual way (same thing, but more explanatory)
np.random.binomial(1, 1-dropout_rate, size=5)

array([1, 1, 1, 1, 0])

- on a realistically sized layer, you will find that the probability more consistently matches your intended value
- assume a neural network layer’s output is:

In [115]:
example_output = np.array([0.27, -1.03, 0.67, 0.99, 0.05, -0.37, -2.01, 1.13, -0.07, 0.73])  

- next, let’s assume our target dropout rate is 0.3, or 30%
- we apply a dropout layer like so:

In [121]:
dropout_rate = 0.3

example_output = np.array([0.27, -1.03, 0.67, 0.99, 0.05, -0.37, -2.01, 1.13, -0.07, 0.73]) 
example_output *= np.random.binomial(1, 1-dropout_rate, example_output.shape)

print(example_output)

[ 0.   -1.03  0.    0.    0.05 -0.37 -2.01  0.   -0.07  0.  ]


- note that our dropout rate is the amount of neurons we intend to disable ($q$), but sometimes the implementation of dropout will include a rate parameter that instead dictates the amount of neurons you intend to keep ($p$) 
---
- while dropout helps a neural network generalize and is helpful for training, it’s not something to utilize when predicting
---
- in any specific example, you will find that scaling doesn’t equal the exact same sum as before because we’re randomly dropping neurons, but after enough samples, the scaling will eventually average out
- let's see a supporting example:

In [130]:
dropout_rate = 0.2
example_output = np.array([0.27, -1.03, 0.67, 0.99, 0.05, -0.37, -2.01, 1.13, -0.07, 0.73]) 
print(f"sum initial {sum(example_output)}")

sums = []
for i in range(10000):

    example_output2 = example_output * np.random.binomial(1, 1-dropout_rate, example_output.shape) / (1-dropout_rate)
    sums.append(sum(example_output2))
    
print(f"mean sum: {np.mean(sums)}") 
# not exact, but you should get the idea ("the scaling will eventually average out")

sum initial 0.36000000000000015
mean sum: 0.37619250000000015


- our final task is to implement a `backward()` method 
- like before, we need to calculate the partial derivative of the dropout operation:

In [131]:
# Dropout
class Layer_Dropout:

    # Init
    def __init__(self, rate):
        self.rate = 1 - rate

    # Forward pass
    def forward(self, values):
        # Save input values
        self.input = values
       
        self.binary_mask = np.random.binomial(1, self.rate, size=values.shape) / self.rate
        # Apply mask to output values
        self.output = values * self.binary_mask

    # Backward pass
    def backward(self, dvalues):
        # Gradient on values
        self.dvalues = dvalues * self.binary_mask

- let’s take this new dropout layer and add it between our two dense layers (first defining it)
- add dropout regularization in the forward and backward passes
- let's take a look out our final code:

### Final Code

In [132]:
import numpy as np
import random

np.random.seed(0)

# Our sample dataset
def create_data(n, k):
    X = np.zeros((n*k, 2))  # data matrix (each row = single example)
    y = np.zeros(n*k, dtype='uint8')  # class labels
    for j in range(k):
        ix = range(n*j, n*(j+1))
        r = np.linspace(0.0, 1, n)  # radius
        t = np.linspace(j*4, (j+1)*4, n) + np.random.randn(n)*0.2  # theta
        X[ix] = np.c_[r*np.sin(t*2.5), r*np.cos(t*2.5)]
        y[ix] = j
    return X, y


# Dense layer
class Layer_Dense:

    # Layer initialization
    def __init__(self, inputs, neurons, weight_regularizer_l1=0, weight_regularizer_l2=0, bias_regularizer_l1=0, bias_regularizer_l2=0):
        # Initialize weights and biases
        self.weights = 0.01 * np.random.randn(inputs, neurons)
        self.biases = np.zeros((1, neurons))
        # Set regularization strength
        self.weight_regularizer_l1 = weight_regularizer_l1
        self.weight_regularizer_l2 = weight_regularizer_l2
        self.bias_regularizer_l1 = bias_regularizer_l1
        self.bias_regularizer_l2 = bias_regularizer_l2

    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        # Calculate output values from input ones, weights and biases
        self.output = np.dot(inputs, self.weights) + self.biases

    # Backward pass
    def backward(self, dvalues):
        # Gradients on parameters
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)

        # Gradients on regularization
        # L1 on weights
        if self.weight_regularizer_l1 > 0:
            dL1 = self.weights.copy()
            dL1[dL1 >= 0] = 1
            dL1[dL1 < 0] = -1
            self.dweights += self.weight_regularizer_l1 * dL1
        # L2 on weights
        if self.weight_regularizer_l2 > 0:
            self.dweights += 2 * self.weight_regularizer_l2 * self.weights
        # L1 on biases
        if self.bias_regularizer_l1 > 0:
            dL1 = self.biases.copy()
            dL1[dL1 >= 0] = 1
            dL1[dL1 < 0] = -1
            self.dbiases += self.bias_regularizer_l1 * dL1
        # L2 on biases
        if self.bias_regularizer_l2 > 0:
            self.dbiases += 2 * self.bias_regularizer_l2 * self.biases

        # Gradient on values
        self.dvalues = np.dot(dvalues, self.weights.T)


# Dropout
class Layer_Dropout:

    # Init
    def __init__(self, rate):
        # Store rate, we invert it as for example for dropout of 0.1 we need success rate of 0.9
        self.rate = 1 - rate

    # Forward pass
    def forward(self, values):
        # Save input values
        self.input = values
        # Generate and save scaled mask
        self.binary_mask = np.random.binomial(1, self.rate, size=values.shape) / self.rate
        # Apply mask to output values
        self.output = values * self.binary_mask

    # Backward pass
    def backward(self, dvalues):
        # Gradient on values
        self.dvalues = dvalues * self.binary_mask


# ReLU activation
class Activation_ReLU:

    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs 
        # Calculate output values from input ones
        self.output = np.maximum(0, inputs)

    # Backward pass
    def backward(self, dvalues):
        dvalues = dvalues.copy()  # Since we need to modify original variable, let;s make a copy of values first
        dvalues[self.inputs <= 0] = 0  # Zero gradient where input values were negative
        self.dvalues = dvalues


# Softmax activation
class Activation_Softmax:

    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs

        # get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        # normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)

        self.output = probabilities

    # Backward pass
    def backward(self, dvalues):
        self.dvalues = dvalues


# Common loss class
class Loss:

    # Regularization loss calculation
    def regularization_loss(self, layer):

        # 0 by default
        regularization_loss = 0

        # L1 regularization - weights
        if layer.weight_regularizer_l1 > 0:  # only calculate when factor greaten than 0
            regularization_loss += layer.weight_regularizer_l1 * np.sum(np.abs(layer.weights))

        # L2 regularization - weights
        if layer.weight_regularizer_l2 > 0:
            regularization_loss += layer.weight_regularizer_l2 * np.sum(layer.weights * layer.weights)

        # L1 regularization - biases
        if layer.bias_regularizer_l1 > 0:  # only calculate when factor greater than 0
            regularization_loss += layer.bias_regularizer_l1 * np.sum(np.abs(layer.weights))

        # L2 regularization - biases
        if layer.bias_regularizer_l2 > 0:
            regularization_loss += layer.bias_regularizer_l2 * np.sum(layer.weights * layer.weights)

        return regularization_loss


# Cross-entropy loss
class Loss_CategoricalCrossentropy(Loss):

    # Forward pass
    def forward(self, y_pred, y_true):

        # Number of samples in a batch
        samples = y_pred.shape[0]

        # Probabilities for target values - only if categorical labels
        if len(y_true.shape) == 1:
            y_pred = y_pred[range(samples), y_true]

        # Losses
        negative_log_likelihoods = -np.log(y_pred)

        # Mask values - only for one-hot encoded labels
        if len(y_true.shape) == 2:
            negative_log_likelihoods *= y_true

        # Overall loss
        data_loss = np.sum(negative_log_likelihoods) / samples
        return data_loss

    # Backward pass
    def backward(self, dvalues, y_true):

        samples = dvalues.shape[0]

        dvalues = dvalues.copy()  # We need to modify variable directly, make a copy first then
        dvalues[range(samples), y_true] -= 1
        dvalues = dvalues / samples

        self.dvalues = dvalues


# SGD Optimizer
class Optimizer_SGD:

    # Initialize optimizer - set settings
    def __init__(self, learning_rate=1., decay=0., momentum=0., nesterov=False):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.momentum = momentum
        self.nesterov = nesterov

    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.current_learning_rate * (1. / (1. + self.decay * self.iterations))

    # Update parameters
    def update_params(self, layer):

        # If layer does not contain momentum arrays, create ones filled with zeros
        if not hasattr(layer, 'weight_momentums'):
            layer.weight_momentums = np.zeros_like(layer.weights)
            layer.bias_momentums = np.zeros_like(layer.biases)

        # If we use momentum
        if self.momentum:

            # Build weight updates with momentum - take previous updates multiplied by retain factor and update with current gradients
            weight_updates = self.momentum * layer.weight_momentums - self.current_learning_rate * layer.dweights
            layer.weight_momentums = weight_updates

            # Build bias updates
            bias_updates = self.momentum * layer.bias_momentums - self.current_learning_rate * layer.dbiases
            layer.bias_momentums = bias_updates

            # Apply Nesterov as well?
            if self.nesterov:
                weight_updates = self.momentum * weight_updates - self.current_learning_rate * layer.dweights
                bias_updates = self.momentum * bias_updates - self.current_learning_rate * layer.dbiases

        # Vanilla SGD updates (as before momentum update)
        else:
            weight_updates = -self.current_learning_rate * layer.dweights
            bias_updates = -self.current_learning_rate * layer.dbiases

        # Update weights with updates which are either vanilla, momentum or momentum+nesterov updates
        layer.weights += weight_updates
        layer.biases += bias_updates

    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1


# Adagrad Optimizer
class Optimizer_Adagrad:

    # Initialize optimizer - set settings
    def __init__(self, learning_rate=1., decay=0., epsilon=1e-7):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon

    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.current_learning_rate * (1. / (1. + self.decay * self.iterations))

    # Update parameters
    def update_params(self, layer):

        # If layer does not contain cache arrays, create them filled with zeros
        if not hasattr(layer, 'weight_cache'):
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_cache = np.zeros_like(layer.biases)

        # Update cache with squared current gradients
        layer.weight_cache += layer.dweights**2
        layer.bias_cache += layer.dbiases**2

        # Vanilla SGD parameter update + normalization with square rooted cache
        layer.weights += -self.current_learning_rate * layer.dweights / (np.sqrt(layer.weight_cache) + self.epsilon)
        layer.biases += -self.current_learning_rate * layer.dbiases / (np.sqrt(layer.bias_cache) + self.epsilon)

    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1


# RMSprop Optimizer
class Optimizer_RMSprop:

    # Initialize optimizer - set settings
    def __init__(self, learning_rate=0.001, decay=0., epsilon=1e-7, rho=0.9):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon
        self.rho = rho

    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.current_learning_rate * (1. / (1. + self.decay * self.iterations))

    # Update parameters
    def update_params(self, layer):

        # If layer does not contain cache arrays, create ones filled with zeros
        if not hasattr(layer, 'weight_cache'):
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_cache = np.zeros_like(layer.biases)

        # Update cache with squared current gradients
        layer.weight_cache = self.rho * layer.weight_cache + (1 - self.rho) * layer.dweights**2
        layer.bias_cache = self.rho * layer.bias_cache + (1 - self.rho) * layer.dbiases**2

        # Vanilla SGD parameter update + normalization with square rooted cache
        layer.weights += -self.current_learning_rate * layer.dweights / (np.sqrt(layer.weight_cache) + self.epsilon)
        layer.biases += -self.current_learning_rate * layer.dbiases / (np.sqrt(layer.bias_cache) + self.epsilon)

    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1


# Adam Optimizer
class Optimizer_Adam:

    # Initialize optimizer - set settings
    def __init__(self, learning_rate=0.001, decay=0., epsilon=1e-7, beta_1=0.9, beta_2=0.999):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon
        self.beta_1 = beta_1
        self.beta_2 = beta_2

    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.current_learning_rate * (1. / (1. + self.decay * self.iterations))

    # Update parameters
    def update_params(self, layer):

        # If layer does not contain cache arrays, create them filled with zeros
        if not hasattr(layer, 'weight_cache'):
            layer.weight_momentums = np.zeros_like(layer.weights)
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_momentums = np.zeros_like(layer.biases)
            layer.bias_cache = np.zeros_like(layer.biases)

        # Update momentum  with current gradients
        layer.weight_momentums = self.beta_1 * layer.weight_momentums + (1 - self.beta_1) * layer.dweights
        layer.bias_momentums = self.beta_1 * layer.bias_momentums + (1 - self.beta_1) * layer.dbiases
        # Get corrected momentum
        weight_momentums_corrected = layer.weight_momentums / (1 - self.beta_1 ** (self.iterations + 1))  # self.iteration is 0 at first pass ans we need to start with 1 here
        bias_momentums_corrected = layer.bias_momentums / (1 - self.beta_1 ** (self.iterations + 1))
        # Update cache with squared current gradients
        layer.weight_cache = self.beta_2 * layer.weight_cache + (1 - self.beta_2) * layer.dweights**2
        layer.bias_cache = self.beta_2 * layer.bias_cache + (1 - self.beta_2) * layer.dbiases**2
        # Get corrected bias
        weight_cache_corrected = layer.weight_cache / (1 - self.beta_2 ** (self.iterations + 1))
        bias_cache_corrected = layer.bias_cache / (1 - self.beta_2 ** (self.iterations + 1))

        # Vanilla SGD parameter update + normalization with square rooted cache
        layer.weights += -self.current_learning_rate * weight_momentums_corrected / (np.sqrt(weight_cache_corrected) + self.epsilon)
        layer.biases += -self.current_learning_rate * bias_momentums_corrected / (np.sqrt(bias_cache_corrected) + self.epsilon)

    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1

#============================================================================================================================================================================#

# Create dataset
X, y = create_data(1000, 3)

# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 64, weight_regularizer_l2=5e-4, bias_regularizer_l2=5e-4)  # first dense layer, 2 inputs (each sample has 2 features), 3 outputs

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create dropout layer
dropout1 = Layer_Dropout(0.1)

# Create second Dense layer with 3 input features (as we take output of previous layer here) and 3 output values (output values)
dense2 = Layer_Dense(64, 3)  # second dense layer, 3 inputs, 3 outputs

# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()

# Create loss function
loss_function = Loss_CategoricalCrossentropy()

# Create optimizer
#optimizer = Optimizer_SGD(decay=1e-8, momentum=0.9)
#optimizer = Optimizer_Adagrad(decay=1e-8)
#optimizer = Optimizer_RMSprop(learning_rate=0.05, decay=4e-8, rho=0.999)
optimizer = Optimizer_Adam(learning_rate=0.05, decay=1e-8)

# Train in loop
for epoch in range(10001):

    # Make a forward pass of our training data thru this layer
    dense1.forward(X)

    # Make a forward pass thru activation function - we take output of previous layer here
    activation1.forward(dense1.output)

    # Make a forward pass thru Dropout layer 
    dropout1.forward(activation1.output)

    # Make a forward pass thru second Dense layer - it takes outputs of activation function of first layer as inputs
    dense2.forward(dropout1.output)

    # Make a forward pass thru activation function - we take output of previous layer here
    activation2.forward(dense2.output)

    # Calculate loss from output of activation2 so softmax activation
    data_loss = loss_function.forward(activation2.output, y)

    # Calculate regularization penalty
    regularization_loss = loss_function.regularization_loss(dense1) + loss_function.regularization_loss(dense2)

    # Calculate overall loss
    loss = data_loss + regularization_loss

    # Calculate accuracy from output of activation2 and targets
    predictions = np.argmax(activation2.output, axis=1)  # calculate values along first axis
    accuracy = np.mean(predictions==y)

    if not epoch % 100:
        print('epoch:', epoch, 'acc:', f'{accuracy:.3f}', 'loss:', f'{loss:.3f}', '(data_loss:', f'{data_loss:.3f}', 'reg_loss:', f'{regularization_loss:.3f})', ')', 'lr:', optimizer.current_learning_rate)

    # Backward pass
    loss_function.backward(activation2.output, y)
    activation2.backward(loss_function.dvalues)
    dense2.backward(activation2.dvalues)
    dropout1.backward(dense2.dvalues)
    activation1.backward(dropout1.dvalues)
    dense1.backward(activation1.dvalues)

    # Update weights
    optimizer.pre_update_params()
    optimizer.update_params(dense1)
    optimizer.update_params(dense2)
    optimizer.post_update_params()

#============================================================================================================================================================================#

# Validate model

# Create test dataset
X_test, y_test = create_data(100, 3)

# Make a forward pass of our training data thru this layer
dense1.forward(X_test)

# Make a forward pass thru activation function - we take output of previous layer here
activation1.forward(dense1.output)

# Make a forward pass thru second Dense layer - it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)

# Make a forward pass thru activation function - we take output of previous layer here
activation2.forward(dense2.output)

# Calculate loss from output of activation2 so softmax activation
loss = loss_function.forward(activation2.output, y_test)

# Calculate accuracy from output of activation2 and targets
predictions = np.argmax(activation2.output, axis=1)  # calculate values along first axis
accuracy = np.mean(predictions==y_test)

print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')

epoch: 0 acc: 0.324 loss: 1.099 (data_loss: 1.099 reg_loss: 0.000) ) lr: 0.05
epoch: 100 acc: 0.580 loss: 0.928 (data_loss: 0.885 reg_loss: 0.043) ) lr: 0.04999752506207612
epoch: 200 acc: 0.627 loss: 0.883 (data_loss: 0.831 reg_loss: 0.052) ) lr: 0.049990050996574775
epoch: 300 acc: 0.670 loss: 0.826 (data_loss: 0.767 reg_loss: 0.059) ) lr: 0.04997758005043209
epoch: 400 acc: 0.665 loss: 0.808 (data_loss: 0.752 reg_loss: 0.056) ) lr: 0.04996011596895705
epoch: 500 acc: 0.672 loss: 0.780 (data_loss: 0.723 reg_loss: 0.057) ) lr: 0.04993766399395728
epoch: 600 acc: 0.692 loss: 0.785 (data_loss: 0.730 reg_loss: 0.056) ) lr: 0.04991023086111661
epoch: 700 acc: 0.684 loss: 0.779 (data_loss: 0.726 reg_loss: 0.053) ) lr: 0.049877824796627425
epoch: 800 acc: 0.687 loss: 0.764 (data_loss: 0.712 reg_loss: 0.052) ) lr: 0.0498404555130797
epoch: 900 acc: 0.690 loss: 0.769 (data_loss: 0.719 reg_loss: 0.050) ) lr: 0.04979813420460921
epoch: 1000 acc: 0.695 loss: 0.751 (data_loss: 0.701 reg_loss: 0.0

- while our accuracy and loss suffered considerably, we’ve found a scenario where our validation set actually performs better than our in-sample dataset, which is due to removing dropout when testing
- further tweaking would likely fix the accuracy issue; for example, due to our regularization tactics, we can change our layer sizes to 512: 

In [133]:
# Create dataset ---> changed number of neurons per layer to 512 (warning: this will take a very long time)
X, y = create_data(1000, 3)

# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 512, weight_regularizer_l2=5e-4, bias_regularizer_l2=5e-4)  # first dense layer, 2 inputs (each sample has 2 features), 3 outputs

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create dropout layer
dropout1 = Layer_Dropout(0.1)

# Create second Dense layer with 3 input features (as we take output of previous layer here) and 3 output values (output values)
dense2 = Layer_Dense(512, 3)  # second dense layer, 3 inputs, 3 outputs

# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()

# Create loss function
loss_function = Loss_CategoricalCrossentropy()

# Create optimizer
#optimizer = Optimizer_SGD(decay=1e-8, momentum=0.9)
#optimizer = Optimizer_Adagrad(decay=1e-8)
#optimizer = Optimizer_RMSprop(learning_rate=0.05, decay=4e-8, rho=0.999)
optimizer = Optimizer_Adam(learning_rate=0.05, decay=1e-8)

# Train in loop
for epoch in range(10001):

    # Make a forward pass of our training data thru this layer
    dense1.forward(X)

    # Make a forward pass thru activation function - we take output of previous layer here
    activation1.forward(dense1.output)

    # Make a forward pass thru Dropout layer 
    dropout1.forward(activation1.output)

    # Make a forward pass thru second Dense layer - it takes outputs of activation function of first layer as inputs
    dense2.forward(dropout1.output)

    # Make a forward pass thru activation function - we take output of previous layer here
    activation2.forward(dense2.output)

    # Calculate loss from output of activation2 so softmax activation
    data_loss = loss_function.forward(activation2.output, y)

    # Calculate regularization penalty
    regularization_loss = loss_function.regularization_loss(dense1) + loss_function.regularization_loss(dense2)

    # Calculate overall loss
    loss = data_loss + regularization_loss

    # Calculate accuracy from output of activation2 and targets
    predictions = np.argmax(activation2.output, axis=1)  # calculate values along first axis
    accuracy = np.mean(predictions==y)

    if not epoch % 100:
        print('epoch:', epoch, 'acc:', f'{accuracy:.3f}', 'loss:', f'{loss:.3f}', '(data_loss:', f'{data_loss:.3f}', 'reg_loss:', f'{regularization_loss:.3f})', ')', 'lr:', optimizer.current_learning_rate)

    # Backward pass
    loss_function.backward(activation2.output, y)
    activation2.backward(loss_function.dvalues)
    dense2.backward(activation2.dvalues)
    dropout1.backward(dense2.dvalues)
    activation1.backward(dropout1.dvalues)
    dense1.backward(activation1.dvalues)

    # Update weights
    optimizer.pre_update_params()
    optimizer.update_params(dense1)
    optimizer.update_params(dense2)
    optimizer.post_update_params()

# Validate model

# Create test dataset
X_test, y_test = create_data(100, 3)

# Make a forward pass of our training data thru this layer
dense1.forward(X_test)

# Make a forward pass thru activation function - we take output of previous layer here
activation1.forward(dense1.output)

# Make a forward pass thru second Dense layer - it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)

# Make a forward pass thru activation function - we take output of previous layer here
activation2.forward(dense2.output)

# Calculate loss from output of activation2 so softmax activation
loss = loss_function.forward(activation2.output, y_test)

# Calculate accuracy from output of activation2 and targets
predictions = np.argmax(activation2.output, axis=1)  # calculate values along first axis
accuracy = np.mean(predictions==y_test)

print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')

epoch: 0 acc: 0.298 loss: 1.099 (data_loss: 1.099 reg_loss: 0.000) ) lr: 0.05
epoch: 100 acc: 0.714 loss: 0.756 (data_loss: 0.661 reg_loss: 0.096) ) lr: 0.04999752506207612
epoch: 200 acc: 0.796 loss: 0.677 (data_loss: 0.554 reg_loss: 0.124) ) lr: 0.049990050996574775
epoch: 300 acc: 0.812 loss: 0.612 (data_loss: 0.484 reg_loss: 0.128) ) lr: 0.04997758005043209
epoch: 400 acc: 0.804 loss: 0.620 (data_loss: 0.494 reg_loss: 0.126) ) lr: 0.04996011596895705
epoch: 500 acc: 0.834 loss: 0.580 (data_loss: 0.456 reg_loss: 0.124) ) lr: 0.04993766399395728
epoch: 600 acc: 0.836 loss: 0.561 (data_loss: 0.438 reg_loss: 0.123) ) lr: 0.04991023086111661
epoch: 700 acc: 0.832 loss: 0.562 (data_loss: 0.443 reg_loss: 0.119) ) lr: 0.049877824796627425
epoch: 800 acc: 0.818 loss: 0.588 (data_loss: 0.470 reg_loss: 0.118) ) lr: 0.0498404555130797
epoch: 900 acc: 0.845 loss: 0.540 (data_loss: 0.427 reg_loss: 0.113) ) lr: 0.04979813420460921
epoch: 1000 acc: 0.824 loss: 0.557 (data_loss: 0.447 reg_loss: 0.1