# Neural Network (MLP) from scratch

### Key notes:

- a network *"learns"* by modifying its weights. Training a neural network means **finding the right **weights and biases** so the network can solve the problem**. I.e. **minimizing a loss function**
<br><br>
- Think of a neuron as a function, that takes the activations of ALL neurons in the previous layer, and outputs a number 
<br><br>
- The activation of a neuron is a measure of how "positive" the relevant weighted sum is.
<br><br>
- The bias is simply a value that lets us choose when a neuron is meaningfully active. Think: "bias for inactivity". Ex. only want neurons with a weighted sum > 10 to be activated, set the bias = -10.
<br><br>
- The weighted sum in a neural network represents the combined influence of all input neurons (i.e., neurons from the previous layer) on a single neuron in the current layer.
<br><br>
    - It signifies the strength of a connection, determining how much impact each 
    input has on the neuron's output and its potential to activate based on the 
    weighted input signals.

### Summary:


## The Structure

- **Forward pass**: the process of passing the input data through the network to compute the output (predictions)

In [11]:
import numpy as np

In [16]:
class Neuron:
    def __init__(self):
        self.activation = None
        self.weight = None
        self.bias = None

    def set_activation(self, value: float) -> None:
        """ Set the activation value of the neuron """
        self.activation = value

    def set_bias(self, value: float) -> None:
        """ Set the bias of the neuron """
        self.bias = value

    def get_bias(self) -> None:
        """ Return the bias of the neuron """
        return self.bias


class Layer:

    # num_inputs : number of neurons in the previous layer
    def __init__(self, num_inputs: int, num_neurons: int):
        """ Initialize a layer.
            - num_inputs: number of neurons in the PREVIOUS layer.
            - num_neurons: number of neurons in the CURRENT layer.
        """
        self.neurons = [Neuron() for _ in range(num_neurons)]       # list of neuron objects
        self.weights = np.random.randn(num_inputs, num_neurons)     # matrix of (initially random) values/weights. shape (num_inputs, num_neurons)
        self.biases = np.zeros(num_neurons) # bias vector

    def set_activations(self, inputs: np.ndarray) -> None:
        """ Calculate the weighted sum for a neuron in the layer 
            (this is NOT the neuron's activation value). """
        self.activations = np.dot(inputs, self.weights) + self.biases
    
    def apply_activation_function(self) -> None:
        """ Apply the activation function (ReLU) to the weighted sum. 
            This is the value of the neuron's activation. """
        self.activations = self.relu(self.activations)

    def get_activations(self) -> np.ndarray:
        """ Return the activations of the layer. """
        return self.activations
    
    @staticmethod
    def relu(x: np.ndarray) -> np.ndarray:
        """ 
        Apply the rely activation function. 
        If the input value > 0, return the input value.
        If the input value == 0, return 0.
        """
        return np.maximum(0, x)

## Training

#### Key notes:

- **Loss function**: returns how inaccurate the network's outputs are
<br><br>
- **Gradient descent**: an algorithm to minimize the loss function. goal: find global minima of the loss function
    - i.e. find the network's parameters (weights and biases) that minimize the loss function (i.e. minimize the network's output errors)
    <br>
    - To calculate **gradients**: find the derivative of the loss w.r.t. network output
        - $\frac{\partial \text{loss}}{\partial a} = \frac{a - y_{\text{true}}}{m}$ 
        <br>
        where m is the number of examples
<br><br>
- **Epoches**: **one complete pass** through the entire training dataset
<br><br>
- **Learning rate:** hyperparameter that controls how big a step you take when updating your network's parameters (weights and biases) during gradient descent
    - **too high**: a high learning rate might cause training to overshoot the minimum of the loss function
    - **too low**: a low learning rate can make the training process very slow or stuck in local minima


In [17]:
# loss function: mean squared error (mse)
def loss(y_pred: np.ndarray, y_true: np.ndarray)->float:
    """ Determines how correct the network's predictions are. """
    return 0.5 * (y_pred - y_true)**2   # coefficient of 1/2 to make derivative cleaner

# calculates gradients
def loss_derivative(y_pred:np.ndarray, y_true:np.ndarray)->float:
    """ Compute derivative of the loss function w.r.t. predictions. I.e. the gradients """
    return y_pred - y_true

# derivative of the activation function
def relu_derivative(z: np.ndarray) -> np.ndarray:
    """ Calculate the derivative of relu: returns 1 for positive values, 0 otherwise """
    if z > 0:
        return 1
    return 0

In [18]:
# Functions for gradient descent

def compute_gradients(layer: Layer, input_data: np.ndarray, target_outputs: np.ndarray) -> tuple:
            """ Compute gradients. Tells network how to incrementally minimize loss function """
            
            # compute the gradient (loss derivative) w.r.t activations
            dA = loss_derivative(layer.get_activations(), target_outputs)   

            # compute the gradient (loss derivative) w.r.t weights using matrix multiplication
            dW = (1 / input_data.shape[0]) * np.dot(input_data.T, dA) 

            # compute gradient w.r.t biases
            dB = (1/ input_data.shape[0]) * np.sum(dA, axis=0)

            return dW, dB

def update_parameters(layer: Layer, dW: float, dB: float, learning_rate: float):
        # update weights
        layer.weights -= learning_rate * dW

        # update biases
        layer.biases -= learning_rate * dB

### Backward process (backpropagation)

The algorithm for determining how a single training example would nudge the network's weights and biases, in terms of relative proportions to those changes which would give the most rapid decrease in the cost.

# Complete MLP

In [36]:
import numpy as np


# ------- Useful Functions -------

# relu (activation function)
def relu(x: float) -> float:
    """ Returns x if x > 0, else 0"""
    return np.maximum(0, x)

def relu_derivative(x: float) -> float:
    """ Returns 1 if x > 0, else 0"""
    return np.where(x > 0, 1, 0)

# loss function (mse)
def mse(y_pred, y_true)->float:
    """ Returns how accurate the network's predictions are 
    compared to expected output """
    return np.mean(0.5 * (y_pred - y_true)**2)

def mse_derivative(y_pred, y_true):
    # note: dividing by number of examples for gradient averaging
    return (y_pred - y_true) / y_true.shape[0]


# ------- Layer structure -------

class Layer:
    def __init__(self, num_inputs, num_neurons):
        # initialize layer with random weights and zero biases
        self.weights = np.random.randn(num_inputs, num_neurons)     # 2d array (num_inputs, num_neurons)
        self.biases = np.zeros((1, num_neurons))                    # effectively a vector (2d array with shape (1, num_neurons) )

    def forward(self, inputs):
        """ Forward process """
        # compute weighted sum (z) and activations (a)
        self.inputs = inputs    # store inputs for use in backprop
        self.z = np.dot(inputs, self.weights) + self.biases  # weighted sum
        self.activations = relu(self.z)     # apply activation function
        return self.activations
    
    def backward(self, dA, learning_rate):
        """ Backpropagation / backward process """
        # compute derivative of activation function
        dz = dA * relu_derivative(self.z)
        # compute gradients for weights and biases
        dW = np.dot(self.inputs.T, dz)
        dB = np.sum(dz, axis=0, keepdims=True)
        # compute gradient to pass to previous layer
        dinputs = np.dot(dz, self.weights.T)
        # update parameters
        self.weights -= learning_rate * dW
        self.biases -= learning_rate * dB
        return dinputs
    

# ------- Network structure -------

class Network:
    def __init__(self, input_dim, hidden_dim, output_dim):
        # initialize network layers
        self.hidden_layer = Layer(input_dim, hidden_dim)
        self.output_layer = Layer(hidden_dim, output_dim)

    def forward(self, x):
        # forward pass through the network
        self.hidden_activations = self.hidden_layer.forward(x)
        self.output_activations = self.output_layer.forward(self.hidden_activations)
        return self.output_activations
    
    def backward(self, x, y, learning_rate):
        # perform a forward pass
        y_pred = self.forward(x)    

        # compute loss derivative at the output
        dLoss = mse_derivative(y_pred, y)  

        # backpropagate through output layer
        dHidden = self.output_layer.backward(dLoss, learning_rate) 

        # backpropagate through hidden layer
        self.hidden_layer.backward(dHidden, learning_rate) 

    def train(self, x, y, epochs, learning_rate):
        # training loop
        for epoch in range(epochs):
            y_pred = self.forward(x)
            loss_val = mse(y_pred, y)
            self.backward(x, y, learning_rate)
            if epoch % 100 == 0:
                print(f'epoch {epoch}, loss: {loss_val}')

In [37]:
# example usage:
if __name__ == '__main__':
    # create some dummy data: x as inputs and y as target outputs
    x = np.random.randn(100, 3)    # 100 examples, 3 features each (thus 3 inputs)
    #print(x)
    y = np.random.randn(100, 1)    # 100 target outputs

    # initialize the network (3 inputs, 4 neurons in hidden layer, 1 output)
    net = Network(input_dim=3, hidden_dim=4, output_dim=1)
    
    # train the network
    net.train(x, y, epochs=1000, learning_rate=0.01)

epoch 0, loss: 0.4750603820847576
epoch 100, loss: 0.441562207673828
epoch 200, loss: 0.43494305406359274
epoch 300, loss: 0.43316537643373715
epoch 400, loss: 0.4319842778344139
epoch 500, loss: 0.4311104397661298
epoch 600, loss: 0.4299355672965768
epoch 700, loss: 0.42835925122788915
epoch 800, loss: 0.42714772313095556
epoch 900, loss: 0.42603610675377207


If the loss is consistently decreasing during training, it's generally a good sign that the network is learning.

### Recommended Resources

-  [3Blue1Brown Playlist](https://www.3blue1brown.com/topics/neural-networks)
- [ML cheatsheet](https://ml-cheatsheet.readthedocs.io/en/latest/backpropagation.html)
- [Kaggle Intro to NN](https://www.kaggle.com/code/ryanholbrook/deep-neural-networks)