# Neural Network implementation from scratch

- Empezar explicando como funciona neural network
- Explicar básicamente matrix multiplication
- Cómo funciona Layer.forward. 
- Cómo estructurar Net
- Cómo funciona SGD
- Cómo funciona backward

This notebook will go through a basic implementation of a neural network unsing only NumPy. And this is the only library that we'll need. Let's import it then. We're defining also a helper decorator that will help us to add methods to the classes as we make them in the notebook. 

In [12]:
import numpy as np

def add_to_class(Class): 
    def wrapper(obj):
        setattr(Class, obj.__name__, obj)
    return wrapper

It will go step by step, from the input data to the update of the network weights and biases. It won't go deep into math and derivation, check the Resources section at the bottom of the page to read more about those topics. 

1. Neuron
2. Layer
3. Network
   
## Neuron
A neural network is made out of (yes, you guessed it) neurons. A neuron can basically be described as a linear mathematic function. It takes any number of inputs, it multiplies them by a number we call the neuron's weights, it sums everything, and it adds another number called the bias. For example, a neuron with two inputs looks like this:

$$z(x_1, x_2) = w_1*x_1 + w_2*x_2 + b$$

This number $z$ then passes (most of the times) through an activation function. We're gonna use the ReLU function here, that basically returns the same number that it takes only if that number is positive. Otherwise it returns 0:
$$a = ReLU(z) = max(0, z)$$

> meter links a info de relu

## Layer
One or more neurons with the same inputs form a layer. Here is where the well known layer diagram comes in handy. The upper index on the letter indicates the layer number and the lower index indicates which neuron it is. The weights have a second lower index indicating its order _in its neuron_. For example $w_{34}^2$, would be the fourth weight of the third neuron of the second layer. 

<div style="text-align: center">
    <img src="./img/layer.png" width="600" style="border-radius: 20px"/>
    </br>
    <img src="./img/weight.png" width="200" style="border-radius: 20px"/>
</div>

If we would follow this notation as it is when we code, we would have to calculate the ouptut of each neuron by itself. This is ok if we have 3 neurons, but in modern neural networks we can have hundreds or thousand of neurons, with hundreds of thousands of weights. Apart from this, we will want to pass several training examples through the network at once. To do that, we'll use matrix operations, that's why we normally save a layer's weights and biases (what we call the parameters) as matrices. In the weights matrix, each column of will represent one neuron, and each row the corresponding weights:

$$\begin{bmatrix} w_{11}^1 & w_{21}^1 & w_{31}^1 \\ w_{12}^1 & w_{22}^1 & w_{32}^1 \end{bmatrix}$$

The bias matrix has only one row, with the biases for each neuron:

$$\begin{bmatrix} b_{1}^1 & b_{2}^1 & b_{3}^1 \end{bmatrix}$$

Let's start coding then. As we're connecting all the inputs to all the weights, we call this a _fully connected_ layer. Sometimes it's also called Linear (if it doesn't have activation function) or Dense layer. To initialize the parameters we'll use `np.random.normal(mean, std_dev, shape)`. If you want to know why we use a standard deviation of $\frac{1}{\sqrt{ins}}$ you can check this book. `meter link al libro de derivadas`. The weights matrix will have a shape of `ins` rows and `outs` columns. The class constructor then looks like this:


In [17]:
class FullyConnected:
    def __init__(self, ins, outs, activation='relu'):
        self.weights = np.random.normal(0, 1/(ins**.5), (ins, outs))
        self.bias = np.random.normal(0, 1/(ins**.5), (1, outs))
        self.activation = activation

## Network
The outputs of a layer become the inputs of the next one. Layers then get connected to each other and that's what makes a network. Let's say that our network has only one more layer, with one more neuron. This will be the output _of the network_, we won't pass it through an activation function in this case. 

<div style="text-align: center">
    <img src="./img/net.png" width="600" style="border-radius: 20px"/>
</div>

To code this, we can simply use an array that we'll call `layers`. The constructor of the class `Net` will accept an array of integers named `neurons` indicating the number of neurons in each layer. `neurons[0]` correspond to the number of inputs. For example, if we have `neurons = (2, 3, 5 ,1)`, we have a neural network with 2 inputs, 3 neurons in the first layer, 5 neurons in the second one, and one output. As we said before, each layer has the same number of inputs as the previous layer has outputs. The code then it's quite self-explanatory:

In [14]:
class Net:
    def __init__(self, neurons, last_relu=False):
        self.layers = []
        for n in range(len(neurons)-1):
            ins = neurons[n]
            outs = neurons[n+1]
            activation = self.decide_activation(len(neurons), last_relu, n)
            layer = FullyConnected(ins, outs, activation=activation)
            self.layers.append(layer)

At line 7 we call the `decide_activation()` method. It's just a helper that sets ReLU as activation to all layers but the last one:

_(we're using the add_to_class decorator we defined above)_

In [15]:
@add_to_class(Net)
def decide_activation(self, neurons, last_relu, n):
        if n == neurons-2 and not last_relu:
            activation = 'none'
        else: 
            activation = 'relu'
        return activation

## Forward pass
We have all the elements we need to make the forward pass. This functions will just take the outputs of one layer and feed them as inputs to the next one, until we have the final input. The code is quite simple: in a for loop we call each layer `forward()` method. We also make a `__call__()`  function so we can directly call the forward pass when we call an instance of the class. 

In [18]:
@add_to_class(Net)
def __call__(self, input):
        return self.forward(input)
    
@add_to_class(Net)
def forward(self, input):
    for layer in self.layers:
        input = layer.forward(input)
    return input

As we mentioned above, we want to use the power of matrix multiplications to pass several trainging examples at once. With matrix multiplication the first matrix has to have the same number of rows as the second matrix has columns. Then, we take the first row of the first matrix, multiply its elements by the corresponding elements of the second matrix first column, sum everything, and place the result in the output matrix as a first element. Then proceed we do the same with the second column. But wait, it will be more clear with an animation and real numbers to visualize it better. 

Let's say we have 5 training examples, with two variables each, and a first layer with 3 neurons. The operation will look like this for the first row: 

$$\begin{bmatrix} x_{1} & y_{1} \\ x_{2} & y_{2} \\ x_{3} & y_{3}\\ x_{4} & y_{4} \\ x_{5} & y_{5}\end{bmatrix} · \begin{bmatrix}w_{11}^1 & w_{21}^1 & w_{31}^1 \\ w_{12}^1 & w_{22}^1 & w_{32}^1\end{bmatrix} = [x_{1}*w_{11}^1 + y_{1}*w_{12}^1 \quad , \quad x_{1}*w_{21}^1 + y_{1}*w_{22}^1\quad , \quad x_{1}*w_{31}^1 + y_{1}*w_{32}^1]$$

Then, we do the same with 


In [1]:
from src.model import Net
import numpy as np

BATCH_SIZE = 32
LR = 0.1
net = Net((2,7, 7, 3, 1))

def f(x, y):
    return (x+4)**3 + 4 * y

def generate_data(size):
    X = np.zeros((size, 2))
    Y = np.zeros((size, 1))
    for n in range(size):
        X[n] = np.random.rand(2)
        Y[n] = f(X[n][0], X[n][1])
    return X, Y + np.random.randn(Y.size, 1) * 0.1

X, Y = generate_data(10000)

In [6]:
net.layers[-2].weights

array([[-0.12049929, -0.28950978,  0.66988556],
       [-0.35532109,  0.40661434,  0.44375281],
       [-0.07391101, -0.17677337, -0.20034583],
       [ 0.13708315, -0.35781561,  0.1689082 ],
       [-0.12742559, -0.25463383, -0.18297772],
       [-0.09330053, -0.28226656, -0.02389244],
       [ 0.21231667, -0.37340849, -0.48735504]])

In [35]:
Y[0]

array([86.76105675])

In [21]:
a = np.arange(1, 11).reshape(5, 2)
b= np.arange(1,7).reshape(2, 3)
np.dot(a, b)

array([[ 9, 12, 15],
       [19, 26, 33],
       [29, 40, 51],
       [39, 54, 69],
       [49, 68, 87]])

Take a sample from the dataset and calculate the derivative of the loss for each training example

In [14]:

indices = np.random.randint(len(X), size=(BATCH_SIZE))
sample_X = X[indices]
sample_Y = Y[indices]
out = net(sample_X)
d_loss = out - sample_Y

In [18]:
d_loss[0]

array([-84.63232543])

If the layer has activation, multiply de loss by the derivative of ReLU

In [30]:
if net.layers[-1].activation == 'relu':
    d_loss *= np.greater(net.layers[-1].z, 0)

In [33]:
print(d_loss.shape, net.layers[-1].weights.T.shape)

(32, 1) (1, 3)


To calculate the gradient with respect of the layer, make matrix multiplication between the loss and the transpose of the weights. This is the value that we pass to the next layer.

In [56]:
dx_1 = np.dot(d_loss, net.layers[-1].weights.T)
dx_1.shape

(32, 3)

To calculate the gradient for each weight of this layer, make matrix multiplication between the original input and the gradient of the loss. Divide by the batch size to make the mean. (meter dibujito de multiplicacion de matrices)

In [62]:
dw_1 = np.dot(net.layers[-1].input.T, d_loss)/BATCH_SIZE
print('dw.shape: ',  dw_1.shape, ', W.shape: ', net.layers[-1].weights.shape)

dw.shape:  (3, 1) , W.shape:  (3, 1)


To calculate the gradient for the bias is basically the mean of the gradinents from above

In [77]:
db_1 = d_loss.mean(axis=0)


Now we update the parameters, multiplying the gradients by the learning rate, and substracting that value from the weights and biases

In [64]:
net.layers[-1].weights -= LR * dw_1
net.layers[-1].bias -= LR * db_1

Now we pass the gradient for this layer that we calculate it before (`dx`) to the next layer, and repeat the process.

In [65]:
if net.layers[-1].activation == 'relu':
    dx_1 *= np.greater(net.layers[-2].z, 0)

In [66]:
print(dx_1.shape, net.layers[-2].weights.T.shape)

(32, 3) (3, 7)


In [68]:
dx_2 = np.dot(dx_1, net.layers[-2].weights.T)
dx_2.shape

(32, 7)

In [70]:
print(net.layers[-2].input.T.shape, ',  dx_1.shape: ', dx_1.shape)


(7, 32) ,  dx_1.shape:  (32, 3)


In [82]:
dw_2 = np.dot(net.layers[-2].input.T, dx_1)/BATCH_SIZE
print('dw.shape: ',  dw_2.shape, ', W.shape: ', net.layers[-2].weights.shape)


dw.shape:  (7, 3) , W.shape:  (7, 3)


In [83]:
print('dx_1.shape: ',  dx_1.shape, ', B.shape: ', net.layers[-2].bias.shape)


dx_1.shape:  (32, 3) , B.shape:  (3,)


In [84]:
db_2 = dx_1.mean(axis=0)
print('db_2.shape: ',  db_2.shape, ', B.shape: ', net.layers[-2].bias.shape)

db_2.shape:  (3,) , B.shape:  (3,)


In [85]:
net.layers[-2].weights -= LR * dw_2
net.layers[-2].bias -= LR * db_2

All this is coded internally in the layers, all we have to do is call net.backward()

In [1]:
from src.optim import SGD

In [19]:
from src.optim import SGD

losses = []
net = Net((2, 3, 1))
optim = SGD(net, LR)
for epoch in range(50):
    indices = np.random.randint(len(X), size=(BATCH_SIZE))
    sample_X = X[indices]
    sample_Y = Y[indices]
    out = net(sample_X)
    sample_loss = ((sample_Y - out) ** 2).sum() / (2*BATCH_SIZE)
    losses.append(sample_loss)
    d_loss = (out - sample_Y)
    net.backward(d_loss, LR, BATCH_SIZE)
    optim.step()

In [36]:
sample_X[0]

array([0.75503691, 0.74366301])

In [37]:
sample_Y[0]

array([110.32976944])

In [38]:
out[0]

array([93.98300414])

In [20]:
import plotly.express as px
print(losses[-5:])
px.line(losses[3:])

[182.9329312503133, 152.85885210244192, 169.3045779163166, 179.8755711883813, 143.6956927396468]


# Classifying MNIST dataset

In [1]:
import plotly.express as px
import numpy as np
from src.model import Net
from src.optim import SGD
from torchvision.datasets import MNIST
from torch.utils.data import random_split

data = MNIST('./data', train=False)
train, test = random_split(data, [0.8,0.2])

In [2]:
data[0][1]

7

In [3]:
px.imshow(np.asarray(data[0][0]), color_continuous_scale='greys')

In [4]:
def process_data(dataset):
    X = np.zeros((len(dataset), 28*28))
    Y = np.zeros((len(dataset), 10))
    for n in range(len(dataset)):
        X[n] = np.asarray(dataset[n][0]).flatten()/255
        Y[n][dataset[n][1]] = 1
    return X, Y

X_train, Y_train = process_data(train)
X_test, Y_test = process_data(test)


In [5]:
def accuracy(exp, pred):
    exp_number = np.argmax(exp, axis=1)
    pred_number = np.argmax(pred, axis=1)
    true_pred = (exp_number == pred_number).sum()
    return true_pred/len(exp)

def MSE(exp, pred):
    return ((exp - pred) ** 2).sum() / (2*len(exp))

In [9]:
LR = 0.01
BATCH_SIZE = 32

costs = {'train': [], 'test': []}
accuracies = {'train': [], 'test': []}
net = Net((28*28, 15, 10), last_relu=False)
optim = SGD(net, LR)

for epoch in range(10000):
    # Sample minibatch:
    indices = np.random.randint(len(X_train), size=(BATCH_SIZE))
    sample_X = X_train[indices]
    sample_Y = Y_train[indices]
    train_out = net(sample_X)
    # Update net
    d_loss = (train_out - sample_Y)
    net.backward(d_loss, BATCH_SIZE)
    optim.step()
    # Save loss
    test_out = net(X_test)
    sample_cost = MSE(sample_Y, train_out)
    test_cost = MSE(Y_test, test_out)
    costs['train'].append(sample_cost)
    costs['test'].append(test_cost)
    # Save accuracy
    accuracies['train'].append(accuracy(sample_Y, train_out))
    accuracies['test'].append(accuracy(Y_test, test_out))

In [10]:
px.line(
    costs
)

In [11]:
px.line(accuracies)

In [41]:
px.imshow(X_test[432].reshape(28,28), color_continuous_scale='greys')

In [42]:
np.argmax(Y_test[432])

2

In [33]:
np.argmax(test_out[462])

2