# Neural Network implementation from scratch

- Empezar explicando como funciona neural network
- Explicar básicamente matrix multiplication
- Cómo funciona Layer.forward. 
- Cómo estructurar Net
- Cómo funciona SGD
- Cómo funciona backward

This notebook will go through a basic implementation of a neural network unsing only NumPy. It will go step by step, from the input data to the update of the network weights and biases. It won't go deep into math and derivation, check the Resources section at the bottom of the page to read more about those topics. 

1. Neuron
2. Layer
3. Network
   
## Neuron
A neural network is made out of (yes, you guessed it) neurons. A neuron can basically be described as a linear mathematic function. It takes any number of inputs, it multiplies them by a number we call the neuron's weights, it sums everything, and it adds another number called the bias. For example, a neuron with two inputs looks like this:

$$z(x_1, x_2) = w_1*x_1 + w_2*x_2 + b$$

This number $z$ then passes (most of the times) through an activation function. We're gonna use the ReLU function here, that basically returns the same number that it takes only if that number is positive. Otherwise it returns 0:
$$ReLU = max(0, z)$$

> meter links a info de relu

## Layer
One or more neurons with the same number of inputs form a layer. Here is where the well know layer diagram comes in handy. 


In [None]:
def add_to_class(Class): 
    def wrapper(obj):
        setattr(Class, obj.__name__, obj)
    return wrapper

In [11]:
from src.model import Net
import numpy as np

BATCH_SIZE = 32
LR = 0.1
net = Net((2,7, 7, 3, 1))

def f(x, y):
    return (x+4)**3 + 4 * y

def generate_data(size):
    X = np.zeros((size, 2))
    Y = np.zeros((size, 1))
    for n in range(size):
        X[n] = np.random.rand(2)
        Y[n] = f(X[n][0], X[n][1])
    return X, Y + np.random.randn(Y.size, 1) * 0.1

X, Y = generate_data(10000)

In [35]:
Y[0]

array([86.76105675])

Take a sample from the dataset and calculate the derivative of the loss for each training example

In [14]:

indices = np.random.randint(len(X), size=(BATCH_SIZE))
sample_X = X[indices]
sample_Y = Y[indices]
out = net(sample_X)
d_loss = out - sample_Y

In [18]:
d_loss[0]

array([-84.63232543])

If the layer has activation, multiply de loss by the derivative of ReLU

In [30]:
if net.layers[-1].activation == 'relu':
    d_loss *= np.greater(net.layers[-1].z, 0)

In [33]:
print(d_loss.shape, net.layers[-1].weights.T.shape)

(32, 1) (1, 3)


To calculate the gradient with respect of the layer, make matrix multiplication between the loss and the transpose of the weights. This is the value that we pass to the next layer.

In [56]:
dx_1 = np.dot(d_loss, net.layers[-1].weights.T)
dx_1.shape

(32, 3)

To calculate the gradient for each weight of this layer, make matrix multiplication between the original input and the gradient of the loss. Divide by the batch size to make the mean. (meter dibujito de multiplicacion de matrices)

In [62]:
dw_1 = np.dot(net.layers[-1].input.T, d_loss)/BATCH_SIZE
print('dw.shape: ',  dw_1.shape, ', W.shape: ', net.layers[-1].weights.shape)

dw.shape:  (3, 1) , W.shape:  (3, 1)


To calculate the gradient for the bias is basically the mean of the gradinents from above

In [77]:
db_1 = d_loss.mean(axis=0)


Now we update the parameters, multiplying the gradients by the learning rate, and substracting that value from the weights and biases

In [64]:
net.layers[-1].weights -= LR * dw_1
net.layers[-1].bias -= LR * db_1

Now we pass the gradient for this layer that we calculate it before (`dx`) to the next layer, and repeat the process.

In [65]:
if net.layers[-1].activation == 'relu':
    dx_1 *= np.greater(net.layers[-2].z, 0)

In [66]:
print(dx_1.shape, net.layers[-2].weights.T.shape)

(32, 3) (3, 7)


In [68]:
dx_2 = np.dot(dx_1, net.layers[-2].weights.T)
dx_2.shape

(32, 7)

In [70]:
print(net.layers[-2].input.T.shape, ',  dx_1.shape: ', dx_1.shape)


(7, 32) ,  dx_1.shape:  (32, 3)


In [82]:
dw_2 = np.dot(net.layers[-2].input.T, dx_1)/BATCH_SIZE
print('dw.shape: ',  dw_2.shape, ', W.shape: ', net.layers[-2].weights.shape)


dw.shape:  (7, 3) , W.shape:  (7, 3)


In [83]:
print('dx_1.shape: ',  dx_1.shape, ', B.shape: ', net.layers[-2].bias.shape)


dx_1.shape:  (32, 3) , B.shape:  (3,)


In [84]:
db_2 = dx_1.mean(axis=0)
print('db_2.shape: ',  db_2.shape, ', B.shape: ', net.layers[-2].bias.shape)

db_2.shape:  (3,) , B.shape:  (3,)


In [85]:
net.layers[-2].weights -= LR * dw_2
net.layers[-2].bias -= LR * db_2

All this is coded internally in the layers, all we have to do is call net.backward()

In [1]:
from src.optim import SGD

In [19]:
from src.optim import SGD

losses = []
net = Net((2, 3, 1))
optim = SGD(net, LR)
for epoch in range(50):
    indices = np.random.randint(len(X), size=(BATCH_SIZE))
    sample_X = X[indices]
    sample_Y = Y[indices]
    out = net(sample_X)
    sample_loss = ((sample_Y - out) ** 2).sum() / (2*BATCH_SIZE)
    losses.append(sample_loss)
    d_loss = (out - sample_Y)
    net.backward(d_loss, LR, BATCH_SIZE)
    optim.step()

In [36]:
sample_X[0]

array([0.75503691, 0.74366301])

In [37]:
sample_Y[0]

array([110.32976944])

In [38]:
out[0]

array([93.98300414])

In [20]:
import plotly.express as px
print(losses[-5:])
px.line(losses[3:])

[182.9329312503133, 152.85885210244192, 169.3045779163166, 179.8755711883813, 143.6956927396468]


# Classifying MNIST dataset

In [6]:
import plotly.express as px
import numpy as np
from src.model import Net
from src.optim import SGD
from torchvision.datasets import MNIST
from torch.utils.data import random_split

data = MNIST('./data', train=False)
train, test = random_split(data, [0.8,0.2])

In [7]:
data[0][1]

7

In [8]:
px.imshow(np.asarray(data[0][0]), color_continuous_scale='greys')

In [9]:
def process_data(dataset):
    X = np.zeros((len(dataset), 28*28))
    Y = np.zeros((len(dataset), 10))
    for n in range(len(dataset)):
        X[n] = np.asarray(dataset[n][0]).flatten()/255
        Y[n][dataset[n][1]] = 1
    return X, Y

X_train, Y_train = process_data(train)
X_test, Y_test = process_data(test)


In [10]:
def accuracy(exp, pred):
    exp_number = np.argmax(exp, axis=1)
    pred_number = np.argmax(pred, axis=1)
    true_pred = (exp_number == pred_number).sum()
    return true_pred/len(exp)

def MSE(exp, pred):
    return ((exp - pred) ** 2).sum() / (2*len(exp))

In [8]:
LR = 0.01
BATCH_SIZE = 32

costs = {'train': [], 'test': []}
accuracies = {'train': [], 'test': []}
net = Net((28*28, 15, 10), last_relu=False)
optim = SGD(net, LR)

for epoch in range(10000):
    # Sample minibatch:
    indices = np.random.randint(len(X_train), size=(BATCH_SIZE))
    sample_X = X_train[indices]
    sample_Y = Y_train[indices]
    train_out = net(sample_X)
    # Update net
    d_loss = (train_out - sample_Y)
    net.backward(d_loss, LR, BATCH_SIZE)
    optim.step()
    # Save loss
    test_out = net(X_test)
    sample_cost = MSE(sample_Y, train_out)
    test_cost = MSE(Y_test, test_out)
    costs['train'].append(sample_cost)
    costs['test'].append(test_cost)
    # Save accuracy
    accuracies['train'].append(accuracy(sample_Y, train_out))
    accuracies['test'].append(accuracy(Y_test, test_out))

In [40]:
px.line(
    costs
)

In [43]:
px.line(accuracies)

In [41]:
px.imshow(X_test[432].reshape(28,28), color_continuous_scale='greys')

In [42]:
np.argmax(Y_test[432])

2

In [33]:
np.argmax(test_out[462])

2