# __Neural Networks in Scikit-Learn and NumPy (Part B)__

In _Part A_, we looked at the Scikit-Learn Multi-layer Perceptron (MLP) object for building a neural network classifier.

In this practical we drop down to a lower level, writing our own functions to build and train a neural network for the same problem. We use NumPy functions to implement things like matrix algebra, random number generation, etc.

__As before from Part A__

General imports

In [None]:
%matplotlib inline  
import numpy as np
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split

Generate some tricky data!

In [None]:
X, y = make_moons(n_samples = 500, noise = 0.2, random_state = 101)

Split the data into training and test sets

In [None]:
# insert code here to split the data into 80:20 training and test sets


In [None]:
# visualise the training data
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
fig, ax = plt.subplots()
ax.scatter(X_train[:,0], X_train[:,1], c = y_train, edgecolors='k', cmap = cm_bright)
plt.title('Training data')
plt.show()

__New for Part B__

To build a neural network classifier at a lower level than available in Scikit-Learn, we need to think first about the "ingredients" that we require.
- Definition of the model to do a __forward pass__: $y = \sigma(W^{[L-1]}(\sigma(W^{[L-2]}(... \sigma(W^{[1]}x + b^{[1]})...)+b^{[L-2]}))+b^{[L-1]})$ where 
    - $\sigma$ is the __activation function__
    - $Wx + b$ is a linear function that goes from $\mathbb{R}^n$ to $\mathbb{R}^m$.
    - In our example, the output of the model is a value between 0 and 1 that tells the probability of a point being blue or red. The input of the model are the coordinates of the point. 
- Definition of a __cost function__ that tells how good is the model in terms of its parameters: $$J(\theta), \text{where } \theta:=\{W^{[1]}, b^{[1]}, \ldots, W^{[L-1]}, b^{[L-1]}\}$$
- __Optimisation__ of the cost function using gradient descent.
    - We need to apply the chain rule (backpropagation) in order to obtain $\partial_{\theta}J$ for each optimisation step.

__Activation Function__ - sigmoid

In [None]:
def sigmoid(z):
    output = 1/(1+np.exp(-z))
    return output

__Forward Model Definition__

In [None]:
def forward_model(model, x):
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    
    # forward pass
    # we use a single hidden layer
    # complete the "a2 =" line using the activation function
    z2 = np.matmul(x,W1) + b1
    a2 = 
    
    # output layer
    # complete the "a3 =" line using the activation function
    z3 = np.matmul(a2,W2) + b2
    a3 = 
    
    #return the output of the model (a3) and the intermediate layers
    return z2, a2, z3, a3

__Loss function__

Let $\mathcal{D} = \{(x^1, y^1), (x^2, y^2), \ldots, (x^N, y^N)\}$ be our training set, where $x^i\in \mathbb{R}^n$
We define the loss function as $$J(\theta) = \sum_{i=1}^{N} y^i \log(f_{\theta}(x^i)) + (1-y^i) \log(1-f_{\theta}(x^i))$$

In [None]:
def loss_fn(model, x, y):
    _,_,_,y_pred = forward_model(model, x)
    loss_batch = y * np.log(y_pred) + (1-y) * np.log(1-y_pred)
    # complete the line below to calculate the loss
    loss = 
    return loss

__Optimisation__ Gradient Descent algorithm
- Update the weights and biases in the neural network according to the gradient of the cost function.

In [None]:
def GD_step(model, x, y, lr = 0.001):
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    
    z2, a2, z3, a3 = forward_model(model, x)
    
    delta3 = a3-y
    dW2 = np.matmul(a2.T,delta3)
    db2 = np.sum(delta3, axis=0, keepdims=True)
    
    delta2 = sigmoid(z2)*(1-sigmoid(z2)) * delta3.dot(W2.T)
    dW1 = np.matmul(x.T, delta2)
    db1 = np.sum(delta2, axis=0)
    
    W2 = W2 - lr * dW2
    b2 = b2 - lr * db2
    W1 = W1 - lr * dW1
    b1 = b1 - lr * db1
    
    model['W1'], model['b1'], model['W2'], model['b2'] = W1, b1, W2, b2
    return model

#### We put everything together for the training:
- Initialise $W^{[1]}, b^{[1]}, \ldots, W^{[L-1]}, b^{[L-1]}$.
- While Not convergence:
    - Calculate $J(\theta)$
    - Update $W^{[i]} := W^{[i]} - \alpha \cdot \partial_{W^{[i]}}J$
    - Update $b^{[i]} := b^{[i]} - \alpha \cdot \partial_{b^{[i]}}J$
    

__Define the training function__

In [None]:
def train(model, n_epochs, X, y):
    # for a pre-defined number of epochs
    for epoch in range(n_epochs):
        # update weights and biases
        model = GD_step(model, x=X, y=y)
        # calculate loss
        loss = loss_fn(model, x=X, y=y)
        # print information every 100 epochs
        if epoch%100 == 0:
            print("Epoch: {}/{}, loss: {}".format(epoch, n_epochs, loss))
    return model

__Now run the training__

In [None]:
%%time
n_hidden = 30
# Initialise weights and biases
W1 = np.random.randn(2, n_hidden)
b1 = np.random.randn(1, n_hidden)
W2 = np.random.randn(n_hidden, 1)
b2 = np.random.randn(1,1)

#The above functions expect to receive 'model' as a Python Dictionary object.
#Define it here
model = 
#Populate the model with the initialised weights and biases from above
#<enter code here>

# Train for 10000 epochs
model = train(model=model, n_epochs=10000, X=X, y=y.reshape(500,1))

__Accuracy Metric:__

- _Use ideas from the code in Practical8a.ipynb to calculate the accuracy of the model._

- _Can you write this this code as a function and incorporate it into the print statement in the training function to track the accuracy during optimisation._
    - _remember, the loss function, not the accuracy metric, is used to improve the model!_

In [None]:
#<enter code here>

__Visualise:__
- _Adapt code from the from Practical8a.ipynb to plot the decision boundaries of the model_
- _Do the boundaries make more sense visually?_

In [None]:
#<enter code here>