<a target="_blank" href="https://colab.research.google.com/github/JLDC/Data-Science-Fundamentals/blob/master/notebooks/206_my-own-neural-network-2.ipynb">
    <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Open this notebook in Google Colab
</a>

___

# Building our own Neural Network (pt. 2)
___
In this notebook, we will unleash more of the unlimited power of neural networks by introducing a second output node.

Unfortunately, this also comes at the cost of some extra degree of confusion, also known as matrix algebra. The main purpose is to understand what changes are necessary to make this work, compared to the previous version. 

Lastly, you can play around with the parameters, in particular the number of hidden nodes and the initialization range. Compare the learning progress of the network with the one used in the previous example.

___
## Data pre-processing

In [None]:
# Import necessary packages
import numpy as np # Numerical computation package
import pandas as pd # Dataframe package
import matplotlib.pyplot as plt # Plotting package
import matplotlib as mpl # For colormaps
np.random.seed(1) # Set the random seed for reproduceability

# Define the path where the data is stored
DATA_PATH = "https://raw.githubusercontent.com/JLDC/Data-Science-Fundamentals/master/data"

In [None]:
# Read in the WDBC dataset
wdbc = pd.read_csv(f"{DATA_PATH}/wdbc.csv")
# Keep only necessary columns: the diagnosis, the perimeter, and the severity of concave portions
# of the cell nucleus
wdbc = wdbc[["perimeterM", "concaveM", "diagnosis"]]
wdbc # Display the dataset

In [None]:
# Recall the necessity to standardize our inputs!
from sklearn.preprocessing import StandardScaler

In [None]:
# Create the features
X = np.array(wdbc[["perimeterM", "concaveM"]])
# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

Notice how we didn't create the labels above? This is because this time, our labels are going to be *one-hot-encoded*, i.e., each individual target $y^{(i)}$ is a vector $[0\, 1]$ or $[1\, 0]$, such that

$$\mathbf{Y} = \begin{bmatrix}0 \, 1 \\ 0 \,1 \\ 1\,0 \\ \dots\\ 1\,0 \end{bmatrix}$$

is now a matrix!

In [None]:
# Get one-hot-encoded target
Y = np.array(pd.get_dummies(wdbc["diagnosis"])).astype(float)
Y # Display the labels

___
## Training the neural network

In [None]:
# Initialize some parameters
N, K = X.shape # Number of observations and features
L = Y.shape[1] # Number of outputs
NZ = 3 # Number of hidden nodes
epochs = 100 # Number of training epochs
eta = 0.01 # Learning rate
init_range = [-.5, .5] # The range of our uniform distribution for weight initialization

Notice how now we keep track of the number of outputs. Now $\mathbf{W}_2$ will not be $N_Z \times 1$ anymore. Instead, it will be $N_Z \times L$!

In [None]:
# Initialize weight matrices
np.random.seed(72) # Set seed
W1 = np.random.rand(K, NZ) # Randomly initialize W1
W1 = W1 * (init_range[1] - init_range[0]) - init_range[0] # Constrain to range
W2 = np.random.rand(NZ, L) # Randomly initialize W2
W2 = W2 * (init_range[1] - init_range[0]) - init_range[0] # Constrain to range

In [None]:
# We will use the sigmoid function as the activation function
sigmoid = lambda x: 1 / (1 + np.exp(-x))

In [None]:
# Forward pass helper
def forward_pass(X, W1, W2):
    # Compute the matrix multiplication of the input layer
    S = X @ W1
    
    # Pass through the non-linear activation function
    Z = sigmoid(S)
    
    # Compute the matrix multiplication of the hidden layer
    T = Z @ W2
    
    # Pass through the non-linear activation function 
    # (This should always be sigmoid for the output to be (0, 1))
    # ⚠️ Notice how we don't flatten the output anymore, this time we want to keep it a matrix!
    return sigmoid(T)

In [None]:
# Check that the output of a forward pass is indeed an NxL matrix
pd.DataFrame(forward_pass(X, W1, W2), columns=[f"Y{i}" for i in range(L)])

___
#### 🤔 Pause and ponder
While the code for the loss is the exact same, the calculation of the loss is now somewhat more tricky. Recall from the math derivations that the loss function has two terms, one for $L=0$, and one for $L=1$. We must first calculate the two loss terms separately and add them together.

Fortunately, using `numpy`, we don't need to think about all of this... it just works. But be careful, it's a good idea to take time and see what each part of the loss looks like since we are now dealing with a matrix instead of a vector of targets.
___

In [None]:
# This function, we will use to evaluate our predictions
def eval_predictions(X, Y, W1, W2):
    # Use the forward_pass function defined above to compute our estimated probability
    prob = forward_pass(X, W1, W2)
    
    # To avoid log(0), clip the probability to not be exactly zero or one
    prob = np.clip(prob, 1e-8, 1-1e-8)
    # Calculate the loss function (negative log-likelihood in this case)
    loss = - np.sum(Y * np.log(prob) + (1 - Y) * np.log(1 - prob))

    # For the actual prediction, we select the outcome with the highest probability. 
    # Unlike in the previous case, the two probabilities need not add up to 1!
    pred = prob.argmax(1) # Take the argmax along the second axis
    
    # Compute number of misclassification
    misclassifications = np.mean(pred != Y.argmax(1))
    
    # Output results as a dictionary
    return {
        "loss": loss, 
        "misclassifications": misclassifications, 
        "prob": prob, 
        "pred": pred
    }

In [None]:
# Initializitation of lists for bookkeeping
loss_list = []
misclassification_list = []

# Create an array of indices (which we will shuffle later on) through which we 
# will iterate. This represent the index of the observations
indices = np.arange(N)

In [None]:
# Ignore this, we use it to print nicely without creating too many lines
from sys import stdout

In [None]:
np.random.seed(72) # Reset random seed for reproduceability
# Compute the loss and misclassifications BEFORE training
res = eval_predictions(X, Y, W1, W2)

# Append to our result lists
loss_list.append(res["loss"])
misclassification_list.append(res["misclassifications"])

# Run the full training loop (iterate over the number of training epochs)
for epoch in range(epochs):
    # Reshuffle the indices
    np.random.shuffle(indices)
    
    # Iterate through each single data point
    for i in indices:
        # Note that we use [i:i+1, :] instead of [i, :] to keep it as a 1x2 matrix
        Xi = X[i:i+1, :] # Extract features for ith observation,
        Yi = Y[i:i+1, :] # Extract label for ith observation
        
        # ----- Forward pass -----
        # Computes the predictions using Xi, W1, W2. Here we avoid using the
        # forward_pass function because we need the hidden nodes values Zi
        # for backpropagation. So, instead, we repeat the code (ugh...)
        
        # Pass to the hidden nodes (pre-activation)
        Si = Xi @ W1
        # Compute activation function
        Zi = sigmoid(Si)
        
        # Pass to the output nodes
        Ti = Zi @ W2
        # Compute sigmoid probability transformation
        prob_i = sigmoid(Ti)
        # ⚠️ Notice that we now have two probabilities!!
        
        
        # ----- Backward pass -----
        # ⚠️ Since we have two output nodes, and two probabilities for prediction
        # we will also have two errors!!
        error_i = Yi - prob_i
        
        # Compute the gradient w.r.t. W2. ⚠️ W2 is a vector! See math derivations
        grad2_i = -Zi.T @ error_i # ⚠️ error_i is now 1 x L, and Zi is NZ x 1
        # Compute the gradient w.r.t. W1. ⚠️ W1 is matrix! Making things even worse
        grad1_i = -Xi.T @ (error_i @ (W2.T * Zi * (1 - Zi))) # ⚠️ error_i is now 1 x L
        
        # Updating: Move 'eta' units in the direction of the negative gradient
        W1 -= eta * grad1_i
        W2 -= eta * grad2_i
    
    # Evaluate the learning process and store the results into our lists
    res = eval_predictions(X, Y, W1, W2)
    loss_list.append(res["loss"])
    misclassification_list.append(res["misclassifications"])    
        
    # Print the current status (🙀 🤯 ignore this part!)
    bar = "".join(["#" if epoch >= t * (epochs // 50) else " " for t in range(50)])
    stdout.write(f"\rEpoch: {epoch+1:>{int(np.floor(np.log10(epochs))+1)}}/{epochs} [{bar}]")

___
## Evaluating the results
That's it, we can go ahead and look at the results of our own custom neural network!

In [None]:
# Set up the canvas
fig, axs = plt.subplots(1, 2, figsize=(16, 8))
# Plot the loss over the epochs (epoch 0 is before training!)
axs[0].plot(range(len(loss_list)), loss_list)
# Plot the misclassification rate over the epochs
axs[1].plot(range(len(misclassification_list)), misclassification_list)
# Add title, grid, axis labels
for ax in axs:
    ax.grid(True)
    ax.set_xlabel("Epoch number")
axs[0].set_ylabel("Loss (negative log-likelihood)")
axs[0].set_title("Evolution of loss function over training epochs")
axs[1].set_ylabel("Misclassification rate")
axs[1].set_title("Evolution of missclassification rate over training epochs")

In [None]:
# Display the weight matrix from the input layer to the hidden layer
pd.DataFrame(W1, columns=[f"Z{i}" for i in range(NZ)], index=["X1", "X2"])

In [None]:
# Display the weight matrix  from the hidden layer to the output layer
pd.DataFrame(W2, columns=["Y0", "Y1"], index=[f"Z{i}" for i in range(NZ)])

___
## Visualizing the decision regions


In [None]:
# Define granularity and limits of grids
npoints = 200
x1_lims = (X[:, 0].min() - .1, X[:, 0].max() + .1)
x2_lims = (X[:, 1].min() - .1, X[:, 1].max() + .1)

# Create the x1 and x2 arrays
x1 = np.linspace(*x1_lims, num=npoints)
x2 = np.linspace(*x2_lims, num=npoints)
preds = np.empty((npoints, npoints))

# Compute the predictions of our network for every single combination
for i in range(x2.shape[0]):
    for j in range(x1.shape[0]):
        # Create the 1x2 input
        xi = np.array([[x1[j], x2[i]]])
        # Compute the prediction of the network 
        # (🙀 🤯 use a trick to keep the uncertainty regions...)
        probs = forward_pass(xi, W1, W2).flatten()
        argmax = probs.argmax()
        preds[i, j] = (1 - argmax) * (1 - probs[0]) + argmax * probs[1]

# Small trick to scale back values to their value before standardization
Xt = scaler.inverse_transform(np.array([x1, x2]).T)
x1, x2 = Xt[:, 0], Xt[:, 1]

In [None]:
# Set up the canvas
fig, ax = plt.subplots(figsize=(12, 8))
# Add decision regions
ax.contourf(x1, x2, preds, cmap=mpl.colormaps["coolwarm"], alpha=0.8)

# Add true data points
for diag in ["B", "M"]:
    subset = wdbc.loc[wdbc["diagnosis"] == diag]
    ax.scatter(subset["perimeterM"], subset["concaveM"], label=diag, alpha=0.9)

# Add grid, labels, etc.
ax.grid(True)
ax.set_xlabel("Perimeter")
ax.set_ylabel("Concavity")
ax.legend()

___
### Exercises

#### <font style="color: green">**➡️ ✏️ Question 1**</font>
Try playing around with the parameters. Change the hidden nodes, learning rate, number of epochs.

How do the results change? Do they get better or worse? How do the decision region change?

*Hint*: If you encounter an error, try changing some parameters back. A too high learning rate can cause your weights to explode and `numpy` won't be able to handle such large numbers (in particular when trying to exponentiate them!)