# Deep Learning Tutorial: An Introduction to Neural Networks

In this tutorial you are going to learn the basics of neural networks. We will walk through each of the different parts that make up a neural network. In the end, we will put these parts together to create a model that can classify animals from a zoo dataset.

The model will be developed from first principles using `Numpy`. This means we can see the inner workers of the neural network and how each part works together.

In [1]:
import numpy as np
import pandas as pd
import pprint

## What is a neural network?
So, what is a neural network. A neural network can be thought of as a more complicated linear equation.

$y = m*x + b$

Given some input (`x`), the model applies some weights (`m`) and biases (`b`) to predict an outcome, `y`.

But, why is it called 'neural'? That seems to imply something related to the brain. For that, we need to introduce the perceptron.

## What is a perceptron?
The perceptron is the most basic unit of a neural network. It functions much like a neuron in our brain. Each neuron in our brain reserves signals from dendrites. Depending on the signals received, the neuron will fire or remain quite.

The perceptron functions in much the same way. Each perceptron receives inputs from adjoining perceptrons. It will then combine these inputs and output either zero or a non-zero value. The functions that are used to make these decisions are called activation functions. Let's take a look at a couple of these activation functions.

### Sigmoid Activation
This activation function sets all inputs to values between 0 and 1. It does this using an exponential function.

In [2]:
def sigmoid_activation(Z):
    activation = 1/(1 + np.exp(-1*Z))
    return activation

### Relu Activation
This activation function sets all negative values to 0 and otherwise returns the positive value.

In [3]:
def relu_activation(Z):
    activation = np.maximum(0.0, Z)
    return activation

### Softmax Function
In our model, we are going to perform classification. Therefore, the neural network needs a way to predict a class as output, given some input features. This can be achieved by using a softmax function, which assigns a probability to each class. All probabibilities add up to one. The class with the highest probability is assigned the prediction for that class. The hidden layers allow the neural network to learn complex relationships between the input features to help it make predictions. 

In [4]:
def softmax_activation(Z):
    exp_Z = np.exp(Z - np.max(Z))
    activation = exp_Z / np.sum(exp_Z, axis=0, keepdims=True)
    return activation

### Connecting Perceptrons
Now that we have the 'neurons' (perceptrons) of the neural network, we need to create the network by connecting perceptrons together. Neurons are aligned in layers. The features are received as inputs to the initial layer. The final layer returns a prediction for the model. Inbetween are hidden layers. They are called hidden because we do not have access to information passed to these neurons. Let's define the architecture that we will use for our model. 

In [5]:
NUM_FEATURES = 20
NUM_CLASSES = 7
LAYER_SIZES = [25, 25]

LAYER_SIZES.insert(0, NUM_FEATURES)

LAYER_SIZES.append(NUM_CLASSES)

LAYER_ACTIVATIONS = ['relu', 'relu', 'softmax']

In [6]:
print(f'The layer sizes include: {LAYER_SIZES}.')

The layer sizes include: [20, 25, 25, 7].


Out dataset contains 20 features, so that is the input. The first and second layers contain 25 neurons. The final layer is the softmax layer, with seven outputs, one of each type of animal being predicted.

### How are values passed between layers?

Mathematically, this is done by using a weighted sum function. For each layer, the inputs to the perceptrons are multiplied by a weight. The weighted inputs are then added together. Finally, a bias term is added to the sum. This can be done exhaustively. But, a more elagant way is using linear algebra. Here is an example using numpy:

`Z = np.dot(np.transpose(W), X) + b`

`Z` is now a vector of size equal to the number of neurons in the layer. The vector `Z` now passes through an activation function before being passed to the next layer in the neural network.

### Layer Initialization
One of the most import steps in setting up a neural network is initiating the weights and biases. Basically, the neural network needs a starting point at which to begin it's learning process. The choice of initial values is very important. If all values are the same, then the outputs to each hidden layer will be the same. This will prevent each neuron in the hidden layer from learning anything useful. To prevent this problem, the weights and biases are initialized to small values. Let's initialize our neural network.

In [7]:
def initialize_network():
    architecture = {}
    for layer in range(1, len(LAYER_SIZES)):
        architecture[f'layer_{layer}'] = {
            'W': np.random.randn(LAYER_SIZES[layer],
                                 LAYER_SIZES[layer-1]) * 0.1,
            'b': np.random.randn(LAYER_SIZES[layer], 1) * 0.1,
            'activation': LAYER_ACTIVATIONS[layer-1]
        }
    return architecture

In [8]:
network = initialize_network()
# pprint.pprint(network)

### Single Forward Pass
Ok, now we have defined our neural network architecture and have initialized all weights and biases. We now need to define a function that will pass values from layer to layer.

In [9]:
act_map = {
    'sigmoid': sigmoid_activation,
    'relu': relu_activation,
    'softmax': softmax_activation
}

In [10]:
def single_forward_pass(A_previous, W, b, activation):
    try:
        act_function = act_map[activation]
    except KeyError:
        print(f'The activation {activation} is not recognized.\nIt must be one of the following: {list(act_map.keys())}')
        return None
    
    Z = np.dot(W, A_previous) + b
    A = act_function(Z)
    
    return A, Z

This function using linear algebra to perform the weighted sum for each layer that was discussed above. It outputs the activated outputs `A` and the non-activated outputs `Z`.

### Full Foward Pass
In the next function I define, we loop through each layer in the network and perform a forward pass using `single_forward_pass()`.

In [11]:
def dZ_sigmoid(dA, Z):
    sigmoid = sigmoid_activation(Z)
    dZ = dA * sigmoid * (1.0 - sigmoid)
    return dZ

In [12]:
def dZ_softmax(dA, Z):
    softmax = softmax_activation(Z)
    dZ = dA * softmax * (1.0 - softmax)
    return dZ

In [13]:
def dZ_relu(dA, Z):
    dZ = np.copy(dA)
    dZ[Z <= 0.0] = 0.0
    return dZ

In [14]:
dZ_map = {
    'sigmoid': dZ_sigmoid,
    'relu': dZ_relu,
    'softmax': dZ_softmax
}

In [15]:
def full_forward_pass(X, network):
    
    cache = {}
    A = np.transpose(X)
    
    for layer in range(1, len(network) + 1):
        A_previous = A
        A, Z = single_forward_pass(A_previous, 
                                   network[f'layer_{layer}']['W'], 
                                   network[f'layer_{layer}']['b'], 
                                   network[f'layer_{layer}']['activation'])
        
        cache[f'A_{layer-1}'] = A_previous
        cache[f'Z_{layer}'] = Z
        
    return A, cache

In [16]:
def compute_cross_entropy_cost(y_pred, y):
    
    cost = -1*np.mean(y * np.log(np.transpose(y_pred)))
    
    return cost

In [17]:
def single_backward_pass(dA, W, b, Z, A_previous, activation):
    
    try:
        backprop_activation = dZ_map[activation]
    except KeyError:
        print(f'The backprop activation {activation} is not recognized.\nIt must be one of the following: {list(dZ_map.keys())}')
        return None
    
    m = A_previous.shape[1]
    
    dZ = backprop_activation(dA, Z)
    
    dW = np.dot(dZ, np.transpose(A_previous)) / m
    db = np.sum(dZ, axis=1, keepdims=True) / m
    dA_previous = np.dot(np.transpose(dW), dZ)
    
    return dA_previous, dW, db

In [18]:
def full_backward_pass(y_pred, y, cache, network):
    
    stored_grads = {}
    m = y.shape[1]
    
    dA_previous = y_pred - np.transpose(y)
    
    for layer in reversed(range(1, len(network) + 1)):
        activation = network[f'layer_{layer}']['activation']
        layer_previous = layer - 1
        
        dA = dA_previous
        
        A_previous = cache[f'A_{layer_previous}']
        Z = cache[f'Z_{layer}']
        W = network[f'layer_{layer}']['W']
        b = network[f'layer_{layer}']['b']
        
        dA_previous, dW, db = single_backward_pass(dA, W, b, Z, A_previous, activation)
        stored_grads[f'dW_{layer}'] = dW
        stored_grads[f'db_{layer}'] = db
        
    return stored_grads

In [19]:
def update_network(network, stored_grads, learning_rate):
    for layer in range(1, len(network) + 1):
        network[f'layer_{layer}']['W'] = network[f'layer_{layer}']['W'] - learning_rate * stored_grads[f'dW_{layer}']
        network[f'layer_{layer}']['b'] = network[f'layer_{layer}']['b'] - learning_rate * stored_grads[f'db_{layer}']
    return network

In [20]:
def flatten_label_predictions(y_pred):
    y_pred_transpose = np.transpose(y_pred)
    y_pred_flat = np.argmax(y_pred_transpose, 1)
    return y_pred_flat

In [21]:
def compute_accuracy(y_pred, y):
    y_pred_transpose = np.transpose(y_pred)
    y_pred_flat = np.argmax(y_pred_transpose, 1)
    y_flat = np.argmax(y, 1)
    accuracy = np.mean(y_pred_flat == y_flat)
    return accuracy

In [22]:
def train_nn(X, y, network):
    
    stored_cost = []
    
    for epoch in range(HYPER_PARAMS['epochs']):
        y_pred, cache = full_forward_pass(X, network)
        cost = compute_cross_entropy_cost(y_pred, y)
        if epoch == 0:
            print(f' * The initial cost is {cost:0.3f}.')
        stored_cost.append(cost)
        stored_grads = full_backward_pass(y_pred, y, cache, network)
        network = update_network(network, stored_grads, HYPER_PARAMS['learning_rate'])
    final_accuracy = compute_accuracy(y_pred, y)
    print(f' * Final cost: {cost:0.3f}.')
    print(f' * Final accuracy: {final_accuracy:0.3%}')
    return network, stored_cost, y_pred

## Data Preprocessing
### Load Dataset

In [23]:
header_values = []
with open('./data/zoo.dat', 'r') as zoo_file:
    for line in zoo_file:
        if '@attribute' in line:
            header_values.append(line.split()[1])

In [24]:
df = pd.read_csv('./data/zoo.dat', skiprows=21, header=None, names=header_values)

In [25]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Hair,101.0,0.425743,0.496921,0.0,0.0,0.0,1.0,1.0
Feathers,101.0,0.19802,0.400495,0.0,0.0,0.0,0.0,1.0
Eggs,101.0,0.584158,0.495325,0.0,0.0,1.0,1.0,1.0
Milk,101.0,0.405941,0.493522,0.0,0.0,0.0,1.0,1.0
Airborne,101.0,0.237624,0.42775,0.0,0.0,0.0,0.0,1.0
Aquatic,101.0,0.356436,0.481335,0.0,0.0,0.0,1.0,1.0
Predator,101.0,0.554455,0.499505,0.0,0.0,1.0,1.0,1.0
Toothed,101.0,0.60396,0.491512,0.0,0.0,1.0,1.0,1.0
Backbone,101.0,0.821782,0.384605,0.0,1.0,1.0,1.0,1.0
Breathes,101.0,0.792079,0.407844,0.0,1.0,1.0,1.0,1.0


### Separate Features and Labels

In [26]:
df_X = df.drop(columns='Type')

In [27]:
df_y = df['Type']

### One-hot Encode Features
All features except 'Legs' include 1-0 values. Since 'Legs' is a categorical variable, it needs to be one-hot encoded. We can do this using Pandas `get_dummies()` method.

In [28]:
df_X_one_hot = pd.get_dummies(df_X, columns=['Legs'], drop_first=True)

In [29]:
df_X_one_hot.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Hair,101.0,0.425743,0.496921,0.0,0.0,0.0,1.0,1.0
Feathers,101.0,0.19802,0.400495,0.0,0.0,0.0,0.0,1.0
Eggs,101.0,0.584158,0.495325,0.0,0.0,1.0,1.0,1.0
Milk,101.0,0.405941,0.493522,0.0,0.0,0.0,1.0,1.0
Airborne,101.0,0.237624,0.42775,0.0,0.0,0.0,0.0,1.0
Aquatic,101.0,0.356436,0.481335,0.0,0.0,0.0,1.0,1.0
Predator,101.0,0.554455,0.499505,0.0,0.0,1.0,1.0,1.0
Toothed,101.0,0.60396,0.491512,0.0,0.0,1.0,1.0,1.0
Backbone,101.0,0.821782,0.384605,0.0,1.0,1.0,1.0,1.0
Breathes,101.0,0.792079,0.407844,0.0,1.0,1.0,1.0,1.0


In [30]:
X = df_X_one_hot.values

### One-hot Encode Labels

In [31]:
NUM_CLASSES = len(pd.unique(df_y))
print(f'There are {NUM_CLASSES} unique classes for the labels, which are {pd.unique(df_y)}.')

There are 7 unique classes for the labels, which are [1 4 7 2 6 3 5].


In [32]:
def encode_labels(x):
    encoded = np.zeros(NUM_CLASSES)
    encoded[x-1] = 1
    return encoded

In [33]:
df_y_one_hot = df_y.apply(lambda x: encode_labels(x))

In [34]:
df_y_one_hot.head()

0    [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
1    [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2    [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
3    [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
4    [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]
Name: Type, dtype: object

In [35]:
y = df_y_one_hot.values

In [36]:
y = np.stack(y)

## Train the Model

In [66]:
HYPER_PARAMS = {
    'epochs': 10000,
    'learning_rate': 20
}

In [67]:
network, stored_cost, y_pred = train_nn(X, y, network)

 * The initial cost is 0.454.
 * Final cost: 0.219.
 * Final accuracy: 44.554%


In [80]:
def compute_label_counts(y):
    y_label_summary = pd.Series(y).value_counts(normalize=True).reset_index().sort_values(by='index')
    y_label_summary.columns = ['Label', 'Fraction']
    return y_label_summary

In [81]:
y_counts = compute_label_counts(df_y)

In [82]:
y_counts

Unnamed: 0,Label,Fraction
0,1,0.405941
1,2,0.19802
5,3,0.049505
2,4,0.128713
6,5,0.039604
4,6,0.079208
3,7,0.09901


In [68]:
y_pred_flat = flatten_label_predictions(y_pred)

In [83]:
y_pred_counts = compute_label_counts(y_pred_flat)

In [84]:
y_pred_counts

Unnamed: 0,Label,Fraction
0,0,0.950495
1,2,0.029703
2,4,0.009901
3,6,0.009901


### Resources
This notebook has been inspired by the Towards Data Science post [Let’s code a Neural Network in plain NumPy](https://towardsdatascience.com/lets-code-a-neural-network-in-plain-numpy-ae7e74410795).

Additional resources include:

* [A Gentle Introduction to Cross-Entropy for Machine Learning](https://machinelearningmastery.com/cross-entropy-for-machine-learning/).
* [Creating a Neural Network from Scratch in Python: Multi-class Classification](https://stackabuse.com/creating-a-neural-network-from-scratch-in-python-multi-class-classification/).
* [The Softmax Function Derivative (Part 1)](https://aimatters.wordpress.com/2019/06/17/the-softmax-function-derivative/).
* [Understanding and implementing Neural Network with SoftMax in Python from scratch](http://www.adeveloperdiary.com/data-science/deep-learning/neural-network-with-softmax-in-python/)