---
title: "Introduction to Neural Networks"
format: html
page-layout: full
code-line-numbers: true
code-block-border: true
toc: true
toc-location: left
number-sections: true
jupyter: python3
---

-  Neural Network with one hidden layer

![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_01_12.jpg)

# Example - Iris Dataset

- a simple neural network that will classify the Iris flower dataset

In [1]:
import pandas as pd

dataset = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
                      names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])

dataset['species'] = pd.Categorical(dataset['species']).codes

In [2]:
dataset

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [3]:
# Shuffle the dataset

dataset = dataset.sample(frac=1, random_state=1234)
dataset

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
91,6.1,3.0,4.6,1.4,1
63,6.1,2.9,4.7,1.4,1
103,6.3,2.9,5.6,1.8,2
6,4.6,3.4,1.4,0.3,0
59,5.2,2.7,3.9,1.4,1
...,...,...,...,...,...
143,6.8,3.2,5.9,2.3,2
116,6.5,3.0,5.5,1.8,2
53,5.5,2.3,4.0,1.3,1
38,4.4,3.0,1.3,0.2,0


In [4]:
# split the data set into train and test subsets
# Use 120 samples for training and 30 samples for testing

train_input = dataset.values[:120, :4]
train_target = dataset.values[:120, 4]

test_input = dataset.values[120:, :4]
test_target = dataset.values[120:, 4]

- Define a feedforward network with one hidden layer with five units,
- a ReLU activation function, *f(x) = max(0, x)*,
- and an output layer with three units.
    - The output layer has three units, whereas each unit corresponds to one of the three classes of Iris flower.
    
- The following is the PyTorch definition of the network:

In [5]:
import torch

torch.manual_seed(1234)

hidden_units = 5

net = torch.nn.Sequential(
    torch.nn.Linear(4, hidden_units), # we'll use a network with 5 hidden units
    torch.nn.ReLU(), # ReLU activation
    torch.nn.Linear(hidden_units, 3) # 3 output units for each of the 3 possible classes
)

- We'll use one-hot encoding for the target data.
- each class of the flower will be represented as an array
        - (Iris Setosa = [1, 0, 0], Iris Versicolour = [0, 1, 0], and Iris Virginica = [0, 0, 1]),
- and one element of the array will be the target for one unit of the output layer.
- When the network classifies a new sample, we'll determine the class by taking the unit with the highest activation value. 


- Define the loss function
  - The loss function will measure how different the output of the network is compared to the target data.

In [6]:
criterion = torch.nn.CrossEntropyLoss()

- Define the optimizer
   - stochastic gradient descent (SGD) optimizer (a variation of the gradient descent algorithm) with a learning rate of 0.1 and a momentum of 0.9

In [7]:
optimizer = torch.optim.SGD(net.parameters(), lr=0.1, momentum=0.9)

We'll run the training for 50 epochs, which means that we'll iterate 50 times over the training dataset: 


1.   Create the torch variable that are `input` and `target` from the numpy array train_input and train_target. 
2.   Zero the gradients of the optimizer to prevent accumulation from the previous iterations. We feed the training data to the neural network net (input) and we compute the loss function criterion (out, targets) between the network output and the target data.
3.   Propagate the loss value back through the network. We do this so that we can calculate how each network weight affects the loss function. 
4.   The optimizer updates the weights of the network in a way that will reduce the future loss function values.

In [8]:
epochs = 50

for epoch in range(epochs):
    inputs = torch.autograd.Variable(torch.Tensor(train_input).float())
    targets = torch.autograd.Variable(torch.Tensor(train_target).long())

    optimizer.zero_grad()
    out = net(inputs)
    loss = criterion(out, targets)
    loss.backward()
    optimizer.step()

    if epoch == 0 or (epoch + 1) % 10 == 0:
        print('Epoch %d Loss: %.4f' % (epoch + 1, loss.item()))

Epoch 1 Loss: 1.2181
Epoch 10 Loss: 0.6745
Epoch 20 Loss: 0.2447
Epoch 30 Loss: 0.1397
Epoch 40 Loss: 0.1001
Epoch 50 Loss: 0.0855


Let's see what the final accuracy of our model is: 

In [9]:
import numpy as np

inputs = torch.autograd.Variable(torch.Tensor(test_input).float())
targets = torch.autograd.Variable(torch.Tensor(test_target).long())

optimizer.zero_grad()
out = net(inputs)
_, predicted = torch.max(out.data, 1)

error_count = test_target.size - np.count_nonzero((targets == predicted).numpy())
print('Errors: %d; Accuracy: %d%%' % (error_count, 100 * torch.sum(targets == predicted) / test_target.size))

Errors: 0; Accuracy: 100%


# Math behind Neural Networks

## Vector
 - a one-dimensional array of numbers
   
   ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/34.png)

- Magnitude or length of a vector

  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/37.png)

## Matrix

- a two dimensional array of scalars

  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/39.png)

## Tensor

- A multi-dimensional array with the following properties
    - rank  (the number of array dimensions)
    - shape (the size of each of the tensor's dimensions)
    - data type  (the data type of the tensor values)

## Vector and matrix operations

 - Vector addition

   ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/40.png)
   
 - Dot product (or scalar product)
     - combines two *n*-dimensional vectors **a** and **b** into a scalar value 

   ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/41.png)

    - If the vectors are two-dimensional
  
     ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/45.png)


![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_02_02.jpg)
         
 - Cross product (or vector product)
     - a new vector perpendicual to both the input vectors
     - the magnitude of the output vector is equal to
  
       ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/48.png)

        ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_02_03.jpg)
 
      - the output vector magnitude to equal to the area of the parallelogram 

- Matrix transpose

  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/55.png)
  
  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/56.png)
  
  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/57.png)
  

- Matrix-scalar multiplication

  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/58.png)

- Matrix-matrix addition

  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/59.png)

- Matrix-vector multiplication

  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/60.png)

  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/61.png)


  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/62.png)


- Matrix-matrix multiplication

  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/63.png)

  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/64.png)



## Units of Neural Networks

 - The smallest building blocks

   ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_02_08.jpg)
   

   ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/111.png)

 - Compute the weighted sum, $\Sigma x_i w_i + b$, also known as the activation value
     - the inputs $x_i$ represent either the outputs of other units of the network, or the values of the input data itself
     - the weights $w_i$ represent either the strength of the inputs or the strengths of the connections between the units
     - the weight $b$ is the **bias**, an always-on input unit with a value of 1
  
 - The sum  $\Sigma x_i w_i + b$ serves as input to the $f$, the **activation function** (**transfer function**)
     - The output of $f$ is a single numerical value, representing the output of the unit itself
  
 - A unit with an identify activation function, $f(x) = x$ is equivalent to **multiple linear regression**
 - A unit with a **sigmoid activation** function is equivalent to **logistic regression**
 - A unit with a threshold activation function is equivalent to a **perceptron (binary classifer)**
     - $f(a) = 1 if a \ge 0 else 0$

### Layers

- Fully connected (FC) layer in classical Neural Networks

  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_02_09.jpg)


  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_02_10.jpg)

  
- Multi layer Neural Networks

 ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_02_11.jpg)


   - A unit from one layer is connected to all the units from the previous and following layers. Each connection has its own weight
   - The above consists of single-path networks with sequential layers
   - The layers form **directed acyclic graphs**
   - Randomly interconnected hidden layers

     ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_02_12.jpg)

   - $\theta$, the set of all weight matrices
   - A Neural network can be represented as a series of nested functions/operations

     ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/138.png)

## Activation functions

 - Without activation functions, the output would be the weighted sum of the inputs, a linear function
 - The entire network becomes a composition of linear functions, resulting in a linear function
 - Hence, the network would be equivalent to a simple linear regression model
 - For non-linear functions, use non-linear activation functions

### Sigmoid

- Output bounded between 0 and 1
- Can be interpreted as the probability of the unit being active
  
![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_02_13.jpg)

### Hyperbolic tangent (tanh)

- Output is in the (-1, 1) range

![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_02_14.jpg)

### Rectified Linear Unit (ReLU)

- ReLU repeats its input when $x > 0$, and stays $0$ otherwise
- Advantage in training Neural networks with more hidden layers

![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_02_15.jpg)

## Example - Boxcar function

![](figs_nn/fig01.png)

- A unit with single input and sigmoid activation

  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_02_17.jpg)

- Unit outputs for different values of *b* and *w*

  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_02_18.jpg)

- Combine two units with a hidden layer

  ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_02_19.jpg)

## Universal approximation theorem

- $f_\theta(x) = g(x)$
  - *x* is the input data
  - $\theta$ are the NN weights
  - $g(x)$ : collection of input samples and labels

- Any continuous function can be approximated to an arbitrary degree of accuracy by a feedforward NN
   - with at least one hidden layer with a finite number of units and a non-linear activation

## Training NNs

 - Find parameters $\theta$ such that $f_\theta(x)$ will be the best approximation of $g(x)$
 - Train NN using **mean square error (MSE)** cost function
     - measures the difference (**error**) between the network output and the training data labels $t^{(i)}$ of all the training samples $x^{(i)}$

![](https://static.packt-cdn.com/products/9781837638505/graphics/image/157.png)

 - Gradient Descent (GD)
   - Compute the derivative (gradient) of $J(\theta)$ with respect to all the network weights
       - gives indication of how   $J(\theta)$ changes with respect to each weight
   - Uses this information to update the weights to minimize  $J(\theta)$ in future iterations
   - Goal is to gradually reach the **global minimum** of the cost function, where the gradient is 0
  
![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_02_20.jpg)

- Initialize the network weights, $\theta$, with random values
- Repeat the following steps until the cost function, $J(\theta)$, falls below a certain threshold
    - **Forward pass**
        - Compute the MSE   $J(\theta)$ cost function for all the samples of the training set
    - **Backward pass**
        - Compute the partial derivatives (gradients) of  $J(\theta)$ with respect to all the network weights $\theta_j$ using the chain rule
     
          ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/166.png)

        - Use the gradient values to update each of the network weights, where $\eta$ is the **learning rate**, which determines the step size at which the optimizer updates the weights during training

      ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/170.png)

### Momentum

 - Gradient descent may converge to a local minimum

![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_02_21.jpg)

 - Adjust the current weight update with the values of the previous weight updates
     - if the weight update in the previous step was big, it will also increase the weight update in the next step (**momentum**)
  
 - Weight update rule

    ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/183.png)
   
 - During step *t* of the training process
     - First, calculate the current weight update value $v_t$ by also including the velocity of the previous update $v_{t-1}$. $\mu$ is the hyperparameter in the range [0,1], called the *momentum* rate

![](https://static.packt-cdn.com/products/9781837638505/graphics/image/186.png)
     
 -  Then, do the actual weight update  
    
 ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/188.png)


 - Adative learning rate algorithm **Adam**
     - Calculates individual and adaptive learning rates for every weight, based on previous weight updates (momentum) 

- **Batch gradient descent**
    - As described above
    - Accumulates the error across all the training samples and performs a single weight update 
- **Stochastic** (or **online**) **gradient descent**
    - Updates the weights after every training sample 
- **Mini-batch gradient descent**
    - Accumulates the error over batches of $k$ samples and performs one weight update after each mini-batch 

# Example - xor Classification

In [10]:
import numpy as np

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

In [11]:
def tanh(x):
    return (1.0 - np.exp(-2 * x)) / (1.0 + np.exp(-2 * x))


def tanh_derivative(x):
    return (1 + tanh(x)) * (1 - tanh(x))

In [12]:
class NeuralNetwork:
    # net_arch consists of a list of integers, indicating
    # the number of neurons in each layer
    def __init__(self, net_arch):
        self.activation_func = tanh
        self.activation_derivative = tanh_derivative
        self.layers = len(net_arch)
        self.steps_per_epoch = 1000
        self.net_arch = net_arch

        # initialize the weights with random values in the range (-1,1)
        self.weights = []
        for layer in range(len(net_arch) - 1):
            w = 2 * np.random.rand(net_arch[layer] + 1, net_arch[layer + 1]) - 1
            self.weights.append(w)

    def fit(self, data, labels, learning_rate=0.1, epochs=10):
        """
        :param data: data is the set of all possible pairs of booleans
                     True or False indicated by the integers 1 or 0
                     labels is the result of the logical operation 'xor'
                     on each of those input pairs
        :param labels: array of 0/1 for each datum
        """

        # Add bias units to the input layer
        bias = np.ones((1, data.shape[0]))
        input_data = np.concatenate((bias.T, data), axis=1)

        for k in range(epochs * self.steps_per_epoch):
            if k % self.steps_per_epoch == 0:
                # print ('epochs:', k/self.steps_per_epoch)
                print('epochs: {}'.format(k / self.steps_per_epoch))
                for s in data:
                    print(s, nn.predict(s))

            sample = np.random.randint(data.shape[0])
            y = [input_data[sample]]

            for i in range(len(self.weights) - 1):
                activation = np.dot(y[i], self.weights[i])
                activation_f = self.activation_func(activation)
                # add the bias for the next layer
                activation_f = np.concatenate((np.ones(1), np.array(activation_f)))
                y.append(activation_f)

            # last layer
            activation = np.dot(y[-1], self.weights[-1])
            activation_f = self.activation_func(activation)
            y.append(activation_f)

            # error for the output layer
            error = y[-1] - labels[sample]
            delta_vec = [error * self.activation_derivative(y[-1])]

            # we need to begin from the back from the next to last layer
            for i in range(self.layers - 2, 0, -1):
                error = delta_vec[-1].dot(self.weights[i][1:].T)
                error = error * self.activation_derivative(y[i][1:])
                delta_vec.append(error)

            # reverse
            # [level3(output)->level2(hidden)]  => [level2(hidden)->level3(output)]
            delta_vec.reverse()

            # backpropagation
            # 1. Multiply its output delta and input activation
            #    to get the gradient of the weight.
            # 2. Update the weight using the weight update formula
            for i in range(len(self.weights)):
                layer = y[i].reshape(1, nn.net_arch[i] + 1)

                delta = delta_vec[i].reshape(1, nn.net_arch[i + 1])
                self.weights[i] -= learning_rate * layer.T.dot(delta)

    def predict(self, x):
        val = np.concatenate((np.ones(1).T, np.array(x)))
        for i in range(0, len(self.weights)):
            val = self.activation_func(np.dot(val, self.weights[i]))
            val = np.concatenate((np.ones(1).T, np.array(val)))

        return val[1]

    def plot_decision_regions(self, X, y, points=200):
        markers = ('o', '^')
        colors = ('red', 'blue')
        cmap = ListedColormap(colors)

        x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
        x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1

        resolution = max(x1_max - x1_min, x2_max - x2_min) / float(points)

        xx1, xx2 = np.meshgrid(np.arange(x1_min,
                                         x1_max,
                                         resolution),
                               np.arange(x2_min, x2_max, resolution))
        input = np.array([xx1.ravel(), xx2.ravel()]).T
        Z = np.empty(0)
        for i in range(input.shape[0]):
            val = nn.predict(np.array(input[i]))
            if val < 0.5:
                val = 0
            if val >= 0.5:
                val = 1
            Z = np.append(Z, val)

        Z = Z.reshape(xx1.shape)

        plt.pcolormesh(xx1, xx2, Z, cmap=cmap)
        plt.xlim(xx1.min(), xx1.max())
        plt.ylim(xx2.min(), xx2.max())
        # plot all samples

        classes = ["False", "True"]

        for idx, cl in enumerate(np.unique(y)):
            plt.scatter(x=X[y == cl, 0],
                        y=X[y == cl, 1],
                        alpha=1.0,
                        c=colors[idx],
                        edgecolors='black',
                        marker=markers[idx],
                        s=80,
                        label=classes[idx])

        plt.xlabel('x1')
        plt.ylabel('x2')
        plt.legend(loc='upper left')
        plt.show()


In [13]:
np.random.seed(0)

# Initialize the NeuralNetwork with 2 input, 2 hidden, and 1 output neurons
nn = NeuralNetwork([2, 2, 1])

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])

y = np.array([0, 1, 1, 0])

nn.fit(X, y, epochs=10)

print("Final prediction")
for s in X:
    print(s, nn.predict(s))



epochs: 0.0
[0 0] 0.31634987228520156
[0 1] 0.38455314510086014
[1 0] 0.49960366414001517
[1 1] 0.5470092417007291
epochs: 1.0
[0 0] 0.10110562119575028
[0 1] 0.4983062530300435
[1 0] 0.5483740117095983
[1 1] 0.635812878112665
epochs: 2.0
[0 0] 0.07164948329787507
[0 1] 0.861075813281495
[1 0] 0.8502850626450226
[1 1] 0.07158530421971575
epochs: 3.0
[0 0] 0.017586567899253898
[0 1] 0.9666637734019891
[1 0] 0.9651222166853127
[1 1] 0.011468668342141863
epochs: 4.0
[0 0] -0.0017118993569209561
[0 1] 0.9815292663780985
[1 0] 0.9828812324283912
[1 1] -0.00030370918696389856
epochs: 5.0
[0 0] 0.0026985374081322524
[0 1] 0.9885083594808965
[1 0] 0.9891298042443863
[1 1] 0.015552778145750895
epochs: 6.0
[0 0] 0.005625211435814411
[0 1] 0.9922099656276941
[1 0] 0.9915443580479176
[1 1] 0.01694332658239157
epochs: 7.0
[0 0] 0.0019544398675395446
[0 1] 0.9934850143000605
[1 0] 0.9934672674082785
[1 1] 0.0007886110283738284
epochs: 8.0
[0 0] 0.0036493566842653565
[0 1] 0.9950489745378326
[1 0] 0.

nn.plot_decision_regions(X, y)

# References
  - Python Deep Learning, Third Edition, Ivan Vasilev, Packt Publishing