# Introduction to Deep Learning - I

Created by [Nimblebox Inc.](https://www.nimblebox.ai/).

<img style="float:left; margin-left: 50px" src="https://databricks.com/wp-content/uploads/2019/02/neural1.jpg" alt="Numpy Logo" width="300" height="400">

<img style="float:right; margin-right: 50px" src="https://media-exp1.licdn.com/dms/image/C4E1BAQH3ErUUfLXoHQ/company-background_10000/0?e=2159024400&v=beta&t=9Z2hcX4LqsxlDd2BAAW8xDc-Obfvk_rziT1AkPKBcCc" alt="Nimblebox Logo" width="500" height="600">

## Introduction:  

Till now we looked at Traditional Statistical Machine Learning models like Support Vector Machines, different Linear and Logistic Models, Random Forests, etc.

As we have seen, Machine Learning is set of algorithms that parse data, learn from them, and then apply what they’ve learned to make intelligent decisions. Whereas, Deep Learning achieves great power and flexibility by learning to represent the world as nested hierarchy of concepts, with each concept defined in relation to simpler concepts, and more abstract representations computed in terms of less abstract ones.

<div style="text-align: center"><img src="https://i2.wp.com/semiengineering.com/wp-content/uploads/2018/01/MLvsDL.png"></div>

### What is Deep Learning?

Deep learning–a machine learning technique–is an efficient way of learning that relies on big data, where features that can help a machine map an input to an output is automatically extracted from layers of “neurons”. A neural network is a type of deep learning architecture and in this webinar, we will be focusing on a simple Aritficial Neural Network.

### What is Neural Network?

Neural networks are composed of simple building blocks called neurons. A neuron is a mathematical function that takes data as input, performs a transformation on them, and produces an output. This means that neurons can represent any mathematical function; however, in neural networks, we typically use non-linear functions.

<figure style="text-align: center"><img src="https://miro.medium.com/max/702/1*NZc0TcMCzpgVZXvUdEkqvA.png" width="400" height="500"><figcaption>A single neuron in a network</figcaption></figure>

Looking at the neuron above, you can see that it’s composed of two main parts: the summation and the activation function. A neuron takes data (x₁, x₂, x₃) as input, multiplies each with a specific weight (w₁, w₂, w₃), and then passes the result to a nonlinear function called the activation function to produce an output.

A neural network combines multiple neurons by stacking them vertically/horizontally to create a network of neurons-hence the name “neural network”. A simple one-neuron network is called a perceptron and is the simplest network ever.

### Layers of Neural Network

1. The first layer is called the input layer, and the number of nodes will depend on the number of features present in your dataset.
2. The final layer of the neural network is called the output layer, and the number depends on what you’re trying to predict. For regression and binary classification tasks, you can use a single node; while for multi-class problems, you’ll use multiple nodes, depending on the number of classes.
3. The layers between the input and the final layer is where the magic happens— these are called the hidden layers. The hidden layers can be as deep or wide as you want, and while a deeper network is better, the computational time also increases as you go deeper.

<div style="text-align: center"><img src="https://miro.medium.com/max/702/1*Z3zHoX1nhK6Rsmd4yNPdsg.jpeg" width="500" height="600"></div>

### Weights and Bias

Weights and biases are the learnable parameters that help a neural network correctly learn a function. Think of weights as a measure of how sure you are that a feature contributes to a prediction and the bias as a base value that your predictions must start from.

A machine learning model uses lots of examples to learn the correct weights and bias to assign to each feature in a dataset to help it correctly predict outputs.

### Activation Function

Activations are the nonlinear computations done in each node of a Neural Network. There are many types of activation functions used in deep learning - some popular ones are Sigmoid, ReLU, tanh, Leaky ReLU, and so on.

<div style="text-align: center"><img src="https://miro.medium.com/max/702/1*ZafDv3VUm60Eh10OeJu1vw.png" width="600" height="700"></div>

The hidden layer receives values from the previous layer, calculates a weighted sum, adds the bias term, and then passes each result through an activation function and continues till it reaches the last but one hidden layer. The result from this last but one layer is then passed to the output layer, where another weighted sum is performed using the second weights and biases. But then instead of passing the result through another activation function, it is passed through an output activation function.

### Loss Function

The loss function is a way of measuring how good a model’s prediction is so that it can adjust the weights and biases. A loss function must be properly designed so that it can correctly penalize a model that is wrong and reward a model that is right. This means that you want the loss to tell you if a prediction made is far or close to the true prediction. The choice of the loss function is dependent on the task—and for classification problems, you can use cross-entropy loss.

$$CE = -\sum_{i}^{C} y_i log(\hat y_i)$$

here $C$ is the number of classes, $y_i$ is the true value and $\hat y_i$ is the predicted value

### Forward Propagation

Forward propagation is the name given to the series of computations performed by the neural network before a prediction is made. In a two-layer network, it will perform the following computation for forward propagation:

1. Compute the weighted sum between the input and the first layer's weights and then add the bias: **Z1 = (W1 * X) + b**
2. Pass the result through the ReLU activation function: **A1 = sigmoid(Z1)**
3. Compute the weighted sum between the output (A1) of the previous step and the second layer's weights—also add the bias: **Z2 = (W2 * A1) + b2**
4. Compute the output function by passing the result through a sigmoid function: **A2 = sigmoid(Z2)**
5. And finally, compute the loss between the predicted output and the true labels: **loss(A2, Y)**

Note: For a three-layer neural network, you’d have to compute Z3 and A2 using W3 and b3 before the output layer.

### Backpropogation

Backpropagation is the name given to the process of training a neural network by updating its weights and bias.

A neural network learns to predict the correct values by continuously trying different values for the weights and then comparing the losses. If the loss function decreases, then the current weight is better than the previous, or vice versa. This means that the neural net has to go through many training (forward propagation) and update (backpropagation) cycles in order to get the best weights and biases. This cycle is what we generally refer to as the training phase, and the process of searching for the right weights is called optimization.

The way we do this by using chain rule to find the local gradient with respect to each input.

This is very big and vast topic, thus I will recommend you to go throught this [original paper on the topic](https://page.mi.fu-berlin.de/rojas/neural/chapter/K7.pdf) or watch different tutorials to clear this concept.

### Training Neural Networks

To automatically use this information to update the weights and biases, a neural network must perform hundreds, thousands, and even millions of forward and backward propagations. That is, in the training phase, the neural network must perform the following:
1. Forward propagation
2. Backpropagation
3. Weight updates with calculated gradients
4. Repeat

In [None]:
import numpy as np
import random


class NeuralNetwork(object):
    def __init__(self, *sizes):
        self.num_layers = len(sizes)
        self.b = [np.random.uniform(-0.12, 0.12, (1, n)) for n in sizes[1:]]
        self.w = [np.random.uniform(-0.12, 0.12, (n, m)) for n, m in zip(sizes[:-1], sizes[1:])]
        # self.b = [ np.random.randn(1, n) for n in sizes[1:] ]
        # self.w = [ np.random.randn(n, m) for n, m in zip(sizes[:-1], sizes[1:]) ]
        self.a = [[] for i in range(self.num_layers)]
        self.z = [[] for i in range(self.num_layers-1)]

        self.training_size = 0
        self.reg_lambda = 1.0

        self.grad_w = [[] for i in range(self.num_layers-1)]
        self.grad_b = [[] for i in range(self.num_layers-1)]

    def feedForward(self, X):
        self.a[0] = X
        for i in range(self.num_layers-1):
            self.z[i] = self.a[i].dot(self.w[i]) + self.b[i]
            self.a[i+1] = self.sigmoid(self.z[i])

    def cost(self, y):
        diff_Ks = -y * np.log(self.a[-1]) - (1 - y) * np.log(1 - self.a[-1])
        J = sum(np.sum(diff_Ks, axis=0)) / self.training_size

        # regularization
        totalSum = np.array([])
        for w in self.w:
            totalSum = np.concatenate((totalSum, sum(w ** 2.0)))
        totalSum = sum(totalSum)
        J = J + ((self.reg_lambda * totalSum) / (2.0 * self.training_size))

        return J

    def backpropagation(self, y):
        batch_size = float(len(y))
        delta = (self.a[-1] - y)
        self.grad_w[-1] = ((delta.T).dot(self.a[-2])) / batch_size
        self.grad_b[-1] = np.sum(delta, axis=0) / batch_size
        for i in range(2, self.num_layers):
            delta = delta.dot(self.w[-i+1].T) * self.sigmoidPrime(self.z[-i])
            self.grad_w[-i] = ((delta.T).dot(self.a[-i - 1])) / batch_size
            self.grad_b[-i] = np.sum(delta, axis=0) / batch_size

        # regularization
        for i in range(self.num_layers-1):
            self.grad_w[i] = self.grad_w[i].T + (self.reg_lambda * self.w[i]) / self.training_size

    def gradientDescent(self, X, y, regularization, learning_rate, epochs, output=False):
        self.training_size = float(len(X))

        for e in range(epochs):
            self.feedForward(X)
            self.backpropagation(y)

            # update weights
            for l in range(self.num_layers-1):
                self.w[l] = self.w[l] - learning_rate * self.grad_w[l]
                self.b[l] = self.b[l] - learning_rate * self.grad_b[l]
            
            if output:
                predictions = np.argmax(self.a[-1], axis=1)
                precision = self.evaluatePredictions(predictions, np.nonzero(y)[1])
                print("Epoch: {0} - precision: {1:.4f}, cost: {2:.4f}".format(e, precision, self.cost(y)))


    def stochasticGradientDescent(self, X, y, regularization, learning_rate, epochs, batch_size, output=False):
        self.training_size = float(len(X))
        for e in range(epochs):
            X, y = self.shuffleData(X, y)
            batches = [(X[i:i+batch_size], y[i:i+batch_size]) for i in range(0, int(self.training_size), batch_size)]
            for batch in batches:
                Xi = batch[0]
                yi = batch[1]

                self.feedForward(Xi)
                self.backpropagation(yi)

                # update weights
                for l in range(self.num_layers-1):
                    self.w[l] = self.w[l] - learning_rate * self.grad_w[l]
                    self.b[l] = self.b[l] - learning_rate * self.grad_b[l]
            
            if output:
                self.outputTrainingStatus(e, X, y)

    def outputTrainingStatus(self, epoch_num, X, y):
        y1 = np.nonzero(y)[1]
        predictions = self.predict(X)
        precision = self.evaluatePredictions(predictions, y1)
        print("Epoch: {0} - precision: {1:.4f}, cost: {2:.4f}".format(epoch_num, precision, self.cost(y)))

    def shuffleData(self, X, y):
        c = list(zip(X, y))
        random.shuffle(c)
        X, y = zip(*c)
        return (np.asarray(X), np.asarray(y))

    def predict(self, X):
        self.feedForward(X)
        return np.argmax(self.a[-1], axis=1)

    def evaluatePredictions(self, predictions, y):
        return (predictions == y).sum() / float(len(y))

    def sigmoid(self, z):
        return 1.0 / (1.0 + np.exp(-z))

    def sigmoidPrime(self, z):
        return self.sigmoid(z) * (1 - self.sigmoid(z))
    
    
def one_hot_encoding(output_layer_size, m, y_train):
    y = np.zeros((m, output_layer_size))
    for i in range(m):
        y[i, :] = [item == y_train[i] for item in range(output_layer_size)]
    return y

In [None]:
import time
import numpy as np
from mnist import MNIST

# load the dataset
mndata = MNIST('./data')
images_train, labels_train = mndata.load_training()
images_test, labels_test = mndata.load_testing()

# convert to numpy arrays
images_train = np.asarray(images_train)
images_test = np.asarray(images_test)
labels_test = np.asarray(labels_test)

# normalize to [0,1] scale
images_train = images_train / 255.0
images_test = images_test / 255.0

X = images_train
y = labels_train

# transform y in a vector of zeros and one. 
# I.e. number "5" will be = [0 0 0 0 0 1 0 0 0 0]
output_layer_size = 10
m = len(images_train)
y = one_hot_encoding(output_layer_size, m, y)


start = time.time()

# let's create our neural network (with one hidden layer) and train it using SGD
nn = NeuralNetwork(784, 30, 10)
nn.stochasticGradientDescent(X, y, regularization=5.0, learning_rate=0.1, epochs=50, batch_size=10, output=True)

# you can also use gradient descent
# nn.gradientDescent(X, y, regularization=1.0, learning_rate=0.3, epochs=150, output=True)

# our neural network is already trained. Now we'll check its precision on
# identifying digits in the test set
predictions = nn.predict(images_test)
print("Precision in test: {0}".format(nn.evaluatePredictions(predictions, labels_test)))

print('Time in seconds: ' + str(time.time() - start))