## Deep Learning Fundamentals

[Playlist link](https://www.youtube.com/watch?v=OT1jslLoCyA&list=PLZbbT5o_s2xq7LwI2y8_QtvuXZedL6tQU&index=2)

### What is Deep Learning

Deep learning is a sub-field of machine learning that uses algorithms inspired by the structure and function of the brain's neural networks.

With deep learning, we're still talking about algorithms that learn from data just like we discussed in the last post on machine learning. However, now the algorithms or models that do this learning are based loosely on the structure and function of the brain's neural networks.

### Artificial Neural Networks

An artificial neural network is a computing system that is comprised of a collection of connected units called neurons that are organized into what we call layers.

The connected neural units form the so-called network. Each connection between neurons transmits a signal from one neuron to the other. The receiving neuron processes the signal and signals to downstream neurons connected to it within the network. Note that neurons are also commonly referred to as nodes.




The neural networks that we use in deep learning aren't actual biological neural networks though. They simply share some characteristics with biological neural networks and for this reason, we call them artificial neural networks (ANNs).


![](http://deeplizard.com/images/neural%20network%203%20layers.png)


### ANN - Architecture

Nodes are organized into what we call layers. At the highest level, there are three types of layers in every ANN:

- Input layer
- Hidden layers
- Output layer

Different layers perform different kinds of transformations on their inputs. Data flows through the network starting at the input layer and moving through the hidden layers until the output layer is reached. This is known as a forward pass through the network. Layers positioned between the input and output layers are known as hidden layers.


Let’s consider the number of nodes contained in each type of layer:

- Input layer - One node for each component of the input data.
- Hidden layers - Arbitrarily chosen number of nodes for each hidden layer.
- Output layer - One node for each of the possible desired outputs.

![](http://deeplizard.com/images/neural%20network%202%203%202.png)

This ANN has three layers total. The layer on the left is the input layer. The layer on the right is the output layer, and the layer in the middle is the hidden layer. Remember that each layer is comprised of neurons or nodes. Here, the nodes are depicted with the circles, so let’s consider how many nodes are in each layer of this network.

Number of nodes in each layer:

- Input layer (left): 2 nodes
- Hidden layer (middle): 3 nodes
- Output layer (right): 2 nodes


Since this network has two nodes in the input layer, this tells us that each input to this network must have two dimensions, like for example height and weight.

Since this network has two nodes in the output layer, this tells us that there are two possible outputs for every input that is passed forward (left to right) through the network. For example, overweight or underweight could be the two output classes. Note that the output classes are also known as the prediction classes.



### Keras Sequential Model

In Keras, we can build what is called a sequential model. **Keras defines a sequential model as a sequential stack of linear layers. This is what we might expect as we have just learned that neurons are organized into layers.**

This sequential model is Keras’ implementation of an artificial neural network. Let’s see now how a very simple sequential model is built using Keras.



In [6]:
from keras.models import Sequential
from keras.layers import Dense, Activation

model is an instance of a Sequential obj

Dense is an obj for layers

Dense is just one type of layer and there are many diff types of layers

Looking at the arrows in our image (in the above section) coming from the hidden layer to the output layer, we can see that each node in the hidden layer is connected to all nodes in the output layer. This is how we know that the **output layer** in the image is a dense layer. This same logic applies to the hidden layer.



Dense is the most basic type of layer and it connects each ip to each op within the layer

First param: no of neurons/nodes in the layer

The input shape parameter input_shape=(2,) tells us how many neurons our input layer has, so in our case, we have two.

activation: activation function is a non-linear function that typically follows a dense layer


In [8]:
layers = [
    Dense(3, input_shape=(2,), activation='relu'),
    Dense(2, activation='softmax')
]

model = Sequential(layers)


### Layers in a NN

Few examples of layers in a NN are:

- Dense (or fully connected) layers
- Convolutional layers
- Pooling layers
- Recurrent layers
- Normalization layers

Different layers perform different transformations on their inputs, and some layers are better suited for some tasks than others. For example, a convolutional layer is usually used in models that are doing work with image data. Recurrent layers are used in models that are doing work with time series data, and fully connected layers, as the name suggests, fully connects each input to each output within its layer.

Let’s consider the following example ANN:

![](http://deeplizard.com/images/deep%20neural%20network%20with%204%20layers.png)

We can see that the first layer, the input layer, consists of eight nodes. Each of the eight nodes in this layer represents an individual feature from a given sample in our dataset.

This tells us that a single sample from our dataset consists of eight dimensions. When we choose a sample from our dataset and pass this sample to the model, each of the eight values contained in the sample will be provided to a corresponding node in the input layer.

We can see that each of the eight input nodes are connected to every node in the next layer.

Each connection between the first and second layers transfers the output from the previous node to the input of the receiving node (left to right). The two layers in the middle that have six nodes each are hidden layers simply because they are positioned between the input and output layers.

#### Layer weights

Each connection between two nodes has an associated weight, which is just a number.

Each weight represents the strength of the connection between the two nodes. When the network receives an input at a given node in the input layer, this input is passed to the next node via a connection, and the input will be multiplied by the weight assigned to that connection.

For each node in the second layer, a weighted sum is then computed with each of the incoming connections. This sum is then passed to an activation function, which performs some type of transformation on the given sum. For example, an activation function may transform the sum to be a number between zero and one. The actual transformation will vary depending on which activation function is used.

`node output = activation(weighted sum of inputs)`

#### Forward pass through a neural network


Once we obtain the output for a given node, the obtained output is the value that is passed as input to the nodes in the next layer.

This process continues until the output layer is reached. The number of nodes in the output layer depends on the number of possible output or prediction classes we have. In our example, we have four possible prediction classes.

Suppose our model was tasked with classifying four types of animals. Each node in the output layer would represent one of four possibilities. For example, we could have cat, dog, llama or lizard. The categories or classes depend on how many classes are in our dataset.

For a given sample from the dataset, the entire process from input layer to output layer is called a forward pass through the network.

#### Finding the optimal weights

As the model learns, the weights at all connections are updated and optimized so that the input data point maps to the correct output prediction class.



### Defining the neural network in code with Keras

In our previous discussion, we saw how to use Keras to build a sequential model. Now, let’s do this for our example network.

Will start out by defining an array of Dense objects, our layers. This array will then be passed to the constructor of the sequential model.

Remember our network looks like this:

![](http://deeplizard.com/images/deep%20neural%20network%20with%204%20layers.png)

Given this, we have

In [10]:
layers = [
    # first hidden layer: needs to have input shape specified
    Dense(6, input_shape=(8,), activation='relu'),
    Dense(6, activation='relu'),
    Dense(4, activation='softmax')
]
model = Sequential(layers)

Notice how the first Dense object specified in the array is not the input layer. The first Dense object is the first hidden layer. The input layer is specified as a parameter to the first Dense object’s constructor.

Our input shape is eight. This is why our input shape is specified as input_shape=(8,). Our first hidden layer has six nodes as does our second hidden layer, and our output layer has four nodes.


### Activation Functions

In an artificial neural network, an activation function is a function that maps a node's inputs to its corresponding output.

`node output = activation(weighted sum of inputs)`

The activation function does some type of operation to transform the sum to a number that is often times between some lower limit and some upper limit. This transformation is often a non-linear transformation. 


#### Sigmoid activation function

Sigmoid takes in an input and does the following:

- For negative inputs, sigmoid will transform the input to a number close to zero.
- For positive inputs, sigmoid will transform the input into a number close to one.
- For inputs close to zero, sigmoid will transform the input into some number between zero and one.

![](http://deeplizard.com/images/sigmoid%20function%20graph%20curve.svg)

So, for sigmoid, zero is the lower limit, and one is the upper limit.

Alright, we now understand mathematically what one of these activation functions does, but what’s the intuition?

#### Activation function intuition

Well, an activation function is biologically inspired by activity in our brains where different neurons fire (or are activated) by different stimuli.

For example, if you smell something pleasant, like freshly baked cookies, certain neurons in your brain will fire and become activated. If you smell something unpleasant, like spoiled milk, this will cause other neurons in your brain to fire.

Deep within the folds of our brains, certain neurons are either firing or they’re not. This can be represented by a zero for not firing or a one for firing.

With the Sigmoid activation function in an artificial neural network, we have seen that the neuron can be between zero and one, and the closer to one, the more activated that neuron is while the closer to zero the less activated that neuron is.


#### Relu activation function

Now, it’s not always the case that our activation function is going to do a transformation on an input to be between zero and one.

In fact, one of the most widely used activation functions today called ReLU doesn’t do this. ReLU, which is short for rectified linear unit, transforms the input to the maximum of either zero or the input itself.

`ReLU(x) = max(0, x)`

So if the input is less than or equal to zero, then relu will output zero. If the input is greater than zero, relu will then just output the given input.

The idea here is, the more positive the neuron is, the more activated it is. Now, we’ve only talked about two activation functions here, Sigmoid and relu, but there are other types of activation functions that do different types of transformations to their inputs.

### Why do we use activation functions?


To understand why we use activation functions, we need to first understand linear functions.

Suppose that f is a function on a set X. 
Suppose that a and b are in X. 
Suppose that x is a real number.

The function f is said to be a linear function if and only if:

`f(a+b) = f(a) + f(b)` and `f(xa) = xf(a)`

An important feature of linear functions is that the composition of two linear functions is also a linear function. This means that, even in very deep neural networks, if we only had linear transformations of our data values during a forward pass, the learned mapping in our network from input to output would also be linear.

Typically, the types of mappings that we are aiming to learn with our deep neural networks are more complex than simple linear mappings.

This is where activation functions come in. Most activation functions are non-linear, and they are chosen in this way on purpose. Having non-linear activation functions allows our neural networks to compute arbitrarily complex functions.

#### Activation functions in code with Keras

Let’s take a look at how to specify an activation function in a Keras Sequential model.

There are two basic ways to achieve this. First, we’ll import our classes.

```python
model = Sequential([
    Dense(5, input_shape=(3,), activation='relu')
])
```

In this case, we have a Dense layer and we are specifying relu as our activation function activation='relu'.

The second way is to add the layers and activation functions to our model after the model has been instantiated like so:

```python
model = Sequential()
model.add(Dense(5, input_shape=(3,)))
model.add(Activation('relu'))
```

Remember that:

`node output = activation(weighted sum of inputs)`

For our example, this means that each output from the nodes in our Dense layer will be equal to the relu result of the weighted sums like

`node output = relu(weighted sum of inputs)`

### Training an ANN

When we train a model, we’re basically trying to solve an optimization problem. We’re trying to optimize the weights within the model. Our task is to find the weights that most accurately map our input data to the correct output class. This mapping is what the network must learn.

#### Optimization algorithm

The weights are optimized using what we call an optimization algorithm. The optimization process depends on the chosen optimization algorithm. We also use the term optimizer to refer to the chosen algorithm. The most widely known optimizer is called stochastic gradient descent, or more simply, SGD.

When we have any optimization problem, we must have an optimization objective, so now let’s consider what SGD’s objective is in optimizing the model’s weights.

The objective of SGD is to minimize some given function that we call a loss function. So, SGD updates the model's weights in such a way as to make this loss function as close to its minimum value as possible.

#### Loss function

One common loss function is mean squared error (MSE), but there are several loss functions that we could use in its place. As deep learning practitioners, it's our job to decide which loss function to use.

Alright, but what is the actual loss we’re talking about? Well, during training, we supply our model with data and the corresponding labels to that data.

For example, suppose we have a model that we want to train to classify whether images are either images of cats or images of dogs. We will supply our model with images of cats and dogs along with the labels for these images that state whether each image is of a cat or of a dog.

Suppose we give one image of a cat to our model. Once the forward pass is complete and the cat image data has flowed through the network, the model is going to provide an output at the end. This will consist of what the model thinks the image is, either a cat or a dog.

In a literal sense, the output will consist of probabilities for cat or dog. For example, it may assign a 75% probability to the image being a cat, and a 25% probability to it being a dog. In this case, the model is assigning a higher likelihood to the image being of a cat than of a dog.

- 75% chance it's a cat
- 25% chance it's a dog

If we stop and think about it for a moment, this is very similar to how humans make decisions. Everything is a prediction!

The loss is the error or difference between what the network is predicting for the image versus the true label of the image, and SGD will to try to minimize this error to make our model as accurate as possible in its predictions.

After passing all of our data through our model, we’re going to continue passing the same data over and over again. This process of repeatedly sending the same data through the network is considered training. During this training process is when the model will actually learn. More about learning in the next post. So, through this process that’s occurring with SGD iteratively, the model is able to learn from the data.

### Learning in artificial neural networks - More details

In a previous post, we learned about the training process and saw that each data point used for training is passed through the network. This pass through the network from input to output is called a forward pass, and the resulting output depends on the weights at each connection inside the network.

Once all of the data points in our dataset have been passed through the network, we say that an epoch is complete.

**An epoch refers to a single pass of the entire dataset to the network during training.**

Note that many epochs occur throughout the training process as the model learns.

#### What does it mean to learn?

Well, remember, when the model is initialized, the network weights are set to arbitrary values. We have also seen that, at the end of the network, the model will provide the output for a given input.

Once the output is obtained, the loss (or the error) can be computed for that specific output by looking at what the model predicted versus the true label.

After the loss is calculated, the gradient of this loss function is computed with respect to each of the weights within the network. Note, gradient is just a word for the derivative of a function of several variables.

Continuing with this explanation, let’s focus in on only one of the weights in the model.

At this point, we’ve calculated the loss of a single output, and we calculate the gradient of that loss with respect to our single chosen weight. This calculation is done using a technique called backpropagation

Once we have the value for the gradient of the loss function, we can use this value to update the model’s weight. The gradient tells us which direction will move the loss towards the minimum, and our task is to move in a direction that lowers the loss and steps closer to this minimum value.

We then multiply the gradient value by something called a learning rate. A learning rate is a small number usually ranging between 0.01 and 0.0001, but the actual value can vary.

**The learning rate tells us how large of a step we should take in the direction of the minimum.**

Alright, so we multiply the gradient with the learning rate, and we subtract this product from the weight, which will give us the new updated value for this weight.

`new weight = old weight - (learning rate * gradient)`

In this discussion, we just focused on one single weight to explain the concept, but this same process is going to happen with each of the weights in the model each time data passes through it.

The only difference is that when the gradient of the loss function is computed, the value for the gradient is going to be different for each weight because the gradient is being calculated with respect to each weight.

So now imagine all these weights being iteratively updated with each epoch. The weights are going to be incrementally getting closer and closer to their optimized values while SGD works to minimize the loss function.

This updating of the weights is essentially what we mean when we say that the model is learning. It’s learning what values to assign to each weight based on how those incremental changes are affecting the loss function. As the weights change, the network is getting smarter in terms of accurately mapping inputs to the correct output.

After each epoch basically the loss should decrease and the accuracy should increase





### Preprocessing the data to be trained using our NN

[link](https://www.youtube.com/watch?v=UkzhouEk6uY)

In [1]:
import numpy as np
from random import randint
from sklearn.preprocessing import MinMaxScaler

In [2]:
train_labels = []
train_samples = []

For keras the samples need to be in form of a np array or a list of np arrays
The labels need to be in form of a np array

We will generate some numeric data and do some preprocessing on it st keras can understand the data and train our 
NN on it

Example data:

- An experimental drug was tested on idvs from ages 13 - 100
- The trial had 2100 participants. Half were under 65 and half over 65
- 95% of patients 65 or older experienced side effects
- 95% of patients under 65 experienced no side effects

We want our NN to predict if an indv will have side effects or not

In [3]:
for i in range(1000):
    random_younger = randint(13, 64)
    train_samples.append(random_younger)
    train_labels.append(0)
    
    random_older = randint(65, 100)
    train_samples.append(random_older)
    train_labels.append(1)
    
for i in range(50):
    random_younger = randint(13, 64)
    train_samples.append(random_younger)
    train_labels.append(1)
    
    random_older = randint(65, 100)
    train_samples.append(random_older)
    train_labels.append(0)

In [6]:
len(train_samples) == len(train_labels)

True

In [8]:
train_labels = np.array(train_labels)
train_samples = np.array(train_samples)

In [9]:
train_samples.shape

(2100,)

In [10]:
train_labels.shape

(2100,)

Now we have our raw data in the formalt keras wants

The NN might not learn v well from nos ranging from 13 - 100

So we scale our data in range 0-1

In [20]:
train_samples.reshape(2100,1)

array([[31],
       [77],
       [49],
       ...,
       [91],
       [26],
       [82]])

In [21]:
scalar = MinMaxScaler(feature_range=(0, 1))

scaled_train_samples = scalar.fit_transform(train_samples.reshape(len(train_samples), 1))



In [22]:
scaled_train_samples

array([[0.20689655],
       [0.73563218],
       [0.4137931 ],
       ...,
       [0.89655172],
       [0.14942529],
       [0.79310345]])

Now our data is perfect for training

#### Training in code with Keras

In [2]:
import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Activation
from keras.layers.core import Dense
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy

Next, we define our model:

In [3]:
model = Sequential([
    Dense(6, input_shape = (1,), activation='relu'),
    Dense(32, activation='relu'),
    Dense(2, activation='sigmoid')
])

Before we can train our model, we must compile it like so:


In [4]:
model.compile(optimizer=Adam(lr=0.0001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

To the compile() function, we are passing the optimizer, the loss function, and the metrics that we would like to see. Notice that the optimizer we have specified is called Adam. Adam is just a variant of SGD. Inside the Adam constructor is where we specify the learning rate, and in this case Adam(lr=.0001), we have chosen 0.0001.

Finally, we fit our model to the data. Fitting the model to the data means to train the model on the data. We do this with the following code:

In [None]:
model.fit(x=scaled_train_samples, y=train_labels, batch_size=10, epochs=20, shuffle=True, verbose=2)

Expected output:

```
Epoch 1/20 0s - loss: 0.6400 - acc: 0.5576
Epoch 2/20 0s - loss: 0.6061 - acc: 0.6310
Epoch 3/20 0s - loss: 0.5748 - acc: 0.7010
Epoch 4/20 0s - loss: 0.5401 - acc: 0.7633
Epoch 5/20 0s - loss: 0.5050 - acc: 0.7990
Epoch 6/20 0s - loss: 0.4702 - acc: 0.8300
Epoch 7/20 0s - loss: 0.4366 - acc: 0.8495
Epoch 8/20 0s - loss: 0.4066 - acc: 0.8767
Epoch 9/20 0s - loss: 0.3808 - acc: 0.8814
Epoch 10/20 0s - loss: 0.3596 - acc: 0.8962
Epoch 11/20 0s - loss: 0.3420 - acc: 0.9043
Epoch 12/20 0s - loss: 0.3282 - acc: 0.9090
Epoch 13/20 0s - loss: 0.3170 - acc: 0.9129
Epoch 14/20 0s - loss: 0.3081 - acc: 0.9210
Epoch 15/20 0s - loss: 0.3014 - acc: 0.9190
Epoch 16/20 0s - loss: 0.2959 - acc: 0.9205
Epoch 17/20 0s - loss: 0.2916 - acc: 0.9238
Epoch 18/20 0s - loss: 0.2879 - acc: 0.9267
Epoch 19/20 0s - loss: 0.2848 - acc: 0.9252
Epoch 20/20 0s - loss: 0.2824 - acc: 0.9286

```

scaled_train_samples is a numpy array consisting of the training samples.

train_labels is a numpy array consisting of the corresponding labels for the training samples.

batch_size=10 specifies how many training samples should be sent to the model at once.

epochs=20 means that the complete training set (all of the samples) will be passed to the model a total of 20 times.

shuffle=True indicates that the data should first be shuffled before being passed to the model.

verbose=2 indicates how much logging we will see as the model trains.

The output gives us the following values for each epoch:

- Epoch number
- Duration in seconds
- Loss
- Accuracy


What you will notice is that the loss is going down and the accuracy is going up as the epochs progress.


### Loss functions in neural networks

The loss function is what SGD is attempting to minimize by iteratively updating the weights in the network.

At the end of each epoch during the training process, the loss will be calculated using the network’s output predictions and the true labels for the respective input.

Suppose our model is classifying images of cats and dogs, and assume that the label for cat is 0 and the label for dog is 1.

- cat: 0
- dog: 1

Now suppose we pass an image of a cat to the model, and the provided output is 0.25. In this case, the difference between the model’s prediction and the true label is 0.25 - 0.00 = 0.25. This difference is also called the error.

`error = 0.25 - 0.00 = 0.25`

This process is performed for every output. For each epoch, the error is accumulated across all the individual outputs.

Let’s look at a loss function that is commonly used in practice called the mean squared error (MSE).

#### MSE

For a single sample, with MSE, we first calculate the difference (the error) between the provided output prediction and the label. We then square this error. For a single input, this is all we do.

`MSE(input) = (output - label)^2`

If we passed multiple samples to the model at once (a batch of samples), then we would take the mean of the squared errors over all of these samples.

This was just illustrating the math behind how one loss function, MSE, works. There are several different loss functions that we could work with though.

The general idea that we just showed for calculating the error of individual samples will hold true for all of the different types of loss functions. The implementation of what we actually do with each of the errors will be dependent upon the algorithm of the given loss function we’re using. For example, we averaged the squared errors to calculate MSE, but other loss functions will use other algorithms to determine the value of the loss.

If we passed our entire training set to the model at once (batch_size=1), then the process we just went over for calculating the loss will occur at the end of each epoch during training.

If we split our training set into batches, and passed batches one at a time to our model, then the loss would be calculated on each batch. With either method, since the loss depends on the weights, we expect to see the value of the loss change each time the weights are updated. Given that the objective of SGD is to minimize the loss, we want to see our loss decrease as we run more epochs.

The currently available loss functions for Keras are as follows:

- mean_squared_error
- mean_absolute_error
- mean_absolute_percentage_error
- mean_squared_logarithmic_error
- squared_hinge
- hinge
- categorical_hinge
- logcosh
- categorical_crossentropy
- sparse_categorical_crossentropy
- binary_crossentropy
- kullback_leibler_divergence
- poisson
- cosine_proximity

