## Introduction to Deep Learning###

Artificial neural networks are modelled after the human brain. The schema below shows a single neuron in the human brain.

Dendrites convey electrical signals from neurons to the cell body; resultant electrical signals are sent along the axon to other neurons. The human brain has approximately 10<sup>11</sup> neurons. If the electrical signal between two neurons is strong enough the receiving neuron is activated.

![](images/neuron.png)  

## What is Deep Learning?

We define machine learning as the practice of using algorithms to analyze data, learn from that data and then make a determination or prediction about new data. Deep learning is a sub-field of machine learning that uses algorithms inspired by the structure and function of the brain&#39;s neural networks.

With deep learning, we&#39;re still talking about algorithms that learn from data just like we discussed in the previous lesssons on machine learning. However, now the algorithms or models that do this learning are based loosely on the structure and function of the brain&#39;s neural networks.

The neural networks that we use in deep learning aren&#39;t actual biological neural networks though. They simply share some characteristics with biological neural networks, and for this reason, we call them artificial neural networks (ANNs).

### What Does Deep Mean In Deep Learning?###

1. ANNs are built using what we call neurons.
2. Neurons in an ANN are organized into what we call layers.
3. Layers within an ANN (all but the input and output layers) are called hidden layers.
4. If an ANN has more than one hidden layer, the ANN is said to be a deep ANN.

![](images/layers.png)

### What Is An Artificial Neural Network?###

An artificial neural network is a computing system that is comprised of a collection of connected units called neurons that are organized into what we call layers.

The connected neural units form the so-called network. Each connection between neurons transmits a signal from one neuron to the other. The receiving neuron processes the signal and signals to downstream neurons connected to it within the network. Note that neurons are also commonly referred to as nodes.

Nodes are organized into what we call layers. At the highest level, there are three types of layers in every ANN:

1. Input layer
2. Hidden layers
3. Output layer

Different layers perform different kinds of transformations on their inputs. Data flows through the network starting at the input layer and moving through the hidden layers until the output layer is reached. This is known as a forward pass through the network. Layers positioned between the input and output layers are known as hidden layers.

Let&#39;s consider the number of nodes contained in each type of layer:

1. Input layer - One node for each component of the input data.
2. Hidden layers - Arbitrarily chosen number of nodes for each hidden layer.
3. Output layer - One node for each of the possible desired outputs.

![](images/hidden.jpg)

This ANN has three layers total. The layer on the left is the input layer. The layer on the right is the output layer, and the layer in the middle is the hidden layer. Remember that each layer is comprised of neurons or nodes. Here, the nodes are depicted with the circles, so let&#39;s consider how many nodes are in each layer of this network.

1. Input layer (left): 3 nodes
2. Hidden layer (middle): 4 nodes
3. Output layer (right): 2 nodes

Since this network has two nodes in the output layer, this tells us that there are two possible outputs for every input that is passed forward (left to right) through the network. For example, cats or dogs could be the two output classes. Note that the output classes are also known as the prediction classes.

### Layers of a Neural Network###

In the previous paragraph, we saw how the neurons in an ANN are organized into layers. The examples we looked at showed the use of dense layers, which are also known as fully connected layers. There are, however, different types of layers. Some examples include:

1. Dense (or fully connected) layers
2. Convolutional layers
3. Pooling layers
4. Recurrent layers
5. Normalization layers

Why Have Different Types Of Layers?

Different layers perform different transformations on their inputs, and some layers are better suited for some tasks than others.

For example, a convolutional layer is usually used in models that are doing work with image data. Recurrent layers are used in models that are doing work with time series data, and fully connected layers, as the name suggests, fully connects each input to each output within its layer.

### Example Artificial Neural Network###

Let&#39;s consider again the following example ANN:

![](images/layers.png)

We can see that the first layer, the input layer, consists of eight nodes. Each of the eight nodes in this layer represents an individual feature from a given sample in our dataset.

This tells us that a single sample from our dataset consists of eight dimensions. When we choose a sample from our dataset and pass this sample to the model, each of the eight values contained in the sample will be provided to a corresponding node in the input layer.

We can see that each of the eight input nodes are connected to every node in the next layer.

Each connection between the first and second layers transfers the output from the previous node to the input of the receiving node (left to right). The two layers in the middle that have six nodes each are hidden layers simply because they are positioned between the input and output layers.

### Layer Weights ###

Each connection between two nodes has an associated weight, which is just a number.

![](images/weights.png)

Each weight represents the strength of the connection between the two nodes. When the network receives an input at a given node in the input layer, this input is passed to the next node via a connection, and the input will be multiplied by the weight assigned to that connection.

![](images/weightscalcul.png)

For each node in the second layer, a weighted sum is then computed with each of the incoming connections. This sum is then passed to an activation function, which performs some type of transformation on the given sum. For example, an activation function may transform the sum to be a number between zero and one. The actual transformation will vary depending on which activation function is used. More on activation functions later.

node output = activation(weighted sum of inputs)

### Forward Pass Through A Neural Network###

Once we obtain the output for a given node, the obtained output is the value that is passed as input to the nodes in the next layer.

This process continues until the output layer is reached. The number of nodes in the output layer depends on the number of possible output or prediction classes we have. In our example, we have two possible prediction classes.

Suppose our model was tasked with classifying two types of animals. Each node in the output layer would represent one of two possibilities. For example, we could have cat and dog. The categories or classes depend on how many classes are in our dataset.

For a given sample from the dataset, the entire process from input layer to output layer is called a forward pass through the network.

![](images/forwardpass.png)

### Finding The Optimal Weights###

As the model learns, the weights at all connections are updated and optimized so that the input data point maps to the correct output prediction class.

### Activation Functions In A Neural Network###

In an artificial neural network, an activation function is a function that maps a node&#39;s inputs to its corresponding output.

The weighted sum of each incoming connection for each node in the layer, is passed to an activation function.

node output = activation(weighted sum of inputs)

The activation function does some type of operation to transform the sum to a number that is often between some lower limit and some upper limit. This transformation is often a non-linear transformation.

![](images/activationfunctions.gif)

### What Do Activation Functions Do?###

An example activation function: Sigmoid

Sigmoid takes in an input and does the following:

- For most negative inputs, sigmoid will transform the input to a number very close to 0.
- For most positive inputs, sigmoid will transform the input into a number very close to 1.
- For inputs relatively close to 0, sigmoid will transform the input into some number between 0 and 1.

Mathematically, we write

![](images/sigmoid.png)

So, for sigmoid, 0 is the lower limit, and 1 is the upper limit.

An activation function is biologically inspired by activity in our brains where different neurons fire (or are activated) by different stimuli.

For example, if you smell something pleasant, like freshly baked cookies, certain neurons in your brain will fire and become activated. If you smell something unpleasant, like spoiled milk, this will cause other neurons in your brain to fire.

Deep within the folds of our brains, certain neurons are either firing or they&#39;re not.

### Relu Activation Function###

Now, it&#39;s not always the case that our activation function is going to do a transformation on an input to be between 0 and 1.

In fact, one of the most widely used activation functions today called ReLU doesn&#39;t do this. ReLU, which is short for rectified linear unit, transforms the input to the maximum of either 0 or the input itself.

_relu(x) = max(0,x)_

![](images/relu.png)

So if the input is less than or equal to 0, then relu will output 0. If the input is greater than 0, relu will then just output the given input.

The idea here is, the more positive the neuron is, the more activated it is. Now, we&#39;ve only talked about two activation functions here, sigmoid and relu, but there are other types of activation functions that do different types of transformations to their inputs.

Why do we use activation functions?

Most activation functions are non-linear, and they are chosen in this way on purpose. Having non-linear activation functions allows our neural networks to compute arbitrarily complex functions.

![](images/activation.png)

### Training an Artificial Neural Network###

### What Is Training?###

When we train a model, we&#39;re basically trying to solve an optimization problem. We&#39;re trying to optimize the weights within the model. Our task is to find the weights that most accurately map our input data to the correct output class. This mapping is what the network must learn.

At the beginning each connection between nodes has an arbitrary weight assigned to it. During training, these weights are iteratively updated and moved towards their optimal values.

### Optimization Algorithm###

The weights are optimized using what we call an optimization algorithm. The optimization process depends on the chosen optimization algorithm. We also use the term optimizer to refer to the chosen algorithm. The most widely known optimizer is called stochastic gradient descent, or more simply, SGD.

When we have any optimization problem, we must have an optimization objective.

The objective of SGD is to minimize some given function that we call a loss function. So, SGD updates the model&#39;s weights in such a way as to make this loss function as close to its minimum value as possible.

### Loss Function###

One common loss function is mean squared error (MSE), but there are several loss functions that we could use in its place.

During training, we supply our model with data and the corresponding labels to that data.

For example, suppose we have a model that we want to train to classify whether images are either images of cats or images of dogs. We will supply our model with images of cats and dogs along with the labels for these images that state whether each image is of a cat or of a dog. Assume that the label for cat is 0 and the label for dog is 1.

cat: 0

dog: 1

Suppose we give one image of a cat to our model. Once the forward pass is complete and the cat image data has flowed through the network, the model is going to provide an output at the end. This will consist of what the model thinks the image is, either a cat or a dog.

Now suppose the provided output is 0.25. In this case, the difference between the model&#39;s prediction and the true label is 0.25 - 0.00 = 0.25. This difference is also called the error.

error = 0.25 - 0.00 = 0.25

This process is performed for every output. For each epoch, the error is accumulated across all the individual outputs.

The loss is the error or difference between what the network is predicting for the image versus the true label of the image, and SGD will try to minimize this error to make our model as accurate as possible in its predictions.

If we passed our entire training set to the model at once (batch\_size=1), then the process we just went over for calculating the loss will occur at the end of each epoch during training.

If we split our training set into batches, and passed batches one at a time to our model, then the loss would be calculated on each batch.

After passing all of our data through our model, we&#39;re going to continue passing the same data over and over again. This process of repeatedly sending the same data through the network is considered training. During this training process is when the model will actually learn. So, through this process that&#39;s occurring with SGD iteratively, the model is able to learn from the data.

### How A Neural Network Learns Explained###

In the previous paragraph we saw that each data point used for training is passed through the network. This pass through the network from input to output is called a forward pass, and the resulting output depends on the weights at each connection inside the network.

Once all of the data points in our dataset have been passed through the network, we say that an epoch is complete.

An epoch refers to a single pass of the entire dataset to the network during training.

Note that many epochs occur throughout the training process as the model learns.

### What Does It Mean To Learn?###

When the model is initialized, the network weights are set to arbitrary values. We have also seen that, at the end of the network, the model will provide the output for a given input.

Once the output is obtained, the loss (or the error) can be computed for that specific output by looking at what the model predicted versus the true label. The loss computation depends on the chosen loss function.

### Gradient Of The Loss Function###

After the loss is calculated, the gradient of this loss function is computed with respect to each of the weights within the network. Note, gradient is just a word for the derivative of a function of several variables.

At this point, we&#39;ve calculated the loss of a single output, and we calculate the gradient of that loss with respect to a single chosen weight. This calculation is done using a technique called backpropagation.

Once we have the value for the gradient of the loss function, we can use this value to update the model&#39;s weight. The gradient tells us which direction will move the loss towards the minimum, and our task is to move in a direction that lowers the loss and steps closer to this minimum value.

### Learning Rate###

We then multiply the gradient value by something called a learning rate. A learning rate is a small number usually ranging between 0.01 and 0.0001, but the actual value can vary.

The learning rate tells us how large of a step we should take in the direction of the minimum.

### Updating The Weights###

So we multiply the gradient with the learning rate, and we subtract this product from the weight, which will give us the new updated value for this weight.

new weight = old weight - (learning rate \* gradient)

This same process is going to happen with each of the weights in the model each time data passes through it.

The only difference is that when the gradient of the loss function is computed, the value for the gradient is going to be different for each weight because the gradient is being calculated with respect to each weight.

So now imagine all these weights being iteratively updated with each epoch. The weights are going to be incrementally getting closer and closer to their optimized values while SGD works to minimize the loss function.

### The Model Is Learning###

This updating of the weights is essentially what we mean when we say that the model is learning. It&#39;s learning what values to assign to each weight based on how those incremental changes are affecting the loss function. As the weights change, the network is getting smarter in terms of accurately mapping inputs to the correct output.

![](images/learning.jpg)

### Train, Validation and Test set###

For training and testing purposes for our model, we should have our data broken down into three distinct datasets. These datasets will consist of the following:

- Training set
- Validation set
- Test set

### Training Set###

The training set is what it sounds like. It&#39;s the set of data used to train the model. During each epoch, our model will be trained over and over again on this same data in our training set, and it will continue to learn about the features of this data.

The hope with this is that later we can deploy our model and have it accurately predict on new data that it&#39;s never seen before. It will be making these predictions based on what it&#39;s learned about the training data.

### Validation Set###

The validation set is a set of data, separate from the training set, that is used to validate our model during training. This validation process gives information that may assist us with adjusting our hyperparameters.

Recall how we just mentioned that with each epoch during training, the model will be trained on the data in the training set. Well, it will also simultaneously be validated on the data in the validation set.

We know from our previous posts on training, that during the training process, the model will be classifying the output for each input in the training set. After this classification occurs, the loss will then be calculated, and the weights in the model will be adjusted. Then, during the next epoch, it will classify the same input again.

Now, also during training, the model will be classifying each input from the validation set as well. It will be doing this classification based only on what it&#39;s learned about the data it&#39;s being trained on in the training set. The weights will not be updated in the model based on the loss calculated from our validation data.

Remember, the data in the validation set is separate from the data in the training set. So when the model is validating on this data, this data does not consist of samples that the model already is familiar with from training.

One of the major reasons we need a validation set is to ensure that our model is not overfitting to the data in the training set. But the idea of overfitting is that our model becomes really good at being able to classify data in the training set, but it&#39;s unable to generalize and make accurate classifications on data that it wasn&#39;t trained on.

During training, if we&#39;re also validating the model on the validation set and see that the results it&#39;s giving for the validation data are just as good as the results it&#39;s giving for the training data, then we can be more confident that our model is not overfitting.

The validation set allows us to see how well the model is generalizing during training.

On the other hand, if the results on the training data are really good, but the results on the validation data are lagging behind, then our model is overfitting.

### Test Set###

The test set is a set of data that is used to test the model after the model has already been trained. The test set is separate from both the training set and validation set.

After our model has been trained and validated using our training and validation sets, we will then use our model to predict the output of the unlabeled data in the test set.

One major difference between the test set and the two other sets is that the test set should not be labeled. The training set and validation set have to be labeled so that we can see the metrics given during training, like the loss and the accuracy from each epoch.

When the model is predicting on unlabeled data in our test set, this would be the same type of process that would be used if we were to deploy our model out into the field.

The test set provides a final check that the model is generalizing well before deploying the model to production.

For example, if we&#39;re using a model to classify data without knowing what the labels of the data are beforehand, or with never have being shown the exact data it&#39;s going to be classifying, then of course we wouldn&#39;t be giving our model labeled data to do this.

The entire goal of having a model be able to classify is to do it without knowing what the data is beforehand.

In Summary

| Dataset | Updates Weights | Description |
| --- | --- | --- |
| Training set | Yes | Used to train the model. The goal of training is to fit the model to the training set while still generalizing to unseen data. |
| Validation set | No | Used during training to check how well the model is generalizing. |
| Test set | No | Used to test the model&#39;s final ability to generalize before deploying to production. |

The main reason for having three separate datasets is to ensure that the model is able to generalize by predicting accurately on unseen data. When the model is failing to generalize, we are usually in a situation of overfitting or underfitting.

### Overfitting###

Overfitting occurs when our model becomes really good at being able to classify or predict on data that was included in the training set, but is not as good at classifying data that it wasn&#39;t trained on. So essentially, the model has overfit the data in the training set.

We can tell if the model is overfitting based on the metrics that are given for our training data and validation data during training. We previously saw that when we specify a validation set during training, we get metrics for the validation accuracy and loss, as well as the training accuracy and loss.

If the validation metrics are considerably worse than the training metrics, then that is indication that our model is overfitting.

We can also get an idea that our model is overfitting if during training, the model&#39;s metrics were good, but when we use the model to predict on test data, it doesn&#39;t accurately classify the data in the test set.

The concept of overfitting boils down to the fact that the model is unable to generalize well. It has learned the features of the training set extremely well, but if we give the model any data that slightly deviates from the exact data used during training, it&#39;s unable to generalize and accurately predict the output.

### Reducing Overfitting###

Overfitting is an incredibly common issue. How can we reduce it? Let&#39;s look at some techniques.

1. Adding More Data To The Training Set 

    The easiest thing we can do, as long as we have access to it, is to add more data. The more data we can train our model on, the more it will be able to learn from the training set. Also, with more data, we&#39;re hoping to be adding more diversity to the training set as well.

    For example, if we train a model to classify whether an image is an image of a dog or cat, and the model has only seen images of larger dogs, like Labs, Golden Retrievers, and Boxers, then in practice if it sees a Pomeranian, it may not do so well at recognizing that a Pomeranian is a dog.

    If we add more data to this model to encompass more breeds, then our training data will become more diverse, and the model will be less likely to overfit.

1. Data Augmentation

    Another technique we can deploy to reduce overfitting is to use data augmentation. This is the process of creating additional augmented data by reasonably modifying the data in our training set. For image data, for example, we can do these modifications by:

    - Cropping
    - Rotating
    - Flipping
    - Zooming

    The general idea of data augmentation allows us to add more data to our training set that is similar to the data that we already have, but is just reasonably modified to some degree so that it&#39;s not the exact same.

    For example, if most of our dog images were dogs facing to the left, then it would be a reasonable modification to add augmented flipped images so that our training set would also have dogs that faced to the right.

1. Reduce The Complexity Of The Model

    Something else we can do to reduce overfitting is to reduce the complexity of our model. We could reduce complexity by making simple changes, like removing some layers from the model, or reducing the number of neurons in the layers. This may help our model generalize better to data it hasn&#39;t seen before.

1. Dropout

    The last tip is to use something called dropout. The general idea behind dropout is that, if you add it to a model, it will randomly ignore some subset of nodes in a given layer during training, i.e., it drops out the nodes from the layer. Hence, the name dropout. This will prevent these dropped out nodes from participating in producing a prediction on the data.

    ![](images/dropout.png)

_Source: [https://deeplizard.com/learn/video/gZmobeGL0Yg](https://deeplizard.com/learn/video/gZmobeGL0Yg)_