## Deep Learning Fundamentals

[Playlist link](https://www.youtube.com/watch?v=OT1jslLoCyA&list=PLZbbT5o_s2xq7LwI2y8_QtvuXZedL6tQU&index=2)

### What is Deep Learning

Deep learning is a sub-field of machine learning that uses algorithms inspired by the structure and function of the brain's neural networks.

With deep learning, we're still talking about algorithms that learn from data just like we discussed in the last post on machine learning. However, now the algorithms or models that do this learning are based loosely on the structure and function of the brain's neural networks.

### Artificial Neural Networks

An artificial neural network is a computing system that is comprised of a collection of connected units called neurons that are organized into what we call layers.

The connected neural units form the so-called network. Each connection between neurons transmits a signal from one neuron to the other. The receiving neuron processes the signal and signals to downstream neurons connected to it within the network. Note that neurons are also commonly referred to as nodes.




The neural networks that we use in deep learning aren't actual biological neural networks though. They simply share some characteristics with biological neural networks and for this reason, we call them artificial neural networks (ANNs).


![](http://deeplizard.com/images/neural%20network%203%20layers.png)


### ANN - Architecture

Nodes are organized into what we call layers. At the highest level, there are three types of layers in every ANN:

- Input layer
- Hidden layers
- Output layer

Different layers perform different kinds of transformations on their inputs. Data flows through the network starting at the input layer and moving through the hidden layers until the output layer is reached. This is known as a forward pass through the network. Layers positioned between the input and output layers are known as hidden layers.


Let’s consider the number of nodes contained in each type of layer:

- Input layer - One node for each component of the input data.
- Hidden layers - Arbitrarily chosen number of nodes for each hidden layer.
- Output layer - One node for each of the possible desired outputs.

![](http://deeplizard.com/images/neural%20network%202%203%202.png)

This ANN has three layers total. The layer on the left is the input layer. The layer on the right is the output layer, and the layer in the middle is the hidden layer. Remember that each layer is comprised of neurons or nodes. Here, the nodes are depicted with the circles, so let’s consider how many nodes are in each layer of this network.

Number of nodes in each layer:

- Input layer (left): 2 nodes
- Hidden layer (middle): 3 nodes
- Output layer (right): 2 nodes


Since this network has two nodes in the input layer, this tells us that each input to this network must have two dimensions, like for example height and weight.

Since this network has two nodes in the output layer, this tells us that there are two possible outputs for every input that is passed forward (left to right) through the network. For example, overweight or underweight could be the two output classes. Note that the output classes are also known as the prediction classes.



### Keras Sequential Model

In Keras, we can build what is called a sequential model. **Keras defines a sequential model as a sequential stack of linear layers. This is what we might expect as we have just learned that neurons are organized into layers.**

This sequential model is Keras’ implementation of an artificial neural network. Let’s see now how a very simple sequential model is built using Keras.



In [1]:
from keras.models import Sequential
from keras.layers import Dense, Activation

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


model is an instance of a Sequential obj

Dense is an obj for layers

Dense is just one type of layer and there are many diff types of layers

Looking at the arrows in our image (in the above section) coming from the hidden layer to the output layer, we can see that each node in the hidden layer is connected to all nodes in the output layer. This is how we know that the **output layer** in the image is a dense layer. This same logic applies to the hidden layer.



Dense is the most basic type of layer and it connects each ip to each op within the layer

First param: no of neurons/nodes in the layer

The input shape parameter input_shape=(2,) tells us how many neurons our input layer has, so in our case, we have two.

activation: activation function is a non-linear function that typically follows a dense layer


In [2]:
layers = [
    Dense(3, input_shape=(2,), activation='relu'),
    Dense(2, activation='softmax')
]

model = Sequential(layers)


### Layers in a NN

Few examples of layers in a NN are:

- Dense (or fully connected) layers
- Convolutional layers
- Pooling layers
- Recurrent layers
- Normalization layers

Different layers perform different transformations on their inputs, and some layers are better suited for some tasks than others. For example, a convolutional layer is usually used in models that are doing work with image data. Recurrent layers are used in models that are doing work with time series data, and fully connected layers, as the name suggests, fully connects each input to each output within its layer.

Let’s consider the following example ANN:

![](http://deeplizard.com/images/deep%20neural%20network%20with%204%20layers.png)

We can see that the first layer, the input layer, consists of eight nodes. Each of the eight nodes in this layer represents an individual feature from a given sample in our dataset.

This tells us that a single sample from our dataset consists of eight dimensions. When we choose a sample from our dataset and pass this sample to the model, each of the eight values contained in the sample will be provided to a corresponding node in the input layer.

We can see that each of the eight input nodes are connected to every node in the next layer.

Each connection between the first and second layers transfers the output from the previous node to the input of the receiving node (left to right). The two layers in the middle that have six nodes each are hidden layers simply because they are positioned between the input and output layers.

#### Layer weights

Each connection between two nodes has an associated weight, which is just a number.

Each weight represents the strength of the connection between the two nodes. When the network receives an input at a given node in the input layer, this input is passed to the next node via a connection, and the input will be multiplied by the weight assigned to that connection.

For each node in the second layer, a weighted sum is then computed with each of the incoming connections. This sum is then passed to an activation function, which performs some type of transformation on the given sum. For example, an activation function may transform the sum to be a number between zero and one. The actual transformation will vary depending on which activation function is used.

`node output = activation(weighted sum of inputs)`

#### Forward pass through a neural network


Once we obtain the output for a given node, the obtained output is the value that is passed as input to the nodes in the next layer.

This process continues until the output layer is reached. The number of nodes in the output layer depends on the number of possible output or prediction classes we have. In our example, we have four possible prediction classes.

Suppose our model was tasked with classifying four types of animals. Each node in the output layer would represent one of four possibilities. For example, we could have cat, dog, llama or lizard. The categories or classes depend on how many classes are in our dataset.

For a given sample from the dataset, the entire process from input layer to output layer is called a forward pass through the network.

#### Finding the optimal weights

As the model learns, the weights at all connections are updated and optimized so that the input data point maps to the correct output prediction class.



### Defining the neural network in code with Keras

In our previous discussion, we saw how to use Keras to build a sequential model. Now, let’s do this for our example network.

Will start out by defining an array of Dense objects, our layers. This array will then be passed to the constructor of the sequential model.

Remember our network looks like this:

![](http://deeplizard.com/images/deep%20neural%20network%20with%204%20layers.png)

Given this, we have

In [3]:
layers = [
    # first hidden layer: needs to have input shape specified
    Dense(6, input_shape=(8,), activation='relu'),
    Dense(6, activation='relu'),
    Dense(4, activation='softmax')
]
model = Sequential(layers)

Notice how the first Dense object specified in the array is not the input layer. The first Dense object is the first hidden layer. The input layer is specified as a parameter to the first Dense object’s constructor.

Our input shape is eight. This is why our input shape is specified as input_shape=(8,). Our first hidden layer has six nodes as does our second hidden layer, and our output layer has four nodes.


### Activation Functions

In an artificial neural network, an activation function is a function that maps a node's inputs to its corresponding output.

`node output = activation(weighted sum of inputs)`

The activation function does some type of operation to transform the sum to a number that is often times between some lower limit and some upper limit. This transformation is often a non-linear transformation. 


#### Sigmoid activation function

Sigmoid takes in an input and does the following:

- For negative inputs, sigmoid will transform the input to a number close to zero.
- For positive inputs, sigmoid will transform the input into a number close to one.
- For inputs close to zero, sigmoid will transform the input into some number between zero and one.

![](http://deeplizard.com/images/sigmoid%20function%20graph%20curve.svg)

So, for sigmoid, zero is the lower limit, and one is the upper limit.

Alright, we now understand mathematically what one of these activation functions does, but what’s the intuition?

#### Activation function intuition

Well, an activation function is biologically inspired by activity in our brains where different neurons fire (or are activated) by different stimuli.

For example, if you smell something pleasant, like freshly baked cookies, certain neurons in your brain will fire and become activated. If you smell something unpleasant, like spoiled milk, this will cause other neurons in your brain to fire.

Deep within the folds of our brains, certain neurons are either firing or they’re not. This can be represented by a zero for not firing or a one for firing.

With the Sigmoid activation function in an artificial neural network, we have seen that the neuron can be between zero and one, and the closer to one, the more activated that neuron is while the closer to zero the less activated that neuron is.


#### Relu activation function

Now, it’s not always the case that our activation function is going to do a transformation on an input to be between zero and one.

In fact, one of the most widely used activation functions today called ReLU doesn’t do this. ReLU, which is short for rectified linear unit, transforms the input to the maximum of either zero or the input itself.

`ReLU(x) = max(0, x)`

So if the input is less than or equal to zero, then relu will output zero. If the input is greater than zero, relu will then just output the given input.

The idea here is, the more positive the neuron is, the more activated it is. Now, we’ve only talked about two activation functions here, Sigmoid and relu, but there are other types of activation functions that do different types of transformations to their inputs.

### Why do we use activation functions?


To understand why we use activation functions, we need to first understand linear functions.

Suppose that f is a function on a set X. 
Suppose that a and b are in X. 
Suppose that x is a real number.

The function f is said to be a linear function if and only if:

`f(a+b) = f(a) + f(b)` and `f(xa) = xf(a)`

An important feature of linear functions is that the composition of two linear functions is also a linear function. This means that, even in very deep neural networks, if we only had linear transformations of our data values during a forward pass, the learned mapping in our network from input to output would also be linear.

Typically, the types of mappings that we are aiming to learn with our deep neural networks are more complex than simple linear mappings.

This is where activation functions come in. Most activation functions are non-linear, and they are chosen in this way on purpose. Having non-linear activation functions allows our neural networks to compute arbitrarily complex functions.

#### Activation functions in code with Keras

Let’s take a look at how to specify an activation function in a Keras Sequential model.

There are two basic ways to achieve this. First, we’ll import our classes.

```python
model = Sequential([
    Dense(5, input_shape=(3,), activation='relu')
])
```

In this case, we have a Dense layer and we are specifying relu as our activation function activation='relu'.

The second way is to add the layers and activation functions to our model after the model has been instantiated like so:

```python
model = Sequential()
model.add(Dense(5, input_shape=(3,)))
model.add(Activation('relu'))
```

Remember that:

`node output = activation(weighted sum of inputs)`

For our example, this means that each output from the nodes in our Dense layer will be equal to the relu result of the weighted sums like

`node output = relu(weighted sum of inputs)`

### Training an ANN

When we train a model, we’re basically trying to solve an optimization problem. We’re trying to optimize the weights within the model. Our task is to find the weights that most accurately map our input data to the correct output class. This mapping is what the network must learn.

#### Optimization algorithm

The weights are optimized using what we call an optimization algorithm. The optimization process depends on the chosen optimization algorithm. We also use the term optimizer to refer to the chosen algorithm. The most widely known optimizer is called stochastic gradient descent, or more simply, SGD.

When we have any optimization problem, we must have an optimization objective, so now let’s consider what SGD’s objective is in optimizing the model’s weights.

The objective of SGD is to minimize some given function that we call a loss function. So, SGD updates the model's weights in such a way as to make this loss function as close to its minimum value as possible.

#### Loss function

One common loss function is mean squared error (MSE), but there are several loss functions that we could use in its place. As deep learning practitioners, it's our job to decide which loss function to use.

Alright, but what is the actual loss we’re talking about? Well, during training, we supply our model with data and the corresponding labels to that data.

For example, suppose we have a model that we want to train to classify whether images are either images of cats or images of dogs. We will supply our model with images of cats and dogs along with the labels for these images that state whether each image is of a cat or of a dog.

Suppose we give one image of a cat to our model. Once the forward pass is complete and the cat image data has flowed through the network, the model is going to provide an output at the end. This will consist of what the model thinks the image is, either a cat or a dog.

In a literal sense, the output will consist of probabilities for cat or dog. For example, it may assign a 75% probability to the image being a cat, and a 25% probability to it being a dog. In this case, the model is assigning a higher likelihood to the image being of a cat than of a dog.

- 75% chance it's a cat
- 25% chance it's a dog

If we stop and think about it for a moment, this is very similar to how humans make decisions. Everything is a prediction!

The loss is the error or difference between what the network is predicting for the image versus the true label of the image, and SGD will to try to minimize this error to make our model as accurate as possible in its predictions.

After passing all of our data through our model, we’re going to continue passing the same data over and over again. This process of repeatedly sending the same data through the network is considered training. During this training process is when the model will actually learn. More about learning in the next post. So, through this process that’s occurring with SGD iteratively, the model is able to learn from the data.

### Learning in artificial neural networks - More details

In a previous post, we learned about the training process and saw that each data point used for training is passed through the network. This pass through the network from input to output is called a forward pass, and the resulting output depends on the weights at each connection inside the network.

Once all of the data points in our dataset have been passed through the network, we say that an epoch is complete.

**An epoch refers to a single pass of the entire dataset to the network during training.**

Note that many epochs occur throughout the training process as the model learns.

#### What does it mean to learn?

Well, remember, when the model is initialized, the network weights are set to arbitrary values. We have also seen that, at the end of the network, the model will provide the output for a given input.

Once the output is obtained, the loss (or the error) can be computed for that specific output by looking at what the model predicted versus the true label.

After the loss is calculated, **the gradient of this loss function is computed with respect to each of the weights within the network.** Note, gradient is just a word for the derivative of a function of several variables.

Continuing with this explanation, let’s focus in on only one of the weights in the model.

At this point, we’ve calculated the loss of a single output, and we calculate the gradient of that loss with respect to our single chosen weight. This calculation is done using a technique called **backpropagation**

Once we have the value for the gradient of the loss function, we can use this value to update the model’s weight. The gradient tells us which direction will move the loss towards the minimum, and our task is to move in a direction that lowers the loss and steps closer to this minimum value.

We then multiply the gradient value by something called a learning rate. A learning rate is a small number usually ranging between 0.01 and 0.0001, but the actual value can vary.

**The learning rate tells us how large of a step we should take in the direction of the minimum.**

Alright, so we multiply the gradient with the learning rate, and we subtract this product from the weight, which will give us the new updated value for this weight.

`new weight = old weight - (learning rate * gradient)`

In this discussion, we just focused on one single weight to explain the concept, but this same process is going to happen with each of the weights in the model each time data passes through it.

The only difference is that when the gradient of the loss function is computed, the value for the gradient is going to be different for each weight because the gradient is being calculated with respect to each weight.

So now imagine all these weights being iteratively updated with each epoch. The weights are going to be incrementally getting closer and closer to their optimized values while SGD works to minimize the loss function.

This updating of the weights is essentially what we mean when we say that the model is learning. It’s learning what values to assign to each weight based on how those incremental changes are affecting the loss function. As the weights change, the network is getting smarter in terms of accurately mapping inputs to the correct output.

After each epoch basically the loss should decrease and the accuracy should increase





### Preprocessing the data to be trained using our NN

[link](https://www.youtube.com/watch?v=UkzhouEk6uY)

In [4]:
import numpy as np
from random import randint
from sklearn.preprocessing import MinMaxScaler

In [22]:
train_labels = []
train_samples = []

For keras the samples need to be in form of a np array or a list of np arrays
The labels need to be in form of a np array

We will generate some numeric data and do some preprocessing on it st keras can understand the data and train our 
NN on it

Example data:

- An experimental drug was tested on idvs from ages 13 - 100
- The trial had 2100 participants. Half were under 65 and half over 65
- 95% of patients 65 or older experienced side effects
- 95% of patients under 65 experienced no side effects

We want our NN to predict if an indv will have side effects or not

In [23]:
for i in range(1000):
    random_younger = randint(13, 64)
    train_samples.append(random_younger)
    train_labels.append(0)
    
    random_older = randint(65, 100)
    train_samples.append(random_older)
    train_labels.append(1)
    
for i in range(50):
    random_younger = randint(13, 64)
    train_samples.append(random_younger)
    train_labels.append(1)
    
    random_older = randint(65, 100)
    train_samples.append(random_older)
    train_labels.append(0)

In [24]:
len(train_samples) == len(train_labels)

True

In [25]:
train_labels = np.array(train_labels)
train_samples = np.array(train_samples)

In [9]:
train_samples.shape

(2100,)

In [10]:
train_labels.shape

(2100,)

Now we have our raw data in the formalt keras wants

The NN might not learn v well from nos ranging from 13 - 100

So we scale our data in range 0-1

In [12]:
train_samples.reshape(2100,1)

array([[62],
       [70],
       [55],
       ...,
       [87],
       [45],
       [81]])

In [13]:
scalar = MinMaxScaler(feature_range=(0, 1))

scaled_train_samples = scalar.fit_transform(train_samples.reshape(len(train_samples), 1))
scaled_test_samples = scalar.fit_transform(tes)



In [14]:
scaled_train_samples

array([[0.56321839],
       [0.65517241],
       [0.48275862],
       ...,
       [0.85057471],
       [0.36781609],
       [0.7816092 ]])

In [26]:
train_labels

array([0, 1, 0, ..., 0, 1, 0])

Now our data is perfect for training

#### Training in code with Keras

In [16]:
import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Activation
from keras.layers.core import Dense
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy

Next, we define our model:

In [17]:
model = Sequential([
    Dense(6, input_shape = (1,), activation='relu'),
    Dense(32, activation='relu'),
    Dense(2, activation='sigmoid')
])

Before we can train our model, we must compile it like so:


In [18]:
model.compile(optimizer=Adam(lr=0.0001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

To the compile() function, we are passing the optimizer, the loss function, and the metrics that we would like to see. Notice that the optimizer we have specified is called Adam. Adam is just a variant of SGD. Inside the Adam constructor is where we specify the learning rate, and in this case Adam(lr=.0001), we have chosen 0.0001.

Finally, we fit our model to the data. Fitting the model to the data means to train the model on the data. We do this with the following code:

In [33]:
model.fit(x=scaled_train_samples, y=train_labels, batch_size=10, epochs=20, shuffle=True, verbose=2)

Epoch 1/20
 - 1s - loss: 0.6570 - acc: 0.5262
Epoch 2/20
 - 0s - loss: 0.6420 - acc: 0.5819
Epoch 3/20
 - 0s - loss: 0.6235 - acc: 0.6324
Epoch 4/20
 - 0s - loss: 0.6056 - acc: 0.6557
Epoch 5/20
 - 0s - loss: 0.5874 - acc: 0.6762
Epoch 6/20
 - 0s - loss: 0.5692 - acc: 0.7048
Epoch 7/20
 - 0s - loss: 0.5506 - acc: 0.7381
Epoch 8/20
 - 0s - loss: 0.5322 - acc: 0.7729
Epoch 9/20
 - 0s - loss: 0.5138 - acc: 0.7929
Epoch 10/20
 - 0s - loss: 0.4955 - acc: 0.8110
Epoch 11/20
 - 0s - loss: 0.4774 - acc: 0.8305
Epoch 12/20
 - 0s - loss: 0.4594 - acc: 0.8433
Epoch 13/20
 - 0s - loss: 0.4419 - acc: 0.8533
Epoch 14/20
 - 0s - loss: 0.4250 - acc: 0.8614
Epoch 15/20
 - 0s - loss: 0.4088 - acc: 0.8719
Epoch 16/20
 - 0s - loss: 0.3937 - acc: 0.8848
Epoch 17/20
 - 0s - loss: 0.3794 - acc: 0.8881
Epoch 18/20
 - 0s - loss: 0.3662 - acc: 0.8971
Epoch 19/20
 - 0s - loss: 0.3542 - acc: 0.9029
Epoch 20/20
 - 0s - loss: 0.3435 - acc: 0.9090


<keras.callbacks.History at 0x7fd9f41e1550>

Expected output:

```
Epoch 1/20 0s - loss: 0.6400 - acc: 0.5576
Epoch 2/20 0s - loss: 0.6061 - acc: 0.6310
Epoch 3/20 0s - loss: 0.5748 - acc: 0.7010
Epoch 4/20 0s - loss: 0.5401 - acc: 0.7633
Epoch 5/20 0s - loss: 0.5050 - acc: 0.7990
Epoch 6/20 0s - loss: 0.4702 - acc: 0.8300
Epoch 7/20 0s - loss: 0.4366 - acc: 0.8495
Epoch 8/20 0s - loss: 0.4066 - acc: 0.8767
Epoch 9/20 0s - loss: 0.3808 - acc: 0.8814
Epoch 10/20 0s - loss: 0.3596 - acc: 0.8962
Epoch 11/20 0s - loss: 0.3420 - acc: 0.9043
Epoch 12/20 0s - loss: 0.3282 - acc: 0.9090
Epoch 13/20 0s - loss: 0.3170 - acc: 0.9129
Epoch 14/20 0s - loss: 0.3081 - acc: 0.9210
Epoch 15/20 0s - loss: 0.3014 - acc: 0.9190
Epoch 16/20 0s - loss: 0.2959 - acc: 0.9205
Epoch 17/20 0s - loss: 0.2916 - acc: 0.9238
Epoch 18/20 0s - loss: 0.2879 - acc: 0.9267
Epoch 19/20 0s - loss: 0.2848 - acc: 0.9252
Epoch 20/20 0s - loss: 0.2824 - acc: 0.9286

```

scaled_train_samples is a numpy array consisting of the training samples.

train_labels is a numpy array consisting of the corresponding labels for the training samples.

batch_size=10 specifies how many training samples should be sent to the model at once.

epochs=20 means that the complete training set (all of the samples) will be passed to the model a total of 20 times.

shuffle=True indicates that the data should first be shuffled before being passed to the model.

verbose=2 indicates how much logging we will see as the model trains.

The output gives us the following values for each epoch:

- Epoch number
- Duration in seconds
- Loss
- Accuracy


What you will notice is that the loss is going down and the accuracy is going up as the epochs progress.


### Loss functions in neural networks

The loss function is what SGD is attempting to minimize by iteratively updating the weights in the network.

At the end of each epoch during the training process, the loss will be calculated using the network’s output predictions and the true labels for the respective input.

Suppose our model is classifying images of cats and dogs, and assume that the label for cat is 0 and the label for dog is 1.

- cat: 0
- dog: 1

Now suppose we pass an image of a cat to the model, and the provided output is 0.25. In this case, the difference between the model’s prediction and the true label is 0.25 - 0.00 = 0.25. This difference is also called the error.

`error = 0.25 - 0.00 = 0.25`

This process is performed for every output. For each epoch, the error is accumulated across all the individual outputs.

Let’s look at a loss function that is commonly used in practice called the mean squared error (MSE).

#### MSE

For a single sample, with MSE, we first calculate the difference (the error) between the provided output prediction and the label. We then square this error. For a single input, this is all we do.

`MSE(input) = (output - label)^2`

If we passed multiple samples to the model at once (a batch of samples), then we would take the mean of the squared errors over all of these samples.

This was just illustrating the math behind how one loss function, MSE, works. There are several different loss functions that we could work with though.

The general idea that we just showed for calculating the error of individual samples will hold true for all of the different types of loss functions. The implementation of what we actually do with each of the errors will be dependent upon the algorithm of the given loss function we’re using. For example, we averaged the squared errors to calculate MSE, but other loss functions will use other algorithms to determine the value of the loss.

If we passed our entire training set to the model at once (batch_size=1), then the process we just went over for calculating the loss will occur at the end of each epoch during training.

If we split our training set into batches, and passed batches one at a time to our model, then the loss would be calculated on each batch. With either method, since the loss depends on the weights, we expect to see the value of the loss change each time the weights are updated. Given that the objective of SGD is to minimize the loss, we want to see our loss decrease as we run more epochs.

The currently available loss functions for Keras are as follows:

- mean_squared_error
- mean_absolute_error
- mean_absolute_percentage_error
- mean_squared_logarithmic_error
- squared_hinge
- hinge
- categorical_hinge
- logcosh
- categorical_crossentropy
- sparse_categorical_crossentropy
- binary_crossentropy
- kullback_leibler_divergence
- poisson
- cosine_proximity



### Train, Test, & Validation Sets explained


For training and testing purposes for our model, we should have our data broken down into three distinct datasets. These datasets will consist of the following:

- Training set
- Validation set
- Test set

#### Training set


The training set is what it sounds like. It’s the set of data used to train the model. During each epoch, our model will be trained over and over again on this same data in our training set, and it will continue to learn about the features of this data.

The hope with this is that later we can deploy our model and have it accurately predict on new data that it’s never seen before. It will be making these predictions based on what it’s learned about the training data. Ok, now let’s discuss the validation set.

#### Validation set

Recall how we just mentioned that with each epoch during training, the model will be trained on the data in the training set. Well, it will also simultaneously be validated on the data in the validation set.

We know from our previous posts on training, that during the training process, the model will be classifying the output for each input in the training set. After this classification occurs, the loss will then be calculated, and the weights in the model will be adjusted. Then, during the next epoch, it will classify the same input again.

Now, also during training, the model will be classifying each input from the validation set as well. It will be doing this classification based only on what it’s learned about the data it’s being trained on in the training set. **The weights will not be updated in the model based on the loss calculated from our validation data.**

Remember, the data in the validation set is separate from the data in the training set. So when the model is validating on this data, this data does not consist of samples that the model already is familiar with from training.

One of the major reasons we need a validation set is to ensure that our model is not overfitting to the data in the training set. We’ll discuss overfitting and underfitting in detail at a later time. But the idea of overfitting is that our model becomes really good at being able to classify data in the training set, but it’s unable to generalize and make accurate classifications on data that it wasn’t trained on.

During training, if we’re also validating the model on the validation set and see that the results it’s giving for the validation data are just as good as the results it’s giving for the training data, then we can be more confident that our model is not overfitting.

**The validation set allows us to see how well the model is generalizing during training.**

On the other hand, if the results on the training data are really good, but the results on the validation data are lagging behind, then our model is overfitting. Now let’s move on to the test set.

#### Test set

The test set is a set of data that is used to test the model after the model has already been trained. The test set is separate from both the training set and validation set.

After our model has been trained and validated using our training and validation sets, we will then use our model to predict the output of the unlabeled data in the test set.

One major difference between the test set and the two other sets is that the test set should not be labeled. The training set and validation set have to be labeled so that we can see the metrics given during training, like the loss and the accuracy from each epoch.

When the model is predicting on unlabeled data in our test set, this would be the same type of process that would be used if we were to deploy our model out into the field.

**The test set provides a final check that the model is generalizing well before deploying the model to production.**

For example, if we’re using a model to classify data without knowing what the labels of the data are beforehand, or with never have being shown the exact data it’s going to be classifying, then of course we wouldn’t be giving our model labeled data to do this.

The entire goal of having a model be able to classify is to do it without knowing what the data is beforehand.

**The ultimate goal of machine learning and deep learning is to build models that are able to generalize well.**

### In Keras:

In `model.fit()` we can pass in validation_split. We do not have to explicitly make a validation set. validation_split=0.20 will split out 20% of training data and use it as validation

In [19]:
model.fit(scaled_train_samples, train_labels, validation_split=0.20, batch_size=10, epochs=20, shuffle=True, verbose=2)

Train on 1680 samples, validate on 420 samples
Epoch 1/20
 - 1s - loss: 0.6673 - acc: 0.5220 - val_loss: 0.6792 - val_acc: 0.5357
Epoch 2/20
 - 0s - loss: 0.6555 - acc: 0.5923 - val_loss: 0.6732 - val_acc: 0.5786
Epoch 3/20
 - 0s - loss: 0.6408 - acc: 0.6512 - val_loss: 0.6671 - val_acc: 0.5929
Epoch 4/20
 - 0s - loss: 0.6227 - acc: 0.7024 - val_loss: 0.6590 - val_acc: 0.6095
Epoch 5/20
 - 0s - loss: 0.6012 - acc: 0.7423 - val_loss: 0.6512 - val_acc: 0.6095
Epoch 6/20
 - 0s - loss: 0.5776 - acc: 0.7768 - val_loss: 0.6428 - val_acc: 0.6167
Epoch 7/20
 - 0s - loss: 0.5523 - acc: 0.8042 - val_loss: 0.6351 - val_acc: 0.6333
Epoch 8/20
 - 0s - loss: 0.5265 - acc: 0.8274 - val_loss: 0.6286 - val_acc: 0.6357
Epoch 9/20
 - 0s - loss: 0.5004 - acc: 0.8363 - val_loss: 0.6223 - val_acc: 0.6405
Epoch 10/20
 - 0s - loss: 0.4738 - acc: 0.8565 - val_loss: 0.6169 - val_acc: 0.6500
Epoch 11/20
 - 0s - loss: 0.4458 - acc: 0.8679 - val_loss: 0.6116 - val_acc: 0.6714
Epoch 12/20
 - 0s - loss: 0.4167 - acc

<keras.callbacks.History at 0x7ff7fade0ac8>

Now we have validation loss and validation accuracy metrics as well

We implicitly created validation set using the validation_split param

But we could also create the validation set explicitly and pass that onto the model

```python

valid_set = [(sample, label), (sample, label), ..., (sample, label)]

model.fit(scaled_train_samples, train_labels, validation_data=valid_set, batch_size=10, epochs=20, shuffle=True, verbose=2)

```



### Overfitting in a neural network

In this post, we’ll discuss what it means when a model is said to be overfitting. We’ll also cover some techniques we can use to try to reduce overfitting when it happens.

Overfitting occurs when our model becomes really good at being able to classify or predict on data that was included in the training set, but is not as good at classifying data that it wasn’t trained on. So essentially, the model has overfit the data in the training set.

#### How to spot overfitting

We can tell if the model is overfitting based on the metrics that are given for our training data and validation data during training. We previously saw that when we specify a validation set during training, we get metrics for the validation accuracy and loss, as well as the training accuracy and loss.

**If the validation metrics are considerably worse than the training metrics, then that is indication that our model is overfitting.**

We can also get an idea that our model is overfitting if during training, the model’s metrics were good, but when we use the model to predict on test data, it doesn't accurately classify the data in the test set.

The concept of overfitting boils down to the fact that the model is unable to generalize well. It has learned the features of the training set extremely well, but if we give the model any data that slightly deviates from the exact data used during training, it’s unable to generalize and accurately predict the output.

#### Reducing overfitting

- **Adding more data to the training set**

    The easiest thing we can do, as long as we have access to it, is to add more data. The more data we can train our model on, the more it will be able to learn from the training set. Also, with more data, we’re hoping to be adding more diversity to the training set as well.

    For example, if we train a model to classify whether an image is an image of a dog or cat, and the model has only seen images of larger dogs, like Labs, Golden Retrievers, and Boxers, then in practice if it sees a Pomeranian, it may not do so well at recognizing that a Pomeranian is a dog.
    If we add more data to this model to encompass more breeds, then our training data will become more diverse, and the model will be less likely to overfit.


    
- **Data augmentation**

    Another technique we can deploy to reduce overfitting is to use data augmentation. This is the process of creating additional augmented data by reasonably modifying the data in our training set. For image data, for example, we can do these modifications by:

    - Cropping
    - Rotating
    - Flipping
    - Zooming
    
    The general idea of data augmentation allows us to add more data to our training set that is similar to the data that we already have, but is just reasonably modified to some degree so that it’s not the exact same.

    For example, if most of our dog images were dogs facing to the left, then it would be a reasonable modification to add augmented flipped images so that our training set would also have dogs that faced to the right.
    

- **Reduce the complexity of the model**

    Something else we can do to reduce overfitting is to reduce the complexity of our model. We could reduce complexity by making simple changes, like removing some layers from the model, or reducing the number of neurons in the layers. This may help our model generalize better to data it hasn’t seen before.

- **Dropout**

    The last tip we'll cover for reducing overfitting is to use something called dropout. The general idea behind dropout is that, if you add it to a model, it will randomly ignore some subset of nodes in a given layer during training, i.e., it drops out the nodes from the layer. Hence, the name dropout. This will prevent these dropped out nodes from participating in producing a prediction on the data.

### Underfitting in a neural network

Underfitting is on the opposite end of the spectrum. A model is said to be underfitting when it’s not even able to classify the data it was trained on, let alone data it hasn’t seen before.

**A model is said to be underfitting when it’s not able to classify the data it was trained on.**

We can tell that a model is underfitting when the metrics given for the training data are poor, meaning that the training accuracy of the model is low and/or the training loss is high.

If the model is unable to classify data it was trained on, it’s likely not going to do well at predicting on data that it hasn’t seen before.

#### Reducing underfitting


- **Increase the complexity of the model**

    One thing we can do is increase the complexity of our model. This is the exact opposite of a technique we gave to reduce overfitting. If our data is more complex, and we have a relatively simple model, then the model may not be sophisticated enough to be able to accurately classify or predict on our complex data.

    We can increase the complexity of our model by doing things such as:

    - Increasing the number of layers in the model.
    - Increasing the number of neurons in each layer.
    - Changing what type of layers we’re using and where.
    
- **Add more features to the input samples**

    Another technique we can use to reduce underfitting is to add more features to the input samples in our training set if we can. These additional features may help our model classify the data better.
    For example, say we have a model that is attempting to predict the price of a stock based on the last three closing prices of this stock. So our input would consist of three features:

    - day 1 close
    - day 2 close
    - day 3 close
    
    If we added additional features to this data, like, maybe the opening prices for these days, or the volume of the stock for these days, then perhaps this may help our model learn more about the data and improve it’s accuracy.

- **Reduce dropout**

    The last tip we'll discuss about reducing underfitting is to reduce dropout. Again, this is exactly opposite of a technique we gave in a previous post for reducing overfitting.

    As mentioned in that post, dropout, which we’ll cover in more detail at a later time, is a regularization technique that randomly ignores a subset of nodes in a given layer. It essentially prevents these dropped out nodes from participating in producing a prediction on the data.
    
    When using dropout, we can specify a percentage of the nodes we want to drop. So if we’re using a 50% dropout rate, and we see that our model is underfitting, then we can decrease our amount of dropout by reducing the dropout percentage to something lower than 50 and see what types of metrics we get when we attempt to train again.

    These nodes are only dropped out for purposes of training and not during validation. So, if we see that our model is fitting better to our validation data than it is to our training data, then this is a good indicator to reduce the amount of dropout that we’re using.

### Regularization in a neural network

In general, regularization is a technique that helps reduce overfitting or reduce variance in our network by penalizing for complexity. The idea is that certain complexities in our model may make our model unlikely to generalize well, even though the model fits the training data.

**Regularization is a technique that helps reduce overfitting or reduce variance in our network by penalizing for complexity.**

Given this, if we add regularization to our model, we’re essentially trading in some of the ability of our model to fit the training data well for the ability to have the model generalize better to data it hasn’t seen before.

To implement regularization is to simply add a term to our loss function that penalizes for large weights.

#### L2 Regularization

The most common regularization technique is called L2 regularization. We know that regularization basically involves adding a term to our loss function that penalizes for large weights.

With L2 regularization, the term we’re adding to the loss is the sum of the squared norms of the weight matrices

$$\sum_{j=1}^{n}\left\Vert w^{[j]}\right\Vert ^{2},$$

multiplied by a small constant

$$\frac{\lambda }{2m}.$$

If you’re not familiar with norms in general, understand that a norm is just a function that assigns a strictly positive length or size for each vector in a vector space. The vector space we’re working with here depends on the sizes of our weight matrices.

To over simplify, know for now that the norm of each of our weight matrices is just going to be a positive number.

Suppose that v is a vector in a vector space. The norm of v is denoted as ∥v∥, and it is required that

$$\left\Vert v\right\Vert \geq 0.$$

Let’s look at what L2 regularization looks like. We have

$$loss + \left( \sum_{j=1}^{n}\left\Vert w^{[j]}\right\Vert ^{2}\right)\frac{\lambda }{2m}.$$

<table class="table table-sm table-hover">
                                                    <tbody>
                                                        <tr>
                                                            <th>
                                                                Variable
                                                            </th>
                                                            <th>
                                                                Definition
                                                            </th>
                                                        </tr>
                                                        <tr>
                                                            <td>
                                                                <span class="MathJax_Preview" style="color: inherit;"></span><span id="MathJax-Element-8-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><mi>n</mi></math>" role="presentation" style="font-size: 119%; position: relative;"><span id="MJXc-Node-97" class="mjx-math" aria-hidden="true"><span id="MJXc-Node-98" class="mjx-mrow"><span id="MJXc-Node-99" class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.213em; padding-bottom: 0.265em;">n</span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math></span></span><script type="math/tex" id="MathJax-Element-8">n</script>
                                                            </td>
                                                            <td>
                                                                Number of layers
                                                            </td>
                                                        </tr>
                                                        <tr>
                                                            <td>
                                                                <span class="MathJax_Preview" style="color: inherit;"></span><span id="MathJax-Element-9-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><msup><mi>w</mi><mrow class=&quot;MJX-TeXAtom-ORD&quot;><mo stretchy=&quot;false&quot;>[</mo><mi>j</mi><mo stretchy=&quot;false&quot;>]</mo></mrow></msup></math>" role="presentation" style="font-size: 119%; position: relative;"><span id="MJXc-Node-100" class="mjx-math" aria-hidden="true"><span id="MJXc-Node-101" class="mjx-mrow"><span id="MJXc-Node-102" class="mjx-msubsup"><span class="mjx-base"><span id="MJXc-Node-103" class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.213em; padding-bottom: 0.265em;">w</span></span></span><span class="mjx-sup" style="font-size: 70.7%; vertical-align: 0.513em; padding-left: 0px; padding-right: 0.071em;"><span id="MJXc-Node-104" class="mjx-texatom" style=""><span id="MJXc-Node-105" class="mjx-mrow"><span id="MJXc-Node-106" class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.475em; padding-bottom: 0.58em;">[</span></span><span id="MJXc-Node-107" class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.423em; padding-bottom: 0.475em;">j</span></span><span id="MJXc-Node-108" class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.475em; padding-bottom: 0.58em;">]</span></span></span></span></span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>w</mi><mrow class="MJX-TeXAtom-ORD"><mo stretchy="false">[</mo><mi>j</mi><mo stretchy="false">]</mo></mrow></msup></math></span></span><script type="math/tex" id="MathJax-Element-9">w^{[j]}</script>
                                                            </td>
                                                            <td>
                                                                Weight matrix for the <span class="MathJax_Preview" style="color: inherit;"></span><span id="MathJax-Element-10-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><msup><mi>j</mi><mrow class=&quot;MJX-TeXAtom-ORD&quot;><mi>t</mi><mi>h</mi></mrow></msup></math>" role="presentation" style="font-size: 119%; position: relative;"><span id="MJXc-Node-109" class="mjx-math" aria-hidden="true"><span id="MJXc-Node-110" class="mjx-mrow"><span id="MJXc-Node-111" class="mjx-msubsup"><span class="mjx-base"><span id="MJXc-Node-112" class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.423em; padding-bottom: 0.475em;">j</span></span></span><span class="mjx-sup" style="font-size: 70.7%; vertical-align: 0.513em; padding-left: 0px; padding-right: 0.071em;"><span id="MJXc-Node-113" class="mjx-texatom" style=""><span id="MJXc-Node-114" class="mjx-mrow"><span id="MJXc-Node-115" class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.423em; padding-bottom: 0.265em;">t</span></span><span id="MJXc-Node-116" class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.475em; padding-bottom: 0.265em;">h</span></span></span></span></span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>j</mi><mrow class="MJX-TeXAtom-ORD"><mi>t</mi><mi>h</mi></mrow></msup></math></span></span><script type="math/tex" id="MathJax-Element-10">j^{th}</script> layer
                                                            </td>
                                                        </tr>
                                                        <tr>
                                                            <td>
                                                                <span class="MathJax_Preview" style="color: inherit;"></span><span id="MathJax-Element-11-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><mi>m</mi></math>" role="presentation" style="font-size: 119%; position: relative;"><span id="MJXc-Node-117" class="mjx-math" aria-hidden="true"><span id="MJXc-Node-118" class="mjx-mrow"><span id="MJXc-Node-119" class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.213em; padding-bottom: 0.265em;">m</span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>m</mi></math></span></span><script type="math/tex" id="MathJax-Element-11">m</script>
                                                            </td>
                                                            <td>
                                                                Number of inputs
                                                            </td>
                                                        </tr>
                                                        <tr>
                                                            <td>
                                                                <span class="MathJax_Preview" style="color: inherit;"></span><span id="MathJax-Element-12-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><mi>&amp;#x03BB;</mi></math>" role="presentation" style="font-size: 119%; position: relative;"><span id="MJXc-Node-120" class="mjx-math" aria-hidden="true"><span id="MJXc-Node-121" class="mjx-mrow"><span id="MJXc-Node-122" class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.475em; padding-bottom: 0.265em;">λ</span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>λ</mi></math></span></span><script type="math/tex" id="MathJax-Element-12">\lambda</script>
                                                            </td>
                                                            <td>
                                                                Regularization parameter
                                                            </td>
                                                        </tr>
                                                    </tbody>
                                                </table>



The term λ is called the regularization parameter, and this is another hyperparameter that we’ll have to choose and then test and tune in order to choose the correct number for our specific model.
 
To summarize, we now know that regularization is just a technique that penalizes for relatively large weights in our model, and behind the scenes, the implementation of regularization is just the addition of a term to our existing loss function.

#### Impact of regularization


Well, using L2 regularization as an example, if we were to set λ to be large, then it would incentivize the model to set the weights close to zero because the objective of SGD is to minimize the loss function. Remember our original loss function is now being summed with the sum of the squared matrix norms,

If λ is large, then this term, λ/2m, will continue to stay relatively large, and if we’re multiplying that by the sum of the squared norms, then the product may be relatively large depending on how large our weights are. This means that our model is incentivized to make the weights small so that the value of this entire function stays relatively small in order to minimize loss.

Intuitively, we could think that maybe this technique will set the weights so close to zero, that it could basically zero-out or reduce the impact of some of our layers. If that’s the case, then it would conceptually simplify our model, making our model less complex, which may in turn reduce variance and overfitting.



### Learning rates and neural networks


We know that the objective during training is for SGD to minimize the loss between the actual output and the predicted output from our training samples. The path towards this minimized loss is occurring over several steps.

Recall that we start the training process with arbitrarily set weights, and then we incrementally update these weights as we move closer and closer to the minimized loss.

Now, the size of these steps we’re taking to reach our minimized loss is going to depend on the learning rate. Conceptually, we can think of the learning rate of our model as the step size.

Before going further, let’s first pause for a quick refresher. We know that during training, after the loss is calculated for our inputs, the gradient of that loss is then calculated with respect to each of the weights in our model.

Once we have the value of these gradients, this is where the idea of our learning rate comes in. The gradients will then get multiplied by the learning rate.

This learning rate is a small number usually ranging between 0.01 and 0.0001, but the actual value can vary, and any value we get for the gradient is going to become pretty small once we multiply it by the learning rate.

Alright, so we get the value of this product for each gradient multiplied by the learning rate, and we then take each of these values and update the respective weights by subtracting this value from them.

`new weight = old weight - (learning rate * gradient)`

We ditch the previous weights that were set on each connection and update them with these new values.

The value we choose for the learning rate is going to require some testing. The learning rate is another one of those hyperparameters that we have to test and tune with each model before we know exactly where we want to set it, but as mentioned earlier, a typical guideline is to set it somewhere between 0.01 and 0.0001.

When setting the learning rate to a number on the higher side of this range, we risk the possibility of overshooting. This occurs when we take a step that’s too large in the direction of the minimized loss function and shoot past this minimum and miss it.

To avoid this, we can set the learning rate to a number on the lower side of this range. With this option, since our steps will be really small, it will take us a lot longer to reach the point of minimized loss.

Overall, the act of choosing between a higher learning rate and a lower learning rate leaves us with this kind of trade-off idea.

Alright, so now we should have an idea about what the learning rate is and how it fits into the overall process of training.

### Batch size in artificial neural networks

Put simply, the batch size is the number of samples that will be passed through to the network at one time. Note that a batch is also commonly referred to as a mini-batch.

The batch size is the number of samples that are passed to the network at once.

Now, recall that an epoch is one single pass over the entire training set to the network. The batch size and an epoch are not the same thing. Let’s illustrate this with an example.

Let’s say we have 1000 images of dogs that we want to train our network on in order to identify different breeds of dogs. Now, let’s say we specify our batch size to be 10. This means that 10 images of dogs will be passed as a group, or as a batch, at one time to the network.

Given that a single epoch is one single pass of all the data through the network, it will take 100 batches to make up full epoch. We have 1000 images divided by a batch size of 10, which equals 100 total batches.

Ok, we have the idea of batch size down now, but what’s the point? Why not just pass each data element one-by-one to our model rather than grouping the data in batches?


Well, for one, generally the larger the batch size, the quicker our model will complete each epoch during training. This is because, depending on our computational resources, our machine may be able to process much more than one single sample at a time.

The trade-off, however, is that even if our machine can handle very large batches, the quality of the model may degrade as we set our batch larger and may ultimately cause the model to be unable to generalize well on data it hasn't seen before.

In general, the batch size is another one of the hyperparameters that we must test and tune based on how our specific model is performing during training. This parameter will also have to be tested in regards to how our machine is performing in terms of its resource utilization when using different batch sizes

For example, if we were to set our batch size to a relatively high number, say 100, then our machine may not have enough computational power to process all 100 images in parallel, and this would suggest that we need to lower our batch size.

```python
model.fit(
    scaled_train_samples, 
    train_labels, 
    validation_data=valid_set, 
    batch_size=10,
    epochs=20, 
    shuffle=True, 
    verbose=2
)
```

This fit() function accepts a parameter called batch_size. This is where we specify our batch_size for training. In this example, we’ve just arbitrarily set the value to 10.

Now, during the training of this model, we’ll be passing in 10 samples at a time until we eventually pass in all the training data to complete one single epoch. Then, we’ll start the same process over again to complete the next epoch.

### Fine-tuning neural networks


Fine-tuning is very closely linked with the term transfer learning.

**Transfer learning occurs when we use knowledge that was gained from solving one problem and apply it to a new but related problem.**

For example, knowledge gained from learning to recognize cars could be applied in a problem of recognizing trucks.

Fine-tuning is a way of applying or utilizing transfer learning. Specifically, fine-tuning is a process that takes a model that has already been trained for one given task and then tunes or tweaks the model to make it perform a second similar task.

#### Why use fine-tuning?


Assuming the original task is similar to the new task, using an artificial neural network that has already been designed and trained allows us to take advantage of what the model has already learned without having to develop it from scratch.

When building a model from scratch, we usually must try many approaches through trial-and-error.

For example, we have to choose how many layers we’re using, what types of layers we’re using, what order to put the layers in, how many nodes to include in each layer, decide how much regularization to use, what to set our learning rate as, etc.

- Number of layers
- Types of layers
- Order of layers
- Number of nodes in each layer
- How much regularization to use
- Learning rate

Building and validating our model can be a huge task in its own right, depending on what data we’re training it on.

This is what makes the fine-tuning approach so attractive. If we can find a trained model that already does one task well, and that task is similar to ours in at least some remote way, then we can take advantage of everything the model has already learned and apply it to our specific task.

Now, of course, if the two tasks are different, then there will be some information that the model has learned that may not apply to our new task, or there may be new information that the model needs to learn from the data regarding the new task that wasn’t learned from the previous task.

For example, a model trained on cars is not going to have ever seen a truck bed, so this feature is something new the model would have to learn about. However, think about everything our model for recofnizing trucks could use from the model that was originally trained on cars.

This already trained model has learned to understand edges and shapes and textures and more objectively, head lights, door handles, windshields, tires, etc. All of these learned features are definitely things we could benefit from in our new model for classifying trucks.

So this sounds fantastic, right, but how do we actually technically implement this?

#### How to fine-tune

Going back to the example we just mentioned, if we have a model that has already been trained to recognize cars and we want to fine-tune this model to recognize trucks, we can first import our original model that was used on the cars problem.

For simplicity purposes, let’s say we remove the last layer of this model. The last layer would have previously been classifying whether an image was a car or not. After removing this, we want to add a new layer back that’s purpose is to classify whether an image is a truck or not.

In some problems, we may want to remove more than just the last single layer, and we may want to add more than just one layer. This will depend on how similar the task is for each of the models.

Layers at the end of our model may have learned features that are very specific to the original task, where as layers at the start of the model usually learn more general features like edges, shapes, and textures.

After we’ve modified the structure of the existing model, we then want to freeze the layers in our new model that came from the original model.

#### Freezing weights

By freezing, we mean that we don’t want the weights for these layers to update whenever we train the model on our new data for our new task. We want to keep all of these weights the same as they were after being trained on the original task. We only want the weights in our new or modified layers to be updating.

After we do this, all that’s left is just to train the model on our new data. Again, during this training process, the weights from all the layers we kept from our original model will stay the same, and only the weights in our new layers will be updating.

### Data augmentation for machine learning


Data augmentation occurs when we create new data based on modifications of our existing data. Essentially, we’re creating new, augmented data by making reasonable modifications to data in our training set.

For example, we could augment image data by flipping the images, either horizontally or vertically. We could rotate the images, zoom in or out, crop, or even vary the color of the images. All of these are common data augmentation techniques.

- Horizontal flip
- Vertical flip
- Rotation
- Zoom in
- Zoom out
- Cropping
- Color variations


Why would we want to do this, though? Why use data augmentation?

Well, we may just want or need to add more data to our training set. For example, say we have a relatively small amount of samples to include in our training set, and it’s difficult to get more. Then we could create new data from our existing data set using data augmentation to create more samples.

#### Reducing overfitting


Additionally, we may want to use data augmentation to reduce overfitting. Recall, we mentioned this point in our post that covered overfitting.

If our model is overfitting, one technique to reduce it to add more data to the training set. Given the first point we just made a moment ago, we can easily create more data using data augmentation if we don’t have access to additional data.

Also, in regards to overfitting, think about if we had a data set full of images of dogs, but most of the dogs were facing to the right.

If a model was trained on these images, it’s reasonable to think that the model would believe that only these right-facing dogs were actually dogs. It may very well not classify left-facing dogs as actually being dogs when we deploy this model in the field or use it to predict on test images.

With this, producing new right-facing images of dogs by augmenting the original images of left-facing dogs would be a reasonable modification. We would do this by horizontally flipping the original images to produce new ones.

Now, some data augmentation techniques may not be appropriate to use on our given data set. Sticking with the dog example, we stated that horizontally flipping our dog images makes sense, however, it wouldn’t necessarily be reasonable to modify our dog images by vertically flipping them.

In real world images of dogs, it’s not really as likely that we’ll be seeing many images of dogs flipped upside down on their heads or backs.

### Predicting with a Neural Network


In an earlier post, we discussed what it means to train a neural network. After this training has completed, if we’re happy with the metrics that the model gave us for our training and validation data, then the next step would be to have our model predict on the data in our test set.

Recall from our post on training, testing, and validation sets, that unlike the train and validation data that get passed to the model with their respective labels, when we pass our test data to the model, we do not pass the corresponding labels. So, the model is not aware of the labels for the test set at all.

For predicting, essentially what we’re doing is passing our unlabeled test data to the model and having the model predict on what it thinks about each sample in our test data. These predictions are occurring based on what the model learned during training.

For example, suppose we trained a model to classify different breeds of dogs based on dog images. For each sample image, the model outputs which breed it thinks is most likely.

Now, suppose our test set contains images of dogs our model hasn’t seen before. We pass these samples to our model, and ask it to predict the output for each image. Remember, the model does not have access to the labels for these images.

This process will tell us how well our model performs on data it hasn’t seen before based on how well its predictions match the true labels for the data.

This process will also help give us some insight on what our model has or hasn’t learned. For example, suppose we trained our model only on images of large dogs, but our test set has some images of small dogs. When we pass a small dog to our model, it likely isn’t going to do well at predicting what breed the dog is, since it’s not been trained very well on smaller dogs in general.

This means that we need to make sure that our training and validation sets are representative of the actual data we want our model to be predicting on.

Aside from running predictions on our test data, we can also have our model predict on real world data once it’s deployed to serve its actual purpose.

For example, if we deployed this neural network for classifying dog breeds to a website that anyone could visit and upload an image of their dog, then we’d want to be predicting the breed of the dog based on the image.

This image would likely not have been one that was included in our training, validation, or test sets, so this prediction would be occurring with true data from out in the field.

#### Using a Keras model to get a prediction

```python

predictions = model.predict(
    scaled_test_samples, 
    batch_size=10, 
    verbose=0
) 

```

The first item we have here is a variable we’ve called predictions. We’re assuming that we already have our model built and trained. Our model in this example is the object called model. We’re setting predictions equal to model.predict().

This predict() function is what we call to actually have the model make predictions. To the predict() function, we’re passing the variable called scaled_test_samples. This is the variable that’s holding our test data.

We set our batch_size here arbitrarily to 10. We set the verbosity, which is how much we want to see printed to the screen when we run these predictions, to 0 here to show nothing.

Before going forward, note that we are just using a sample model here that we’ve used in previous posts. We won’t go into any details about the actual model now, but if you’re interested in building the same model and running these same predictions, then check out the posts from the Keras series on preprocessing data and creating a confusion matrix. They will give you the full picture regarding this test data.

For now, we’re just showing the concept of how to run predictions in code with Keras.

Ok, so we ran our predictions. Now let’s look at our output.

```
for p in predictions:
    print(p)

[ 0.7410683  0.2589317]
[ 0.14958295  0.85041702]
...
[ 0.87152088  0.12847912]
[ 0.04943148  0.95056852]
```

For this sample model, we have two output categories, and we’re just printing each prediction from each sample in our test set, which is stored in our predictions variable.

We see we have two columns here. These represent the two output categories, and are showing us probabilities for each category. These are the actual predictions. Let’s call the categories 0 and 1 for simplicity.

For example, for the first sample in our test set, the model is assigning a 74% probability that the sample falls into category 0 and only a 26% probability that it falls into category 1.

The second sample shows us that the model is assigning an 85% probability to the sample being in category 1, and a 15% probability that it’s in category 0, and this occurs for each of the test samples in our predictions variable.

### Supervised learning for machine learning

Supervised learning occurs when the data in our training set is labeled.

Recall from our post on training, validation, and testing sets, we explained that both the training data and validation data are labeled when passed to the model. This is the case for supervised learning.

With supervised learning, each piece of data passed to the model during training is a pair that consists of the input object, or sample, along with the corresponding label or output value.

Essentially, with supervised learning, the model is learning how to create a mapping from given inputs to particular outputs based on what it’s learning from the labeled training data.

For example, say we’re training a model to classify different types of reptiles based on images of reptiles. Now during training, we pass in an image of a lizard.

Since we’re doing supervised learning, we’ll also be supplying our model with the label for this image, which in this case is simply just lizard.

Based on what we saw in our post on training, we know that the model will then classify the output of this image, and then determine the error for that image by looking at the difference between the value it predicted and the actual label for the image.

To do this, the labels need to be encoded into something numeric. In this case, the label of lizard may be encoded as 0, whereas the label of turtle may be encoded as 1.

After this, we go through this process of determining the error or loss for all of the data in our training set for as many epochs as we specify. Remember, during this training, the objective of the model is to minimize the loss, so when we deploy our model and use it to predict on data it wasn’t trained on, it will be making these predictions based on the labeled data that it did see during training.

If we didn’t supply our labels to the model, though, then what’s the alternative? Well, as opposed to supervised learning, we could instead use something called unsupervised learning. We could also use another technique called semi-supervised learning. We’ll be covering each of these topics in future posts.

For now, we’re going to take a peek at some Keras code to reiterate how and where we’re supplying our labeled samples to our model.

Suppose we have a simple Sequential model here with two hidden dense layers and an output layer with two output categories.

```python
model = Sequential([
    Dense(16, input_shape=(2,), activation='relu'),
    Dense(32, activation='relu'),
    Dense(2, activation='sigmoid')
])
```

We’re assuming the task of this model is to classify whether an individual is male or female based on his or her height and weight.

```python
model.compile(
    Adam(lr=0.0001), 
    loss='sparse_categorical_crossentropy', 
    metrics=['accuracy']
)
```

After compiling our model, we’ve have an example here of some training data that is completely made up for illustration purposes.

```python
# weight, height
train_samples = [
    [150, 67], 
    [130, 60], 
    [200, 65], 
    [125, 52], 
    [230, 72], 
    [181, 70]
]
```

The actual training data is stored in the train_samples variable. Here, we have a list of pairs, and each of these pairs is an individual sample, and a sample is the weight and height of a person.

The first element in each pair is the weight measured in pounds, and the second element is the height measured in inches.

Next, we have our labels stored in this train_labels variable. Here, a 0 represents a male, and a 1 represents a female.

```python

# 0: male
# 1: female
train_labels = [1, 1, 0, 1, 0, 0]

```

The position of each of these labels corresponds to the positions of each sample in our train_samples variable. For example, this first 1 here, which represents a female, is the label for the first element in the train_samples array. This second 1 in train_labels corresponds to the second sample in train_samples, and so on.

```python
model.fit(
    x=train_samples, 
    y=train_labels,
    batch_size=3,
    epochs=10,
    shuffle=True,
    verbose=2
)
```

Now, when we go to train our model, we call model.fit() as we’ve discussed in previous posts, and the first parameter here specified by x is going to be our train_samples variable, and the second parameter, specified by y, is going to be the corresponding train_labels.



### Unsupervised learning in machine learning


In contrast to supervised learning, unsupervised learning occurs when the data in our training set is not labeled.

With unsupervised learning, each piece of data passed to our model during training is solely an unlabeled input object, or sample. There is no corresponding label that’s paired with the sample.

Hm... but if the data isn’t labeled, then how is the model learning? How is it evaluating itself to understand if it’s performing well or not?

Well, first, let’s go ahead and touch on the fact that, with unsupervised learning, since the model is unaware of the labels for the training data, there is no way to measure accuracy. Accuracy is not typically a metric that we use to analyze an unsupervised learning process.

Essentially, with unsupervised learning, the model is going to be given an unlabeled dataset, and it’s going to attempt to learn some type of structure from the data and will extract the useful information or features from this data.

It’s going to be learning how to create a mapping from given inputs to particular outputs based on what it’s learning about the structure of this data without any labels.

#### Clustering algorithms


One of the most popular applications of unsupervised learning is through the use of clustering algorithms. Sticking with our example from our previous post on supervised learning, let’s suppose we have the height and weight data for a particular age group of males and females.

This time, we don’t have the labels for this data, so any given sample from this data set would just be a pair consisting of one person’s height and weight. There is no associated label telling us whether this person was a male or female.

Now, a clustering algorithm could analyze this data and start to learn the structure of it even though it’s not labeled. Through learning the structure, it can start to cluster the data into groups.

We could imagine that if we were to plot this height and weight data on a chart, then maybe it would look something like this with weight on the x-axis and height on the y-axis.

![](http://deeplizard.com/images/cluster%20two%202%20groups.png)

There’s nothing explicitly telling us the labels for this data, but we can see that there are two pretty distinct clusters here, and so we could infer that perhaps this clustering is occurring based on whether these individuals are male or female.

One of these clusters may be made up predominately of females, while the other is predominately male, so clustering is one area that makes use of unsupervised learning. Let’s look at another.

#### Autoencoders

Unsupervised learning is also used by autoencoders.

In the most basic terms, an autoencoders is an artificial neural network that takes in input, and then outputs a reconstruction of this input.

Based on everything we’ve learned so far on neural networks, this seems pretty strange, but let’s explain this idea further using an example.


![](http://deeplizard.com/images/autoencoder.jpg)

The example we’ll use is written about in a [blog](https://blog.keras.io/building-autoencoders-in-keras.html) by François Chollet, the author of Keras, the neural network API we’ve used in several posts.


Suppose we have a set of images of handwritten digits, and we want to pass them through an autoencoder. Remember, an autoencoder is just a neural network.

This neural network will take in this image of a digit, and it will then encode the image. Then, at the end of the network, it will decode the image and output the decoded reconstructed version of the original image.

The goal here is for the reconstructed image to be as close as possible to the original image.

A question we might ask about this process is: How can we even measure how well this autoencoder is doing at reconstructing the original image without visually inspecting it?

Well, we can think of the loss function for this autoencoder as measuring how similar the reconstructed version of the image is to the original version. The more similar the reconstructed image is to the original image, the lower the loss.

Since this is an artificial neural network after all, we’ll still be using some variation of SGD during training, and so we’ll still have the same objective of minimizing our loss function.

During training, our model is incentivized to make the reconstructed images closer and closer to the originals.

#### Applications of autoencoders


Alright, so hopefully we have the very basic idea of an autoencoder down, but what would be an application for doing this? Why would we just want to reconstruct input?

Well, one application for this could be to denoise images. Once the model has been trained, then it can accept other similar images that may have a lot of noise surrounding them, and it will be able to extract the underlying meaningful features and reconstruct the image without the noise.



### Semi-supervised learning for machine learning


Semi-supervised learning kind of takes a middle ground between supervised learning and unsupervised learning.

As a quick refresher, recall from previous posts that supervised learning is the learning that occurs during training of an artificial neural network when the data in our training set is labeled. Unsupervised learning, on the other hand, is the learning that occurs when the data in our training set is not labeled. Now, onto semi-supervised learning.

Semi-supervised learning uses a combination of supervised and unsupervised learning techniques, and that’s because, in a scenario where we’d make use of semi-supervised learning, we would have a combination of both labeled and unlabeled data.

Let’s expand on this idea with an example.

Suppose we have access to a large unlabeled dataset that we’d like to train a model on and that manually labeling all of this data ourselves is just not practical.

Well, we could go through and manually label some portion of this large data set ourselves and use that portion to train our model.

This is fine. In fact, this is how a lot of data used for neural networks becomes labeled. However, if we have access to large amounts of data, and we’ve only labeled some small portion of this data, then what a waste it would be to just leave all the other unlabeled data on the table.

I mean, after all, we know the more data we have to train a model, the better and more robust our model will be. What can we do to make use of the remaining unlabeled data in our data set?

Well, one thing we can do is implement a technique that falls under the category of semi-supervised learning called pseudo-labeling.

#### Pseudo-labeling

This is how pseudo-labeling works. As just mentioned, we’ve already labeled some portion of our data set. Now, we’re going to use this labeled data as the training set for our model. We’re then going to train our model, just as we would with any other labeled data set.

Just through the regular training process, we get our model performing pretty well, and so everything we’ve done up to this point has been regular old supervised learning in practice.

Now here’s where the unsupervised learning piece comes into play. After we’ve trained our model on the labeled portion of the data set, we then use our model to predict on the remaining unlabeled portion of data, and we then take these predictions and label each piece of unlabeled data with the individual outputs that were predicted for them.

This process of labeling the unlabeled data with the output that was predicted by our neural network is the very essence of pseudo-labeling.

After labeling the unlabeled data through this pseudo-labeling process, we train our model on the full dataset, which is now comprised of both the data that was actually truly labeled along with the data that was pseudo labeled.

**Pseudo-labeling allows us to train on a vastly larger dataset.**


We’re also able to train on data that otherwise may have potentially taken many tedious hours of human labor to manually label the data.

As we can imagine, sometimes the cost of acquiring or generating a fully labeled data set is just too high, or the pure act of generating all the labels itself is just not feasible.

Through this process, we can see how this approach makes use of both supervised learning, with the labeled data, and unsupervised learning, with the unlabeled data, which together give us the practice of semi-supervised learning.

**If the unlabeled portion vastly outnumbers the labeled portion, it seems like you're taking a risk pushing through the pseudo-labeled content as it could very well contain a larger number of incorrectly labeled items than the original set.  Isn't this going to be counter productive?  Is there a way to avoid this without manually evaluating a significant percentage of the giant data set?**

- Yeah, you’re right that we could be taking a risk of mislabeling data by using pseudo-labeled samples in our training set. Something we could do to lessen the risk is to only include the pseudo-labeled samples in our training set that received a predicted probability for a particular category that was higher than X%. For example, we could make a rule to only include pseudo-labeled samples in the training set that received a prediction for a specific category of, say, 80% or more. This doesn’t completely strip out the risk of mislabeling, but it does decrease it. The samples that didn't make the cut due to not having a prediction that met the X% rule could then be predicted on again after the model was retrained with a larger data set that included the first round of pseduo-labeled samples.

    Also, before going through the pseudo-labeling process, we need to ensure that our model is performing well during training and validation (“well-performing” is subjective here). Additionally, the labeled data that the model was initially trained on should be a decent representation of the full data set. For example, we’d be in trouble if we were training on images of cats and dogs, but the only labeled dogs we had were of larger breeds, like Labs or Boxers. If the remaining unlabeled data that we end up pseudo-labeling had images of Chihuahuas and Pomeranians, then you can imagine that these small breeds may become mislabeled as cats since the model was never trained to recognize small dogs as actually being dogs.
    
    

### Deep learning with convolutional neural networks


In this post, we’ll be discussing convolutional neural networks. A convolutional neural network, also known as a CNN or ConvNet, is an artificial neural network that has so far been most popularly used for analyzing images for computer vision tasks.

Although image analysis has been the most wide spread use of CNNS, they can also be used for other data analysis or classification as well. Let's get started!

Most generally, we can think of a CNN as an artificial neural network that has some type of specialization for being able to pick out or detect patterns. This pattern detection is what makes CNNs so useful for image analysis.

If a CNN is just an artificial neural network, though, then what differentiates it from a standard multilayer perceptron or MLP?

CNNs have hidden layers called convolutional layers, and these layers are what make a CNN, well... a CNN!

CNNs can, and usually do, have other, non-convolutional layers as well, but the basis of a CNN is the convolutional layers.

Alright, so what do these convolutional layers do?


Just like any other layer, a convolutional layer receives input, transforms the input in some way, and then outputs the transformed input to the next layer. The inputs to convolutional layers are called input channels, and the outputs are called output channels.

With a convolutional layer, the transformation that occurs is called a convolution operation. This is the term that’s used by the deep learning community anyway. Mathematically, the convolution operations performed by convolutional layers are actually called cross-correlations.

We’ll come back to this operation in a bit. For now, let’s look at a high level idea of what convolutional layers are doing.

#### Filters and convolution operations

As mentioned earlier, convolutional neural networks are able to detect patterns in images.

With each convolutional layer, we need to specify the number of filters the layer should have. These filters are actually what detect the patterns.

Let's expand on precisely what we mean When we say that the filters are able to detect patterns. Think about how much may be going on in any single image. Multiple edges, shapes, textures, objects, etc. These are what we mean by patterns.

- edges
- shapes
- textures
- curves
- objects
- colors

One type of pattern that a filter can detect in an image is edges, so this filter would be called an edge detector.

Aside from edges, some filters may detect corners. Some may detect circles. Others, squares. Now these simple, and kind of geometric, filters are what we’d see at the start of a convolutional neural network.

The deeper the network goes, the more sophisticated the filters become. In later layers, rather than edges and simple shapes, our filters may be able to detect specific objects like eyes, ears, hair or fur, feathers, scales, and beaks.

In even deeper layers, the filters are able to detect even more sophisticated objects like full dogs, cats, lizards, and birds.

To understand what’s actually happening here with these convolutional layers and their respective filters, let’s look at an example.

Suppose we have a convolutional neural network that is accepting images of handwritten digits (like from the MNIST data set) and our network is classifying them into their respective categories of whether the image is of a 1, 2, 3, etc.

Let’s now assume that the first hidden layer in our model is a convolutional layer. As mentioned earlier, when adding a convolutional layer to a model, we also have to specify how many filters we want the layer to have.


**The number of filters determine the number of output channels.**

A filter can technically just be thought of as a relatively small matrix ( tensor), for which, we decide the number of rows and columns this matrix has, and the values within this matrix are initialized with random numbers.

For this first convolutional layer of ours, we’re going to specify that we want the layer to contain one filter of size 3 x 3.

#### Convolutional layer

Let’s look at an example animation of the convolution operation:


![](http://deeplizard.com/images/same_padding_no_strides.gif)

This animation showcases the convolution process without numbers. We have an input channel in blue on the bottom. A convolutional filter shaded on the bottom that is sliding across the input channel, and a green output channel:

- Blue (bottom) - Input channel
- Shaded (on top of blue) - 3 x 3 convolutional filter
- Green (top) - Output channel


For each position on the blue input channel, the 3 x 3 filter does a computation that maps the shaded part of the blue input channel to the corresponding shaded part of the green output channel.

This convolutional layer receives an input channel, and the filter will slide over each 3 x 3 set of pixels of the input itself until it’s slid over every 3 x 3 block of pixels from the entire image.

Here we have only one filter, so only one output channel


#### Convolution operation

This sliding is referred to as convolving, so really, we should say that this filter is going to convolve across each 3 x 3 block of pixels from the input.

The blue input channel is a matrix representation of an image from the MNIST dataset. The values in this matrix are the individual pixels from the image. These images are grayscale images, and so we only have a single input channel.

- Grayscale images have a single color channel
- RGB images have three color channels

This input will be passed to a convolutional layer.

As just discussed, we’ve specified the first convolutional layer to only have one filter, and this filter is going to convolve across each 3 x 3 block of pixels from the input. When the filter lands on its first 3 x 3 block of pixels, the dot product of the filter itself with the 3 x 3 block of pixels from the input will be computed and stored. This will occur for each 3 x 3 block of pixels that the filter convolves.

For example, we take the dot product of the filter with the first 3 x 3 block of pixels, and then that result is stored in the output channel. Then, the filter slides to the next 3 x 3 block, computes the dot product, and stores the value as the next pixel in the output channel.

After this filter has convolved the entire input, we’ll be left with a new representation of our input, which is now stored in the output channel. This output channel is called a feature map.

This green output channel becomes the input channel to the next layer as input, and then this process that we just went through with the filter will happen to this new output channel with the next layer’s filters.

This was just a very simple illustration, but as mentioned earlier, **we can think of these filters as pattern detectors.**

#### Input and output channels

Suppose that this grayscale image (single color channel) of a seven from the MNIST data set is our input:

![](http://deeplizard.com/images/mnist%207.PNG)

Let’s suppose that we have four 3 x 3 filters for our first convolutional layer, and these filters are filled with the values you see below. These values can be represented visually by having -1s correspond to black, 1s correspond to white, and 0s correspond to grey.

Let’s suppose that we have four 3 x 3 filters for our first convolutional layer, and these filters are filled with the values you see below. These values can be represented visually by having -1s correspond to black, 1s correspond to white, and 0s correspond to grey.

![](./img/diag1.png)

If we convolve our original image of a seven with each of these four filters individually, this is what the output would look like for each filter:

![](./img/diag2.png)


We can see that all four of these filters are detecting edges. In the output channels, the brightest pixels can be interpreted as what the filter has detected. In the first one, we can see detects top horizontal edges of the seven, and that’s indicated by the brightest pixels (white).

The second detects left vertical edges, again being displayed with the brightest pixels. The third detects bottom horizontal edges, and the fourth detects right vertical edges.

These filters, as we mentioned before, are really basic and just detect edges. These are filters we may see towards the start of a convolutional neural network. More complex filters would be located deeper in the network and would gradually be able to detect more sophisticated patterns like the ones shown here:

![](http://deeplizard.com/images/CNN%20layer%202%20filters.PNG)

We can see the shapes that the filters on the left detected from the images on the right. We can see circles, curves and corners. As we go further into our layers, the filters are able to detect much more complex patterns like dog faces or bird legs shown here:

![](http://deeplizard.com/images/CNN%20layer%204%20filters.PNG)

**The amazing thing is that the pattern detectors are derived automatically by the network. The filter values start out with random values, and the values change as the network learns during training. The pattern detecting capability of the filters emerges automatically.**

**Pattern detectors emerge as the network learns.**

In the past, computer vision experts would develop filters (pattern detectors) manually. One example of this is the Sobel filter, an edge detector. However, with deep learning, we can learn these filters automatically using neural networks!





### Visualizing convolutional filters

In that post, we discussed how each convolutional layer has some set number of filters and that these filters are what actually detect patterns in the given input. We explained technically how this works, and then at the end of the post, we looked at some filters from a CNN and observed what they were able to detect from real world images.

We’re going to be using Keras, a neural network API, to visualize the filters of the convolutional layers from the VGG16 network. We’ve talked about VGG16 previously in the Keras series, but in short, VGG16 is a CNN that won the ImageNet competition in 2014. This is a competition where teams build algorithms to compete on visual recognition tasks.

Most of the code we’ll be using to visualize the filters comes from the blog, [How convolutional neural networks see the world](https://blog.keras.io/how-convolutional-neural-networks-see-the-world.html), by the creator of Keras, François Chollet.

Rather than going over the code line-by-line, we’re going to instead give a high-level overview of what the code is doing, and then we’ll get to the visualization piece. This github link contains the original code from the blog so you can check it out or run it yourself.

The first step is to import the pre-trained VGG16 model.

```
# build the VGG16 network with ImageNet weights
model = vgg16.VGG16(weights='imagenet', include_top=False)
```

Then we define a loss function that has an objective to maximize the activation of a given filter within a given layer. We then calculate gradient ascent with regard to our filter’s activation loss.

```
# we build a loss function that maximizes the activation
# of the nth filter of the layer considered
layer_output = layer_dict[layer_name].output
if K.image_data_format() == 'channels_first':
    loss = K.mean(layer_output[:, filter_index, :, :])
else:
    loss = K.mean(layer_output[:, :, :, filter_index])

# we compute the gradient of the input picture wrt this loss
grads = K.gradients(loss, input_img)[0]

```

Note that gradient ascent is the same thing as gradient descent, except for rather than trying to minimize our loss, we’re trying to maximize it.

We can think of the purpose of maximizing our loss here as basically trying to activate the filter as much as possible in order for us to be able to visually inspect what types of patterns the filter is detecting.

We then pass the network a plain gray image with some random noise as input.

```
# we start from a gray image with some random noise
if K.image_data_format() == 'channels_first':
    input_img_data = np.random.random((1, 3, img_width, img_height))
else:
    input_img_data = np.random.random((1, img_width, img_height, 3))
input_img_data = (input_img_data - 0.5) * 20 + 128
```

After we maximize the loss, we’re then able to obtain a visual representation of what sort of input maximizes the activation for each filter in each layer.

This is generated from the original gray image that we supplied the network.

To run this code, it did take a bit of time running on a CPU. Maybe about an hour to generate all of the visualizations.

That’s a summary of what our code is actually doing. Now, let’s get to the cool part and step through some of these generated visualizations from each convolutional layer.

#### Generated CNN layer visualizations


Here, we’re looking at 25 filters from the first convolutional layer in the first convolutional block of the network. It looks like most of these have encoded some type of direction or color.

1st conv layer from the 1st conv block:

![](http://deeplizard.com/images/stitched_filters_block1_conv1.png)

We can see some that indicate the vertical patterns and others that that indicate left and right diagonal patterns.

Let’s skip to another deeper convolutional layer. We’ll choose the second conv layer from the second conv block.

2nd conv layer from the 2nd conv block:

![](http://deeplizard.com/images/stitched_filters_block2_conv2.png)

Here, these visualizations have become more complex and a little more interesting in regards to what types of patterns some of the filters have encoded.

Let's check out some even deeper layers.

2nd conv layer from the 3rd conv block:


![](http://deeplizard.com/images/stitched_filters_block3_conv2.png)

3rd conv layer from 4th conv block:

![](http://deeplizard.com/images/stitched_filters_block4_conv3.png)

2nd conv layer from 5th conv block:

![](http://deeplizard.com/images/stitched_filters_block5_conv2.png)

Notice how with each deeper convolutional layer, we’re getting more complex and more interesting visualizations. This whole visualization process was pretty fascinating for me when I first observed it, so I hope you think it’s just as cool!

Recall, in the last post, we showed the visualization of these filters on the left relative to the input images on the right.

![](http://deeplizard.com/images/CNN%20layer%204%20filters.PNG)

Let’s focus on the one of dog faces at the top. Recall that none of the filter visualizations we just observed gave us anything that looked remotely like an actual real world object. Instead, we just saw those cool patterns.

Why is this? Why didn’t we see things like dog faces? Well, recall, what we were previously observing was visual representation of what sort of input would maximize the activation for any given filter.

Here, what we’re looking at is the patterns that a given filter was able to detect on specific image input for which the filter was highly activated. I just wanted to touch on the differences between those two illustrations.



### One-hot encodings for machine learning

We know that when we’re training a neural network via supervised learning, we pass labeled input to our model, and the model gives us a predicted output.

If our model is an image classifier, for example, we may be passing labeled images of animals as input. When we do this, the model is usually not interpreting these labels as words, like dog or cat. Additionally, the output that our model gives us in regards to its predictions aren’t typically words like dog or cat either. Instead, most of the time our labels become encoded, so they can take on the form of an integer or of a vector of integers.

One type of encoding that is widely used for encoding categorical data with numerical values is called one-hot encoding.

One-hot encodings transform our categorical labels into vectors of 0s and 1s. The length of these vectors is the number of classes or categories that our model is expected to classify.

- 0: Cold
- 1: Hot

If we were classifying whether images were either of a dog or of a cat, then our one-hot encoded vectors that corresponded to these classes would each be of length 2 reflecting the two categories.

If we added another category, like lizard, so that we could then classify whether images were of dogs, cats, or lizards, then our corresponding one-hot encoded vectors would each be of length 3 since we now have three categories.

Alright, so we know the labels are transformed or encoded into vectors. We know that each of these vectors has a length that is equal to the number of output categories, and we briefly mentioned that the vectors contain 0s and 1s. Let’s go into further detail on this last piece.

Let’s stick with the example of classifying images as being either of a cat, dog, or lizard. With each of the corresponding vectors for these categories being of length 3, we can think of each index or each element within the vector corresponding to one of the three categories.

Let’s say for this example that the cat label corresponds to the first element, dog corresponds to the second element, and lizard corresponds to the third element.

With each of these categories having their own place in the corresponding vectors, we can now discuss the intuition behind the name one-hot.

With each one-hot encoded vector, every element will be a zero EXCEPT for the element that corresponds to the actual category of the given input. This element will be a hot one.


- Cat	1	0	0
- Dog	0	1	0
- Lizard	0	0	1


For cat, we see that the first element is a one and the next two elements are zeros. This is because each element within the vector is a zero except for the element that corresponds to the actual category, and we said that the cat category corresponded to the first element.

Similarly, for dog, we see that the second element is a one, while the first and third elements are zeros. Lastly, for lizard, the third element is a one, while the first and second elements are zeros.

We can see that each time the model receives input that is a cat, it’s not interpreting the label as the word cat, but instead is interpreting the label as this vector [1,0,0].

For images labeled as dog, the model is interpreting the dog label as the vector [0,1,0], and for images labeled as lizard, the model is interpreting the label as the vector [0,0,1].


- Cat	[1,0,0]
- Dog	[0,1,0]
- Lizard	[0,0,1]

Just for clarity purposes, say we add another category, llama, to the mix. Now, we have four categories total, and so this will cause each one-hot encoded vector corresponding to each of these categories to be of length 4 now.

The vectors will now look like this.


- Cat	[1,0,0,0]
- Dog	[0,1,0,0]
- Lizard	[0,0,1,0]
- Llama	[0,0,0,1]

### Batch Normalization (Batch Norm)

Before getting to the details about batch normalization, let’s quickly first discuss regular normalization techniques.

$$z=\frac{x-mean}{std}$$

Generally speaking, when training a neural network, we want to normalize or standardize our data in some way ahead of time as part of the pre-processing step. This is the step where we prepare our data to get it ready for training.

Normalization and standardization have the same objective of transforming the data to put all the data points on the same scale.

A typical normalization process consists of scaling numerical data down to be on a scale from zero to one, and a typical standardization process consists of subtracting the mean of the dataset from each data point, and then dividing that difference by the data set’s standard deviation.

This forces the standardized data to take on a mean of zero and a standard deviation of one. In practice, this standardization process is often just referred to as normalization as well.

#### Why use normalization techniques?


In general, this all boils down to putting our data on some type of known or standard scale. Why do we do this?

Well, if we didn’t normalize our data in some way, we can imagine that we may have some numerical data points in our data set that might be very high, and other that might be very low.

For example, suppose we have data on the number of miles individuals have driven a car over the last 5 years. We may have someone who has driven 100,000 miles total, and we may have someone else who’s only driven 1000 miles total. This data has a relatively wide range and isn’t necessarily on the same scale.

Additionally, each one of the features for each of our samples could vary widely as well. If we have one feature which corresponds to an individual’s age and the other feature corresponds to the number of miles that individual has driven a car over the last five years, then, again, we can see that these two pieces of data, age and miles driven, will not be on the same scale.

The larger data points in these non-normalized data sets can cause instability in neural networks because the relatively large inputs can cascade down through the layers in the network, which may cause imbalanced gradients, which may therefore cause the famous exploding gradient problem.

For now, understand that this imbalanced, non-normalized data may cause problems with our network that make it drastically harder to train. Additionally, non-normalized data can significantly decrease our training speed.

When we normalize our inputs, however, we put all of our data on the same scale, in attempts to increase training speed as well as avoid the problem we just discussed because we won’t have this relatively wide range between data points.

This is good, but there is another problem that can arise even with normalized data. From our previous post on how a neural network learns, we know how the weights in our model become updated over each epoch during training via the process of stochastic gradient descent.

#### Weights that tip the scale


What if, during training, one of the weights ends up becoming drastically larger than the other weights?

Well, this large weight will then cause the output from its corresponding neuron to be extremely large, and this imbalance will, again, continue to cascade through the network, causing instability. This is where batch normalization comes into play.

**Batch norm is applied to layers that we choose within our network.**

When applying batch norm to a layer, the first thing batch norm does is normalize the output from the activation function. Recall from our post on activation functions that the output from a layer is passed to an activation function, which transforms the output in some way depending on the function itself, before being passed to the next layer as input.

After normalizing the output from the activation function, batch norm multiplies this normalized output by some arbitrary parameter and then adds another arbitrary parameter to this resulting product.

**This calculation with the two arbitrary parameters sets a new standard deviation and mean for the data. The two arbitrarily set parameters, g and b are trainable, meaning that they will be become learned and optimized during the training process.**

![](./img/diag3.png)

This process makes it so that the weights within the network don’t become imbalanced with extremely high or low values since the normalization is included in the gradient process.

This addition of batch norm to our model can greatly increase the speed in which training occurs and reduce the ability of outlying large weights to over-influence the training process.

Everything we just mentioned about the batch normalization process occurs on a per-batch basis, hence the name batch norm.

These batches are determined by the batch size we set when we train our model.

```python
model = Sequential([
    Dense(16, input_shape=(1,5), activation='relu'),
    Dense(32, activation='relu'),
    BatchNormalization(axis=1),
    Dense(2, activation='softmax')
])
```

We have a model with two hidden layers with 16 and 32 nodes respectively, both using relu() as their activation functions, and an output layer with two output categories using the softmax() activation function.

The only difference here is the line between the last hidden layer and the output layer.

This is how we specify batch normalization in Keras. Following the layer for which we want the activation output normalized, we specify a BatchNormalization object. To do this, we need to import BatchNormilization from Keras, as shown below.

```python
from keras.models import Sequential
from keras.layers import Dense, Activation, BatchNormalization
```

The only parameter that we’re specifying for BatchNormalization is the axis parameter, and that is just to specify the axis from the data that should be normalized, which is typically the features axis.

There are several other parameters that we can optionally specify, including two called beta_initializer and gamma_initializer. These are the initializers for the arbitrarily set parameters that we mentioned when we were describing how batch norm works.

These are set by default to 0 and 1 by Keras, but we can optionally change these, along with several other optionally specified parameters.



### Zero Padding in Convolutional Neural Networks


We’ve seen in our post on CNNs that each convolutional layer has some number of filters that we define, and we also define the dimension of these filters as well. We also showed how these filters convolve image input.

When a filter convolves a given input channel, it gives us an output channel. This output channel is a matrix of pixels with the values that were computed during the convolutions that occurred on the input channel.

**When this happens, the dimensions of our image are reduced.**

Let’s check this out using the same image of a seven that we used in our previous post on CNNs. Recall, we have a 28 x 28 matrix of the pixel values from an image of a 7 from the MNIST data set. We'll use a 3 x 3 filter. This gives us the following the items:

![](./img/diag4.png)

![](./img/diag5.png)

![](./img/diag6.png)

We can see that the output is actually not the same size as the original input. The output size is 26 x 26. Our original input channel was 28 x 28, and now we have an output channel that has shrank in size to 26 x 26 after convolving the image. Why is that?

With our 28 x 28 image, our 3 x 3 filter can only fit into 26 x 26 possible positions, not all 28 x 28. Given this, we get the resulting 26 x 26 output. This is due to what happens when we convolve the edges of our image.

For ease of visualizing this, let’s look at a smaller scale example. Here we have an input of size 4 x 4 and then a 3 x 3 filter. Let’s look at how many times we can convolve our input with this filter, and what the resulting output size will be.

![](http://deeplizard.com/images/zero%20padding%20example.PNG)

This means that when this 3 x 3 filter finishes convolving this 4 x 4 input, it will give us an output of size 2 x 2.

We see that the resulting output is 2 x 2, while our input was 4 x 4, and so again, just like in our larger example with the image of a seven, we see that our output is indeed smaller than our input in terms of dimensions.

We can know ahead of time by how much our dimensions are going to shrink. In general, if our image is of size n x n, and we convolve it with an f x f filter, then the size of the resulting output is

$$(n-f+1)(n-f+1)$$

Indeed, this gives us a 2 x 2 output channel, which is exactly what we saw a moment ago. This holds up for the example with the larger input of the seven as well


#### Issues with reducing the dimensions

Consider the resulting output of the image of a seven again. It doesn’t really appear to be a big deal that this output is a little smaller than the input, right?

We didn’t lose that much data or anything because most of the important pieces of this input are kind of situated in the middle. But we can imagine that this would be a bigger deal if we did have meaningful data around the edges of the image.

Additionally, we only convolved this image with one filter. What happens as this original input passes through the network and gets convolved by more filters as it moves deeper and deeper?

Well, what’s going to happen is that the resulting output is going to continue to become smaller and smaller. This is a problem.

If we start out with a 4 x 4 image, for example, then just after a convolutional layer or two, the resulting output may become almost meaningless with how small it becomes. Another issue is that we’re losing valuable data by completely throwing away the information around the edges of the input.

What can we do here? Queue the super hero music because this is where zero padding comes into play.

#### Zero padding to the rescue

Zero padding is a technique that allows us to preserve the original input size. This is something that we specify on a per-convolutional layer basis. With each convolutional layer, just as we define how many filters to have and the size of the filters, we can also specify whether or not to use padding.

Zero padding occurs when we add a border of pixels all with value zero around the edges of the input images. This adds kind of a padding of zeros around the outside of the image, hence the name zero padding. Going back to our small example from earlier, if we pad our input with a border of zero valued pixels, let’s see what the resulting output size will be after convolving our input.

![](http://deeplizard.com/images/zero%20padding%20example%202.PNG)

We see that our output size is indeed 4 x 4, maintaining the original input size. Now, sometimes we may need to add more than a border that’s only a single pixel thick. Sometimes we may need to add something like a double border or triple border of zeros to maintain the original size of the input. This is just going to depend on the size of the input and the size of the filters.

The good thing is that most neural network APIs figure the size of the border out for us. All we have to do is just specify whether or not we actually want to use padding in our convolutional layers.

#### Valid and same padding

There are two categories of padding. One is referred to by the name valid. This just means no padding. If we specify valid padding, that means our convolutional layer is not going to pad at all, and our input size won’t be maintained.

The other type of padding is called same. This means that we want to pad the original input before we convolve it so that the output size is the same size as the input size.



In [3]:
import keras
from keras.models import Sequential
from keras.layers import Activation
from keras.layers.core import Dense,Flatten
from keras.layers.convolutional import *

In [4]:
# Now, we'll create a completely arbitrary CNN.

model_valid = Sequential([
    Dense(16, input_shape=(20,20,3), activation='relu'),
    Conv2D(32, kernel_size=(3,3), activation='relu', padding='valid'),
    Conv2D(64, kernel_size=(5,5), activation='relu', padding='valid'),
    Conv2D(128, kernel_size=(7,7), activation='relu', padding='valid'),
    Flatten(),
    Dense(2, activation='softmax')
])

We’ve specified that the input size of the images that are coming into this CNN is 20 x 20, and our first convolutional layer has a filter size of 3 x 3, which is specified in Keras with the kernel_size parameter. Then, the second conv layer specifies size 5 x 5, and the third, 7 x 7.

With this model, we’re specifying the parameter called padding for each convolutional layer. We’re setting this parameter equal to the string 'valid'. Remember from earlier that, valid padding means no padding.

This is actually the default for convolutional layers in Keras, so if we don’t specify this parameter, it’s going to default to valid padding. Since we’re using valid padding here, we expect the dimension of our output from each of these convolutional layers to decrease.

Let’s check. Here is the summary of this model.

In [5]:
model_valid.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 20, 20, 16)        64        
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 18, 18, 32)        4640      
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 14, 14, 64)        51264     
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 8, 8, 128)         401536    
_________________________________________________________________
flatten_1 (Flatten)          (None, 8192)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 16386     
Total params: 473,890
Trainable params: 473,890
Non-trainable params: 0
_________________________________________________________________


We can see the output shape of each layer in the second column. The first two integers specify the dimension of the output in height and width. Starting with our first layer, we see our output size is the original size of our input, 20 x 20.

Once we get to the output of our first convolutional layer, the dimensions decrease to 18 x 18, and again at the next layer, it decreases to 14 x 14, and finally, at the last convolutional layer, it decreases to 8 x 8.

So, we start with 20 x 20 and end up with 8 x 8 when it’s all done and over with.

On the contrary, now, we can create a second model.

In [None]:
model_same = 