![](https://mlnotebook.github.io/img/CNN/convSobel.gif)

# <p style="text-align: center;"> Table of Contents </p>
- ## 1. [Introduction](#Intro)
   - ### 1.1 [Abstract](#abstract)
- ## 2. [Understanding Convolution Operations](#Understanding_Convolution_Operations)
   - ### 2.1 [Edge Detection Example](#Edge_Detection_Example)
   - ### 2.2 [More Edge Detection](#more)
- ## 3. [Padding](#Padding)
- ## 4. [Strided Convolutions](#Strided)
- ## 5. [Convolutions Over Volume](#Convolutions)
- ## 6. [One Layer of a Convolutional Network](#One_Layer)
  - ### 6.1 [Simple Convolutional Network Example](#Simple)
- ## 7. [Pooling](#Pooling)
- ## 8. [CNN Example](#cnn)
  - ### 8.1 [Why Convolutions?](#con)
- ## 9. [A brief overview of Imitation Learning](#imi)
  - ### 9.1 [Basics of Imitation Learning](#bimi)
- ## 10. [Contribution](#Contribution)
- ## 11. [Citation](#Citation) 
- ## 12. [License](#License)

# <a id="Introduction"> 1 Introduction </a>
##   <a id='abstract'> 1.1 Abstract </a>

The main agenda of this notebook is as follow:-
- To understand the convolution operation
- To understand the pooling operation
- Remembering the vocabulary used in convolutional neural networks (padding, stride, filter, etc.)
- Building a convolutional neural network for multi-class classification in images

#   <a id='Understanding_Convolution_Operations'> 2. Understanding Convolution Operations </a>

One major problem with computer vision problems is that the input data can get really big. Suppose an image is of the size 68 X 68 X 3. The input feature dimension then becomes 12,288. This will be even bigger if we have larger images (say, of size 720 X 720 X 3). Now, if we pass such a big input to a neural network, the number of parameters will swell up to a HUGE number (depending on the number of hidden layers and hidden units). This will result in more computational and memory requirements – not something most of us can deal with.

#### We will explain the Convolution Operation by an example

## <a id="Edge_Detection_Example"> 2.1. Edge Detection Example </a>

The early layers of a neural network detect edges from an image. Deeper layers might be able to detect the cause of the objects and even more deeper layers might detect the cause of complete objects (like a person’s face).

In this section, we will focus on how the edges can be detected from an image. Suppose we are given the below image:

![](images/1.png)

As you can see, there are many vertical and horizontal edges in the image. The first thing to do is to detect these edges:

![](images/2.png)

Next, we convolve this 6 X 6 matrix with a 3 X 3 filter:
![](images/3.png)

After the convolution, we will get a 4 X 4 image. The first element of the 4 X 4 matrix will be calculated as:

![](images/4.png)

So, we take the first 3 X 3 matrix from the 6 X 6 image and multiply it with the filter. Now, the first element of the 4 X 4 output will be the sum of the element-wise product of these values, i.e. 3*1 + 0 + 1*-1 + 1*1 + 5*0 + 8*-1 + 2*1 + 7*0 + 2*-1 = -5. To calculate the second element of the 4 X 4 output, we will shift our filter one step towards the right and again get the sum of the element-wise product:

![](images/5.png)

Similarly, we will convolve over the entire image and get a 4 X 4 output:

![](images/6.png)

So, convolving a 6 X 6 input with a 3 X 3 filter gave us an output of 4 X 4. Consider one more example:

![](images/7.png)

## <a id="More"> 2.2. More Edge Detection </a>

The type of filter that we choose helps to detect the vertical or horizontal edges. We can use the following filters to detect different edges:

![](images/8.png)

Some of the commonly used filters are:

![](images/9.png)

The Sobel filter puts a little bit more weight on the central pixels. Instead of using these filters, we can create our own as well and treat them as a parameter which the model will learn using backpropagation.

# <a id="Padding"> 3. Padding </a>

We have seen that convolving an input of 6 X 6 dimension with a 3 X 3 filter results in 4 X 4 output. We can generalize it and say that if the input is n X n and the filter size is f X f, then the output size will be (n-f+1) X (n-f+1):

> - Input: n X n
- Filter size: f X f
- Output: (n-f+1) X (n-f+1)

#### There are primarily two disadvantages here:

> 1. Every time we apply a convolutional operation, the size of the image shrinks
2. Pixels present in the corner of the image are used only a few number of times during convolution as compared to the central pixels. Hence, we do not focus too much on the corners since that can lead to information loss

To overcome these issues, we can pad the image with an additional border, i.e., we add one pixel all around the edges. This means that the input will be an 8 X 8 matrix (instead of a 6 X 6 matrix). Applying convolution of 3 X 3 on it will result in a 6 X 6 matrix which is the original shape of the image. This is where padding comes to the fore:

> - Input: n X n
- Padding: p
- Filter size: f X f
- Output: (n+2p-f+1) X (n+2p-f+1)

There are two common choices for padding:

> #### 1. Valid: It means no padding. If we are using valid padding, the output will be (n-f+1) X (n-f+1)

> #### 2. Same: Here, we apply padding so that the output size is the same as the input size, i.e., n+2p-f+1 = n . So, p = (f-1)/2

We now know how to use padded convolution. This way we don’t lose a lot of information and the image does not shrink either. Next, we will look at how to implement strided convolutions.



# <a id="Strided"> 4. Strided Convolutions </a>

Suppose we choose a stride of 2. So, while convoluting through the image, we will take two steps – both in the horizontal and vertical directions separately. The dimensions for stride s will be:

> - Input: n X n
- Padding: p
- Stride: s
- Filter size: f X f
- Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1]

Stride helps to reduce the size of the image, a particularly useful feature.

# <a id="Convolutions"> 5. Convolutions Over Volume </a>

Suppose, instead of a 2-D image, we have a 3-D input image of shape 6 X 6 X 3. How will we apply convolution on this image? We will use a 3 X 3 X 3 filter instead of a 3 X 3 filter. Let’s look at an example:

> - Input: 6 X 6 X 3
- Filter: 3 X 3 X 3


The dimensions above represent the height, width and channels in the input and filter.Keep in mind that the number of channels in the input and filter should be same. This will result in an output of 4 X 4. Let’s understand it visually:

![](images/10.png)

Since there are three channels in the input, the filter will consequently also have three channels. After convolution, the output shape is a 4 X 4 matrix. So, the first element of the output is the sum of the element-wise product of the first 27 values from the input (9 values from each channel) and the 27 values from the filter. After that we convolve over the entire image.

Instead of using just a single filter, we can use multiple filters as well. How do we do that? Let’s say the first filter will detect vertical edges and the second filter will detect horizontal edges from the image. If we use multiple filters, the output dimension will change. So, instead of having a 4 X 4 output as in the above example, we would have a 4 X 4 X 2 output (if we have used 2 filters):

![](images/11.png)

#### Generalized dimensions can be given as:

- Input: n X n X nc
- Filter: f X f X nc
- Padding: p
- Stride: s
- Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1] X nc’

Here, nc is the number of channels in the input and filter, while nc’ is the number of filters.

 

# <a id="One_Layer" > 6. One Layer of a Convolutional Network</a>

Once we get an output after convolving over the entire image using a filter, we add a bias term to those outputs and finally apply an activation function to generate activations. This is one layer of a convolutional network. Recall that the equation for one forward pass is given by:

![](images/12.png)

In our case, input (6 X 6 X 3) is a[0]and filters (3 X 3 X 3) are the weights w[1]. These activations from layer 1 act as the input for layer 2, and so on. Clearly, the number of parameters in case of convolutional neural networks is independent of the size of the image. It essentially depends on the filter size. Suppose we have 10 filters, each of shape 3 X 3 X 3. What will be the number of parameters in that layer? Let’s try to solve this:

- Number of parameters for each filter = 3* 3* 3 = 27
- There will be a bias term for each filter, so total parameters per filter = 28
- As there are 10 filters, the total parameters for that layer = 28 * 10 = 280

No matter how big the image is, the parameters only depend on the filter size. Awesome, isn’t it? Let’s have a look at the summary of notations for a convolution layer:

- f[l] = filter size
- p[l] = padding
- s[l] = stride
- n[c][l] = number of filters

Let’s combine all the concepts we have learned so far and look at a convolutional network example.

## <a id="Simple"> 6.1. Simple Convolutional Network Example </a>

This is how a typical convolutional network looks like:

![](images/13.png)

We take an input image (size = 39 X 39 X 3 in our case), convolve it with 10 filters of size 3 X 3, and take the stride as 1 and no padding. This will give us an output of 37 X 37 X 10. We convolve this output further and get an output of 7 X 7 X 40 as shown above. Finally, we take all these numbers (7 X 7 X 40 = 1960), unroll them into a large vector, and pass them to a classifier that will make predictions. This is a microcosm of how a convolutional network works.

There are a number of hyperparameters that we can tweak while building a convolutional network. These include the number of filters, size of filters, stride to be used, padding, etc. We will look at each of these in detail later in this article. Just keep in mind that as we go deeper into the network, the size of the image shrinks whereas the number of channels usually increases.

In a convolutional network (ConvNet), there are basically three types of layers:

- Convolution layer
- Pooling layer
- Fully connected layer
Let’s understand the pooling layer in the next section.



# <a id="Pooling"> 7. Pooling Layers </a>
Pooling layers are generally used to reduce the size of the inputs and hence speed up the computation. Consider a 4 X 4 matrix as shown below:

![](images/14.png)

Applying max pooling on this matrix will result in a 2 X 2 output:

![](images/15.png)

For every consecutive 2 X 2 block, we take the max number. Here, we have applied a filter of size 2 and a stride of 2. These are the hyperparameters for the pooling layer. Apart from max pooling, we can also apply average pooling where, instead of taking the max of the numbers, we take their average. In summary, the hyperparameters for a pooling layer are:

- Filter size
- Stride
- Max or average pooling

If the input of the pooling layer is nh X nw X nc, then the output will be [{(nh – f) / s + 1} X {(nw – f) / s + 1} X nc].

 



# <a id="cnn"> 8. CNN Example </a>

We’ll take things up a notch now. Let’s look at how a convolution neural network with convolutional and pooling layer works. Suppose we have an input of shape 32 X 32 X 3:

![](images/16.png)

There are a combination of convolution and pooling layers at the beginning, a few fully connected layers at the end and finally a softmax classifier to classify the input into various categories. There are a lot of hyperparameters in this network which we have to specify as well.

Generally, we take the set of hyperparameters which have been used in proven research and they end up doing well. As seen in the above example, the height and width of the input shrinks as we go deeper into the network (from 32 X 32 to 5 X 5) and the number of channels increases (from 3 to 10).

#### All of these concepts and techniques bring up a very fundamental question – why convolutions? Why not something else?



## <a id="con"> 8.1. Why Convolution? </a>

There are primarily two major advantages of using convolutional layers over using just fully connected layers:

>- Parameter sharing
- Sparsity of connections

Consider the below example:

![](images/17.png)

If we would have used just the fully connected layer, the number of parameters would be = 32 *32 * 3 * 28 * 28 * 6, which is nearly equal to 14 million! Makes no sense, right?

If we see the number of parameters in case of a convolutional layer, it will be = (5 * 5 + 1) * 6 (if there are 6 filters), which is equal to 156. Convolutional layers reduce the number of parameters and speed up the training of the model significantly.

In convolutions, we share the parameters while convolving through the input. The intuition behind this is that a feature detector, which is helpful in one part of the image, is probably also useful in another part of the image. So a single filter is convolved over the entire input and hence the parameters are shared.

The second advantage of convolution is the sparsity of connections. For each layer, each output value depends on a small number of inputs, instead of taking into account all the inputs.

 

# <a id="imi"> 9. A brief overview of Imitation Learning</a>

Reinforcement learning (RL) is one of the most interesting areas of machine learning, where an agent interacts with an environment by following a policy. In each state of the environment, it takes action based on the policy, and as a result, receives a reward and transitions to a new state. The goal of RL is to learn an optimal policy which maximizes the long-term cumulative rewards.

To achieve this, there are several RL algorithms and methods, which use the received rewards as the main approach to approximate the best policy. Generally, these methods perform really well. In some cases, though the teaching process is challenging. This can be especially true in an environment where the rewards are sparse (e.g. a game where we only receive a reward when the game is won or lost). To help with this issue, we can manually design rewards functions, which provide the agent with more frequent rewards. Also, in certain scenarios, there isn’t any direct reward function (e.g. teaching a self-driving vehicle), thus, the manual approach is necessary.

However, manually designing a reward function that satisfies the desired behaviour can be extremely complicated.
A feasible solution to this problem is imitation learning (IL). In IL instead of trying to learn from the sparse rewards or manually specifying a reward function, an expert (typically a human) provides us with a set of demonstrations. The agent then tries to learn the optimal policy by following, imitating the expert’s decisions.


## <a id="bimi"> 9.1. Basics of Imitation Learning</a>

Generally, imitation learning is useful when it is easier for an expert to demonstrate the desired behaviour rather than to specify a reward function which would generate the same behaviour or to directly learn the policy. The main component of IL is the environment, which is essentially a Markov Decision Process (MDP). This means that the environment has an S set of states, an A set of actions, a P(s’|s,a) transition model (which is the probability that an action a in the state s leads to state s’ ) and an unknown R(s,a) reward function. The agent performs different actions in this environment based on its π policy. We also have the expert’s demonstrations (which are also known as trajectories) τ = (s0, a0, s1, a1, …) , where the actions are based on the expert’s (“optimal”) π* policy. In some cases, we even “have access” to the expert at training time, which means that we can query the expert for more demonstrations or for evaluation. Finally, the loss function and the learning algorithm are two main components, in which the various imitation learning methods differ from each other.

### - Behavioural Cloning
The simplest form of imitation learning is behaviour cloning (BC), which focuses on learning the expert’s policy using supervised learning. An important example of behaviour cloning is ALVINN, a vehicle equipped with sensors, which learned to map the sensor inputs into steering angles and drive autonomously. This project was carried out in 1989 by Dean Pomerleau, and it was also the first application of imitation learning in general.
The way behavioural cloning works is quite simple. Given the expert’s demonstrations, we divide these into state-action pairs, we treat these pairs as i.i.d. examples and finally, we apply supervised learning. The loss function can depend on the application. Therefore, the algorithm is the following:

![](images/18.png)

In some applications, behavioural cloning can work excellently. For the majority of the cases, though, behavioural cloning can be quite problematic. The main reason for this is the i.i.d. assumption: while supervised learning assumes that the state-action pairs are distributed i.i.d., in MDP an action in a given state induces the next state, which breaks the previous assumption. This also means, that errors made in different states add up, therefore a mistake made by the agent can easily put it into a state that the expert has never visited and the agent has never trained on. In such states, the behaviour is undefined and this can lead to catastrophic failures.

![](images/19.png)

Still, behavioural cloning can work quite well in certain applications. Its main advantages are its simplicity and efficiency. Suitable applications can be those, where we don’t need long-term planning, the expert’s trajectories can cover the state space, and where committing an error doesn’t lead to fatal consequences. However, we should avoid using BC when any of these characteristics are true.

#### Other types of Imitation Learning Includea:-

- Direct Policy Learning (via Interactive Demonstrator)
- Inverse Reinforcement Learning

We won't go in details of these

# <a id='Contribution'> 10. Contribution</a>
As this was a learning assignment, the majority of the code has been taken from the Various Kaggle Kernals enlisted below in [Citations](#citation). 
    
- Code by self : 45%
- Code from external Sources : 55%


# <a id='Citation'>11. Citation </a>

- https://www.analyticsvidhya.com/blog/2018/12/guide-convolutional-neural-network-cnn/
- https://www.jessicayung.com/explaining-tensorflow-code-for-a-convolutional-neural-network/
- https://github.com/unccv/autonomous_driving
- https://towardsdatascience.com/basics-of-the-classic-cnn-a3dce1225add
- https://medium.com/@SmartLabAI/a-brief-overview-of-imitation-learning-8a8a75c44a9c
- https://github.com/hchkaiban/CarRacingImitationLearning/blob/master/Demo.mp4


# <a id='License'> 12. License </a>
Copyright (c) 2020 Manali Sharma, Rushabh Nisher

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

