# Demystifying Convolutional Neural Networks 

In this article we will concretely explain how a Convolutional Neural Network works and how different it is from a regular Artificial Neural Network.

![](https://image.slidesharecdn.com/nvidiaces2015presentationdeck-150105190022-conversion-gate02/95/visual-computing-the-road-ahead-nvidia-ceo-jenhsun-huang-at-ces-2015-30-638.jpg?cb=1424436369)

## Definition:

Simply put, a [**Convolutional Neural Network**](https://en.wikipedia.org/wiki/Convolutional_neural_network) is a Deep learning model or a multilayered percepteron similar to Artificial Neural Networks which is most commonly applied to analyzing visual imagery.
The founding father of Convolutional Neural Networks is the well known computer scientist working in Facebook [**Yann LeCun**](https://en.wikipedia.org/wiki/Yann_LeCun) who was the first one to use them to solve the hand written digits problem using the famous [**MNIST**](http://yann.lecun.com/exdb/mnist/) Dataset.

![](https://img-0.journaldunet.com/XPrXteY4uHzxXounKVbjhBCh0aQ=/250x/smart/0744f6823657461ebd47a7af55f79a6c/ccmcms-jdn/10823697.jpg) 

Convolutional Neural Networks were inspired by biological processes in that the connectivity pattern between neurons resembles the organization of the animal visual cortex.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Neural_pathway_diagram.svg/512px-Neural_pathway_diagram.svg.png)

Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.

![](https://slideplayer.com/10833762/39/images/7/Computer+Vision+What+we+see+What+a+computer+sees.jpg)

As you can see we can't possibly talk about any type of Neural Networks without mentioning a little bit of neuroscience and how the human body (especially the brain) and its functions have been the primary inspiration for the creation of various Deep learning models.

## The architecture of ConvNets:

![](http://www.lirmm.fr/~chaumont/images/CNN_ElectronicImaging2016.jpg)

As you can see in the illustration above a ConvNet architecture is very similar to the regular ANN architecture especially in the last layers of the network namely the Fully connected layers area, you will also notice that a ConvNet accepts a volume as an input instead of a vector.

let's now explore the layers that constitute a ConvNet and the mathematical operations that the latter goes through to visualize and classify pictures based on the features and attributes it has learnt during the training process.

### Input Layer:

The input layer is mostly an n × m × 3 RGB (short for Red, Green and Blue) image(s) unlike an Artificial Neural Networks which gets fed with a n × 1 vector, nothing hard to grasp here.

![](http://www.sai-tai.com/blog/wp-content/uploads/2017/04/cross-section.png)

### Convolution Layer:

In the **Convolution layer** we compute the output of the dot product between an area of the input image(s) and a weight matrice called a **filter**, the filter will slide through out the whole image repeating the same dot product operation.  Two things that should be mentioned: 
- The filter must have the same number of channels as the input image.
- it's commonly known that the deeper you go into the Network the more filters you use, the intuition behind it is that the more filters we have the more edge and feature detection you'll get.

![](https://cdn-images-1.medium.com/max/2000/1*wqZ0Q4mBaHKjqWx45GPIow.gif)

We calculate the dimensions of the output of the Convolution layer:

***Output Width:*** $$ \frac{W - F_w + 2P}{S} + 1 $$

***Output Height:*** $$ \frac{H - F_h + 2P}{S} + 1 $$

where:
- $W$: the width of the input image
- $H$: the height of the input image
- $F_{w}$: the width of the filter or kernel
- $F_{h}$: the height of the filter
- $P$: padding
- $S$: stride

The number of channels of the Convolution layer output equals to the **number of filters** used during the convolution operation.

#### Why Convolutions ?

You are probably asking yourself, why do we use convolutions in the first place ? why not flatten the input images from the beginning ? well if we do that we will end up with a massive number of parameters that need to be trained and most of us don't have the computational power that will solve our computationally expensive task in the fastest way possible.
In addition, with the fewer parameters that the ConvNets have we can avoid overfitting.

### Pooling Layer:

There are two widely used types of pooling, average pooling and max pooling where the latter being the most used of the two.
The pooling layer is used to reduce the spatial dimensions, but not depth, on a convolutional neural network.
When using the max pooling layer we take the highest number (the most responsive area in the image) of the input's area (an n × m matrice), whereas when we use the average pooling layer we take the mean of the input area instead.

![](https://shafeentejani.github.io/assets/images/pooling.gif)

#### Why Pooling ?

One of the core goals of the pooling layer (max pooling in this case) is to provide spatial variance, which simply means that you or the machine will be capable of recognizing an object as an object even when its appearance varies in some way.
for more in depth explanation of the pooling layer check this rigourous [**paper**](http://yann.lecun.org/exdb/publis/pdf/boureau-icml-10.pdf) by Yann LeCunn.

### Non-linearity Layer:

In the Non-linearity layer we use the ReLU activation function most if not all the time instead of the Sigmoid or Tan-H activation function.
The ReLU activation function returns 0 for every negative value in the input image while it returns the same value for every positive value in the input image (for more in depth explanation of activation functions please check this [**article**](https://blog.goodaudience.com/artificial-neural-networks-explained-436fcf36e75) of mine).

![](https://cdn-images-1.medium.com/max/800/1*6HyqifN4M_bJ7DTJ0RFRJA.jpeg)     

### Fully Connected Layer:

In the FC layer we flatten the output of the last Convolution layer and connect every node of the current layer with the other node of the next layer, Fully connected layer is just another word for the regular Artificial neural network as you will see in the image below.
The operations in the fully connected layer are exactly the same as in any artificial neural network:

$$ y = \sigma(\sum_{i=1}^{n} \theta_{i}^{T} x_i + b) $$

![](https://cdn-images-1.medium.com/max/800/1*Zd5ScCO-meZl9yrCw6ZC0Q.jpeg)

![](https://image.slidesharecdn.com/styletemp-161002182243/95/deep-learning-behind-prisma-8-638.jpg?cb=1475432629)

The layers and operations discussed above are the core components of every Convolutional neural network.

Now that we've discussed the operations that a ConvNet goes through in a forward pass let's jump to the operations that a ConvNet goes through in a backward pass.

## Backpropagation:

### Fully Connected Layer:

in the Fully connected layer backpropagation works exactly the same as in any regular artificial neural network, in backpropagation (using gradient descent as an optimization algorithm) we use partial derivatives namely the derivative of the loss function with regard to the weights $\frac{\partial J(\theta_{i})}{\partial \theta_{i}}$, in order to calculate the latter we use a well known operation in calculus called **The Chain rule** where we multiply (in the backpropagation context) the derivative of the loss function w.r.t the activated output $\frac{\partial J(\theta_{i})}{\partial \sigma(\sum_{i=1}^{n} \theta_{i}^{T} x_i + b)}$  with the derivative of the activated output w.r.t the non-activated output $\frac{\partial \sigma(\sum_{i=1}^{n} \theta_{i}^{T} x_i + b)}{\partial \sum_{i=1}^{n} \theta_{i}^{T} x_i + b}$ with the derivative of the non-activated output w.r.t to the weights $\frac{\partial \sum_{i=1}^{n} \theta_{i}^{T} x_i + b}{\partial \theta_{i}}$.
$$ \frac{\partial J(\theta_{i})}{\partial \theta_{i}} := \frac{\partial J(\theta_{i})}{\partial \sigma(\sum_{i=1}^{n} \theta_{i}^{T} x_i + b)} \bullet \frac{\partial \sigma(\sum_{i=1}^{n} \theta_{i}^{T} x_i + b)}{\partial \sum_{i=1}^{n} \theta_{i}^{T} x_i + b} \bullet \frac{\partial \sum_{i=1}^{n} \theta_{i}^{T} x_i + b}{\partial \theta_{i}} $$


![](https://matthewmazur.files.wordpress.com/2018/03/output_1_backprop-4.png?w=525)

after calculating the gradient we substract it from the initial weights to get newly optimized ones:
$$ \theta_{i + 1} := \theta_{i} - \alpha \nabla J(\theta_{i}) $$
where:
- $ \theta_{i + 1} :$ optimized weights
- $ \theta_{i} :$ initial weights
- $ \alpha :$ learning rates
- $ \nabla J(\theta_{i}) :$ gradient of the loss function


![](https://cdn-images-1.medium.com/max/1600/0*rBQI7uBhBKE8KT-X.png)

In the animation below, gradient descent is applied to linear regression, you can clearly see that the more the cost function gets minimized the better the linear model fits the data.

![](https://imamdigmi.github.io/images/memahami-epoch-batch-size-iteration/gradient-descent.gif)

note that you should be careful with choosing the value of the learning rate, a very high learning rate could cause the gradient to overshoot the target minimum.

![](https://cdn-images-1.medium.com/max/1600/0*QwE8M4MupSdqA3M4.png)

In all optimization tasks ,whether in physics, economics or Computer science, partial derivatives are overwhelmingly used, partial derivatives are primarily used to calculate the rate of change of a dependent variable $ f(x, y, z) $ with regard to one of its independent variables while the rest of the variables remain constant. for example imagine you own a share of a company, the stocks of the latter will go up or down based on multiple factors (security, politics, sales revenue etc ...), to implement partial derivatives on your situation you would calculate how much the stock price of your company change if security (for example) get affected while others factors remain constant and repeat the same process with each and every other factor.

### Pooling Layer:

In the Max Pooling layer the gradient gets backpropagated through the maximum values only since changing them slightly won't affect the output.
In the process we replace the maximum values before max pooling with 1 and set all the non maximum values to zero then use the **Chain rule** to multiply the gradient by them.

![](https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/image_folder_4/BackPropagation_MaxPool.png)

Unlike the max pooling layer, in the average pooling layer the gradient passes through all the inputs (before average pooling) the maximum and the non maximum ones.

### Convolution layer:

You are probably asking yourself right now, if the forward pass of a convolution layer is a convolution then what is its backward pass ? luckily, its backward pass is also a convolution (as you can clearly see below) so you don't need to worry about learning new set of hard to grasp mathematical operations.

![](https://cdn-images-1.medium.com/max/1000/1*CkzOyjui3ymVqF54BR6AOQ.gif)

where:
- $\partial{h_{ij}}$: the derivative of the loss function w.r.t the output of the convolution layer

![](https://cdn-images-1.medium.com/max/800/1*VruqyvXfFMrFCa3E9U6Eog.png)

This is in a nutshell how backpropagation works in a Convolution layer.

Now that you have a robust theoretical understanding of Convolutional Neural Networks let's build our first ConvNet with TensorFlow.

## Convolutional Neural Network with TensorFlow:

### What is Tensorflow ?

![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/11/TensorFlowLogo.svg/220px-TensorFlowLogo.svg.png)

[**TensorFlow**](https://opensource.com/article/17/11/intro-tensorflow) is an open source software library for numerical computation using data-flow graphs. It was originally developed by the Google Brain Team within Google's Machine Intelligence research organization for machine learning and deep neural networks research.

### What is a Tensor ?

A [**tensor**](https://en.wikipedia.org/wiki/Tensor) is an organized multidimensional array of numerical values. The order (also degree or rank) of a tensor is the dimensionality of the array needed to represent it.

![](https://cdn-images-1.medium.com/max/2000/1*_D5ZvufDS38WkhK9rK32hQ.jpeg)

### What is a Computational Graph ?

[**Computational graphs**](http://www.cs.columbia.edu/~mcollins/ff2.pdf) are a powerful formalism that have been extremely fruitful in deriving algorithms and software packages for neural networks and other models in machine learning. The basic idea in a computational graph is to express some model—for example a feedforward neural network—as a directed graph expressing a sequence of computational steps. Each step in the sequence corresponds to a vertex in the computational graph; each step corresponds to a simple operation that
takes some inputs and produces some output as a function of its inputs.

In the illustrated graph below we have two inputs $ w_{1} = x $ and $ w_{2} = y $, the inputs will flow through the graph where each node in the graph is a mathematical operation to give us the following outputs:
- $ w_{3} = cos x $ where the operation is the Cosine trigonometric function
- $ w_{4} = sin x $ where the operation is the Sine trigonometric function 
- $ w_{5} = w_{3} \bullet w_{4} $ where the operation is multiplication 
- $ w_{6} = \frac{w_{1}}{w_{2}} $ where the operation is division 
- $ w_{7} = w_{5} + w_{6} $ where the operation is addition 

![](http://www.columbia.edu/~ahd2125/static/img/2015-12-05/Fig1.png)

Now that we understand what a computational graph is, let's build our own in tensorflow, we will build the same one above.

### Code:

In [1]:
# Import the deep learning library
import tensorflow as tf

# Define our compuational graph 
W1 = tf.constant(5.0, name = "x")
W2 = tf.constant(3.0, name = "y")
W3 = tf.cos(W1, name = "cos")
W4 = tf.sin(W2, name = "sin")
W5 = tf.multiply(W3, W4, name = "mult")
W6 = tf.divide(W1, W2, name = "div")
W7 = tf.add(W5, W6, name = "add")

# Open the session
with tf.Session() as sess:

    cos = sess.run(W3)
    sin = sess.run(W4)
    mult = sess.run(W5)
    div = sess.run(W6)
    add = sess.run(W7)
    
    # Before running TensorBoard, make sure you have generated summary data in a log directory by creating a summary writer
    writer = tf.summary.FileWriter("./Desktop/ComputationGraph", sess.graph)
    
    # Once you have event files, run TensorBoard and provide the log directory
    # Command: tensorboard --logdir="path/to/logs" 

## Visualization with Tensorboard:

### What is Tensorboard ?

[**TensorBoard**](https://github.com/tensorflow/tensorboard) is a suite of web applications for inspecting and understanding your TensorFlow runs and graphs, it's one of the biggest edges that Google's TensorFlow has over Facebook's [**Pytorch**](https://pytorch.org/).

![](https://cdn-images-1.medium.com/max/800/1*BIy_Sqob3hsC-oQkdd9UTg.jpeg)

Now that you have a robust understanding of Convnets, TensorFlow and TensorBoard, let's build our first ConvNet that will recognize hand written digits using the **MNIST** dataset.

![](https://camo.githubusercontent.com/b06741b45df8ffe29c7de999ab2ec4ff6b2965ba/687474703a2f2f6e657572616c6e6574776f726b73616e64646565706c6561726e696e672e636f6d2f696d616765732f6d6e6973745f3130305f6469676974732e706e67)

The architecture of our Convnet will be a set of convolution, max-pooling and non linearity layers similar to the [**LeNet-5**](http://yann.lecun.com/exdb/lenet/) architecture.

![](https://thumbs.gfycat.com/BonyTotalArthropods-size_restricted.gif)

### Code:

In [2]:
# Import the deep learning library
import tensorflow as tf
import time

In [3]:
# Import the MNIST dataset
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


In [4]:
# Network inputs and outputs
# The network's input is a 28×28 dimensional input
n = 28
m = 28
num_input = n * m # MNIST data input 
num_classes = 10 # MNIST total classes (0-9 digits)

# tf Graph input
X = tf.placeholder(tf.float32, [None, num_input])
Y = tf.placeholder(tf.float32, [None, num_classes])

In [5]:
# Storing the parameters of our LeNET-5 inspired Convolutional Neural Network
weights = {
   "W_ij": tf.Variable(tf.random_normal([5, 5, 1, 32])),
   "W_jk": tf.Variable(tf.random_normal([5, 5, 32, 64])),
   "W_kl": tf.Variable(tf.random_normal([7 * 7 * 64, 1024])),
   "W_lm": tf.Variable(tf.random_normal([1024, num_classes]))
    }

biases = {
   "b_ij": tf.Variable(tf.random_normal([32])),
   "b_jk": tf.Variable(tf.random_normal([64])),
   "b_kl": tf.Variable(tf.random_normal([1024])),
   "b_lm": tf.Variable(tf.random_normal([num_classes]))
    }

In [6]:
# The hyper-parameters of our Convolutional Neural Network
learning_rate = 1e-3
num_steps = 500
batch_size = 128
display_step = 10

In [7]:
def ConvolutionLayer(x, W, b, strides=1):
    # Convolution Layer
    x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
    x = tf.nn.bias_add(x, b)
    return x


def ReLU(x):
    # ReLU activation function
    return tf.nn.relu(x)


def PoolingLayer(x, k=2, strides=2):
    # Max Pooling layer
    return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, strides, strides, 1],
                          padding='SAME')


def Softmax(x):
    # Softmax activation function for the CNN's final output
    return tf.nn.softmax(x)


# Create model
def ConvolutionalNeuralNetwork(x, weights, biases):
    # MNIST data input is a 1-D row vector of 784 features (28×28 pixels)
    # Reshape to match picture format [Height x Width x Channel]
    # Tensor input become 4-D: [Batch Size, Height, Width, Channel]
    x = tf.reshape(x, shape=[-1, 28, 28, 1])

    # Convolution Layer
    Conv1 = ConvolutionLayer(x, weights["W_ij"], biases["b_ij"])
    # Non-Linearity
    ReLU1 = ReLU(Conv1)
    # Max Pooling (down-sampling)
    Pool1 = PoolingLayer(ReLU1, k=2)

    # Convolution Layer
    Conv2 = ConvolutionLayer(Pool1, weights["W_jk"], biases["b_jk"])
    # Non-Linearity
    ReLU2 = ReLU(Conv2)
    # Max Pooling (down-sampling)
    Pool2 = PoolingLayer(ReLU2, k=2)
    
    # Fully connected layer
    # Reshape conv2 output to fit fully connected layer input
    FC = tf.reshape(Pool2, [-1, weights["W_kl"].get_shape().as_list()[0]])
    FC = tf.add(tf.matmul(FC, weights["W_kl"]), biases["b_kl"])
    FC = ReLU(FC)

    # Output, class prediction
    output = tf.add(tf.matmul(FC, weights["W_lm"]), biases["b_lm"])
    
    return output

In [8]:
# Construct model
logits = ConvolutionalNeuralNetwork(X, weights, biases)
prediction = Softmax(logits)

In [9]:
# Softamx cross entropy loss function
loss_function = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
    logits=logits, labels=Y))
# Optimization using the Adam Gradient Descent optimizer
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_process = optimizer.minimize(loss_function)

In [10]:
# Evaluate model
correct_pred = tf.equal(tf.argmax(prediction, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

In [11]:
# recording how the loss functio varies over time during training
cost = tf.summary.scalar("cost", loss_function)
training_accuracy = tf.summary.scalar("accuracy", accuracy)
train_summary_op = tf.summary.merge([cost,training_accuracy])

In [12]:
train_writer = tf.summary.FileWriter("./Desktop/logs",
                                        graph=tf.get_default_graph())

In [13]:
# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()

In [14]:
# Start training
with tf.Session() as sess:

    # Run the initializer
    sess.run(init)
    
    start_time = time.time()
    
    for step in range(1, num_steps+1):
        
        batch_x, batch_y = mnist.train.next_batch(batch_size)
        # Run optimization op (backprop)
        sess.run(training_process, feed_dict={X: batch_x, Y: batch_y})
        
        if step % display_step == 0 or step == 1:
            # Calculate batch loss and accuracy
            loss, acc, summary = sess.run([loss_function, accuracy, train_summary_op], feed_dict={X: batch_x,
                                                                 Y: batch_y})
            train_writer.add_summary(summary, step)
            
            print("Step " + str(step) + ", Minibatch Loss= " + \
                  "{:.4f}".format(loss) + ", Training Accuracy= " + \
                  "{:.3f}".format(acc))
            
    end_time = time.time() 
    
    print("Time duration: " + str(int(end_time-start_time)) + " seconds")
    print("Optimization Finished!")
            
    # Calculate accuracy for 256 MNIST test images
    print("Testing Accuracy:", \
        sess.run(accuracy, feed_dict={X: mnist.test.images[:256],
                                      Y: mnist.test.labels[:256]}))

Step 1, Minibatch Loss= 74470.4844, Training Accuracy= 0.117
Step 10, Minibatch Loss= 20529.4141, Training Accuracy= 0.250
Step 20, Minibatch Loss= 14074.7539, Training Accuracy= 0.531
Step 30, Minibatch Loss= 7168.9839, Training Accuracy= 0.586
Step 40, Minibatch Loss= 4781.1060, Training Accuracy= 0.703
Step 50, Minibatch Loss= 3281.0979, Training Accuracy= 0.766
Step 60, Minibatch Loss= 2701.2451, Training Accuracy= 0.781
Step 70, Minibatch Loss= 2478.7153, Training Accuracy= 0.773
Step 80, Minibatch Loss= 2312.8320, Training Accuracy= 0.820
Step 90, Minibatch Loss= 2143.0774, Training Accuracy= 0.852
Step 100, Minibatch Loss= 1373.9169, Training Accuracy= 0.852
Step 110, Minibatch Loss= 1852.9535, Training Accuracy= 0.852
Step 120, Minibatch Loss= 1845.3500, Training Accuracy= 0.891
Step 130, Minibatch Loss= 1677.2566, Training Accuracy= 0.844
Step 140, Minibatch Loss= 1683.3661, Training Accuracy= 0.875
Step 150, Minibatch Loss= 1859.3821, Training Accuracy= 0.836
Step 160, Miniba

We have just finished building our first convolutional neural network, as you can see in the results above, the accuracy has dramatically increased from the first step to the last step, but still there is more room for improvement of our ConNet.

let's now visualize our ConvNet in Tensorboard:

![](https://cdn-images-1.medium.com/max/1000/1*IG7juGoP2xRsyEZP8hA6hA.png)

![](https://cdn-images-1.medium.com/max/800/1*qfK6YdI0miD3wNhwrhISGA.jpeg)

## Conclusion:

Convolutional Neural Networks are powerful deep learning models that are applied in a wide range of fields such as radiology, the use of ConvNets will only increase as the data gets bigger and the problems become more sophisticated and challenging.

## References:

- https://en.wikipedia.org/wiki/Convolutional_neural_network
- https://en.wikipedia.org/wiki/Yann_LeCun
- http://yann.lecun.com/exdb/mnist/
- https://opensource.com/article/17/11/intro-tensorflow
- https://en.wikipedia.org/wiki/Tensor
- http://www.cs.columbia.edu/~mcollins/ff2.pdf
- https://github.com/tensorflow/tensorboard
- http://yann.lecun.com/exdb/lenet/