<a href="https://colab.research.google.com/github/MohamedElhossin/Vision/blob/master/CnnIntroduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to convolutional neural network**

First, let's take a practical look at a very simple convnet example. We will use our convnet to classify MNIST digits, using a densely-connected network. Even though our convnet will be very basic.

The 6 lines of code below show you what a basic convnet looks like. It's a stack of Conv2D and MaxPooling2D layers.

*   a convnet takes as input tensors of shape (image_height, image_width,image_channels)

*   we will configure our convnet to process inputs of size (28, 28, 1),

which is the format of MNIST images. We do this via passing the argument input_shape=(28, 28, 1) to our first layer.

In [None]:
import keras
from keras import layers
from keras import models

model = models.Sequential()
model.add(layers.Conv2D(64, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))

![Conv_basics](https://www.researchgate.net/publication/326963855/figure/fig2/AS:658367580213249@1533978471914/The-sub-convolution-pooling-neural-network.png)


**display the architecture of our CNN:**

In [None]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_4 (Conv2D)            (None, 26, 26, 64)        640       
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 13, 13, 64)        0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 11, 11, 64)        36928     
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 3, 3, 128)         73856     
Total params: 111,424
Trainable params: 111,424
Non-trainable params: 0
_________________________________________________________________


You can see above that the output of every Conv2D and MaxPooling2D layer is a 3D tensor of shape (height, width, channels). The width and height dimensions tend to shrink as we go deeper in the network. The number of channels is controlled by the first argument passed to the Conv2D layers (e.g. 64 or 128).




In [None]:
model.add(layers.Flatten())
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

In this step would be to feed our last output tensor (of shape (3, 3, 128)) into a densely-connected classifier network. These classifiers process vectors, which are 1D, whereas our current output is a 3D tensor. So first, we will have to flatten our 3D outputs to 1D, and then add a few Dense layers on top.


---

We are going to do 10-way classification, so we use a final layer with 10 outputs and a softmax activation. Now here's what our network looks


![Conv_dense](https://miro.medium.com/max/2000/0*HWj5PgxWxdcld_ye)

**display the architecture of our CNN again:**

In [None]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_4 (Conv2D)            (None, 26, 26, 64)        640       
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 13, 13, 64)        0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 11, 11, 64)        36928     
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 3, 3, 128)         73856     
_________________________________________________________________
flatten_1 (Flatten)          (None, 1152)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)              

As you can see, our (3, 3, 128) outputs were flattened into vectors of shape (1152,), before going through two Dense layers.


---


Now, let's train our convnet on the MNIST digits. 

*   download the MINIST digit dataset
*   Reshape the images to (28,28,1)
*   split the dataset into train/test set 
*   categorical or label the 10 classes in the dataset



In [None]:
from keras.datasets import mnist
from keras.utils import to_categorical

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype('float32') / 255

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz


**let's compile our model:**


*   [Information about optimizer](https://towardsdatascience.com/types-of-optimization-algorithms-used-in-neural-networks-and-ways-to-optimize-gradient-95ae5d39529f)

*     [Information about loss function](https://towardsdatascience.com/understanding-different-loss-functions-for-neural-networks-dd1ed0274718f)



In [None]:
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.callbacks.History at 0x7f7f5f91b748>

**Let's evaluate the model on the test data:**


In [None]:
test_loss, test_acc = model.evaluate(test_images, test_labels)



In [None]:
test_acc

0.9929999709129333

## **ًWhat happened in the above !!??**

Convolutions operate over 3D tensors, called feature maps, with two spatial axes (height and width) as well as a depth axis (also called the channels axis).

*   For an RGB image, the dimension of the depth axis is 3, because the image has three color channels: red, green, and blue.
*   For a black-and-white picture, like the MNIST digits, the depth is 1 (levels of gray).


---



The convolution operation extracts patches from its input feature map and applies the same transformation to all of these patches, producing an output feature map.

Filter vs. Feature map

This output feature map is still a 3D tensor: it has a width and a height.

Its depth can be arbitrary, because the output depth is a parameter of the layer, and the different channels in that depth axis no longer stand for specific colors as in RGB input; rather, they stand for filters.

Feature map = 3D tensor

Filter = 2D kernel ==> each channel feature map = response map

Filters encode specific aspects of the input data: at a high level, a single filter could encode the concept “presence of a face in the input,”for instance.



---

In the MNIST example, the first convolution layer takes a feature map of size (28, 28, 1) and outputs a feature map of size (26, 26, 64): it computes 32 filters over its input. Each of these 32 output channels contains a 26 × 26 grid of values, which is a response map of the filter over the input, indicating the response of that filter pattern at different locations in the input.





![5.1_8_response_map.png](https://github.com/ahmadelsallab/practical_dl/blob/master/Keras/notebooks/imgs/5.1_8_response_map.png?raw=true)

### **Convolutions are defined as:**



*   Kernel size Size of the patches extracted from the inputs—These are typically 3 × 3 or 5 × 5. In the example, they were 3 × 3, which is a common choice.

*   Output channels Depth of the output feature map—The number of filters computed by the convolution. The example started with a depth of 64 and ended with a depth of 128



## **UNDERSTANDING BORDER EFFECTS AND PADDING**

Consider a 5 × 5 feature map (25 tiles total). There are only 9 tiles around which you can center a 3 × 3 window, forming a 3 × 3 grid. Hence, the output feature map will be 3 × 3. It shrinks a little: by exactly two tiles alongside each dimension, in this case. You can see this border effect in action in the earlier example: you start with 28 × 28 inputs, which become 26 × 26 after the first convolution layer.

**VALID conv**
*   Kernel = MxM
*   Input = NxN
*   Output = N-M+1 x N-M+1
*   PAD = No padding

n Conv2D layers, padding is configurable via the padding argument, which takes two values: "valid", which means no padding (only valid window locations will be used); and "same", which means “pad in such a way as to have an output with the same width and height as the input.” The padding argument defaults to "valid".

Note that, the normal convolution produces bigger size, but this is not in Keras.

In this type (default from signal processing), we pad with the size of the kernel (M) on each side.

**NORM conv**

* Kernel = MxM
* Input = NxN
* Output = N+M-1xN+M-1
* PAD = M, so out = (N+2*M) - M + 1 = N + M - 1


# The pooling operation

_Downsampling_: In the convnet example, you may have noticed that the size of the feature maps is halved after every MaxPooling2D layer. 

For instance, before the first MaxPooling2D layers, the feature map is 26 × 26, but the max-pooling operation halves it to 13 × 13.

_That’s the role of max pooling: to aggressively downsample feature maps, much like strided convolutions._

Max pooling consists of extracting windows from the input feature maps and outputting the max value of each channel.

_It’s conceptually similar to convolution, except that instead of transforming local patches via a learned linear transformation (the convolution kernel), they’re transformed via a hardcoded max tensor operation._

_A big difference from convolution is that max pooling is usually done with 2 × 2 windows and stride 2, in order to downsample the feature maps by a factor of 2. On the other hand, convolution is typically done with 3 × 3 windows and no stride (stride 1)._

Max pooling selects the brighter pixels from the image. It is useful when the background of the image is dark and we are interested in only the lighter pixels of the image. For example: in MNIST dataset, the digits are represented in white color and the background is black. So, max pooling is used. Similarly, min pooling is used in the other way round. 

Whereas average pooling extracts features like edges so smoothly.

[more information.....](https://medium.com/@bdhuma/which-pooling-method-is-better-maxpooling-vs-minpooling-vs-average-pooling-95fb03f45a9)

# **uncovered topics:**

1. [Global Average Pooling](https://adventuresinmachinelearning.com/global-average-pooling-convolutional-neural-networks/)

2.  [Global Average Pooling (GAP) vs Flatten](https://arxiv.org/pdf/1312.4400.pdf)

3. [Special Convolutions](https://machinelearningmastery.com/introduction-to-1x1-convolutions-to-reduce-the-complexity-of-convolutional-neural-networks/)

4. [Deconvolution](https://distill.pub/2016/deconv-checkerboard/)

5. [Usampling](https://distill.pub/2016/deconv-checkerboard/)

