# Chapter 8 Introduction to Deep Learning for Computer Vision

This chapter introduces convolutional neural networks, also known as convnets, the
 type of deep learning model that is now used almost universally in computer vision
 applications. You’ll learn to apply convnets to image-classification problems—in particular those involving small training datasets, which are the most common use case i 
 you aren’t a large tech company

## 8.1 Introduction to convnets

First, let’s take a practical look at a simple convnet example that classifies MNIST digits, a task we performed in chapter 2 using a
densely connected network (our test accuracy then was 97.8%). Even though the
convnet will be basic, its accuracy will blow our densely connected model from chapter 2 out of the water

The following listing shows what a basic convnet looks like. 

It’s a stack of Conv2D and MaxPooling2D layers. 

You’ll see in a minute exactly what they do. We’ll build the
model using the Functional API, which we introduced in the previous chapter.

In [1]:
from tensorflow import keras
from keras import layers

In [3]:
inputs = keras.Input(shape=(28,28,1))

x = layers.Conv2D(filters=32,kernel_size=3,activation='relu')(inputs)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=64,kernel_size=3,activation='relu')(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=128,kernel_size=3,activation='relu')(x)
x = layers.Flatten()(x)
outputs = layers.Dense(10,activation='softmax')(x)

model = keras.Model(inputs = inputs,outputs=outputs)




Importantly, a convnet takes as input tensors of shape __(image_height, image_width,
image_channels)__, not including the batch dimension. 

In this case, we’ll configure the
convnet to process inputs of size __(28, 28, 1)__, which is the format of MNIST images.

Listing 8.2 Displaying the model’s summary

In [15]:
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 conv2d_3 (Conv2D)           (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 13, 13, 32)       0         
 2D)                                                             
                                                                 
 conv2d_4 (Conv2D)           (None, 11, 11, 64)        18496     
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 5, 5, 64)         0         
 2D)                                                             
                                                                 
 conv2d_5 (Conv2D)           (None, 3, 3, 128)         7385

1. You can see that the output of every Conv2D and MaxPooling2D layer is a rank-3 tensor
of shape (height, width, channels). The width and height dimensions tend to
shrink as you go deeper in the model. The number of channels is controlled by the
first argument passed to the Conv2D layers (32, 64, or 128).


2. After the last Conv2D layer, we end up with an output of shape (3, 3, 128)—a 3 × 3
feature map of 128 channels. The next step is to feed this output into a densely connected classifier like those you’re already familiar with: a stack of Dense layers. These
classifiers process vectors, which are 1D, whereas the current output is a rank-3 tensor.


3. To bridge the gap, we flatten the 3D outputs to 1D with a Flatten layer before adding
the Dense layers.
 
 
4. Finally, we do 10-way classification, so our last layer has 10 outputs and a softmax
activation.

Now, let’s train the convnet on the MNIST digits. We’ll reuse a lot of the code from
 the MNIST example in chapter 2. 
 
Because we’re doing 10-way classification with a
 softmax output, we’ll use the categorical crossentropy loss, and because our labels are
 integers, we’ll use the sparse version, sparse_categorical_crossentropy.

Listing 8.3 Training the convnet on MNIST images

In [9]:
from keras.datasets import mnist
import numpy as np

(train_images,train_labels),(test_images,test_labels) = mnist.load_data()
train_images = train_images.reshape((60000,28,28,1))
train_images = train_images.astype('float32')/255

test_images = test_images.reshape((10000,28,28,1))
test_images = test_images.astype('float32')/255



In [11]:
model.compile(
    optimizer = keras.optimizers.RMSprop(),
    loss = keras.losses.SparseCategoricalCrossentropy(),
    metrics = ['accuracy']
)

model.fit(train_images,train_labels,epochs=5,batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x17382379370>

Let's check the evaluation accuacy

In [12]:
test_loss, test_acc = model.evaluate(test_images,test_labels)
print("Test Accuracy: {}".format(test_acc))

Test Accuracy: 0.991599977016449


### 8.1.1 The convolution operation

The Foundamental difference between a __densely connected layer__ and a __convolution layer__:

__Dense layers__ learn __global patterns__ in their input feature space (for exam ple, for a MNIST digit, patterns involving all pixels), 

whereas __convolution layers__ learn
 __local patterns__—in the case of images, patterns found in small 2D windows of the
 inputs, in previous example, this window has size of 3x3

This Key characteristic gives convnets TWO interesting properties:

1. The patterns they learn are ${translation-invariant}$. 

After learning a certain pattern in
the lower-right corner of a picture, a convnet can recognize it anywhere: for
example, in the upper-left corner. A densely connected model would have to
learn the pattern anew if it appeared at a new location. 

This makes convnets data-efficient when processing images (because the visual world is fundamentally
translation-invariant): they need fewer training samples to learn representations
that have generalization power.

2. They can learn ${spatial}$ ${hierarchies}$ ${of}$ ${patterns}$. 

A first convolution layer will learn
small local patterns such as edges, a second convolution layer will learn larger
patterns made of the features of the first layers, and so on

This allows convnets to efficiently learn increasingly complex and abstract visual concepts, because the visual world is fundamentally spatially hierarchical

Convolutions operate over rank-3 tensors called ${feature}$ ${maps}$: 

With two $spatial$ $axes$
(height and width) as well as a $depth$ $axis$ (also called the channels axis). 

+ For an RGB image, the dimension of the depth axis is 3, because the image has three color channels: red, green, and blue. 

+ For a black-and-white picture, like the MNIST digits, the
depth is 1 (levels of gray).

The convolution operation extracts patches from its $input$
$feature$ $map$ and applies the same transformation to all of these patches, producing
an $output$ $feature$ $map$.

$output$ $feature$ $map$:

+ Has  $spatial$ $axes$ size maybe not same as input

+ Its $depth$ can be __arbitrary__, because the output depth is a parameter of the
    layer, and the different channels in that __depth axis__  __no longer__ stand for specific __colors__
    as in RGB input; rather, they stand for __filters__.

__Filters__ encode specific aspects of the
 input data: at a high level, a single filter could encode the concept “presence of a face
 in the input,

In the MNIST example, the first convolution layer takes a feature map of size __(28,
28, 1)__ and __outputs__ a feature map of size __(26, 26, 32)__: it computes __32 filters__ over its
input. 

__Each__ of these 32 output channels __contains a 26 × 26 grid of values__, which is a
__response map of the filter over the input__, indicating the response of that filter pattern at
different locations in the input

That is what the term $feature$ $map$ means: 

Every dimension in the __depth axis__ is a __feature
(or filter)__ , and the $nth$ __rank-2 tensor output__ $[:, :, n]$ is the __2D spatial map__ of the response
of this filter over the input.

 Convolutions are defined by two key parameters:
+ $Size$ $of$ $the$ $patches$ $extracted$ $from$ $the$ $inputs$— These are typically 3 × 3 or 5 × 5. In the
example, they were 3 × 3, which is a common choice.

+ $Depth$ $of$ $the$ $output$ $feature$ $map$— This is the number of filters computed by the convolution. The example started with a depth of 32 and ended with a depth of 64.

In Keras Conv2D layers, these parameters are the first arguments passed to the layer:


__Conv2D(output_depth, (window_height, window_width))__

A convolution works by _sliding_ these windows of size 3 × 3 or 5 × 5 over the 3D
 input feature map, stopping at every possible location, and extracting the 3D patch o 
 surrounding features __(shape (window_height, window_width, input_depth))__

Each such 3D patch is then transformed into a __1D vector__ of __shape (output_depth,)__ , which is
done via a tensor product with a learned weight matrix

It is called the $convolution$ $kernel$ —
the same kernel is reused across every patch.

All of these vectors (one per patch) are
 then __spatially__ reassembled into a __3D__ output map of shape = ___(height, width, output_
 depth)__

Every spatial location in the output feature map corresponds to the same
 location in the input feature map

Note that the output width and height may differ from the input width and height for
two reasons:

1.  __Border effects__, which can be countered by __padding__ the input feature map

2.  The use of strides, which I’ll define in a second

#### Understand Border effect and padding

Consider a 5 × 5 feature map (25 tiles total). There are only 9 tiles around which you
can center a 3 × 3 window, forming a 3 × 3 grid. 


Hence, the output feature map will be 3 × 3. It shrinks a little: by exactly two tiles alongside each dimension,
in this case. 

You can see this border effect in action in the earlier example: you start
with 28 × 28 inputs, which become 26 × 26 after the first convolution layer

If you want to get an output feature map with the same spatial dimensions as the
 input, you can use $padding$:  
 
 $Padding$ consists of __adding an appropriate number of rows and columns on each side of the input feature map__ so as to make it possible to fit center convolution windows around every input tile.

In __Conv2D()__ layers, padding is configurable via the padding argument, which takes two
 values: 
 
 1. __"valid"__, which means no padding (only valid window locations will be used),
 
 2. __"same"__, which means “pad in such a way as to have an output with the same width
 and height as the input.” 
 
 The padding argument defaults to "valid".

#### Understanding Convolution Strides

The other factor that can influence output size is the notion of __strides__. 

Our description  of convolution so far has assumed that the center tiles of the convolution windows are
 all contiguous.

$stride$: __distance between two successive windows__ , default = 1

$strided$ $convolutions$: convolutions with a $stride$ higher than 1

Using $stride$= 2 means the $width$ and $height$ of the feature map are __downsampled__ by a
 __factor of 2__ (in addition to any changes induced by border effects). 
 
Strided convolutions are rarely used in classification models, but they come in handy for some types o 
 models, as you will see in the next chapter

### 8.1.2 The max-pooling operation

The __role__ of max pooling: To aggressively __downsample__ feature maps, much like
 strided convolutions.

It’s conceptually similar to convolution,
 except that __instead of transforming local patches__ via a learned linear transformation (the convolution kernel), they’re __transformed via a hardcoded max tensor
 operation__.

A big difference from convolution is that 

+  max pooling is usually done
 with __2 × 2__ windows and __stride 2__, in order to downsample the feature maps by a factor of 2. 
 
+  On the other hand, __convolution__ is typically done with __3 × 3__ windows and __no
 stride (stride 1)__

Why downsample feature maps this way? Why not remove the max-pooling layers
and keep fairly large feature maps all the way up? Let’s look at this option. Our model
would then look like the following listing.

Listing 8.5 An incorrectly structured convnet missing its max-pooling layers

In [13]:
inputs = keras.Input(shape=(28,28,1))

x = layers.Conv2D(filters=32,kernel_size=3,activation='relu')(inputs)
x = layers.Conv2D(filters=64,kernel_size=3,activation='relu')(x)
x = layers.Conv2D(filters=128,kernel_size=3,activation='relu')(x)
x = layers.Flatten()(x)

outputs = layers.Dense(10,activation='softmax')(x)

model_no_max_pool = keras.Model(inputs,outputs)

In [14]:
model_no_max_pool.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 conv2d_6 (Conv2D)           (None, 26, 26, 32)        320       
                                                                 
 conv2d_7 (Conv2D)           (None, 24, 24, 64)        18496     
                                                                 
 conv2d_8 (Conv2D)           (None, 22, 22, 128)       73856     
                                                                 
 flatten_2 (Flatten)         (None, 61952)             0         
                                                                 
 dense_2 (Dense)             (None, 10)                619530    
                                                                 
Total params: 712,202
Trainable params: 712,202
Non-trainab

There is __TWO__ problems

1. It isn’t conducive to learning a spatial hierarchy of features. The __3 × 3__ windows
in the __third layer__ will only contain information coming from __7 × 7__ windows in
the __initial input__. 
The high-level patterns learned by the convnet will still be very
small with regard to the initial input, which may __not be enough__ to learn to classify digits (try recognizing a digit by only looking at it through windows that are
7 × 7 pixels!). We need the features from the last convolution layer to contain
information about the totality of the input.


2. The final feature map has 22 × 22 × 128 = __61,952 total coefficients per sample__.
This is huge. When you flatten it to stick a Dense layer of size 10 on top, that
layer would have over half a million parameters. __This is far too large for such a
small model and would result in intense overfitting__

The Reason of using downsampling :

1. __Reduce__ the number of feature-map coefficients to process

2. __Induce__ spatial-filter hierarchies by making successive convolution layers look at increasingly large windows

Alternative : 

1. Use strides in the prior convolution layer. 

2. Use average pooling instead of max pooling, where each local input patch is transformed by taking the average value of each channel over the patch



__Max pooling__ tends to work __better__ than these alternative solutions. 

The reason is that :

1. Features tend to encode the spatial presence of some pattern or concept
 over the different tiles of the feature map.

2. it’s more
 informative to look at the maximal presence of different features than at their average
 presence.

The __most reasonable subsampling strategy__ 

1. Produce dense maps of features (via unstrided convolutions)  

2. Look at the maximal activation of the features over small patches