<!--TITLE: Moving Windows-->
# Introduction #

In the previous two lessons, we learned about the three operations that carry out feature extraction from an image:
1. *filter* with a **convolution** layer
2. *detect* with **ReLU** activation
3. *condense* with a **maximum pooling** layer

The convolution and pooling operations share a common feature: they are both performed over a **moving window**. With convolution, this "window" is given by the dimensions of the kernel, the parameter `kernel_size`. With pooling, it is the pooling window, given by `pool_size`.

<figure>
<img src="https://i.imgur.com/LueNK6b.gif" width=400 alt="A 2D moving window.">
</figure>

There are two additional parameters affecting both convolution and pooling layers -- these are the `strides` of the window and whether to use `padding` at the image edges. The `strides` parameter says how far the window should move at each step, and the `padding` parameter describes how we handle the pixels at the edges of the input.

With these two parameters, defining the two layers becomes:

In [None]:
import tensorflow.keras as keras
import tensorflow.keras.layers as layers

model = keras.Sequential([
    layers.Conv2D(filters=64,
                  kernel_size=3,
                  strides=1,
                  padding='same',
                  activation='relu'),
    layers.MaxPool2D(pool_size=2,
                     strides=1,
                     padding='same')
    # More layers follow
])

# Stride #

The distance the window moves at each step is called the **stride**. We need to specify the stride in both dimensions of the image: one for moving left to right and one for moving top to bottom. This animation shows `strides=(2, 2)`, a movement of 2 pixels each step.

<figure>
<img src="https://i.imgur.com/Tlptsvt.gif" width=400 alt="Moving window with a stride of (2, 2).">
</figure>

What effect does the stride have? Whenever the stride in either direction is greater than 1, the moving window will skip over some of the pixels in the input at each step.

Because we want high-quality features to use for classification, convolutional layers will most often have `strides=(1, 1)`. Increasing the stride means that we miss out on potentially valuble information in our summary. Maximum pooling layers, however, will almost always have stride values greater than 1, like `(2, 2)` or `(3, 3)`, but not larger than the window itself.

Finally, note that when the value of the `strides` is the same number in both directions, you only need to set that number; for instance, instead of `strides=(2, 2)`, you could use `strides=2` for the parameter setting.

# Padding #

When performing the moving window computation, there is a question as to what to do at the boundaries of the input. Staying entirely inside the input image means the window will never sit squarely over these boundary pixels like it does for every other pixel in the input. Since we aren't treating all the pixels exactly the same, could there be a problem?

What the convolution does with these boundary values is determined by its `padding` parameter. In TensorFlow, you have two choices: either `padding='same'` or `padding='valid'`. There are trade-offs with each.

When we set `padding='valid'`, the convolution window will stay entirely inside the input. The drawback is that the output shrinks (loses pixels), and shrinks more for larger kernels. This will limit the number of layers the network can contain, especially when inputs are small in size.

The alternative is to use `padding='same'`. The trick here is to **pad** the input with 0's around its borders, using just enough 0's to make the size of the output the *same* as the size of the input. This can have the effect however of diluting the influence of pixels at the borders. The animation below shows a moving window with `'same'` padding.

<figure>
<img src="https://i.imgur.com/RvGM2xb.gif" width=400 alt="Illustration of zero (same) padding.">
</figure>

The VGG model we've been looking at uses `same` padding for all of its convolutional layers. Most modern convnets will use some combination of the two. (Another parameter to tune!)

# Example - Exploring Feature Extraction #

One of the fundamental problems in convolutional networks is how to extract information from raw image data **efficiently** while also minimizing the amount of information lost before it reaches the output. Many of the major advances in convnet architecture have been advances in efficient feature extraction.

The choice of operations used for feature extraction (`conv2d`, `relu`, `maxpool2d`) together with their parameters is one of the primary ways through which a convnet attempts to solve this problem. In this example and in the exercises, we'll explore the choices made by some of the most successful networks, as implemented in the `tf.keras.applications` module.

In this example, we'll look at VGG16. In the exercises, you'll have a chance to explore feature extraction as performed by the ResNet50 and InceptionV3 architectures.

## VGG16 ##

This hidden cell will load an image that we'll use and also the VGG16 model.

In [None]:
#$HIDE_INPUT$
import tensorflow as tf
import matplotlib.pyplot as plt
from matplotlib import gridspec
import visiontools
import warnings

plt.rc('figure', autolayout=True)
plt.rc('axes', labelweight='bold', labelsize='large',
       titleweight='bold', titlesize=18, titlepad=10)
plt.rc('image', cmap='magma')
warnings.filterwarnings("ignore") # to clean up output cells

IMAGE_PATH = '/kaggle/input/computer-vision-resources/car_illus.jpg'
image = visiontools.read_image(IMAGE_PATH, channels=1)
image = tf.image.resize(image, size=[224, 224], method='nearest')

vgg16 = tf.keras.models.load_model(
    '/kaggle/input/cv-course-models/cv-course-models/vgg16-pretrained-base',
)

VGG16 has a relatively simple way of extracting features. It performs a filtering step with $3 \times 3$ kernels, strides of 1, and "same" padding, followed by a condensing step with $2 \times 2$ windows and strides of 2. (This is what we used for the examples in Lessons 2 and 3.) We'll use a function from the `visiontools` script to help us visualize what effect this has.

In [None]:
# Get a random kernel from the first convolutional layer of VGG16
kernel = visiontools.random_kernel(model=vgg16, layer='block1_conv1')
# Show the kernel
visiontools.show_kernel(kernel, label=False)

# Show feature extraction with given parameters
visiontools.show_extraction(
    image, kernel,
    conv_stride=1,
    conv_padding='same',
    activation='relu',
    pool_size=2,
    pool_stride=2,
    pool_padding='same',
subplot_shape=(1, 4),
figsize=(16, 6))

Pretty neat! And this is just one feature map from a single grayscale input. The first extraction performed by VGG16 actually produces 64 feature maps using all three color channels. (In fact, it uses *two* convolutions before pooling, instead of just one as we did here. We'll see why this is advantageous in the exercises.)

# Conclusion #

In this lesson, we looked at a characteristic computation common to both convolution and pooling: the **moving window** and the parameters affecting its behavior in these layers. This style of windowed computation contributes much of what is characteristic of convolutional networks and is an essential part of their functioning.

# Your Turn #

Move on to the [**Exercise**](#$NEXT_NOTEBOOK_URL$) where you'll explore the feature extraction, learn about how *stacking* convolutional layers can increase the effective window size, and also learn about how convolution can be used with *one-dimensional* data, like time series.