# Convolutional Neural Networks

## 1. Convolutions for Images
Now that we understand how convolutional layers work in theory, we are ready to see how this works in practice. Sincewehavemotivatedconvolutionalneuralnetworksbytheirapplicabilitytoimagedata, wewill stick with image data in our examples, and begin by revisiting the convolutional layer that we introduced in the previous section. We note that strictly speaking, convolutional layers are a slight misnomer, since the operations are typically expressed as cross correlations.

### 1.1 The Cross-Correlation Operator 
In a convolutional layer, an input array and a correlation kernel array are combined to produce an output array through a cross-correlation operation. Let’s see how this works for two dimensions. In our example, the input is a two-dimensional array with a height of 3 and width of 3. We mark the shape of the array as 3×3 or (3, 3). The height and width of the kernel array are both 2. Common names for this array in the deep learning research community include kernel and filter. The shape of the kernel window (also known as the convolution window) is given precisely by the height and width of the kernel (here it is 2×2). 
![cnn_pic1.PNG](attachment:cnn_pic1.PNG)
In the two-dimensional cross-correlation operation, we begin with the convolution window positioned at the top-left corner of the input array and slide it across the input array, both from left to right and top to bottom. When the convolution window slides to a certain position, the input subarray contained in that window and the kernel array are multiplied (element-wise) and the resulting array is summed up yielding a single scalar value. This result if precisely the value of the output array at the corresponding location. Here, theoutputarrayhasaheightof2andwidthof2andthefourelementsarederivedfromthetwo-dimensional cross-correlation operation:
![cnn_pic2.PNG](attachment:cnn_pic2.PNG)
Note that along each axi, the output is slightly smaller than the input. Because the kernel has a width greater than one, and we can only computer the cross-correlation for locations where the kernel fits wholly within the image, the output size is given by the input size H×W minus the size of the convolutional kernel h×w via (H−h+1)×(W −w+1). This is the case since we need enough space to ‘shift’ the convolutional kernel across the image (later we will see how to keep the size unchanged by padding the image with zeros around its boundary such that there’s enough space to shift the kernel). Next, we implement the above process in the corr2d function. It accepts the input array X with the kernel array K and outputs the array Y. 

In [1]:
from mxnet import autograd, nd 
from mxnet.gluon import nn
# Save to the d2l package. 
def corr2d(X, K): 
    """Compute 2D cross-correlation.""" 
    h, w = K.shape 
    Y = nd.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]): 
            Y[i, j] = (X[i: i + h, j: j + w] * K).sum() 
    return Y

We can construct the input array X and the kernel array K from the figure above to validate the output of the above implementations of the two-dimensional cross-correlation operation.

In [2]:
X = nd.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) 
K = nd.array([[0, 1], [2, 3]])
corr2d(X, K)


[[19. 25.]
 [37. 43.]]
<NDArray 2x2 @cpu(0)>

## 1.2 Convolutional Layers 

A convolutional layer cross-correlates the input and kernels and adds a scalar bias to produce an output. The parameters of the convolutional layer are precisely the values that constitute the kernel and the scalar bias. When training the models based on convolutional layers, we typically initialize the kernels randomly, just as we would with a fully-connected layer. 

We are now ready to implement a two-dimensional convolutional layer based on the corr2d function defined above. In the __init__ constructor function, we declare weight and bias as the two model parameters. The forward computation function forward calls the corr2d function and adds the bias. As with h×w cross-correlation we also refer to convolutional layers as h×w convolutions. 

In [3]:
class Conv2D(nn.Block): 
    def __init__(self, kernel_size, **kwargs):
        super(Conv2D, self).__init__(**kwargs) 
        self.weight = self.params.get('weight', shape=kernel_size) 
        self.bias = self.params.get('bias', shape=(1,))
    def forward(self, x): 
        return corr2d(x, self.weight.data()) + self.bias.data()

## 1.3 Object Edge Detection in Images

Let’s look at a simple application of a convolutional layer: detecting the edge of an object in an image by finding the location of the pixel change. First, we construct an ‘image’ of 6 × 8 pixels. The middle four columns are black (0) and the rest are white (1). 

In [4]:
X = nd.ones((6, 8))
X[:, 2:6] = 0 
X


[[1. 1. 0. 0. 0. 0. 1. 1.]
 [1. 1. 0. 0. 0. 0. 1. 1.]
 [1. 1. 0. 0. 0. 0. 1. 1.]
 [1. 1. 0. 0. 0. 0. 1. 1.]
 [1. 1. 0. 0. 0. 0. 1. 1.]
 [1. 1. 0. 0. 0. 0. 1. 1.]]
<NDArray 6x8 @cpu(0)>

Next, we construct a kernel K with a height of 1 and width of 2. When we perform the cross-correlation operation with the input, if the horizontally adjacent elements are the same, the output is 0. Otherwise, the output is non-zero. 

In [5]:
K = nd.array([[1, -1]])

Enter X and our designed kernel K to perform the cross-correlation operations. As you can see, we will detect 1 for the edge from white to black and -1 for the edge from black to white. The rest of the outputs are 0

In [6]:
Y = corr2d(X, K) 
Y


[[ 0.  1.  0.  0.  0. -1.  0.]
 [ 0.  1.  0.  0.  0. -1.  0.]
 [ 0.  1.  0.  0.  0. -1.  0.]
 [ 0.  1.  0.  0.  0. -1.  0.]
 [ 0.  1.  0.  0.  0. -1.  0.]
 [ 0.  1.  0.  0.  0. -1.  0.]]
<NDArray 6x7 @cpu(0)>

Let’s apply the kernel to the transposed image. As expected, it vanishes. The kernel K only detects vertical edges. 

In [7]:
corr2d(X.T, K)


[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
<NDArray 8x5 @cpu(0)>

## 1.4 Learning a Kernel 

Designing an edge detector by finite differences [1, -1] is neat if we know this is precisely what we are looking for. However, as we look at larger kernels, and consider successive layers of convolutions, it might be impossible to specify precisely what each filter should be doing manually.

Now let’s see whether we can learn the kernel that generated Y from X by looking at the (input, output) pairs only. We first construct a convolutional layer and initialize its kernel as a random array. Next, in each iteration, we will use the squared error to compare Y and the output of the convolutional layer, then calculate the gradient to update the weight. For the sake of simplicity, in this convolutional layer, we will ignores the bias. 

We previously constructed the Conv2D class. However, since we used single-element assignments, Gluon has some trouble finding the gradient. Instead, we use the built-in Conv2D class provided by Gluon below. 

In [8]:
# Construct a convolutional layer with 1 output channel 
# (channels will be introduced in the following section) 
# and a kernel array shape of (1, 2) 
conv2d = nn.Conv2D(1, kernel_size=(1, 2)) 
conv2d.initialize()
# The two-dimensional convolutional layer uses four-dimensional input and
# output in the format of (example channel, height, width), where the batch 
# size (number of examples in the batch) and the number of channels are both 1 
X = X.reshape((1, 1, 6, 8)) 
Y = Y.reshape((1, 1, 6, 7))

In [9]:
for i in range(10): 
    with autograd.record():
        Y_hat = conv2d(X) 
        l = (Y_hat - Y) ** 2
    l.backward() 
    # For the sake of simplicity, we ignore the bias here
    conv2d.weight.data()[:] -= 3e-2 * conv2d.weight.grad() 
    if (i + 1) % 2 == 0: 
        print('batch %d, loss %.3f' % (i + 1, l.sum().asscalar()))

batch 2, loss 4.949
batch 4, loss 0.831
batch 6, loss 0.140
batch 8, loss 0.024
batch 10, loss 0.004


As you can see, the error has dropped to a small value after 10 iterations. Now we will take a look at the kernel array we learned. 

In [10]:
conv2d.weight.data().reshape((1, 2))


[[ 0.9895    -0.9873705]]
<NDArray 1x2 @cpu(0)>

Indeed, the learned kernel array is remarkably close to the kernel array K we defined earlier.

## 1.5 Padding

Asdescribed above,one tricky issue when applying convolutional layers is that losing pixels on the permimeter of our image. Since we typically use small kernels, for any given convolution, we might only lose a few pixels, but this can add up as we apply many successive convolutional layers. One straightforward solution to this problem is to add extra pixels of filler around the boundary of our input image, thus increasing the effective size of the image Typically, we set the values of the extra pixels to 0. In the figure below, we pad a 3×5 input, increasing its size to 5×7. The corresponding output then increases to a 4×6 matrix.
![cnn_pic3.PNG](attachment:cnn_pic3.PNG)
In general, if we add a total of ph rows of padding (roughly half on top and half on bottom) and a total of pw columns of padding (roughly half on the left and half on the right), the output shape will be 
![cnn_pic4.PNG](attachment:cnn_pic4.PNG)
This means that the height and width of the output will increase by ph and pw respectively.

In many cases, we will want to set ph = kh −1 and pw = kw −1 to give the input and output the same height and width. This will make it easier to predict the output shape of each layer when constructing the network. Assuming that kh is odd here, we will pad ph/2 rows on both sides of the height. If kh is even, one possibility is to pad ⌈ph/2⌉ rows on the top of the input and ⌊ph/2⌋ rows on the bottom. We will pad both sides of the width in the same way. 

Convolutional neural networks commonly use convolutional kernels with odd height and width values, such as 1, 3, 5, or 7. Choosing odd kernel sizes has the benefit that we can preserve the spatial dimensionality while padding with the same number of rows on top and bottom, and the same number of columns on left and right. 

Moreover, this practice of using odd kernels and padding to precisely preserve dimensionality offers a clerical benefit. For any two-dimensional array X, when the kernels size is odd and the number of padding rows and columns on all sides are the same, producing an output with the have the same height and width as the input, we know that the output Y[i,j] is calculated by cross-correlation of the input and convolution kernel with the window centered on X[i,j]. 

In the following example, we create a two-dimensional convolutional layer with a height and width of 3 and apply 1 pixel of padding on all sides. Given an input with a height and width of 8, we find that the height and width of the output is also 8. 

In [11]:
from mxnet import nd 
from mxnet.gluon import nn

In [12]:
# For convenience, we define a function to calculate the convolutional layer. 
# This function initializes the convolutional layer weights and performs 
# corresponding dimensionality elevations and reductions on the input and 
# output
def comp_conv2d(conv2d, X):
    conv2d.initialize() 
    # (1,1) indicates that the batch size and the number of channels
    # (described in later chapters) are both 1 
    X = X.reshape((1, 1) + X.shape)
    Y = conv2d(X) 
    # Exclude the first two dimensions that do not interest us: batch and
    # channel
    return Y.reshape(Y.shape[2:])
# Note that here 1 row or column is padded on either side, so a total of 2 
# rows or columns are added 
conv2d = nn.Conv2D(1, kernel_size=3, padding=1) 
X = nd.random.uniform(shape=(8, 8))
comp_conv2d(conv2d, X).shape

(8, 8)

When the height and width of the convolution kernel are different, we can make the output and input have the same height and width by setting different padding numbers for height and width. 

In [13]:
# Here, we use a convolution kernel with a height of 5 and a width of 3. The
# padding numbers on both sides of the height and width are 2 and 1,
# respectively 
conv2d = nn.Conv2D(1, kernel_size=(5, 3), padding=(2, 1)) 
comp_conv2d(conv2d, X).shape

(8, 8)

## 1.6 Stride 

When computing the cross-correlation, we start with the convolution window at the top-left corner of the input array, and then slide it over all locations both down and to the right. In previous examples, we default to sliding one pixel at a time. However, sometimes, either for computational efficiency or because we wish to downsample, we move our window more than one pixel at a time, skipping the intermediate locations. 

We refer to the number of rows and columns traversed per slide as the stride. So far, we have used strides of 1, both for height and width. Sometimes, we may want to use a larger stride. The figure below shows a two-dimensional cross-correlation operation with a stride of 3 vertically and 2 horizontally. We can see that when the second element of the first column is output, the convolution window slides down three rows. The convolution window slides two columns to the right when the second element of the first row is output. When the convolution window slides two columns to the right on the input, there is no output because the input element cannot fill the window (unless we add padding). 

In general, when the stride for the height is sh and the stride for the width is sw, the output shape is
![cnn_pic5.PNG](attachment:cnn_pic5.PNG)
![cnn_pic6.PNG](attachment:cnn_pic6.PNG)

If we set ph = kh −1 and pw = kw −1, then the output shape will be simplified to ⌊(nh + sh −1)/sh⌋× ⌊(nw + sw −1)/sw⌋. Going a step further, if the input height and width are divisible by the strides on the height and width, then the output shape will be (nh/sh)×(nw/sw). 

Below, we set the strides on both the height and width to 2, thus halving the input height and width. 

In [14]:
conv2d = nn.Conv2D(1, kernel_size=3, padding=1, strides=2)
comp_conv2d(conv2d, X).shape

(4, 4)

In [15]:
conv2d = nn.Conv2D(1, kernel_size=(3, 5), padding=(0, 1), strides=(3, 4))
comp_conv2d(conv2d, X).shape

(2, 2)

For the sake of brevity, when the padding number on both sides of the input height and width are ph and pw respectively, we call the padding (ph,pw). Specifically, when ph = pw = p, the padding is p. When the strides on the height and width are sh and sw, respectively, we call the stride (sh,sw). Specifically, when sh = sw = s, the stride is s. By default, the padding is 0 and the stride is 1. In practice we rarely use inhomogeneous strides or padding, i.e., we usually have ph = pw and sh = sw.

## 1.6 Multiple Input Channels 

When the input data contains multiple channels, we need to construct a convolution kernel with the same number of input channels as the input data, so that it can perform cross-correlation with the input data. Assumingthatthenumberofchannelsfortheinputdataisci,thenumberofinputchannelsoftheconvolution kernel also needs to be ci. If our convolution kernel’s window shape is kh ×kw, then when ci = 1, we can think of our convolution kernel as just a two-dimensional array of shape kh × kw. 

However, when ci > 1, we need a kernel that contains an array of shape kh ×kw for each input channel. Concatenating these ci arrays together yields a convolution kernel of shape ci×kh×kw. Since the input and convolutionkerneleachhaveci channels,wecanperformacross-correlationoperationonthetwo-dimensional array of the input and the two-dimensional kernel array of the convolution kernel for each channel, adding the ci results together (summing over the channels) to yield a two-dimensional array. This is the result of a two-dimensional cross-correlation between multi-channel input data and a multi-input channel convolution kernel. In the figure 

In the figure below, we demonstrate an example of a two-dimensional cross-correlation with two input channels. The shaded portions are the first output element as well as the input and kernel array elements used in its computation: (1×1 + 2×2 + 4×3 + 5×4) + (0×0 + 1×1 + 3×2 + 4×3) = 56.
![cnn_pic7.PNG](attachment:cnn_pic7.PNG)

To make sure we reall understand what’s going on here, we can implement cross-correlation operations with multiple input channels ourselves. Notice that all we are doing is performing one cross-correlation operation per channel and then adding up the results using the add_n function. 

In [16]:
!pip install d2l
import d2l 
from mxnet import nd
def corr2d_multi_in(X, K): 
    # First, traverse along the 0th dimension (channel dimension) of X and K.
    # Then, add them together by using * to turn the result list into a
    # positional argument of the add_n function 
    return nd.add_n(*[d2l.corr2d(x, k) for x, k in zip(X, K)])



You are using pip version 19.0.3, however version 19.3.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


We can construct the input array X and the kernel array K corresponding to the values in the above diagram to validate the output of the cross-correlation operation. 

In [17]:
X = nd.array([[[0, 1, 2], [3, 4, 5], [6, 7, 8]], [[1, 2, 3], [4, 5, 6], [7, 8, 9]]]) 
K = nd.array([[[0, 1], [2, 3]], [[1, 2], [3, 4]]])
corr2d_multi_in(X, K)


[[ 56.  72.]
 [104. 120.]]
<NDArray 2x2 @cpu(0)>

## 1.7 Multiple Output Channels 

Regardless of the number of input channels, so far we always ended up with one output channel. However, as we discussed earlier, it turns out to be essential to have multiple channels at each layer. In the most popular neural network architectures, we actually increase the channel dimension as we go higher up in the neural network, typically downsampling to trade off spatial resolution for greater channel depth. Intuitively, you could think of each channel as responding to some different set of features. Reality is a bit more complicated than the most naive intepretations of this intuition since representations aren’t learned independent but are rather optimized to be jointly useful. So it may not be that a single channel learns an edge detector but rather that some direction in channel space corresponds to detecting edges. 

Denote by ci and co the number of input and output channels, respectively, and let kh and kw be the height and width of the kernel. To get an output with multiple channels, we can create a kernel array of shape ci ×kh ×kw for each output channel. We concatenate them on the output channel dimension, so that the shape of the convolution kernel is co×ci×kh×kw. In cross-correlation operations, the result on each output channel is calculated from the convolution kernel corrsponding to that output channel and takes input from all channels in the input array. 

We implement a cross-correlation function to calculate the output of multiple channels as shown below. 

In [18]:
def corr2d_multi_in_out(X, K): 
    # Traverse along the 0th dimension of K, and each time, perform
    # cross-correlation operations with input X. All of the results are merged
    # together using the stack function 
    return nd.stack(*[corr2d_multi_in(X, k) for k in K])

We construct a convolution kernel with 3 output channels by concatenating the kernel array K with K+1(plus one for each element in K) and K+2.

In [19]:
K = nd.stack(K, K + 1, K + 2)
K.shape

(3, 2, 2, 2)

Below, we perform cross-correlation operations on the input array X with the kernel array K. Now the output contains 3 channels. The result of the first channel is consistent with the result of the previous input array X and the multi-input channel, single-output channel kernel. 

In [20]:
corr2d_multi_in_out(X, K)


[[[ 56.  72.]
  [104. 120.]]

 [[ 76. 100.]
  [148. 172.]]

 [[ 96. 128.]
  [192. 224.]]]
<NDArray 3x2x2 @cpu(0)>

## 1.8 1×1 Convolutional Layer 

At first, a 1×1 convolution, i.e. kh = kw = 1, doesn’t seem to make much sense. After all, a convolution correlates adjacent pixels. A 1×1 convolution obviously doesn’t. Nonetheless, they are popular operations that are sometimes included in the designs of complex deep networks. Let’s see in some detail what it actually does. Because the minimum window is used, the 1×1 convolution loses the ability of larger convolutional layers to recognize patterns consisting of interactions among adjacent elements in the height and width dimensions. The only computation of the 1×1 convolution occurs on the channel dimension.

The figure below shows the cross-correlation computation using the 1×1 convolution kernel with 3 input channels and 2 output channels. Note that the inputs and outputs have the same height and width. Each element in the output is derived from a linear combination of elements at the same position in the input image. You could think of the 1×1 convolutional layer as constituting a fully-connected layer applied at every single pixel location to transform the c_i corresponding input values into c_o output values. Because this is still a convolutional layer, the weights are tied across pixel location Thus the 1×1 convolutional layer requires co ×ci weights (plus the bias terms).
![cnn_pic8.PNG](attachment:cnn_pic8.PNG)

Let’scheckwhetherthisworksinpractice: weimplementthe 1×1 convolutionusingafully-connectedlayer. The only thing is that we need to make some adjustments to the data shape before and after the matrix multiplication. 

In [21]:
def corr2d_multi_in_out_1x1(X, K): 
    c_i, h, w = X.shape 
    c_o = K.shape[0]
    X = X.reshape((c_i, h * w))
    K = K.reshape((c_o, c_i)) 
    Y = nd.dot(K, X) 
    # Matrix multiplication in the fully connected layer 
    return Y.reshape((c_o, h, w)) 

When performing 1×1 convolution, the above function is equivalent to the previously implemented crosscorrelation function corr2d_multi_in_out. Let’s check this with some reference data. 

In [22]:
X = nd.random.uniform(shape=(3, 3, 3))
K = nd.random.uniform(shape=(2, 3, 1, 1))
Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_in_out(X, K)
(Y1 - Y2).norm().asscalar() < 1e-6

True

## 1.9 Pooling

Often, as we process images, we want to gradually reduce thes patial resolution of our hidden representations, aggregating information so that the higher up we go in the network, the larger the receptive field (in the input) to which each hidden node is sensitive.

Often our ultimate task asks some global question about the image, e.g., does it contain a cat? So typically the nodes of our final layer should be sensitive to the entire input. By gradually aggregating information, yielding coarser and coarser maps, we accomplish this goal of ultimately learning a global representation, while keeping all of the advantages of convolutional layers at the intermediate layers of processing.

Moreover, when detecting lower-level features, such as edges (as discussed in Section 8.2), we often want our representations to be somewhat invariant to translation. For instance, if we take the image X with a sharp delineation between black and white and shift the whole image by one pixel to the right, i.e. Z[i,j] = X[i,j+1], then the output for for the new image Z might be vastly different. The edge will have shifted by one pixel and with it all the activations. In reality, objects hardly ever occur exactly at the same place. In fact, even with a tripod and a stationary object, vibration of the camera due to the movement of the shutter might shift everything by a pixel or so (high-end cameras are loaded with special features to address this problem). 

This section introduces pooling layers, which serve the dual purposes of mitigating the sensitivity of convolutional layers to location and of spatially downsampling representations.

Like convolutional layers, pooling operators consist of a fixed-shape window that is slid over all regions in the input according to its stride, computing a single output for each location traversed by the fixed-shape window (sometimes known as the pooling window). However, unlike the cross-correlation computation of the inputs and kernels in the convolutional layer, the pooling layer contains no parameters (there is no filter). Instead, pooling operators are deterministic, typically calculating either the maximum or the average value oftheelementsinthepoolingwindow. Theseoperationsarecalledmaximum pooling (max pooling forshort) and average pooling, respectively. In both cases, as with the cross-correlation operator, we can think of the pooling window as starting from the top left of the input array and sliding across the input array from left to right and top to bottom. At eachlocationthatthepoolingwindowhits, itcomputesthemaximumoraveragevalueoftheinputsubarray in the window (depending on whether max or average pooling is employed).
![cnn_pic9.PNG](attachment:cnn_pic9.PNG)
The output array in the figure above has a height of 2 and a width of 2. The four elements are derived from the maximum value of max: 
![cnn_pic10.PNG](attachment:cnn_pic10.PNG)

A pooling layer with a pooling window shape of p×q is called a p×q pooling layer. The pooling operation is called p×q pooling. 
Let us return to the object edge detection example mentioned at the beginning of this section. Now we will use the output of the convolutional layer as the input for 2×2 maximum pooling. Set the convolutional layer input as X and the pooling layer output as Y. Whether or not the values of X[i, j] and X[i, j+1] are different, or X[i, j+1] and X[i, j+2] are different, the pooling layer outputs all include Y[i, j]=1.
That is to say, using the 2×2 maximum pooling layer, we can still detect if the pattern recognized by the convolutional layer moves no more than one element in height and width. 
In the code below, we implement the forward computation of the pooling layer in the pool2d function. This function is similar to the corr2d function in Section 8.2. However, here we have no kernel, computing the output as either the max or the average of each region in the input.

In [23]:
from mxnet import nd 
from mxnet.gluon import nn
def pool2d(X, pool_size, mode='max'):
    p_h, p_w = pool_size 
    Y = nd.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1)) 
    for i in range(Y.shape[0]): 
        for j in range(Y.shape[1]): 
            if mode == 'max': 
                Y[i, j] = X[i: i + p_h, j: j + p_w].max()
            elif mode == 'avg':
                Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
    return Y

We can construct the input array X in the above diagram to validate the output of the two-dimensional maximum pooling layer. 

In [24]:
X = nd.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
pool2d(X, (2, 2))


[[4. 5.]
 [7. 8.]]
<NDArray 2x2 @cpu(0)>

At the same time, we experiment with the average pooling layer. 

In [25]:
pool2d(X, (2, 2), 'avg')


[[2. 3.]
 [5. 6.]]
<NDArray 2x2 @cpu(0)>

As with convolutional layers, pooling layers can also change the output shape. And as before, we can alter the operation to achieve a desired output shape by padding the input and adjusting the stride. We can demonstrate the use of padding and strides in pooling layers via the two-dimensional maximum pooling layer MaxPool2D shipped in MXNet Gluon’s nn module. We first construct an input data of shape (1, 1, 4, 4), where the first two dimensions are batch and channel. 

In [26]:
X = nd.arange(16).reshape((1, 1, 4, 4))
X


[[[[ 0.  1.  2.  3.]
   [ 4.  5.  6.  7.]
   [ 8.  9. 10. 11.]
   [12. 13. 14. 15.]]]]
<NDArray 1x1x4x4 @cpu(0)>

By default, the stride in the MaxPool2D class has the same shape as the pooling window. Below, we use a pooling window of shape (3, 3), so we get a stride shape of (3, 3) by default. 

In [27]:
pool2d = nn.MaxPool2D(3)
# Because there are no model parameters in the pooling layer, we do not need 
# to call the parameter initialization function 
pool2d(X)


[[[[10.]]]]
<NDArray 1x1x1x1 @cpu(0)>

The stride and padding can be manually specified. 

In [28]:
pool2d = nn.MaxPool2D(3, padding=1, strides=2)
pool2d(X)


[[[[ 5.  7.]
   [13. 15.]]]]
<NDArray 1x1x2x2 @cpu(0)>

Of course, we can specify an arbitrary rectangular pooling window and specify the padding and stride for height and width, respectively. 

In [29]:
pool2d = nn.MaxPool2D((2, 3), padding=(1, 2), strides=(2, 3)) 
pool2d(X)


[[[[ 0.  3.]
   [ 8. 11.]
   [12. 15.]]]]
<NDArray 1x1x3x2 @cpu(0)>

# Homework
1. Create a model only with fully connected layer, then train and test on fashion mnist dataset.
2. Create a model only with convolutional layer and fully connected layer, then train and test on fashion mnist dataset.(the parameter number need to same as first question)
3. Create a model only with convolutional layer and fully connected layer and pooling layer, then train and test on fashion mnist dataset.(the parameter number need to same as first question)

## 1. Create a model only with fully connected layer, then train and test on fashion mnist dataset.

In [None]:
import mxnet as mx
from mxnet import gluon, init, nd, autograd
from mxnet.gluon import data as gdata, nn, loss as gloss, utils as gutils
import os
#os.environ['CUDA_VISIBLE_DEVICES'] = '1'
import sys
import time

def try_gpu():
    """If GPU is available, return mx.gpu(0); else return mx.cpu()."""
    try:
        ctx = mx.gpu()
        _ = nd.array([0], ctx=ctx)
    except mx.base.MXNetError:
        ctx = mx.cpu()
    return ctx

def train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx,
              num_epochs):
    """Train and evaluate a model with CPU or GPU."""
    print('training on', ctx)
    loss = gloss.SoftmaxCrossEntropyLoss()
    for epoch in range(num_epochs):
        train_l_sum, train_acc_sum, n, start = 0.0, 0.0, 0, time.time()
        for X, y in train_iter:
            X, y = X.as_in_context(ctx), y.as_in_context(ctx)
            with autograd.record():
                y_hat = net(X)            
                l = loss(y_hat, y).sum()
            l.backward()
            trainer.step(batch_size)
            y = y.astype('float32')
            train_l_sum += l.asscalar()
            train_acc_sum += (y_hat.argmax(axis=1) == y).sum().asscalar()
            n += y.size
        test_acc = evaluate_accuracy(test_iter, net, ctx)
        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, '
              'time %.1f sec'
              % (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc,
                 time.time() - start))
        
def evaluate_accuracy(data_iter, net, ctx=[mx.cpu()]):
    """Evaluate accuracy of a model on the given data set."""
    if isinstance(ctx, mx.Context):
        ctx = [ctx]
    acc_sum, n = nd.array([0]), 0
    for batch in data_iter:
        features, labels, _ = _get_batch(batch, ctx)
        for X, y in zip(features, labels):
            y = y.astype('float32')
            acc_sum += (net(X).argmax(axis=1) == y).sum().copyto(mx.cpu())
            n += y.size
        acc_sum.wait_to_read()
    return acc_sum.asscalar() / n

def _get_batch(batch, ctx):
    """Return features and labels on ctx."""
    features, labels = batch
    if labels.dtype != features.dtype:
        labels = labels.astype(features.dtype)
    return (gutils.split_and_load(features, ctx),
            gutils.split_and_load(labels, ctx), features.shape[0])

def load_data_fashion_mnist(batch_size, resize=None, root=os.path.join(
        '~', '.mxnet', 'datasets', 'fashion-mnist')):
    root = os.path.expanduser(root)  # Expand the user path '~'.
    transformer = []
    if resize:
        transformer += [gdata.vision.transforms.Resize(resize)]
    transformer += [gdata.vision.transforms.ToTensor()]
    transformer += [gdata.vision.transforms.Normalize(0.13, 0.31)]
    transformer = gdata.vision.transforms.Compose(transformer)
    mnist_train = gdata.vision.FashionMNIST(root=root, train=True)
    mnist_test = gdata.vision.FashionMNIST(root=root, train=False)
    num_workers = 0 if sys.platform.startswith('win32') else 4
    train_iter = gdata.DataLoader(
        mnist_train.transform_first(transformer), batch_size, shuffle=True,
        num_workers=num_workers)
    test_iter = gdata.DataLoader(
        mnist_test.transform_first(transformer), batch_size, shuffle=False,
        num_workers=num_workers)
    return train_iter, test_iter

lr, num_epochs, ctx = 1e-3, 100, try_gpu()
#############################################################################
# TODO:                                                                     #                                  
#############################################################################


## 2. Create a model only with convolutional layer and fully connected layer, then train and test on fashion mnist dataset.

In [None]:
#############################################################################
# TODO:                                                                     #                                  
#############################################################################


## 3. Create a model only with convolutional layer and fully connected layer and pooling layer, then train and test on fashion mnist dataset.

In [None]:
#############################################################################
# TODO:                                                                     #                                  
#############################################################################
