# Chapter 15:  Classifying Images with Deep Convolutional Neural Networks.

In this chapter, we'll now learn about **Convolutional Neural Networks** (CNNs), and how we can implement CNNs in TensorFlow.  We will also apply this type of Deep Neural Network architecture to **image classification.**

So we'll start by discussing the basic building blocks of CNN's, using a bottom-up approach.  Then we will take a deeper dive into the CNN architecture and how to implement deep CNNs in TensorFlow.  Along the way we'll be covering the following topics:
* Understanding convolutional operations in one and two dimensions.
* Learning about the building blocks of CNN architectures.
* Implementing deep convolutional neural networks in TensorFlow.

## Building blocks of convolutional neural networks.

In the following sections, we next see how CNNs are used as **feature extraction engines**, and then we'll delve into the theoretical definition of convolution and computing convolution in one and two dimensions.

## Understanding CNNs and learning feature hierarchies

Neural networks are able to automatically learn the features from raw data that are most useful for a particular task.  For this reason, it is common to consider a neural networks as a feature extraction engine: the early layers (those right after the input layer) extract **low-level features.**

Deep convolutional neural networks construct a so-called feature hierarchy by combining the low-level features in a layer-wise fashion to form high-level features.  For example, if we're dealing with images then low-level features, such as edges and blobs, are extracted from the earlier layers, which are combined together to form high-level features---as object shapes like a building, a car, or a dog.

As you can see in the following image, a CNN computes feature maps from an input image, where each element comes from a local patch of pixels in the input image:

<img src="images/15_01.png" style="width:500px">

This local patch of pixels is referred to as the **local receptive field**.  CNNs will usually perform very well for image-related tasks, and that is largely due to two important ideas:
1. **Sparse-connectivity**:  A single element in the feature map is connected to only a small patch of pixels.  (This is very different from connecting to the whole input image, in the case of perceptrons.)
2. **Parameter-sharing**:  The same weights are used for different patches of the input image.

As a direct consequence of those two ideas, the number of weights (parameters) in the network decreases dramatically, and we see an improvement in the ability to capture salient features.  Intuitively, it makes sense that nearby pixels are probably more relevant to each other than pixels that are far away from each other.

Typically, CNNs are composed of several Convolutional (conv) layers, and subsampling (also known as Pooling (p)) layers that are followed by one or more Fully Connected (FC) layers at the end.

Please note that subsampling layers, commonly known as pooling layers, do not have any learnable parameters.

In the following sections, we'll study convolutional and pooling layers in more detail and see how they work.  To understand how convolution operations work, let's start with a convolution in one dimension before working through the typical two-dimensional cases as applications for two-dimensional images later.

## Performing discrete convolutions

In this section, we will learn the mathematical definition and discuss some of the **naive** algorithms to compute convolutions of two one-dimensional vectors or two two-dimensional matrices.

Take note of the following mathematical notation:
* $A_{n_1 \times n_2}$ represents a two-dimensional array of size $n_1 \times n_2$.
* We use brackets $[]$ to denote the indexing of a multi-dimensional array.
* We use the special symbol $*$ to denote the convolution operation between two vectors or matrices, which is not to be confused with the multiplication operator $*$ in Python.

## Performing a discrete convolution in one dimension.

A discrete convolution for two one-dimensional vectors $\mathbf{x}$ and $\mathbf{w}$ is denoted by $\mathbf{y} = \mathbf{x} * \mathbf{w}$, in which vector $\mathbf{x}$ is our input (sometimes called **signal**), and $\mathbf{w}$ is called the **filter** or **kernel**.  A discrete convolution is mathematically defined as follows:

$$
\mathbf{y} = \mathbf{x} * \mathbf{w} \rightarrow \mathbf{y}[i] = \sum_{k=-\infty}^{k=+\infty}\mathbf{x}[i - k]\mathbf{w}[k]
$$

Here, the brackets $[]$ are used to denote the indexing for vector elements.  The index $i$ runw through each element of the output vector $\mathbf{y}$.  There are two odd things in the preceding formula that we need to clarify: $-\infty$ to $+\infty$ and negative indexing for $\mathbf{x}$.

In machine learning applications, we always deal with finite feature vectors.  Therefore to correctly compute the summation showin in the preceding formula, it is assumed that $\mathbf{x}$ and $\mathbf{w}$ are filled with zeros.  This will result in an output vector $\mathbf{y}$ that also has infinite size with lots of zeros as well.  Since this is not useful in practical situations, $\mathbf{x}$ is padded only with a finite number of zeros.

This process is called zero-padding or simply padding.  Here, the number of zeros padded on each side is denoted by $p$.  An example of padding of a one-dimensional vector $\mathbf{x}$ is shown in the following figure:

<img src="images/15_02.png" style="width:500px">

Let us assume that the original input $\mathbf{x}$ and filter $\mathbf{w}$ have $n$ and $m$ elements, respectively, where $m \le n$.  Therefore, the padded vector $\mathbf{x}^p$ has size $n + 2p$.  Then, the practical formula for computing a discrete convolution will change to the following:

$$
\mathbf{y} = \mathbf{x} * \mathbf{w} \rightarrow \mathbf{y}[i] = \sum_{k=0}^{k=m-1}\mathbf{x}^p[i + m - k]\mathbf{w}[k]
$$

Now that we have solved the infinite index issue, the second issue is indexing $\mathbf{x}$ with $i + m - k$.  The important point to notice here is that $\mathbf{x}$ and $\mathbf{w}$ are indexed in different directions in this summation.  For this reason, we can flip one of those vectors, $\mathbf{x}$ or $\mathbf{w}$, after they are padded.  Then we can simply compute their dot products.

This operation is repeated like in a sliding window approach to get all the output elements.  The following figure provides an example with $\mathbf{x} = [3, 2, 1, 7, 1, 2, 5, 4]$ and $\mathbf{w} = \left[ \frac{1}{2}, \frac{3}{4}, 1, \frac{1}{4} \right]$ so that the first three output elements are computed as follows:

<img src="images/15_03.png" style="width:500px">

We can see in the preceding example that the padding size is zero ($p = 0$).  Notice that the rotated filter $\mathbf{w}^r$ is shifted by 2 cells each time we shift.  This shift is another hyperparameter of a convolution, the **stride** $s$.  In this example, the stride is 2, $s = 2$.  Note that the stride has to be a postive number smaller than the size of the input vector.

## The effect of zero-padding in a convolution.

Technically, padding can be applied with any $p \ge 0$.  Depending on the choice of $p$, bounday cells may be treated different than the cells located in the middle of $\mathbf{x}$.

The size of the output $\mathbf{y}$ depends on the choice of the padding strategy that we use.  There are three modes of padding that are commonly used in practice: **full**, **same**, and **valid**.
* In the **full** mode, the padding parameter $p$ is set to $p = m - 1$.  Full padding increases the dimensions of the output; this, it is rarely used in convolutional neural network architectures.
* **Same** padding is usually used if you want to have the size of the output the same as the input vector $\mathbf{x}$.
* Finally, computing a convolution in the **valid** mode refers to the case where $p = 0$ (no padding).

The following figure illustrates the three different padding modes for a simple $5 \times 5$ pixel input with a kernel size of $3\times 3$ and a stride of 1:

<img src="images/15_11.png" style="width:500px">

One of the advantages of **same** padding over other padding modes is that it preserves the height and width of the imput images of tensors, which makes designing a network architecture more convenient.

One big disadvantage of the **valid** padding versus **full** and **same** padding, for example, is that the volume of the tensors would decreasse substantially in neural networks with many layers, which can be detrimental to the network performance.

In practice, it is recommended that you preserve the spatial size using same padding for the convolutional layers and decrease the spatial size via pooling layers instead.  Full padding is usually used in signal processing applications where it is important to minimize boundary effects.  However, in the deep learning context, boundary effects is not usually an issue, so we rarely see full padding.

## Determining the size of the convolution output.

The output size of a convolution is determined by the total number of times that we **shift** the filter $\mathbf{w}$ along the input vector.  The size of the output resulting from the $\mathbf{x} * \mathbf{w}$ with padding $p$ and stride $s$ is determined as follows:

$$
o = \left\lfloor \frac{n + 2p - m}{s} \right\rfloor + 1
$$

Here, $\lfloor \ldots \rfloor$ denotes the floor operation.

If you want to learn more about the size of the convolution output, look at the manuscript _A guide to convolution arithmetic for deep learning, Vincent Dumouling and Franceso Visin, 2016_ which is freely available at <https://arxiv.org/abs/1603.07285>.

Finally, in order to learn how to compute convolutions in one dimension, a n\"{a}ive implementation is shown in the following code block, and the results are compared with the `numpy.convolve` function.  The code is as follows:

In [11]:
import numpy as np

def conv1d(x, w, p=0, s=1):
    w_rot = np.array(w[::-1])
    x_padded = np.array(x)
    if p > 0:
        zero_pad = np.zeros(shape=p)
        x_padded = np.concatenate([zero_pad, x_padded, zero_pad], axis=None)
    res = []
    for i in range(0, int(len(x)/s), s):
        res.append(np.sum(x_padded[i:i+w_rot.shape[0]] * w_rot))
    return np.array(res)

## Testing
x = [1, 3, 2, 4, 5, 6, 1, 3]
w = [1, 0, 3, 1, 2]

print("conv1d Implementation:", conv1d(x, w, p=2, s=1))

print("Numpy Results", np.convolve(x, w, mode="same"))

conv1d Implementation: [ 5. 14. 16. 26. 24. 34. 19. 22.]
Numpy Results [ 5 14 16 26 24 34 19 22]


So far, here, we have explored the convolution in 1D.  We started withk the 1D case to make the concepts easier to understand.  In the next section, we will extend this to 2 dimensions.

## Performing a discrete convolution in 2D:

The 2D convolution is mathematically defined as follows:

$$
\mathbf{Y} = \mathbf{X} * \mathbf{W} \rightarrow \sum_{k_1 = -\infty}^{+\infty}\sum_{k_2 = -\infty}^{+\infty}\mathbf{X}[i - k_1, j - k_2]\mathbf{W}[k_1, k_2]
$$

The following example illustrates the computation of a 2D convolution between an input matrix $\mathbf{X}_{x\times 3}$, a kernel matrix $\mathrmb{W}_{3\times 3}$, padding $p = (1, 1)$, and stride $s = (2, 2)$.  According to the specified padding, one layer of zeros are padded on each side of the input matrix, which results in the padded matrix $\mathbf{X}_{5\times 5}^{padded}$, as follows:

<img src="images/15_04.png" style="width:500px">

With the preceding filter, the rotated filter will be:

$$
\mathbf{W}^r = \begin{bmatrix}
0.5 & 1 & 0.5 \\
0.1 & 0.4 & 0.3 \\
0.4 & 0.7 & 0.5 \\
\end{bmatrix}
$$

Notice that this rotation is not the same as the transpose matrix.  To get the roated filter in NumPy, we can write `W_rot = W[::-1, ::-1]`.  Next, we can shift the rotated filter matrix along the padded input matrix $\mathbf{X}^{\mathrm{padded}}$ like a sliding window and compute the sum of the element-wise product, which is denoted by the $\odot$ operator in the following figure:

<img src="images/15_05.png" style="width:500px">

The result will be the $2\times 2$ matrix $\mathbf{Y}$.

Let us also implement the 2D convolution according to the na\"{i}ve algorithm described.  
The `scipy.signal` package provides a way to compute 2D convolution via the `scipy.signal.convolve2d` function:

In [13]:
import numpy as np
import scipy.signal

def conv2d(X, W, p=(0,0), s=(1,1)):
    W_rot = np.array(W)[::-1, ::-1]
    X_orig = np.array(X)
    n1 = X_orig.shape[0] + 2*p[0]
    n2 = X_orig.shape[1] + 2*p[1]
    X_padded = np.zeros(shape=(n1, n2))
    X_padded[p[0]:p[0]+X_orig.shape[0], p[1]:p[1]+X_orig.shape[1]] = X_orig
    
    res = []
    
    for i in range(0, int((X_padded.shape[0] - W_rot.shape[0])/s[0])+1, s[0]):
        res.append([])
        for j in range(0, int((X_padded.shape[1] - W_rot.shape[1])/s[1])+1, s[1]):
            X_sub = X_padded[i:i+W_rot.shape[0], j:j+W_rot.shape[1]]
            res[-1].append(np.sum(X_sub * W_rot))
    return(np.array(res))

X = [[1, 3, 2, 4],
     [5, 6, 1, 3],
     [1, 2, 0, 2],
     [3, 4, 3, 2]]

W = [[1, 0, 3],
     [1, 2, 1],
     [0, 1, 1]]

print("Conv2d Implementation:\n", conv2d(X, W, p=(1, 1), s=(1, 1)))
print("Scipy Results:\n", scipy.signal.convolve2d(X, W, mode="same"))

Conv2d Implementation:
 [[11. 25. 32. 13.]
 [19. 25. 24. 13.]
 [13. 28. 25. 17.]
 [11. 17. 14.  9.]]
Scipy Results:
 [[11 25 32 13]
 [19 25 24 13]
 [13 28 25 17]
 [11 17 14  9]]


In the next section, we will discuss subsampling, which is another important operation often used in CNNs.

## Subsampling

Subsampling is typically applied in two forms of pooling opertions in convolutional neural networks:
* **max-pooling**
* **mean-pooling**

The pooling layer is usually denoted by $\mathbf{P}_{n_1 \times n_2}$.  Here, the subscript determines the size of the neighbourhood (the number of adjacent pixel in each dimension), where the max of mean operation is performed.  We refer to such a neighbourhood as the pooling size.

The operation is described in the following figure.  Here, max-pooling takes the maximum value from a neighbourhood of pixels, and mean-pooling computes their average:

<img src="images/15_06.png" style="width:500px">

The advantages of Pooling is twofold:
1. Pooling (max-pooling) introduces some sort of local invariance.  This means that small changes in a local neighbourhood do not change the result of max-pooling.  Therefore, it helps generate features that are more robust to noise in the input data.
2. Pooling decreases the size of features, which results in higher computational efficiency.  Furthermore, reducing the number of features may reduce the degree of overfitting as well.

Traditionally, pooling is assumed to be non-overlapping.  This can be done by setting the stride parameter equal to the pooling size.  For example, a nonoverlapping pooling layer $\mathbf{P}_{n_1 \times n_2}$ requires a stride parameter $s = (n_1, n_2)$.

## Working with multiple input or color channels

An input sample to a convolutional layer may contain one or more 2D array of matrices with dimensions $N_1 \times N_2$ (for example, the image height and width in pixels).  These $N_1\times N_2$ matrices are called channels.  Therefore, using multiple channels as input to a convolutional layer requires us to use a rank-3 tensor or a three-dimensional array:  $\mathbf{X}_{N_1 \times N_2 \times C_{\mathrm{in}}}$, where $C_{\mathrm{in}}$ is the number of input channels.

When we work with images, we can read the images into Numpy arrays using the `uint8` (unsigned 8-bit integer) data type to reduce memory usage compare to 16-bit, 32-bit, or 64-bit integer types, for example.  Unsigned 8-bit integers take values in the range of $[0, 255]$, which are sufficient to store the pixel information in RGB images, which also take values in the same range.

Now, let's look at an example of how we can read in an image into our Python session using SciPy.

Once Pillow is installed, we can use the `imread` function from the `scipy.misc` module to read an RGB image.

In [19]:
import scipy.misc
import imageio

img = imageio.imread("example-image.png")[:,:,0:3]
print("Image shape: {}.".format(img.shape))

print("Number of channels: {}.".format(img.shape[2]))

print("Image data type: {}.".format(img.dtype))

print(img[100, 100, :])

print(img[100, 100:102, :])

print(img[100:102, 100:102, :])

Image shape: (252, 221, 3).
Number of channels: 3.
Image data type: uint8.
[179 134 110]
[[179 134 110]
 [182 136 112]]
[[[179 134 110]
  [182 136 112]]

 [[180 135 111]
  [182 137 113]]]


Now that we have familiarized ourselves with the structure of the input data, the next question is how can we incorporate multiple input channels in the convolution operation that we discussed in the previous sections?

The answer is very simple:  we perform the vonvolution operation for each channel separately and then add the results together using the matrix summation.  The convolution associated with eacdh channel ($c$) has its own kernel matrix as $\mathbf{W}[::,c]$.  The total pre-activation result is computed in the following formula:

Given a sample $\mathbf{X}_{n_1\times n_2\times c_{in}}$, a kernel matrix $\mathbf{W}_{m_1\times m_2\times c_{\mathrm{in}}}$, and a bias value $b$, it follows that:
* $\mathbf{Y}^{\mathrm{Conv}} = \sum_{c=1}^{C_{in}}\mathbf{W}[:,:,c] * \mathbf{X}[:,:,c]$
* Pre-activation:  $\mathbf{A} = \mathbf{Y}^{\mathrm{Conv}} + b$
* Feature Map:  $\mathbf{H} = \phi(\mathbf{A})$

The final result, $\mathbf{H}$, is called a **feature map**.  Usually, a convolutional layer of a CNN has more than one feature map.  If we use multiple feature maps, the kernel tensor become four-dimensional: $width\times height\times C_{in}\times C_{out}$.  Here, $width\times height$ is the kernel size, $C_{in}$ is the number of input channels, and $C_{out}$ is the number of output feature maps.  So, now let's include the number of output feature maps in the preceding formula and update it as follows:

Given a sample $\mathbf{X}_{n_1\times n_2\times c_{in}}$, a kernel matrix $\mathbf{W}_{m_1\times m_2\times c_{\mathrm{in}}\times C_{out}}$, and a bias value $\mathbf{b}_{C_{out}}$, it follows that:
* $\mathbf{Y}^{\mathrm{Conv}}[:, :, k] = \sum_{c=1}^{C_{in}}\mathbf{W}[:,:,c, k] * \mathbf{X}[:,:,c]$
* Pre-activation:  $\mathbf{A}[:,:,k] = \mathbf{Y}^{\mathrm{Conv}}[:,:, k] + \mathbf{b}[k]$
* Feature Map:  $\mathbf{H}[:,:,k] = \phi(\mathbf{A}[:,:,k])$

To conclude our discussion of computing convolutions in the context of neural networs, let us look at the example in the following figure that shows a convolutionl layer, followed by a pooling layer.

In this example, there are 3 input channels.  The kernel tensor is four-dimensional.  Each kernel matrix is denoted as $m_1\times m_2$, and there are three of them, one for each input channel.  Furthermore, there are five such kernels, accounting for the five output feature maps.  Finally, there is a pooling layer for subsampling the feature maps, as shown in the following figure:

<img src="images/15_07.png" style="width:500px">
In this example, there are $m_1 \times m_2 \times 3\times 5 + 5$ trainable parameters.

In the next section, we will talk about how to regularize a neural network.

## Regularizing a neural network with dropout

The **capacity** of a netwok refers to the level of complexity of the function that it can learn.  Small networks are likely to underfit, while large networks may more easily result in overfitting.  When we deal with real-world machine learning problems, we do not know how large the network should be **a priori**.

One way to address this problem is to build a network with a relatively large capacity (in practice, we want to choose a capacity that is slightly larger than necessary) to do well on the training set.  Then, to prevent overfitting, we can apply one or multiple regularization schemes to achive good generalization performance on new data.

In recent years, another popular regularization technique called dropout has emerged that works amazingly well for regularizing deep neural networks.  (Refer to the original paper _Dropout: A simple way to prevent neural networks from overfitting, Nitish Srivastava and others, Journal of Machine Learning Research 15.1_ pages 1929-1958, _2014_, <http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf>).

Intuitively, dropout can be considered as the consensus (averaging) of an ensemble of models.  Furtheremore, here, dropout offers a workaround with an efficient way to train many models at once and compute their average predictions at test or prediction time.

Dropout is usually applied to the hidden units of higher layers.  During the training phase of a neural network, a fraction of the hidden units is randomly dropped at every iteration with probability $p_{\mathrm{drop}}$ (or the keep probability $p_{\mathrm{keep}} = 1 - p_{\mathrm{drop}}$).

This dropout probability is determined by the user, and the common choice is $p = 0.5$.  When dropping a certain fraction of the input neurons, the weights associated with the remaining neurons are rescaled to account for the missing (dropped) neurons.

The effect of this random dropout forces the network to learn a redundant representation of the data.  Therefore, the network cannot rely on an activation of any set of hidden units since they may be turned off at any time during training and is forced to learn more general and robust patterns from the data.

However, during prediction, all neurons will contribute to computing the pre-activations of the next layer.

<img src="images/15_08.png" style="width:500px">
As shown here, one important point to remember is that units may drop randomly during training only, while for the evaluation phase, all the hidden units must be active (for instance, $p_{\mathrm{drop}} = 0$, or $p_{\mathrm{keep}} = 1$).  To ensure that the overall activations are on the same scale during training and prediction, the activations of the active neurons have to be scaled appropriately (for example, by halving the activation if the dropout probability was set to $p = 0.5$).

However, since it is inconvenient to always scale activations when we make predictions in practice, TensorFlow and other tools scale the activations during training (for example, by double the activations if the droupout probability was set to $p = 0.5$).

So, what is the relationship between dropout and ensemble learning?  Since we drop different hidden neurons at each iteration, effectively we are training different models.  When all these models are finally trained, we set the keep probability to 1 and use all the hidden units.  This means we are taking the average activation from all the hidden units.

## Implementing a deep convolutional neural network using TensorFlow.

So now, we want to implement a CNN to solve this MNIST problem and see its predictive power in classifying handwritten digits.

## The multilayer CNN architecture

The architecture that we are going to implement is shown in the following figure.  The input is $28\times 28$ grayscale images.  Considering the number of channels (which is 1 for grayscale images) and a batch of input images, the input tensor's dimension will be $batchsize\times 28\times 28\times 1$.

The input data goes through two convolutional layers that have a kernel size of $5\times 5$.  The first convolution has 32 output feature maps, and the second one has 64 output feature maps.  Each convolutional layer is followed by a subsampling layer in the form of a max-pooling operation.

Then a fully-connected layer passes the output to a second fully-connected layer, which acts as the final _softmax_ output layer.  The architecture of the network that we are going to implement is shown in the following figure:

<img src="images/15_09.png" style="width:500px">

The dimensions of the tensors in each layer are as follows:
* Input:  $batchsize\times 28\times 28\times 1$
* Conv_1:  $batchsize\times 24\times 24\times 32$
* Pooling_1:  $batchsize\times 12\times 12\times 32$
* Conv_2:  $batchsize\times 8\times 8\times 64$
* Pooling_2:  $batchsize\times 4\times 4\times 64$
* FC_1:  $batchsize\times 1024$
* FC_2 and softmax layer:  $batchsize \times 10$

We will implement this network using two APIs:  the low-level TensorFlow API and the TensorFlow Layers API.  But first, let's define some helper functions at the beginning of the next section.

## Loading and preprocessing the data

If you recall again from _Chapter 13, Parallelizing Neural Network Training with TensorFlow_, we used a function called `load_mnist` to read the MNIST handwritten digit dataset.  Now we need to repeat the same procedure here as well, as follows:

In [6]:
import os
import struct
import numpy as np

def load_mnist(path, kind="train"):
    """Load the MNIST data from `path`"""
    labels_path = os.path.join(path, "{}-labels-idx1-ubyte".format(kind))
    images_path = os.path.join(path, "{}-images-idx3-ubyte".format(kind))
    
    with open(labels_path, "rb") as lbpath:
        magic, n = struct.unpack(">II", lbpath.read(8))
        labels = np.fromfile(lbpath, dtype=np.uint8)
    
    with open(images_path, "rb") as imgpath:
        magic, num, rows, cols = struct.unpack(">IIII", imgpath.read(16))
        images = np.fromfile(imgpath, dtype=np.uint8).reshape(len(labels), 784)
    
    return images, labels

In [7]:
### Loading the Data ###
X_data, y_data = load_mnist("./", kind="train")
print("Rows: {}, Columns: {}.".format(X_data.shape[0], X_data.shape[1]))

X_test, y_test = load_mnist("./", kind="t10k")
print("Rows: {}, Columns: {}.".format(X_test.shape[0], X_test.shape[1]))

X_train, y_train = X_data[:50000, :], y_data[:50000]
X_valid, y_valid = X_data[50000:, :], y_data[50000:]

print("Training:  {} {}.".format(X_train.shape, y_train.shape))
print("Validation:  {} {}.".format(X_valid.shape, y_valid.shape))
print("Test Set:  {} {}.".format(X_test.shape, y_test.shape))

# Notice that we are splitting the data into training, a validation, and test sets.
# The following results shows the shpae of each set:

Rows: 60000, Columns: 784.
Rows: 10000, Columns: 784.
Training:  (50000, 784) (50000,).
Validation:  (10000, 784) (10000,).
Test Set:  (10000, 784) (10000,).


After we have loaded the data, we need a function for iterating through the mini-batches of data, as follows:

In [8]:
def batch_generator(X, y, batch_size=64, shuffle=False, random_seed=None):
    
    idx = np.arange(y.shape[0])
    
    if shuffle:
        rng = np.random.RandomState(random_seed)
        rng.shuffle(idx)
        X = X[idx]
        y = y[idx]
    
    for i in range(0, X.shape[0], batch_size):
        yield (X[i:i+batch_size, :], y[i:i+batch_size])

This function returns a generator with a tuple for a mathc of samples, for instance, data X and labels y.  We then need to normalize the data (mean centering and divising by the standard deviation) for better training performance and convergence.

We compute the mean of each feature using the training data (`X_train`) and calculate the standard deviation across all the features.  The reason why we don't compute the standard deviation for each feature individually is vecause some features (piuxel positions) in image datasets such as MNIST have a constant value of 255 across all the images corresponding to white pixles in a grayscale image.

A constant value across all samples indicates no variation, and therefore, the standard deviation of those features will be zero, and a result would yield by the division-by-zero error, which is why we compute the standard deviation from the `X_train` array using np.std without specifying an `axis` argument:

In [10]:
mean_vals = np.mean(X_train, axis=0)
print(mean_vals.shape)
std_val = np.std(X_train)

X_train_centered = (X_train - mean_vals)/std_val
X_valid_centered = (X_valid - mean_vals)/std_val
X_test_centered = (X_test - mean_vals)/std_val

(784,)


Now we are ready to implement the CNN we just described.  We will proceed by implementing the CNN model in TensorFlow:

In [11]:
import tensorflow as tf
import numpy as np

def conv_layer(input_tensor, name, kernel_size, n_output_channels, padding_mode="SAME", strides=(1, 1, 1, 1)):
    with tf.variable_scope(name):
        ## get n_input channels:
        ## get input tensor shape:
        ## [batch  x  width  x  height  x  channels_in]
        input_shape = input_tensor.get_shape().as_list()
        n_input_channels = input_shape[-1]
        
        weights_shape = list(kernel_size) + [n_input_channels, n_output_channels]
        
        weights = tf.get_variable(name="_weights", shape=weights_shape)
        print(weights)
        
        biases = tf.get_variable(name="_biases", initializer=tf.zeros(shape=n_output_channels))
        print(biases)
        
        conv = tf.nn.conv2d(input=input_tensor,
                            filter=weights,
                            strides=strides,
                            padding=padding_mode)
        print(conv)
        conv = tf.nn.bias_add(conv, biases, name="net_pre-activation")
        print(conv)
        conv = tf.nn.relu(conv, name="activation")
        print(conv)
        
        return conv

In [12]:
g = tf.Graph()
with g.as_default():
    x = tf.placeholder(tf.float32, shape=[None, 28, 28, 1])
    conv_layer(x, name="convtest", kernel_size=(3,3), n_output_channels=32)

<tf.Variable 'convtest/_weights:0' shape=(3, 3, 1, 32) dtype=float32_ref>
<tf.Variable 'convtest/_biases:0' shape=(32,) dtype=float32_ref>
Tensor("convtest/Conv2D:0", shape=(?, 28, 28, 32), dtype=float32)
Tensor("convtest/net_pre-activation:0", shape=(?, 28, 28, 32), dtype=float32)
Tensor("convtest/activation:0", shape=(?, 28, 28, 32), dtype=float32)


In [13]:
del g, x

The next wrapper function if for defiing our fully connected layers:

In [14]:
def fc_layer(input_tensor, name, n_output_units, activation_fn=None):
    with tf.variable_scope(name):
        input_shape = input_tensor.get_shape().as_list()[1:]
        n_input_units = np.prod(input_shape)
        if len(input_shape) > 1:
            input_tensor = tf.reshape(input_tensor, shape=(-1, n_input_units))
        
        weights_shape = [n_input_units, n_output_units]
        weights = tf.get_variable(name="_weights", shape=weights_shape)
        print(weights)
        
        biases = tf.get_variable(name="_biases", initializer=tf.zeros(shape=[n_output_units]))
        print(biases)
        
        layer = tf.matmul(input_tensor, weights)
        print(layer)
        layer = tf.nn.bias_add(layer, biases, name="net_pre-activation")
        print(layer)
        
        if activation_fn is None:
            return layer
        
        layer = activation_fn(layer, name="activation")
        print(layer)
        return layer

In [16]:
g = tf.Graph()
with g.as_default():
    x = tf.placeholder(tf.float32, shape=[None, 28, 28, 1])
    
    fc_layer(x, name="fc_test", n_output_units=32, activation_fn=tf.nn.relu)

<tf.Variable 'fc_test/_weights:0' shape=(784, 32) dtype=float32_ref>
<tf.Variable 'fc_test/_biases:0' shape=(32,) dtype=float32_ref>
Tensor("fc_test/MatMul:0", shape=(?, 32), dtype=float32)
Tensor("fc_test/net_pre-activation:0", shape=(?, 32), dtype=float32)
Tensor("fc_test/activation:0", shape=(?, 32), dtype=float32)


In [17]:
del g, x

In [47]:
def build_cnn():
    ## Placeholders for x and y:
    tf_x = tf.placeholder(tf.float32, shape=[None, 784], name="tf_x")
    tf_y = tf.placeholder(tf.int32, shape=[None], name="tf_y")
    
    # Reshape x to a 4D tensor
    # [batchsize, height, width, 1]
    tf_x_image = tf.reshape(tf_x, shape=[-1, 28, 28, 1], name="tf_x_reshaped")
    
    ## One-hot encoding
    tf_y_onehot = tf.one_hot(indices=tf_y, depth=10, dtype=tf.float32, name="tf_y_onehot")
    
    ## 1st layer: Conv_1
    print("\nBuilding the first layer:")
    h1 = conv_layer(tf_x_image, name="conv_1", kernel_size=(5, 5), padding_mode="VALID", n_output_channels=32)
    
    ## MaxPooling
    h1_pool = tf.nn.max_pool(h1, ksize=(1, 2, 2, 1), strides=(1, 2, 2, 1), padding="SAME")
    
    ## 2nd layer: Conv_2
    print("\nBuilding the second layer:")
    h2 = conv_layer(h1_pool, name="conv_2", kernel_size=(5, 5), padding_mode="VALID", n_output_channels=64)
    
    ## MaxPooling
    h2_pool = tf.nn.max_pool(h2, ksize=(1, 2, 2, 1), strides=(1, 2, 2, 1), padding="SAME")
    
    ## 3rd layer: Fully Connected
    print("\nBuilding the 3rd layer:")
    h3 = fc_layer(h2_pool, name="fc_3", n_output_units=1024, activation_fn=tf.nn.relu)
    
    ## Dropout
    keep_prob = tf.placeholder(tf.float32, name="fc_keep_prob")
    h3_drop = tf.nn.dropout(h3, keep_prob=keep_prob, name="dropout_layer")
    
    ## 4th layer: Fully Connected (linear activation)
    print("\nBuilding the 4th layer:")
    h4 = fc_layer(h3_drop, name="fc_4", n_output_units=10, activation_fn=None)
    
    ## Prediction
    predictions = {"probabilities": tf.nn.softmax(h4, name="probabilities"),
                   "labels": tf.cast(tf.argmax(h4, axis=1), tf.int32, name="labels")}
    
    ## Visualise the graph with TensorBoard
    
    ## Loss Function and Optimization
    cross_entropy_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=h4, labels=tf_y_onehot),
                                        name="cross_entropy_loss")
    
    ## Optimizer:
    optimizer = tf.train.AdamOptimizer(learning_rate)
    optimizer = optimizer.minimize(cross_entropy_loss, name="train_op")
    
    ## Computing the prediction accuracy
    correct_predictions = tf.equal(predictions["labels"], tf_y, name="correct_preds")
    
    accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32), name="accuracy")

def save(saver, sess, epoch, path="./model/"):
    """Saving checkpoints of the trained model."""
    if not os.path.isdir(path):
        os.makedirs(path)
    print("Saving the model in {}.".format(path))
    saver.save(sess, os.path.join(path, "cnn-model.ckpt"), global_step=epoch)

def load(saver, sess, path, epoch):
    """Loading checkpoints of the trained model."""
    print("Loading model from {}.".format(path))
    saver.restore(sess, os.path.join(path, "cnn-model.ckpt-{:.0f}".format(epoch)))

def train(sess, training_set, validation_set=None, initialize=True,
          epochs=20, shuffle=True, dropout=0.5, random_seed=None):
    """Training the model by using `training_set`."""
    X_data = np.array(training_set[0])
    y_data = np.array(training_set[1])
    training_loss = []
    
    ## initialize the variables
    if initialize:
        sess.run(tf.global_variables_initializer())
    
    np.random.seed(random_seed)  # for shuffling in the batch generator
    for epoch in range(1, epochs + 1):
        batch_gen = batch_generator(X_data, y_data, shuffle=shuffle)
        
        avg_loss = 0.0
        for i, (batch_x, batch_y) in  enumerate(batch_gen):
            feed = {"tf_x:0": batch_x,
                    "tf_y:0": batch_y,
                    "fc_keep_prob:0": dropout}
            loss, _ = sess.run(["cross_entropy_loss:0", "train_op"], feed_dict=feed)
            avg_loss += loss
        
        training_loss.append(avg_loss / (i + 1))
        print("Epoch {:>2} Training Avg. Loss: {:7.3f}.".format(epoch, avg_loss), end=" ")
        if validation_set:
            feed = {"tf_x:0": validation_set[0],
                    "tf_y:0": validation_set[1],
                    "fc_keep_prob:0": 1.0}
            valid_acc = sess.run("accuracy:0", feed_dict=feed)
            print(" Validation Acc.: {:7.3f}.".format(valid_acc))
        else:
            print("\n")

def predict(sess, X_test, return_proba=False):
    """Get prediction probabilities or prediciton labels of the test data."""
    feed_dict = {"tf_x:0": X_test, "fc_keep_prob:0": 1.0}
    if return_proba:
        return sess.run("probabilities:0", feed_dict=feed_dict)
    else:
        return sess.run("labels:0", feed_dict=feed_dict)

Now we can create a TensorFlow graph object, set the graph-level random seed, and build the CNN model in that graph, as follows:

In [48]:
## Define the hyperparameters
learning_rate = 1e-4
random_seed = 123

g = tf.Graph()
with g.as_default():
    tf.set_random_seed(random_seed)
    
    ## build the graph
    build_cnn()
    
    ## saver
    saver = tf.train.Saver()


Building the first layer:
<tf.Variable 'conv_1/_weights:0' shape=(5, 5, 1, 32) dtype=float32_ref>
<tf.Variable 'conv_1/_biases:0' shape=(32,) dtype=float32_ref>
Tensor("conv_1/Conv2D:0", shape=(?, 24, 24, 32), dtype=float32)
Tensor("conv_1/net_pre-activation:0", shape=(?, 24, 24, 32), dtype=float32)
Tensor("conv_1/activation:0", shape=(?, 24, 24, 32), dtype=float32)

Building the second layer:
<tf.Variable 'conv_2/_weights:0' shape=(5, 5, 32, 64) dtype=float32_ref>
<tf.Variable 'conv_2/_biases:0' shape=(64,) dtype=float32_ref>
Tensor("conv_2/Conv2D:0", shape=(?, 8, 8, 64), dtype=float32)
Tensor("conv_2/net_pre-activation:0", shape=(?, 8, 8, 64), dtype=float32)
Tensor("conv_2/activation:0", shape=(?, 8, 8, 64), dtype=float32)

Building the 3rd layer:
<tf.Variable 'fc_3/_weights:0' shape=(1024, 1024) dtype=float32_ref>
<tf.Variable 'fc_3/_biases:0' shape=(1024,) dtype=float32_ref>
Tensor("fc_3/MatMul:0", shape=(?, 1024), dtype=float32)
Tensor("fc_3/net_pre-activation:0", shape=(?, 1024)

Note that in the preceding code, after we built the model by calling the `build_cnn` function, we created a saver object from the `tf.train.Saver` class for saving and restoring trained models.

The next step is to train our CNN model.  For this, we need to create a TensorFlow session to launch the graph; then, we call the `train` function.  To train the model for the first time, we have to initialize all the variables in the network.

For this purpose, we have defined an argument named `initialize` that will take care of the initialization.  When `initialize=True`, we will execute `tf.global_variables_initializer` through `session.run`.  This initialization step should be avoided in case you want to train additional epochs; for example, you can restore an already trained model and train further for additional 10 epochs.  The code for training the model for the first time is as follows:

In [50]:
## Create a TF session and train the CNN model.

with tf.Session(graph=g) as sess:
    train(sess, training_set=(X_train_centered, y_train), validation_set=(X_valid_centered, y_valid),
          initialize=True, random_seed=123)
    save(saver, sess, epoch=20)
    file_writer = tf.summary.FileWriter(logdir="./logs/", graph=g)

Epoch  1 Training Avg. Loss: 278.592.  Validation Acc.:   0.975.
Epoch  2 Training Avg. Loss:  75.561.  Validation Acc.:   0.983.
Epoch  3 Training Avg. Loss:  49.742.  Validation Acc.:   0.987.
Epoch  4 Training Avg. Loss:  39.254.  Validation Acc.:   0.988.
Epoch  5 Training Avg. Loss:  30.857.  Validation Acc.:   0.988.
Epoch  6 Training Avg. Loss:  26.932.  Validation Acc.:   0.990.
Epoch  7 Training Avg. Loss:  22.153.  Validation Acc.:   0.991.
Epoch  8 Training Avg. Loss:  19.955.  Validation Acc.:   0.991.
Epoch  9 Training Avg. Loss:  17.585.  Validation Acc.:   0.992.
Epoch 10 Training Avg. Loss:  14.826.  Validation Acc.:   0.992.
Epoch 11 Training Avg. Loss:  12.878.  Validation Acc.:   0.991.
Epoch 12 Training Avg. Loss:  11.369.  Validation Acc.:   0.992.
Epoch 13 Training Avg. Loss:   9.850.  Validation Acc.:   0.991.
Epoch 14 Training Avg. Loss:   8.994.  Validation Acc.:   0.993.
Epoch 15 Training Avg. Loss:   8.237.  Validation Acc.:   0.992.
Epoch 16 Training Avg. Lo

<img src="images/tensorboard.png" style="width:1000px">

After the 20 epochs are finished, we save the trained model for future use so that we do not have to retrain the model every time, and therefore save computation time.

The following code shows how to restore a saved model.
1. We delete the graph `g`.
2. Create a new graph `g2`.
3. Reload the trained model to do prediction on the test set.

In [39]:
### Calculate prediction accuracy on the test set.
### Restoring the saved model.

del g

### Create a new graph and build the model

g2 = tf.Graph()
with g2.as_default():
    tf.set_random_seed(random_seed)
    # build the graph
    build_cnn()
    
    ## saver
    saver = tf.train.Saver()

## Create a new session and restore the model
with tf.Session(graph=g2) as sess:
    load(saver, sess, epoch=20, path="./model/")
    
    preds = predict(sess, X_test_centered, return_proba=False)
    
    print("Test Accuracy: {:.3f}%.".format(100*np.sum(preds==y_test)/len(y_test)))


Building the first layer:
<tf.Variable 'conv_1/_weights:0' shape=(5, 5, 1, 32) dtype=float32_ref>
<tf.Variable 'conv_1/_biases:0' shape=(32,) dtype=float32_ref>
Tensor("conv_1/Conv2D:0", shape=(?, 24, 24, 32), dtype=float32)
Tensor("conv_1/net_pre-activation:0", shape=(?, 24, 24, 32), dtype=float32)
Tensor("conv_1/activation:0", shape=(?, 24, 24, 32), dtype=float32)

Building the second layer:
<tf.Variable 'conv_2/_weights:0' shape=(5, 5, 32, 64) dtype=float32_ref>
<tf.Variable 'conv_2/_biases:0' shape=(64,) dtype=float32_ref>
Tensor("conv_2/Conv2D:0", shape=(?, 8, 8, 64), dtype=float32)
Tensor("conv_2/net_pre-activation:0", shape=(?, 8, 8, 64), dtype=float32)
Tensor("conv_2/activation:0", shape=(?, 8, 8, 64), dtype=float32)

Building the 3rd layer:
<tf.Variable 'fc_3/_weights:0' shape=(1024, 1024) dtype=float32_ref>
<tf.Variable 'fc_3/_biases:0' shape=(1024,) dtype=float32_ref>
Tensor("fc_3/MatMul:0", shape=(?, 1024), dtype=float32)
Tensor("fc_3/net_pre-activation:0", shape=(?, 1024)

The output contains several extra lines from the `print` statements in the `build_cnn` function.  As you can see, the prediction accuracy on the test set is already better than what we achieved using the multilayer perceptron in _Chapter 13_.

Now, let us look at the predicted labels as well as their probabilities on the first 10 test samples.

We already have the predictions stored in `preds`; however, in order to have more practice in using the session and launching the graph, we repeat those stesp here:

In [44]:
## run the prediction on some test samples
np.set_printoptions(precision=2, suppress=True)

with tf.Session(graph=g2) as sess:
    load(saver, sess, epoch=20, path="./model/")
    
    print(predict(sess, X_test_centered[:10], return_proba=False))
    
    print(predict(sess, X_test_centered[:10], return_proba=True))

Loading model from ./model/.
INFO:tensorflow:Restoring parameters from ./model/cnn-model.ckpt-20
[7 2 1 0 4 1 4 9 5 9]
[[0.   0.   0.   0.   0.   0.   0.   1.   0.   0.  ]
 [0.   0.   1.   0.   0.   0.   0.   0.   0.   0.  ]
 [0.   1.   0.   0.   0.   0.   0.   0.   0.   0.  ]
 [1.   0.   0.   0.   0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.   0.   1.   0.   0.   0.   0.   0.  ]
 [0.   1.   0.   0.   0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.   0.   1.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.   0.   0.   0.   0.   0.   0.   1.  ]
 [0.   0.   0.   0.   0.   0.99 0.01 0.   0.   0.  ]
 [0.   0.   0.   0.   0.   0.   0.   0.   0.   1.  ]]


Finally, let us see how we can train the model further to reach a total of 40 epochs.  We can save time by restoring the already trained model and continue training for 20 additional epochs.  This will be very easy to do with our setup.  W need to call the `train` function again, but this time, we set `initialize=False` to avoid the initialization step.

The code is as follows:

In [46]:
## Continue training for 20 more epochs without re-initializing:  initialize=False

## Create a new session and restore the model

with tf.Session(graph=g2) as sess:
    load(saver, sess, epoch=20, path="./model/")
    
    train(sess, training_set=(X_train_centered, y_train), validation_set=(X_valid_centered, y_valid),
          initialize=False, epochs=30, random_seed=123)
    
    save(saver, sess, epoch=50, path="./model/")
    
    preds = predict(sess, X_test_centered, return_proba=False)
    
    print("Test accuracy: {:.3f}%.".format(100*np.sum(preds==y_test)/len(y_test)))

Loading model from ./model/.
INFO:tensorflow:Restoring parameters from ./model/cnn-model.ckpt-20
Epoch  1 Training Avg. Loss:   4.414.  Validation Acc.:   0.991.
Epoch  2 Training Avg. Loss:   4.341.  Validation Acc.:   0.992.
Epoch  3 Training Avg. Loss:   3.634.  Validation Acc.:   0.992.
Epoch  4 Training Avg. Loss:   3.178.  Validation Acc.:   0.992.
Epoch  5 Training Avg. Loss:   2.926.  Validation Acc.:   0.992.
Epoch  6 Training Avg. Loss:   3.488.  Validation Acc.:   0.993.
Epoch  7 Training Avg. Loss:   3.021.  Validation Acc.:   0.991.
Epoch  8 Training Avg. Loss:   1.981.  Validation Acc.:   0.992.
Epoch  9 Training Avg. Loss:   2.652.  Validation Acc.:   0.992.
Epoch 10 Training Avg. Loss:   2.019.  Validation Acc.:   0.992.
Epoch 11 Training Avg. Loss:   1.878.  Validation Acc.:   0.993.
Epoch 12 Training Avg. Loss:   2.480.  Validation Acc.:   0.993.
Epoch 13 Training Avg. Loss:   1.959.  Validation Acc.:   0.993.
Epoch 14 Training Avg. Loss:   2.329.  Validation Acc.:   

The result shows that training for 30 additional epochs slightly improved the performance to get 99.340% prediction accuracy on the test set.

In this section, we saw how to implement a multilayer convolutional neural network in the low-level TensorFlow API.  In the next section, we'll now implement the same network but we'll use the TensorFlow Layers API.

## Implementing a CNN in the TensorFlow Layers API

For the implementation in the TensorFlow Layers API, we need to repeat the same process of loading the data and preprocessing steps to get `X_train_centered`, `X_valid_centered`, and `X_test_centered`:

In [52]:
import os
import struct
import numpy as np

def load_mnist(path, kind="train"):
    """Load the MNIST data from `path`"""
    labels_path = os.path.join(path, "{}-labels-idx1-ubyte".format(kind))
    images_path = os.path.join(path, "{}-images-idx3-ubyte".format(kind))
    
    with open(labels_path, "rb") as lbpath:
        magic, n = struct.unpack(">II", lbpath.read(8))
        labels = np.fromfile(lbpath, dtype=np.uint8)
    
    with open(images_path, "rb") as imgpath:
        magic, num, rows, cols = struct.unpack(">IIII", imgpath.read(16))
        images = np.fromfile(imgpath, dtype=np.uint8).reshape(len(labels), 784)
    
    return images, labels


### Loading the Data ###
X_data, y_data = load_mnist("./", kind="train")
print("Rows: {}, Columns: {}.".format(X_data.shape[0], X_data.shape[1]))

X_test, y_test = load_mnist("./", kind="t10k")
print("Rows: {}, Columns: {}.".format(X_test.shape[0], X_test.shape[1]))

X_train, y_train = X_data[:50000, :], y_data[:50000]
X_valid, y_valid = X_data[50000:, :], y_data[50000:]

print("Training:  {} {}.".format(X_train.shape, y_train.shape))
print("Validation:  {} {}.".format(X_valid.shape, y_valid.shape))
print("Test Set:  {} {}.".format(X_test.shape, y_test.shape))

def batch_generator(X, y, batch_size=64, shuffle=False, random_seed=None):
    
    idx = np.arange(y.shape[0])
    
    if shuffle:
        rng = np.random.RandomState(random_seed)
        rng.shuffle(idx)
        X = X[idx]
        y = y[idx]
    
    for i in range(0, X.shape[0], batch_size):
        yield (X[i:i+batch_size, :], y[i:i+batch_size])

mean_vals = np.mean(X_train, axis=0)
print(mean_vals.shape)
std_val = np.std(X_train)

X_train_centered = (X_train - mean_vals)/std_val
X_valid_centered = (X_valid - mean_vals)/std_val
X_test_centered = (X_test - mean_vals)/std_val

Rows: 60000, Columns: 784.
Rows: 10000, Columns: 784.
Training:  (50000, 784) (50000,).
Validation:  (10000, 784) (10000,).
Test Set:  (10000, 784) (10000,).
(784,)


Finally, we can implement the model in a new class as follows:

In [53]:
import tensorflow as tf
import numpy as np

class ConvNN(object):
    def __init__(self, batchsize=64, epochs=20, learning_rate=1e-4,
                 dropout_rate=0.5, shuffle=True, random_seed=None):
        np.random.seed(random_seed)
        self.batchsize = batchsize
        self.epochs = epochs
        self.learning_rate = learning_rate
        self.dropout_rate = dropout_rate
        self.shuffle = shuffle
        
        g = tf.Graph()
        with g.as_default():
            ## set random_seed:
            tf.set_random_seed(random_seed)
            
            ## build the network
            self.build()
            
            ## initializer
            self.init_op = tf.global_variables_initializer()
            
            ## saver
            self.saver = tf.train.Saver()
        
        ## create a session
        self.sess = tf.Session(graph=g)
        
    def build(self):
        
        ## Placeholder for X and y:
        tf_x = tf.placeholder(tf.float32, shape=[None, 784], name="tf_x")
        tf_y = tf.placeholder(tf.int32, shape=[None], name="tf_y")
        is_train = tf.placeholder(tf.bool, shape=(), name="is_train")
        
        ## reshape x to a 4D tensor:
        ## [batchsize, height, width, 1]
        tf_x_image = tf.reshape(tf_x, shape=[-1, 28, 28, 1], name="input_x_2dimages")
        
        ## One-hot encoding:
        tf_y_onehot = tf.one_hot(indices=tf_y, depth=10, dtype=tf.float32, name="input_y_onehot")
        
        ## 1st layer: Conv_1
        h1 = tf.layers.conv2d(tf_x_image, kernel_size=(5, 5), filters=32, activation=tf.nn.relu)
        
        ## Max-Pooling
        h1_pool = tf.layers.max_pooling2d(h1, pool_size=(2,2), strides=(2,2))
        
        ## 2nd layer: Conv_2
        h2 = tf.layers.conv2d(h1_pool, kernel_size=(5,5), filters=64, activation=tf.nn.relu)
        
        ## Max-Pooling
        h2_pool = tf.layers.max_pooling2d(h2, pool_size=(2,2), strides=(2,2))
        
        ## 3rd layer: Fully Connected
        input_shape = h2_pool.get_shape().as_list()
        n_input_units = np.prod(input_shape[1:])
        h2_pool_flat = tf.reshape(h2_pool, shape=[-1, n_input_units])
        h3 = tf.layers.dense(h2_pool_flat, 1024, activation=tf.nn.relu)
        
        ## Dropout
        h3_drop = tf.layers.dropout(h3, rate=self.dropout_rate, training=is_train)
        
        ## 4th layer: Fully Connected (linear activation)
        h4 = tf.layers.dense(h3_drop, 10, activation=None)
        
        ## Prediction
        predictions = {"probabilities": tf.nn.softmax(h4, name="probabilities"),
                       "labels": tf.cast(tf.argmax(h4, axis=1), tf.int32, name="labels")}
        
        ## Loss function and Optimization
        cross_entropy_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=h4, labels=tf_y_onehot),
                                            name="cross_entropy_loss")
        
        ## Optimizer
        optimizer = tf.train.AdamOptimizer(self.learning_rate)
        optimizer = optimizer.minimize(cross_entropy_loss, name="train_op")
        
        ## Finding the Accuracy
        correct_predictions = tf.equal(predictions["labels"], tf_y, name="correct_preds")
        accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32), name="accuracy")
        
    def save(self, epoch, path="./tflayers-model/"):
        if not os.path.isdir(path):
            os.makedirs(path)
        print("Saving model in {}.".format(path))
        self.saver.save(self.sess, os.path.join(path, "model.ckpt"), global_step=epoch)
    
    def load(self, epoch, path):
        print("Loading the model from {}.".format(path))
        self.saver.restore(self.sess, os.path.join(path, "model.ckpt-{:.0f}".format(epoch)))
    
    def train(self, training_set, validation_set=None, initialize=True):
        ## initialize the variables
        if initialize:
            self.sess.run(self.init_op)
        
        self.train_cost_ = []
        X_data = np.array(training_set[0])
        y_data = np.array(training_set[1])
        
        for epoch in range(1, self.epochs+1):
            batch_gen = batch_generator(X_data, y_data, shuffle=self.shuffle)
            avg_loss = 0.0
            for i, (batch_x, batch_y) in enumerate(batch_gen):
                feed = {"tf_x:0": batch_x, "tf_y:0": batch_y, "is_train:0": True}  # for the dropout
                loss, _ = self.sess.run(["cross_entropy_loss:0", "train_op"], feed_dict=feed)
                avg_loss += loss
            
            print("Epoch {:>2}: Training Average Loss: {:7.3f}.".format(epoch, avg_loss), end=" ")
            if validation_set:
                feed = {"tf_x:0": batch_x, "tf_y:0": batch_y, "is_train:0": False}  # for the dropout
                valid_acc = self.sess.run("accuracy:0", feed_dict=feed)
                print("Validation Accuracy: {:7.3f}.".format(valid_acc))
            else:
                print("\n")
    
    def predict(self, X_test, return_proba=False):
        feed = {"tf_x:0": X_test, "is_train:0": False}  # for the dropout
        if return_proba:
            return self.ses.run("probabilities:0", feed_dict=feed)
        else:
            return self.sess.run("labels:0", feed_dict=feed)

The structure of this class is very similar to the previous section with the low-level TensorFlow API.  The class has a constructor that sets the training parameters, creates a graph `g`, and builds the model.  Besides the constructor, there are five major methods:
1. `.build`:  Builds the model.
2. `.save`:  To save the trained model.
3. `.load`:  To restore a saved model.
4. `.train`:  Trains the model.
5. `.predict`:  To do prediction on a test set.

Similar to the implementation in the previous section, we have used a dropout layer after the first fully connected layer.  In the previous implementation that used the low-level TensorFlow API, we used the `tf.nn.dropout` function, but here we used `tf.layers.dropout`, which is a wrapper for the `tf.nn.dropout` function.  There are 2 major differences between these two functions that we need to be careful about:
1. `tf.nn.dropout`:  This has an argument called `keep_prob` that indicates the probability of keeping the units, while `tf.layers.dropout` has a `rate` parameter, which is the rate of dropping units---therefore `rate = 1 - keep_prob`.
2. In the `tf.nn.dropout` function, we fed the `keep_prob` parameter using a placeholder so that that during the training, we will use `keep_prob=0.5`.  Then, during the inference (or prediction) mode, we used `keep_prob=1`.  However, in `tf.layers.dropout`, the value of `rate` is provided upon the creation of the dropout layer in the graph, and we cannot change it during the training or the inference modes.  Instead, we need to provide a Boolean argument called `training` to determine whether we need to apply dropout or not.  This can be done using a placeholder of type `tf.bool`, which we will feed with the value `True` during the training mode and `False` during the inference mode.

We can create an instance of the `ConvNN` class, train it for 20 epochs, and save the model.  The code for this is as follows:

In [54]:
cnn = ConvNN(random_seed=123)

## train the model
cnn.train(training_set=(X_train_centered, y_train), validation_set=(X_valid_centered, y_valid), initialize=True)
cnn.save(epoch=20)

Epoch  1: Training Average Loss: 272.391. Validation Accuracy:   1.000.
Epoch  2: Training Average Loss:  74.064. Validation Accuracy:   1.000.
Epoch  3: Training Average Loss:  51.987. Validation Accuracy:   1.000.
Epoch  4: Training Average Loss:  39.056. Validation Accuracy:   1.000.
Epoch  5: Training Average Loss:  32.421. Validation Accuracy:   0.938.
Epoch  6: Training Average Loss:  26.798. Validation Accuracy:   1.000.
Epoch  7: Training Average Loss:  23.933. Validation Accuracy:   1.000.
Epoch  8: Training Average Loss:  19.666. Validation Accuracy:   1.000.
Epoch  9: Training Average Loss:  17.833. Validation Accuracy:   1.000.
Epoch 10: Training Average Loss:  14.502. Validation Accuracy:   1.000.
Epoch 11: Training Average Loss:  12.899. Validation Accuracy:   1.000.
Epoch 12: Training Average Loss:  11.653. Validation Accuracy:   1.000.
Epoch 13: Training Average Loss:  10.141. Validation Accuracy:   1.000.
Epoch 14: Training Average Loss:   9.552. Validation Accuracy:  

After the training is finished, the model can be used to do prediction on the test dataset as follows:

In [55]:
del cnn

cnn2 = ConvNN(random_seed=123)
cnn2.load(epoch=20, path="./tflayers-model/")
print(cnn2.predict(X_test_centered[:10, :]))

Loading the model from ./tflayers-model/.
INFO:tensorflow:Restoring parameters from ./tflayers-model/model.ckpt-20
[7 2 1 0 4 1 4 9 5 9]


Finally, we can measure the accuracy of the test dataset as follows:

In [56]:
preds = cnn2.predict(X_test_centered)

print("Test Accuracy: {:.2f}%.".format(100*np.sum(y_test == preds)/len(y_test)))

Test Accuracy: 99.23%.


This concludes our discussion on implementing convolution neural networks using the TensorFlow low-level API and TensorFlow Layers API.  We defined some wrapper functions for the first implementation using the low-level API.  The second implementation was more straightforward since we could use the `tf.layers.conv2d` and `tf.layers.dense` functions to build the convolutional and the fully connected layers.

## Summary:

In this chapter, we learned about CNNs, or convolutional neural networks, and explored the building blocks that form different CNN architectures.  We started by defining the convolution operation, then we learned about its fundamentals by discussing 1D as well as 2D implementations.

We also covered subsampling by discussing two forms of pooling operations: max-pooling and average-pooling.  Then, putting all these blocks together, we built a deep convolutional neural network and implemented it using the TensorFlow core API as well as the TensorFlow Layers API to apply CNNs for image classification.

In the next capter, we'll move on to Recurrent Neural Networks (RNN).  RNNs are used for learning the structure of sequence data, and they have some fascinating applications, including language translation and image captioning.