# Implementing convolutional and recurrent neural networks
By Brett Naul (UC Berkeley)

In this exercise we'll explore how to implement basic convolutional and recurrent neural networks for classifying image and sequence data. The networks we'll see here differ from state-of-the-art classification techniques mostly in scale: we'll be training smaller networks on small datasets, whereas more powerful classifiers are much deeper, contain more complicated connections between layers, and are trained using enormous quantities of data.

*Based on various notebooks from the [`keras` examples](https://github.com/fchollet/keras/tree/master/examples) repository.*

In [None]:
# Imports / plotting configuration
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('poster')
plt.rcParams['image.cmap'] = 'viridis'
np.random.seed(13)

In [None]:
import json
import os

# Change this to `'theano'` if you prefer
backend = 'tensorflow'
config = {'image_dim_ordering': 'tf', 'epsilon': 1e-07, 
          'floatx': 'float32', 'backend': backend}
!mkdir -p ~/.keras
with open(os.path.expanduser('~/.keras/keras.json'), 'w') as f:
    json.dump(config, f)

In [None]:
!pip install -q keras-tqdm  # Install Jupyter-friendly progress bar 

## Part 1: Fully-connected network for digit recognition
Handwritten digit recognition is one of the most famous neural network applications; even before the recent advances in "deep learning," neural networks were able to achieve excellent performance for this problem and have been used in real-world applications for many years.

The canonical digit recognition example uses the so-called MNIST dataset, which consists of 60,000 training and 10,000 test examples. Each consists of a 28x28 black and white image of a handwritten digit. First we'll load the MNIST dataset and visualize the first few training examples (this may take a minute or two the first time you download the data).

In [None]:
from keras.datasets import mnist
from keras.utils import np_utils

nrow = 28; ncol = 28; nb_classes = 10  # MNIST data parameters
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.astype('float32'); X_test = X_test.astype('float32')  # int -> float
X_train = X_train.reshape(-1, nrow * ncol); X_test = X_test.reshape(-1, nrow * ncol)  # flatten
X_train /= 255; X_test /= 255  # normalize pixels to between 0 and 1

# convert class vectors to binary class matrices (i.e., one-hot encoding)
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

fig, ax = plt.subplots(2, 5, figsize=(15, 8))
for i in range(10):
    plt.sca(ax.ravel()[i])
    plt.imshow(X_train[i].reshape(nrow, ncol))

We'll start with our old friend the single-layer perceptron that we implemented in the "Basic Neural Network Exercise." The perceptron consists of a single fully-connected (a.k.a. dense) layer with some activation function, plus an output that we pass to the softmax function. An example of a `keras` implementation of such a network is given below; you'll want to use this as the template for the rest of your models in this exercise.

The steps below should mostly be self-explanatory, but here's a quick breakdown:
- `Sequential` is a `keras` class that allows us to add new layers one at a time
- `Dense` is a fully-connected layer; the `input_shape` argument is only needed in the first layer
- `Activation` passes the output of the previous layer through a specified activation function
- `compile` prepares the underlying `tensorflow` or `theano` graph corresponding to the requested model
- `rmsprop` is a variant of gradient descent that converges more quickly; vanilla gradient descent is rarely used for training neural networks
- "Categorical crossentropy" is the same standard loss function we used in the previous example for evaluating predicted class probabilities

In [None]:
# Define model architecture
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation

model = Sequential()
model.add(Dense(8, input_dim=nrow * ncol))
model.add(Activation('relu'))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.summary()

In [None]:
# Fit model to training data and check accuracy
from keras_tqdm import TQDMNotebookCallback

batch_size = 128
nb_epoch = 20

model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
history = model.fit(X_train, Y_train,
                    batch_size=batch_size, nb_epoch=nb_epoch,
                    validation_data=(X_test, Y_test),
                    verbose=0, callbacks=[TQDMNotebookCallback()])
score, accuracy = model.evaluate(X_test, Y_test, verbose=0)
print('Test score: {}; test accuracy: {}'.format(score, accuracy))

### Part 1a: Fit model and examine predictions
Look through the above code carefully and make sure you understand each step. Identify a couple of test cases using that are incorrectly classified by our model; the `model.predict_classes` function will be useful (`model.predict` in this case is more like `scikit-learn`'s `predict_proba`). Are these examples particularly difficult or is our model underperforming?

In [None]:
pred_test = model.predict_classes(X_test, verbose=0)
misclassified = y_test != pred_test

fig, ax = plt.subplots(2, 5, figsize=(15, 8))
for i in range(10):
    plt.sca(ax.ravel()[i])
    plt.imshow(X_test[misclassified][i].reshape(nrow, ncol))

### Part 1b: Multi-layer classifier
Write a function that takes parameters `hidden_size` and `num_layers` and returns a classifier like the one above but with multiple fully-connected layers. Compare the performance of your multi-layer classifier with that of the single-layer network above. Also, check the output of `model.summary()` and see how the number of parameters varies with `hidden_size` and `num_layers`.

In [None]:
def fully_connected(hidden_size, num_layers):
    model = Sequential()
    for i in range(num_layers):
        model.add(Dense(hidden_size, input_dim=nrow * ncol if i == 0 else None))
        model.add(Activation('relu'))
    model.add(Dense(nb_classes))
    model.add(Activation('softmax'))
    return model

model = fully_connected(64, 3)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
history = model.fit(X_train, Y_train,
                    batch_size=batch_size, nb_epoch=nb_epoch,
                    validation_data=(X_test, Y_test),
                    verbose=0, callbacks=[TQDMNotebookCallback()])
score, accuracy = model.evaluate(X_test, Y_test, verbose=0)
print('Test score: {}; test accuracy: {}'.format(score, accuracy))
model.summary()

## Part 2: Convolutional network
In the above example, we treated our images as one-dimensional vectors and input them to a simple feed-forward network. This has the major disadvantage of ignoring the local structure of the image: each pixel is considered separately from its surroundings. Convolutional networks, on the other hand, apply a number of filters to small regions of each image; these filters are trained to recognize common shapes that appear in the image and are useful for distinguishing classes. Here we'll train a basic convolutional network to perform the same image classification task.

In [None]:
# First we'll reshape the data back into two-dimensional form
X_train = X_train.reshape(X_train.shape[0], nrow, ncol, 1)
X_test = X_test.reshape(X_test.shape[0], nrow, ncol, 1)
input_shape = (nrow, ncol, 1)  # only 1 channel since the images are black and white

### Part 2a: Simple convolutional network
First, implement a simple convolutional network consisting of a single convolutional layer, a single dense ReLU layer, and a softmax output. The structure should look basically like the single-layer network from earlier, but with a couple of additions.

A couple of hints:
- The `Conv2D` layer takes parameters `nb_filter` for the number of filters, and `nb_row` and `nb_col` for the filter dimensions; try a small number (e.g. 12) of small (e.g. 3 x 3) filters to start.
- Since a convolutional layer expects 2D inputs and a fully-connected layer expects 1D inputs, you'll want to add a `Flatten()` layer inbetween; this is just like the flattening we performed the MNIST inputs themselves before passing them to our fully-connected network, except applied to the intermediate values of our network.
- This is a more computationally-intensive training procedure, so try reducing `nb_epoch` to something like 5.

In [None]:
from keras.layers import Conv2D, Flatten

model = Sequential()
model.add(Conv2D(12, 3, 3, input_shape=input_shape))
model.add(Flatten())
model.add(Dense(8))
model.add(Activation('relu'))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

nb_epoch = 5
history = model.fit(X_train, Y_train,
                    batch_size=batch_size, nb_epoch=nb_epoch,
                    validation_data=(X_test, Y_test),
                    verbose=0, callbacks=[TQDMNotebookCallback()])
score, accuracy = model.evaluate(X_test, Y_test, verbose=0)
print('Test score: {}; test accuracy: {}'.format(score, accuracy))

### Part 2b: Deep-ish convolutional network
Part of the reason the above network is so slow is because of the connection between the convolutional and fully-connected layers: the convolutional layer's weights consist of just a few small filters, but its output is large (roughly 28x28x{# filters}), so the following fully-connected layer needs roughly 28x28x{# filters}x{hidden layer size} parameters...in other words, a lot. For this reason, it's common to insert a pooling layer after a convolutional layer, which reduces the output size considerably. The most common type of pooling is max pooling, primarily because of its simplicity. For more about pooling see, e.g., the [CS 231n lecture notes](http://cs231n.github.io/convolutional-networks/#pool) about convolutional architectures.

Extend the network above by first replacing the first dense layer with a max pooling layer; the `MaxPooling2D` layer takes a 2-dimensional tuple `pool_size` which controls the size of regions that are pooled together. Experiment with this parameter and see how the training time changes; then try adding additional convolutional+pooling layers to form a deeper network. Check the output of `model.summary()` to see how the number of parameters changes depending on whether you apply pooling after a convolutional layer.

In [None]:
from keras.layers import MaxPooling2D

model = Sequential()
model.add(Conv2D(32, 3, 3, input_shape=input_shape))
model.add(MaxPooling2D((3, 3)))
model.add(Conv2D(32, 3, 3))
model.add(MaxPooling2D((3, 3)))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

nb_epoch = 5
history = model.fit(X_train, Y_train,
                    batch_size=batch_size, nb_epoch=nb_epoch,
                    validation_data=(X_test, Y_test),
                    verbose=0, callbacks=[TQDMNotebookCallback()])
score, accuracy = model.evaluate(X_test, Y_test, verbose=0)
print('Test score: {}; test accuracy: {}'.format(score, accuracy))

### Part 2c: Regularization
Because of the extremely high number of parameters contained in deep convolutional networks, the chances of overfitting your model to the test data are rather high. The term "regularization" is something of a catch-all for techniques that try to mitigate this overfitting behavior. Here we'll try adding a couple of approaches to our existing network: dropout and weight penalization.

Dropout is a technique that randomly sets some fraction neuron activations to zero during training; the idea is that the network will learn to cope with this by developing multiple different representations for the various patterns in the data, and thereby increase its robustness to unseen data. For validation/test data, dropout is then disabled to generate the best predictions possible.

Weight penalization is a standard idea from linear regression (there it's called ridge regression/LASSO/elastic net, among other things). For each set of weights, a penalty is added to the loss function that is proportional to the magnitude of the coefficients; the result is that the network prefers a combination of many smaller weights rather than some extremely large weights, which again (hopefully) should lead to a more robust model.

Extend your network by adding dropout layers between the convolutional layers, and/or by adding $\ell_2$ regularization to the weights in an existing layer. Does this help reduce the gap between the training and test error? Dropout in `keras` is simply another layer; regularization can be passed into an existing layer with the keyword argument `W_regularizer=l2({some small value})`.

In [None]:
from keras.layers import Dropout
from keras.regularizers import l2

model = Sequential()
model.add(Conv2D(32, 3, 3, input_shape=input_shape))
model.add(MaxPooling2D((3, 3)))
model.add(Dropout(0.25))
model.add(Conv2D(32, 3, 3))
model.add(MaxPooling2D((3, 3)))
model.add(Dropout(0.25))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

nb_epoch = 5
history = model.fit(X_train, Y_train,
                    batch_size=batch_size, nb_epoch=nb_epoch,
                    validation_data=(X_test, Y_test),
                    verbose=0, callbacks=[TQDMNotebookCallback()])
score, accuracy = model.evaluate(X_test, Y_test, verbose=0)
print('Test score: {}; test accuracy: {}'.format(score, accuracy))

## Part 3: Recurrent neural network
Recurrent neural networks are typically used to process sequence data, such as text data, time series, etc. In this case, we'll treat each image as a one-dimensional sequence of pixels and process them sequentially using recurrent layers. This is generally not the preferred way to handle image data as it somewhat distorts the spatial structure, but as we'll see the results aren't too bad.

In [None]:
# First, reshape the data into one-dimensional sequences
X_train = X_train.reshape(X_train.shape[0], -1, 1)
X_test = X_test.reshape(X_test.shape[0], -1, 1)
input_shape = X_train.shape[1:]

The most popular type of recurrent cell is called a "Long Short-Term Memory" cell, or LSTM. Follow the same structure as above and implement a simple recurrent neural network classifier using the `LSTM` layer. 

- Note that a recurrent cell takes sequences as inputs, but can output either sequences (i.e. a new value each time it processes a value from the input), or individual values (i.e. only one value after the entire sequence is processed). In `keras` this is controlled using the `return_sequences` keyword argument.
- Recurrent networks generally use different activations than convolutional networks; you can omit the `Activation` layers here and just use the `LSTM` cell's default, which is tanh.
- Training LSTMs is computationally-intensive so for now keep your hidden layer sizes small and only use the first 250 training examples; this won't be enough to train an effective model but you'll at least see the steps needed to do so given more time/computing power.

In [None]:
from keras.layers import LSTM

model = Sequential()
model.add(LSTM(16, input_shape=input_shape, return_sequences=True))
model.add(LSTM(16, return_sequences=False))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

nb_epoch = 5
history = model.fit(X_train[:250], Y_train[:250],
                    batch_size=batch_size, nb_epoch=nb_epoch,
                    validation_data=(X_test, Y_test),
                    verbose=0, callbacks=[TQDMNotebookCallback()])
score, accuracy = model.evaluate(X_test, Y_test, verbose=0)
print('Test score: {}; test accuracy: {}'.format(score, accuracy))

## Part 4: CIFAR10 dataset (optional)
Another standard neural network test case is the CIFAR10 image dataset, which consists of 60,000 32x32 color images from 10 classes. This dataset is available in `keras` as `keras.datasets.cifar10`. Use one of the above network architectures and see how it performs on this (more difficult) problem.

In [None]:
# Here's a larger network from the Keras examples page that performs decently on CIFAR10
# https://github.com/fchollet/keras/blob/master/examples/cifar10_cnn.py
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.utils import np_utils

batch_size = 32
nb_classes = 10
nb_epoch = 200

# input image dimensions
img_rows, img_cols = 32, 32
# The CIFAR10 images are RGB.
img_channels = 3

# The data, shuffled and split between train and test sets:
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices.
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

model = Sequential()

model.add(Convolution2D(32, 3, 3, border_mode='same',
                        input_shape=X_train.shape[1:]))
model.add(Activation('relu'))
model.add(Convolution2D(32, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Convolution2D(64, 3, 3, border_mode='same'))
model.add(Activation('relu'))
model.add(Convolution2D(64, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

# Let's train the model using RMSprop
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

model.fit(X_train, Y_train,
          batch_size=batch_size,
          nb_epoch=nb_epoch,
          validation_data=(X_test, Y_test),
          shuffle=True, verbose=0,
          callbacks=[TQDMNotebookCallback()])