# Highway networks
**A quick example of how to implement [highway networks](http://arxiv.org/abs/1505.00387) in Lasagne.**

## What's a highway network?
The paper linked above introduces a new type of neural network layer, which works roughly as follows.

Let $x$ be the input to a layer. Then a typical neural network layer computes some nonlinear transform of this input $y = H(x)$.

A highway layer also computes an additional nonlinear transform $T(x)$, which in practice is constrained to the interval $[0, 1]$. The output of the layer is then $y = T(x) \cdot H(x) + (1 - T(x)) \cdot x$, where the multiplication is elementwise.

In other words, **depending on the gate values $T(x)$, the layer behaves as a traditional layer would ($T(x) = 1$), or passes its input through unchanged ($T(x) = 0$)**. This idea is inspired by the gates in LSTM units. According to the authors, **it enables gradient descent-based training of much deeper networks** with as many as 900 layers.

Note that a highway layer needs to have as many outputs as inputs: the shapes of $x$, $H(x)$ and $T(x)$ all have to have matching shapes. To change the dimensionality in a highway network, the authors suggest inserting a traditional neural network layer.

## Purpose
I read the paper before and wanted to try it out. I figured this would be a good way of showing how to implement a new concept or idea in Lasagne. **This use case of trying out new ideas and implementing new types of layers is extremely important to the Lasagne development team, and we are trying to make it as easy as possible to use Lasagne in this way.**

## Approach
I decided to implement this idea in two steps. First, I added a `MultiplicativeGatingLayer`, which performs the following operation: $y(t, x_1, x_2) = t \cdot x_1 + (1 - t) \cdot x_2$. In other words, the first input $t$ multiplicatively gates between the others $x_1$ and $x_2$.

This then makes it possible to use any layer we like for computing $t$ and $x_1$ (and $x_2$ is taken to be the output of the previous layer). I implemented two "macro functions" on top of this: `highway_dense` and `highway_conv2d`. They create fully connected and 2D convolutional highway layers respectively.

This two-step approach allows for some code reuse and easy implementation of different types of highway layers.

In [1]:
import numpy as np
import theano
import theano.tensor as T
import lasagne as nn

Using gpu device 0: GeForce GTX 980


The MNIST dataset (15MB) can be downloaded with:

In [2]:
!wget -N http://deeplearning.net/data/mnist/mnist.pkl.gz

--2015-08-22 23:00:18--  http://deeplearning.net/data/mnist/mnist.pkl.gz
Resolving deeplearning.net (deeplearning.net)... 132.204.26.28
Connecting to deeplearning.net (deeplearning.net)|132.204.26.28|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16168813 (15M) [application/x-gzip]
Server file no newer than local file ‘mnist.pkl.gz’ -- not retrieving.



In [3]:
import gzip
import cPickle as pickle
import sys

PY2 = sys.version_info[0] == 2 # check if we're running Python 2 or 3
# we need to know this because unpickling is slightly different in both cases

if PY2:
    def pickle_load(f, encoding):
        return pickle.load(f)
else:
    def pickle_load(f, encoding):
        return pickle.load(f, encoding=encoding)

def load_data():
    """Get data with labels, split into training, validation and test set."""
    with gzip.open('mnist.pkl.gz', 'rb') as f:
        data = pickle_load(f, encoding='latin-1')
    X_train, y_train = data[0]
    X_valid, y_valid = data[1]
    X_test, y_test = data[2]

    return dict(
        X_train=theano.shared(nn.utils.floatX(X_train)),
        y_train=T.cast(theano.shared(y_train), 'int32'),
        X_valid=theano.shared(nn.utils.floatX(X_valid)),
        y_valid=T.cast(theano.shared(y_valid), 'int32'),
        X_test=theano.shared(nn.utils.floatX(X_test)),
        y_test=T.cast(theano.shared(y_test), 'int32'),
        num_examples_train=X_train.shape[0],
        num_examples_valid=X_valid.shape[0],
        num_examples_test=X_test.shape[0],
        input_dim=X_train.shape[1],
        output_dim=10,
    )

**First, create a custom layer class for the multiplicative gating operation.** This is a layer with multiple input layers, three to be precise: $x$, $H(x)$ and $T(x)$. In Lasagne, this means it needs to inherit from `MergeLayer`.

In [4]:
class MultiplicativeGatingLayer(nn.layers.MergeLayer):
    """
    Generic layer that combines its 3 inputs t, h1, h2 as follows:
    y = t * h1 + (1 - t) * h2
    """
    def __init__(self, gate, input1, input2, **kwargs):
        incomings = [gate, input1, input2]
        super(MultiplicativeGatingLayer, self).__init__(incomings, **kwargs)
        assert gate.output_shape == input1.output_shape == input2.output_shape
    
    def get_output_shape_for(self, input_shapes):
        return input_shapes[0]
    
    def get_output_for(self, inputs, **kwargs):
        return inputs[0] * inputs[1] + (1 - inputs[0]) * inputs[2]

**Now we can define a macro function to create a dense highway layer.** Note that it does not take a `num_units` input argument: the number of outputs should always be the same as the number of inputs, so it is redundant.

We initialize the biases of the gates to `-4.0` to disable all of them initially. This means all layers will basically pass through the inputs (and gradients) unchanged at the start of training.

In [5]:
def highway_dense(incoming, Wh=nn.init.Orthogonal(), bh=nn.init.Constant(0.0),
                  Wt=nn.init.Orthogonal(), bt=nn.init.Constant(-4.0),
                  nonlinearity=nn.nonlinearities.rectify, **kwargs):
    num_inputs = int(np.prod(incoming.output_shape[1:]))
    # regular layer
    l_h = nn.layers.DenseLayer(incoming, num_units=num_inputs, W=Wh, b=bh,
                               nonlinearity=nonlinearity)
    # gate layer
    l_t = nn.layers.DenseLayer(incoming, num_units=num_inputs, W=Wt, b=bt,
                               nonlinearity=T.nnet.sigmoid)
    
    return MultiplicativeGatingLayer(gate=l_t, input1=l_h, input2=incoming)

**We can easily do the same for a 2D convolution highway layer.** As mentioned in the paper, we need to use 'same' convolutions here to ensure that the shape of $H(x)$ and $T(x)$ matches that of $x$.

Unfortunately the implementation of 'same' convolutions in Theano using the default convolution operations `T.nnet.conv.conv2d` is a bit challenging. The default approach in Lasagne is to perform a 'full' convolution and then crop it, which can be slow. This is implemented in `lasagne.layers.Conv2DLayer`.

To get an actual 'same' convolution, you could use one of the alternative convolution layer implementations that Lasagne provides, such as `lasagne.layers.dnn.Conv2DDNNLayer`, `lasagne.layers.corrmm.Conv2DMMLayer` or `lasagne.layers.cuda_convnet.Conv2DCCLayer`, all of which support the 'same' convolution mode properly.

In [6]:
def highway_conv2d(incoming, filter_size,
                   Wh=nn.init.Orthogonal(), bh=nn.init.Constant(0.0),
                   Wt=nn.init.Orthogonal(), bt=nn.init.Constant(-4.0),
                   nonlinearity=nn.nonlinearities.rectify, **kwargs):
    num_channels = incoming.output_shape[1]
    # regular layer
    l_h = nn.layers.Conv2DLayer(incoming, num_filters=num_channels,
                                filter_size=filter_size,
                                border_mode='same', W=Wh, b=bh,
                                nonlinearity=nonlinearity)
    # gate layer
    l_t = nn.layers.Conv2DLayer(incoming, num_filters=num_channels,
                                filter_size=filter_size,
                                border_mode='same', W=wt, b=bt,
                                nonlinearity=T.nnet.sigmoid)
    
    return MultiplicativeGatingLayer(gate=l_t, input1=l_h, input2=incoming)

Now let's **build a model** with a number of dense highway layers.

In [7]:
def build_model(input_dim, output_dim, batch_size,
                num_hidden_units, num_hidden_layers):
    """Create a symbolic representation of a neural network with `intput_dim`
    input nodes, `output_dim` output nodes, `num_hidden_layers` hidden layers
    and `num_hidden_units` per hidden layer.
    
    The training function of this model must have a mini-batch size of
    `batch_size`.
    """
    l_in = nn.layers.InputLayer((batch_size, input_dim))
    
    # first, project it down to the desired number of units per layer
    l_hidden1 = nn.layers.DenseLayer(l_in, num_units=num_hidden_units)
    
    # then stack highway layers on top of this
    l_current = l_hidden1
    for k in range(num_hidden_layers - 1):
        l_current = highway_dense(l_current)
        
    # finally add an output layer
    l_out = nn.layers.DenseLayer(
        l_current, num_units=output_dim,
        nonlinearity=nn.nonlinearities.softmax,
    )
    
    return l_in, l_out

Now we can **load the data, build the model and compile the necessary Theano functions.**

In [8]:
num_epochs = 50
batch_size = 100
learning_rate = 0.01
momentum = 0.9

print("Loading data...")
dataset = load_data()

print("Building model and compiling functions...")
l_in, l_out = build_model(
    input_dim=dataset['input_dim'],
    output_dim=dataset['output_dim'],
    batch_size=batch_size,
    num_hidden_units=40,
    num_hidden_layers=50,
)

x = l_in.input_var
y = T.ivector('y')
y_pred = nn.layers.get_output(l_out)
loss = T.mean(nn.objectives.categorical_crossentropy(y_pred, y))
params = nn.layers.get_all_params(l_out)
updates = nn.updates.nesterov_momentum(loss, params, learning_rate, momentum)

# compile iteration functions
batch_index = T.iscalar('batch_index')
batch_slice = slice(batch_index * batch_size,
                    (batch_index + 1) * batch_size)

pred = T.argmax(y_pred, axis=1)
accuracy = T.mean(T.eq(pred, y), dtype=theano.config.floatX)

iter_train = theano.function(
    [batch_index], loss,
    updates=updates,
    givens={
        x: dataset['X_train'][batch_slice],
        y: dataset['y_train'][batch_slice],
    },
)

iter_valid = theano.function(
    [batch_index], [loss, accuracy],
    givens={
        x: dataset['X_valid'][batch_slice],
        y: dataset['y_valid'][batch_slice],
    },
)

iter_test = theano.function(
    [batch_index], [loss, accuracy],
    givens={
        x: dataset['X_test'][batch_slice],
        y: dataset['y_test'][batch_slice],
    },
)

Loading data...
Building model and compiling functions...


Finally, here's the **main training loop**.

In [9]:
import time

num_batches_train = dataset['num_examples_train'] // batch_size
num_batches_valid = dataset['num_examples_valid'] // batch_size

print("Starting training...")
now = time.time()

try:
    for epoch in range(num_epochs):
        batch_train_losses = []
        for b in range(num_batches_train):
            batch_train_loss = iter_train(b)
            batch_train_losses.append(batch_train_loss)

        avg_train_loss = np.mean(batch_train_losses)

        batch_valid_losses = []
        batch_valid_accuracies = []
        for b in range(num_batches_valid):
            batch_valid_loss, batch_valid_accuracy = iter_valid(b)
            batch_valid_losses.append(batch_valid_loss)
            batch_valid_accuracies.append(batch_valid_accuracy)

        avg_valid_loss = np.mean(batch_valid_losses)
        avg_valid_accuracy = np.mean(batch_valid_accuracies)

        print("Epoch %d of %d took %.3f s" % (epoch + 1, num_epochs, time.time() - now))
        now = time.time()
        print("  training loss:\t\t%.6f" % avg_train_loss)
        print("  validation loss:\t\t%.6f" % avg_valid_loss)
        print("  validation accuracy:\t\t%.2f %%" % (avg_valid_accuracy * 100))
except KeyboardInterrupt:
    pass

Starting training...
Epoch 1 of 50 took 5.509 s
  training loss:		0.903543
  validation loss:		0.337289
  validation accuracy:		90.37 %
Epoch 2 of 50 took 5.529 s
  training loss:		0.314176
  validation loss:		0.232568
  validation accuracy:		92.90 %
Epoch 3 of 50 took 5.574 s
  training loss:		0.239013
  validation loss:		0.190169
  validation accuracy:		94.42 %
Epoch 4 of 50 took 5.588 s
  training loss:		0.196143
  validation loss:		0.166020
  validation accuracy:		95.14 %
Epoch 5 of 50 took 5.602 s
  training loss:		0.166447
  validation loss:		0.151690
  validation accuracy:		95.66 %
Epoch 6 of 50 took 5.618 s
  training loss:		0.144108
  validation loss:		0.140717
  validation accuracy:		96.02 %
Epoch 7 of 50 took 5.608 s
  training loss:		0.126167
  validation loss:		0.134176
  validation accuracy:		96.18 %
Epoch 8 of 50 took 5.614 s
  training loss:		0.111839
  validation loss:		0.129229
  validation accuracy:		96.30 %
Epoch 9 of 50 took 5.623 s
  training loss:		0.099244
  val