# Highway networks
**A quick example of how to implement [highway networks](http://arxiv.org/abs/1505.00387) in Lasagne.** This uses the code from the MNIST example bundled with Lasagne, so you will need to have the `mnist.py` file available for importing.

## What's a highway network?
The paper linked above introduces a new type of neural network layer, which works roughly as follows.

Let `x` be the input to a layer. Then a typical neural network layer computes some nonlinear transform of this input `y = H(x)`.

A highway layer also computes an additional nonlinear transform `T(x)`, which in practice is constrained to the interval `[0, 1]`. The output of the layer is then `y = T(x) * H(x) + (1 - T(x)) * x`, where the multiplication is elementwise.

In other words, **depending on the gate values, the layer behaves as a traditional layer would (`T(x) = 1`), or passes its input through unchanged (`T(x) = 0`)**. This idea is inspired by the gates in LSTM units. According to the authors, **it enables gradient descent-based training of much deeper networks** with as many as 900 layers.

Note that a highway layer needs to have as many outputs as inputs: the shapes of `x`, `H(x)` and `T(x)` all have to match. To change the dimensionality in a highway network, the authors suggest inserting a traditional neural network layer.

## Purpose
I read the paper before and wanted to try it out. I figured this would be a good way of showing how to implement a new concept or idea in Lasagne. **This use case of trying out new ideas and implementing new types of layers is extremely important to us, and we are trying to make it as easy as possible to use Lasagne in this way.**

## Approach
I decided to implement this idea in two steps. First, I added a `MultiplicativeGatingLayer`, which performs the following operation: `y(t, x1, x2) = t * x1 + (1 - t) * x2`. In other words, the first input `t` multiplicatively gates between the others `x1` and `x2`.

This then makes it possible to use any layer we like for computing `t` and `x1` (and `x2` is taken to be the output of the previous layer). I implemented two "macro functions" on top of this: `highway_dense` and `highway_conv2d`. They create fully connected and 2D convolutional highway layers respectively.

This two-step approach allows for some code reuse and easy implementation of different types of highway layers.

In [1]:
import numpy as np
import theano
import theano.tensor as T
import lasagne as nn

Using gpu device 0: GeForce GT 540M


**First, create a custom layer class for the multiplicative gating operation.** This is a layer with multiple input layers, three to be precise: `x`, `H(x)` and `T(x)`. In Lasagne, this means it needs to inherit from `MergeLayer`.

In [2]:
class MultiplicativeGatingLayer(nn.layers.MergeLayer):
    """
    Generic layer that combines its 3 inputs t, h1, h2 as follows:
    y = t * h1 + (1 - t) * h2
    """
    def __init__(self, gate, input1, input2, **kwargs):
        incomings = [gate, input1, input2]
        super(MultiplicativeGatingLayer, self).__init__(incomings, **kwargs)
        assert gate.output_shape == input1.output_shape == input2.output_shape
    
    def get_output_shape_for(self, input_shapes):
        return input_shapes[0]
    
    def get_output_for(self, inputs, **kwargs):
        return inputs[0] * inputs[1] + (1 - inputs[0]) * inputs[2]

**Now we can define a macro function to create a dense highway layer.** Note that it does not take a `num_units` input argument: the number of outputs should always be the same as the number of inputs, so it is redundant.

In [3]:
def highway_dense(incoming, Wh=nn.init.Orthogonal(), bh=nn.init.Constant(0.0),
                  Wt=nn.init.Orthogonal(), bt=nn.init.Constant(-4.0),
                  nonlinearity=nn.nonlinearities.rectify, **kwargs):
    num_inputs = int(np.prod(incoming.output_shape[1:]))
    # regular layer
    l_h = nn.layers.DenseLayer(incoming, num_units=num_inputs, W=Wh, b=bh,
                               nonlinearity=nonlinearity)
    # gate layer
    l_t = nn.layers.DenseLayer(incoming, num_units=num_inputs, W=Wt, b=bt,
                               nonlinearity=T.nnet.sigmoid)
    
    return MultiplicativeGatingLayer(gate=l_t, input1=l_h, input2=incoming)

**We can easily do the same for a 2D convolution highway layer.** As mentioned in the paper, we need to use 'same' convolutions here to ensure that the shape of `H(x)` and `T(x)` matches that of `x`.

Unfortunately the implementation of 'same' convolutions in Theano using the default convolution operations `T.nnet.conv.conv2d` is a bit challenging. The default approach in Lasagne is to perform a 'full' convolution and then crop it, which can be slow. This is implemented in `lasagne.layers.Conv2DLayer`.

To get an actual 'same' convolution, you could use one of the alternative convolution layer implementations that Lasagne provides, such as `lasagne.layers.dnn.Conv2DDNNLayer`, `lasagne.layers.corrmm.Conv2DMMLayer` or `lasagne.layers.cuda_convnet.Conv2DCCLayer`, all of which support the 'same' convolution mode properly.

In [4]:
def highway_conv2d(incoming, filter_size,
                   Wh=nn.init.Orthogonal(), bh=nn.init.Constant(0.0),
                   Wt=nn.init.Orthogonal(), bt=nn.init.Constant(-4.0),
                   nonlinearity=nn.nonlinearities.rectify, **kwargs):
    num_channels = incoming.output_shape[1]
    # regular layer
    l_h = nn.layers.Conv2DLayer(incoming, num_filters=num_channels,
                                filter_size=filter_size,
                                border_mode='same', W=Wh, b=bh,
                                nonlinearity=nonlinearity)
    # gate layer
    l_t = nn.layers.Conv2DLayer(incoming, num_filters=num_channels,
                                filter_size=filter_size,
                                border_mode='same', W=wt, b=bt,
                                nonlinearity=T.nnet.sigmoid)
    
    return MultiplicativeGatingLayer(gate=l_t, input1=l_h, input2=incoming)

Now we'll import some helper code from the MNIST example bundled with Lasagne, and use it to build and train a highway model.

In [5]:
from mnist import load_data, train, create_iter_functions

In [6]:
def build_model(input_dim, output_dim, batch_size=100,
                num_hidden_units=20, num_hidden_layers=50):
    """Create a symbolic representation of a neural network with `intput_dim`
    input nodes, `output_dim` output nodes, `num_hidden_layers` hidden layers
    and `num_hidden_units` per hidden layer.
    
    The training function of this model must have a mini-batch size of
    `batch_size`.
    """
    l_in = nn.layers.InputLayer((batch_size, input_dim))
    
    # first, project it down to the desired number of units per layer
    l_hidden1 = nn.layers.DenseLayer(l_in, num_units=num_hidden_units)
    
    # then stack highway layers on top of this
    l_current = l_hidden1
    for k in range(num_hidden_layers - 1):
        l_current = highway_dense(l_current)
        
    # finally add an output layer
    l_out = nn.layers.DenseLayer(
        l_current, num_units=output_dim,
        nonlinearity=nn.nonlinearities.softmax,
    )
    
    return l_out

Unfortunately we need to redefine the main function here. We cannot import it because we need it to use our version of
`build_model`, not the one defined in mnist.py.

In [7]:
import time

def main(num_epochs=50):
    print("Loading data...")
    dataset = load_data()

    print("Building model and compiling functions...")
    output_layer = build_model(
        input_dim=dataset['input_dim'],
        output_dim=dataset['output_dim'],
    )
    iter_funcs = create_iter_functions(dataset, output_layer)

    print("Starting training...")
    now = time.time()
    try:
        for epoch in train(iter_funcs, dataset):
            print("Epoch {} of {} took {:.3f}s".format(
                epoch['number'], num_epochs, time.time() - now))
            now = time.time()
            print("  training loss:\t\t{:.6f}".format(epoch['train_loss']))
            print("  validation loss:\t\t{:.6f}".format(epoch['valid_loss']))
            print("  validation accuracy:\t\t{:.2f} %%".format(
                epoch['valid_accuracy'] * 100))

            if epoch['number'] >= num_epochs:
                break

    except KeyboardInterrupt:
        pass

    return output_layer

In [8]:
main()

Loading data...
Building model and compiling functions...




Starting training...
Epoch 1 of 50 took 5.111s
  training loss:		2.030891
  validation loss:		1.723327
  validation accuracy:		62.33 %%
Epoch 2 of 50 took 5.135s
  training loss:		1.473506
  validation loss:		1.177809
  validation accuracy:		76.32 %%
Epoch 3 of 50 took 5.040s
  training loss:		0.994532
  validation loss:		0.762730
  validation accuracy:		82.26 %%
Epoch 4 of 50 took 5.199s
  training loss:		0.679902
  validation loss:		0.535956
  validation accuracy:		86.06 %%
Epoch 5 of 50 took 5.134s
  training loss:		0.514816
  validation loss:		0.430797
  validation accuracy:		88.16 %%
Epoch 6 of 50 took 5.120s
  training loss:		0.427798
  validation loss:		0.372764
  validation accuracy:		89.27 %%
Epoch 7 of 50 took 5.271s
  training loss:		0.374679
  validation loss:		0.333487
  validation accuracy:		90.05 %%
Epoch 8 of 50 took 5.325s
  training loss:		0.338382
  validation loss:		0.303995
  validation accuracy:		90.65 %%
Epoch 9 of 50 took 5.467s
  training loss:		0.311969
  val

<lasagne.layers.dense.DenseLayer at 0x7f5741462e50>