# Train a recurrent neural net on FML

##### Today, while guitaring my morning advice with my time of walking girl. I lost bored, oddly buriover at a 2%, my boyfriend dumped me next to me. FML 

###### -- FMLBot

Bots have tough lifes too, so we should listen to their sorrows.

Table of contents:
1. [Introduction](#Introduction)
2. [Imports](#Imports)
3. [Load FML data](#Load-FML-data)
4. [Parameter settings](#Parameters)
5. [Training the model](#Training-the-model)
6. [Sampling from the model](#Sampling-from-the-model)
7. [Saving the model](#Saving-the-model)

## Introduction

<b>Spaghetto</b> provides tools for training recurrent neural networks (RNNs) on sequences and to generate sequences. This fits especially well to text. 

The idea for Spaghetto was sparked by Andrej Karpathy's [blog post](karpathy.github.io/2015/05/21/rnn-effectiveness). While he provided working code, it was written in Lua, which fewer people are familiar with than Python. Spaghetto's interface closely resembles the excellent [nolearn](https://github.com/dnouri/nolearn) library's interface, which in turn is mostly the same as [scikit learn](scikit-learn.org)'s. The layers are taken from the [Lasagne](https://github.com/Lasagne/Lasagne) library. Further inspiration was drawn from [Shawn Tan](https://github.com/shawntan/theano-nlp)'s [Theano](deeplearning.net/software/theano)-based implementation.

## Imports

In [1]:
from __future__ import unicode_literals

In [2]:
import random
import re
from functools import partial

In [3]:
from lasagne.layers import Layer, InputLayer, EmbeddingLayer
from lasagne import nonlinearities
from lasagne.updates import rmsprop
from lasagne.layers import GRULayer

Using gpu device 0: GeForce GTX 970 (CNMeM is disabled)


In [4]:
from spaghetto.model import RNN
from spaghetto.utils import TokenEncoder
from spaghetto.recurrent import RNNDenseLayer

## Load FML data

First we load the data from FML. Unfortunately, I don't think I can provide this data to you. 

Original data source: [FML](www.fmylife.com)

Each FML is stored in a row in the text file. Therefore, _X_ will be a list of strings, which each element corresponding to an FML. This is a good representation if the lines are independent of each other. However, there are cases when this is not true, for example when we want the RNN to train on a book. In that case, our _X_ should just be one long string. Spaghetto will then automatically break this string into lines for us.

To recapitulate:
* Use one long string as input if the content of the data is related.
* Use a list of strings if the elements are independent of one another, as is the case here.

In [5]:
X = open('/home/bbossan_dev/Downloads/fmls.txt').readlines()

The text is encoded in latin-1.

In [6]:
X = [x.decode('latin-1') for x in X]

Here is what a sample FML looks like.

In [7]:
print(X[0])

Today, my brother told me to, "Stop bitching and get over it" after I complained of pain from my stomach after invasive surgery. This from the guy who spends multiple hours a day playing Halo and whining about the stupid ways he got killed. FML 



### Encode data

The RNN will need to know how to encode the text. For that, we use the spaghetto TokenEncoder. Internally, this converts the characters into integers that are used to look up the character's embedding from the embedding layer.

In [8]:
encoder = TokenEncoder().fit(X)

We could also tell the encoder not to split on a character-level but on a word level by specifying the _separator_ argument to be space, i.e. _encoder = TokenEncoder(separator=' ')_. You will generally get better results from a character-level encoding, though.

## Parameters

Here we specify the parameters for the model:

In [9]:
embedding_size = 50
num_units = 200
max_len_line = 300
update = rmsprop
learning_rate = 1e-2

The meanings are:
* _embedding_size_: The size of character embeddings.
* _num_units_: Number of units used for the hidden state of the network.
* _max_len_line_: Cut a line if it contains more than this many tokens. This is mainly there because there might be some outliers with very high length, which would lead to all other samples in the batch being padded to that lenght, which in turn lowers performance.
* _update_: The updating rule for the parameters; for RNNs, RMSprop seems be the default.
* _learning_rate_: The learning rate of the updater.

Next we specify the architecture of the net:

In [10]:
layers = [
    (InputLayer, {}),
    (EmbeddingLayer, {'output_size': embedding_size}),
    (GRULayer, {'num_units': num_units}),
    (GRULayer, {'num_units': num_units}),
    (RNNDenseLayer, {'nonlinearity': nonlinearities.identity}),
]

The layers are:
* _InputLayer_: Always begin with an input layer. Spaghetto takes care of setting all required parameters of the InputLayer automatically for you.
* _EmbeddingLayer_: Each character/token has its own embedding, which is stored in the embedding layer. Spaghetto automatically sets the input size for us. This layer could be used to find token similarities, similar to [word2vec](https://code.google.com/p/word2vec) word embeddings. This of course makes more sense when you tokenize on words and not on characters.
* 2 _GRULayers_: This is our recurrent layer of choice here. We could also use Lasagne's LSTMLayer, which might improve the outcome at the cost of slower training time. You can also try to use more recurrent layers, or different numbers of units per recurrent layer.
* _RNNDenseLayer_: This is a convenient DenseLayer provided by Spaghetto that preserves the 3-dimensional shape of recurrent layers (batch size x time x number of units) instead of flattening the output to 2 dimensions, as Lasagne's DenseLayer would. We should set the output nonlineariy to _identity_ because Spaghetto automatically applies the correct nonlinearity afterwards. Spaghetto also takes care to automatically set the number of output units for us.

Finally, we specify the recurrent neural network itself:

In [11]:
rnn = RNN(
    layers,
    encoder=encoder,
    verbose=1,
    updater=partial(update, learning_rate=learning_rate),
    max_epochs=15,
    max_len_line=max_len_line,
    eval_size=0.2,
)

The parameters are:
* _layers_: First and most importantly, we pass the layers we just defined.
* _encoder_: The net needs to know how to encode the data, so we pass the encoder that we fitted above.
* _verbose_: Set this value to greater than 0 to receive some useful information about the net during training.
* _updater_: The update rule.
* _max_epochs_: Maximum number of epochs to train.
* _max_len_line_: The maximum number of characters per line.
* _eval_size_: Proportion of data held back for validation.

## Training the model

One special parameter for training that does not exist in sklearn or nolearn is _on_nth_batch_. The Spaghetto RNN has a call back after each batch, which allows you to get more frequent feedback from the model. Waiting for a whole epoch to finish can be tedious, especially if you train on a lot of data. On the other hand, getting feedback after each batch is too frequent for most cases. Therefore, the _on_nth_batch_ parameter allows you to regulate how often you want to get feedback. With _on_nth_batch=250_, that means that you get feedback each 250th batch.

During training you will receive some information from the model, as you would when using nolearn. Here the information is:
* _epoch_: How many epochs have passed (i.e. full passes through the training data).
* _train perpl._: The perplexity on the training set. Lower is better, with 1 being the minimum. If a new best value is achieved, it is colored <font color="cyan">cyan</font>
* _valid perpl._: The perplexity on the hold of validation set. If a new best value is achieved, it is colored <font color="green">green</font>.
* _train/valid_: Proportion of train to validation perplexity. If this becomes too low, it means that your model overfits. How about adding a dropout layer?
* _duration_: Time it took to train the last batches.
* _sample_: Especially with RNNs that can take quite some time to train, it is important to monitor the progress during training. In addition to the perplexity, a more *human-friendly* way to achieve this is to look at samples generated by the RNN. Are they gibberish or do they look more and more reasonable as training progresses?

In [12]:
rnn.fit(X, on_nth_batch=250)

  from scan_perform.scan_perform import *


 - Compiling functions ...
   ... finished compilation.

  epoch    train perpl.    valid perpl.    train/valid  duration    sample
-------  --------------  --------------  -------------  ----------  ------------------------------------------------------------------------------------------------------
      1        [36m14.05023[0m        [32m13.94870[0m        1.00728  108.52s     Today, I I whemos haok oth h" I wathrey boeithI honl harkg I vhe theol ont an dtor wuttar7 and, FML
      1         [36m3.90302[0m         [32m4.03950[0m        0.96621  114.22s     Today, my butel, I girled mirur gomelliendyi't gertsery, witt gloset'ss. Oil_ater, writeth, so deower, octior onreat in al, it lookked a liglried, and, intiouttyrarmiends, oneler, sone bablitoldor mirster, ous,b'ssserp, arriend, antalle woilled lastsicarpedsse of my seecid it really pelletuous." So
      1         [36m3.78602[0m         [32m3.74330[0m        1.01140  108.59s     Today, I had buy. He walked my date can

<spaghetto.model.RNN at 0x7fd7e00bf6d0>

## Sampling from the model

It is possible to sample from the RNN using the _sample_ method. The first argument is the number of samples you want to get back. This method is unfortunately a little slow.

In [24]:
samples = rnn.sample(10)

In [23]:
for sample in samples:
    print(sample)

Today, we watched it and killvas that nert work and turn and that looks like me walk. Horror, the nur had bugs. FML 

Today, my 5-year old daughter picked because birthday. FML

Today, I discovered my potion months got banaled from his email punbous money that I've snaved myself his camera's. FML 

Today, while guitaring my morning advice with my time of walking girl. I lost bored, oddly buriover at a 2%, my boyfriend dumped me next to me. FML 

Today, thinking I gave hot both horribly dumping myself look going how had doesn't tell my book. And multing over my walle centre awake and be giving me blood camera my mom. FML 

Today, after a strearic, I'm in line that she'd ringzendy borrowed morning. The engagement ran video phone and said, "Surry, she has been landing. " FML 

Today, my ex girls asked me to move ever had been slowly on sticking my dream. Then my mom had been kissing our excuse for a high group. FML 

Today, on my body)ry, I moded stage midnight. After an employy, I google

The results already look quite nice but could certainly improve. Mainly, this could be achieved with more training data, but for this, we unfortunately have to wait. Further simple ways to improve the outcome could be:
* train longer
* use LSTMLayers instead of GRULayers
* use more layers or a different architecture
* use a different update rule or learning rate
* use different embedding sizes or number of units
* use dropout or other ways to regularize the net

## Saving the model

The RNN can be saved with _save_params_to_ so that we can later load it again to train some more or just create some fresh samples. To load the model, initialize it again and call _load_params_from_.

In [13]:
rnn.save_params_to('../save/fmybotlife.pkl')