# Language Modeling with Recurrent Neural Networks

In this tutorial we'll try out what the "modern" Language Modeling (LM) looks like, see what it takes to implement one using PyTorch and then use the trained model to generate some text.

Here is a list of some relevant references:

* http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture8.pdf
* http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture9.pdf
* http://colah.github.io/posts/2015-08-Understanding-LSTMs/
* https://github.com/pytorch/examples/tree/master/word_language_model
* https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/02-intermediate/language_model

As the first order of business, we import some useful Python libraries, along with [PyTorch](https://pytorch.org/).

**Note**: this tutorial expects at least basic familiarity with PyTorch. We strongly encourage you to get familiar with the ["blitz" introduction](https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html) into this framework. 

**Technical note**: this tutorial also expects PyTorch to be installed. To do so, we encourage you to follow the ["Get started"](https://pytorch.org/get-started/locally/) guide. Note that this is not necessary when playing with PyTorch within the Google Collaboratory environment.

In [0]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.optim as optim
import torch.nn.functional as F
import random
import numpy as np
from collections import Counter, OrderedDict
from copy import deepcopy
flatten = lambda l: [item for sublist in l for item in sublist]
random.seed(1234)

The following cell allows us to use the speed-ups of GPU, if it is available.

In [0]:
USE_CUDA = torch.cuda.is_available()
gpus = [0]
torch.cuda.set_device(gpus[0])

FloatTensor = torch.cuda.FloatTensor if USE_CUDA else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if USE_CUDA else torch.LongTensor
ByteTensor = torch.cuda.ByteTensor if USE_CUDA else torch.ByteTensor

## Data loading and preprocessing

In order to try out Language Modeling on a relevant data source, we'll use the [Penn Treebank](https://nlpprogress.com/english/language_modeling.html#penn-treebank) dataset, which is a very common benchmark for Language modeling.

The dataset is comprised of various parts of WSJ articles, which have already been tokenized and split into sentences, one per each line of the input file. Furthermore, only the top 10,000 most frequent words were used -- all of the others have been replaced with the special `<unk>` word.

Running the following cell should download the dataset, provided that `wget` is installed in your environment.

In [0]:
!wget https://raw.githubusercontent.com/NaiveNeuron/nlp-exercises/master/tutorial1-lm/data/ptb.train.txt
!wget https://raw.githubusercontent.com/NaiveNeuron/nlp-exercises/master/tutorial1-lm/data/ptb.test.txt
!wget https://raw.githubusercontent.com/NaiveNeuron/nlp-exercises/master/tutorial1-lm/data/ptb.valid.txt  


Although the structure described above makes this dataset nicely human-readable, it needs to be converted into a format that would be consumable by a Neural Network model, which will serve as a testbed for our experiments. To simplify this conversion, we introduce the following helper functions.

In [0]:
def prepare_sequence(seq, to_index):
    idxs = list(map(lambda w: to_index[w] if to_index.get(w) is not None else to_index["<unk>"], seq))
    return LongTensor(idxs)

In [0]:
def prepare_ptb_dataset(filename, word2index=None):
    corpus = open(filename, 'r', encoding='utf-8').readlines()
    corpus = flatten([co.strip().split() + ['<eos>'] for co in corpus])
    
    if word2index == None:
        vocab = list(set(corpus))
        word2index = {'<unk>': 0}
        for vo in vocab:
            if word2index.get(vo) is None:
                word2index[vo] = len(word2index)
    
    return prepare_sequence(corpus, word2index), word2index

In [0]:
# borrowed code from https://github.com/pytorch/examples/tree/master/word_language_model

def batchify(data, bsz):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).contiguous()
    if USE_CUDA:
        data = data.cuda()
    return data

In [0]:
def get_batch(data, seq_length):
     for i in range(0, data.size(1) - seq_length, seq_length):
        inputs = Variable(data[:, i: i + seq_length])
        targets = Variable(data[:, (i + 1): (i + 1) + seq_length].contiguous())
        yield (inputs, targets)

With all the helper functions prepared, we are ready to convert the data from a sequence of words (strings) into a sequence of integers, where each integer represents a specific word.

Note that we are only interested in the `word2index` dictionary when loading the training dataset -- in all the other datasets we use the previously created dictionary to encode the input data.

In [0]:
train_data, word2index = prepare_ptb_dataset('./ptb.train.txt',)
dev_data , _ = prepare_ptb_dataset('./ptb.valid.txt', word2index)
test_data, _ = prepare_ptb_dataset('./ptb.test.txt', word2index)

As a sanity check, we can try to see how many items can be found in the `word2index` dictionary.

In [0]:
len(word2index)

As we will see in a bit, being able to convert a word to its unique index is very useful. It turns out, the invese can be just as useful, so we'll create the `index2word` dictionary in the next cell.

In [0]:
index2word = {v:k for k, v in word2index.items()}

## Modeling 

<img src="./images/rnnlm-architecture.png">
Image borrowed from http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture8.pdf

In the following cell we'll define a very simple Language Model by extending the `nn.Module` PyTorch provides.

Note that we allow this module to have a variable `embedding_size`, `hidden_size`,  number of layers (`n_layers`) as well as Dropout probability `dropout_p` on the embedding layer.

In [0]:
class LanguageModel(nn.Module): 
    def __init__(self, vocab_size, embedding_size, hidden_size, n_layers=1, dropout_p=0.5):

        super(LanguageModel, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embed = nn.Embedding(vocab_size, embedding_size)
        self.rnn = nn.LSTM(embedding_size, hidden_size, n_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, vocab_size)
        self.dropout = nn.Dropout(dropout_p)
        
    def init_weight(self):
        self.embed.weight = nn.init.xavier_uniform_(self.embed.weight)
        self.linear.weight = nn.init.xavier_uniform_(self.linear.weight)
        self.linear.bias.data.fill_(0)
        
    def init_hidden(self, batch_size):
        hidden = Variable(torch.zeros(self.n_layers, batch_size, self.hidden_size))
        context = Variable(torch.zeros(self.n_layers, batch_size, self.hidden_size))
        return (hidden.cuda(), context.cuda()) if USE_CUDA else (hidden, context)
    
    def detach_hidden(self, hiddens):
        return tuple([hidden.detach() for hidden in hiddens])
    
    def forward(self, inputs, hidden, is_training=False): 
        embeds = self.embed(inputs)
        
        if is_training:
            embeds = self.dropout(embeds)
        out, hidden = self.rnn(embeds, hidden)
        
        return self.linear(out.contiguous().view(out.size(0) * out.size(1), -1)), hidden

## Training

In [0]:
EMBED_SIZE = 128
HIDDEN_SIZE = 1024
NUM_LAYER = 1
LR = 0.01
SEQ_LENGTH = 30 # for bptt
BATCH_SIZE = 100
EPOCH = 10
VOCAB_SIZE = len(word2index)
DROPOUT_PROB = 0.5
USE_RESCHEDULING = False

In [0]:
train_data = batchify(train_data, BATCH_SIZE)
dev_data = batchify(dev_data, BATCH_SIZE//2)
test_data = batchify(test_data, BATCH_SIZE//2)

In [0]:
model = LanguageModel(VOCAB_SIZE, EMBED_SIZE, HIDDEN_SIZE, NUM_LAYER, DROPOUT_PROB)
model.init_weight() 
if USE_CUDA:
    model = model.cuda()
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LR)

In [0]:
rescheduled = False
for epoch in range(EPOCH):
    total_loss = 0
    losses = []
    hidden = model.init_hidden(BATCH_SIZE)
    for i,batch in enumerate(get_batch(train_data, SEQ_LENGTH)):
        inputs, targets = batch
        hidden = model.detach_hidden(hidden)
        model.zero_grad()
        preds, hidden = model(inputs, hidden, True)

        loss = loss_function(preds, targets.view(-1))
        losses.append(loss.item())
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5) # gradient clipping
        optimizer.step()

        if i > 0 and i % 50 == 0:
            print("[{0:2d}/{1:2d}] mean_loss : {2:.2f}, Perplexity : {3:.2f}".format(epoch, EPOCH, np.mean(losses), np.exp(np.mean(losses))))
            losses = []
      
        if USE_RESCHEDULING:
          # learning rate anealing
          # You can use http://pytorch.org/docs/master/optim.html#how-to-adjust-learning-rate
          if rescheduled == False and epoch == EPOCH//2:
              LR *= 0.1
              optimizer = optim.Adam(model.parameters(), lr=LR)
              rescheduled = True

## Testing

In [0]:
total_loss = 0
hidden = model.init_hidden(BATCH_SIZE//2)
for batch in get_batch(test_data, SEQ_LENGTH):
    inputs,targets = batch
        
    hidden = model.detach_hidden(hidden)
    model.zero_grad()
    preds, hidden = model(inputs, hidden)
    total_loss += inputs.size(1) * loss_function(preds, targets.view(-1)).data

total_loss = total_loss.item()/test_data.size(1)
print("Test Perpelexity : {:.2f}".format(np.exp(total_loss)))

In [0]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
NUM_SAMPLES = 500

with torch.no_grad():
    with open('sample.txt', 'w') as f:
        # Set intial hidden ane cell states
        state = (torch.zeros(NUM_LAYER, 1, HIDDEN_SIZE).to(device),
                 torch.zeros(NUM_LAYER, 1, HIDDEN_SIZE).to(device))

        # Select one word id randomly
        prob = torch.ones(VOCAB_SIZE)
        input = torch.multinomial(prob, num_samples=1).unsqueeze(1).to(device)

        for i in range(NUM_SAMPLES):
            # Forward propagate RNN 
            output, state = model(input, state)

            # Sample a word id
            prob = output.exp()
            word_id = torch.multinomial(prob, num_samples=1).item()

            # Fill input with sampled word id for the next time step
            input.fill_(word_id)

            # File write
            word = index2word[word_id]
            word = '\n' if word in ['<eos>', '</s>'] else word + ' '
            f.write(word)

            if (i+1) % 100 == 0:
              print('Sampled [{}/{}] words and saved to {}'.format(i+1, NUM_SAMPLES, 'sample.txt'))

In [0]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
NUM_SAMPLED_WORDS = 200
NUM_SAMPLES = 5

def sequence_to_variables(seq, word2index):
  return [Variable(x).view(1, 1) for x in prepare_sequence(seq, word2index)]

starting_sequence = ['this', 'is', 'the']

with torch.no_grad():
    with open('sample_prefilled.txt', 'w') as f:
      for k in range(NUM_SAMPLES):
        # Set intial hidden ane cell states
        state = (torch.zeros(NUM_LAYER, 1, HIDDEN_SIZE).to(device),
                 torch.zeros(NUM_LAYER, 1, HIDDEN_SIZE).to(device))

        seq_of_variables = sequence_to_variables(starting_sequence, word2index)

        for input in seq_of_variables[:-1]:          
          output, state = model(input, state)

        input = seq_of_variables[:1][0]
        for i in range(NUM_SAMPLED_WORDS):
            # Forward propagate RNN 
            output, state = model(input, state)

            # Sample a word id
            prob = output.exp()
            word_id = torch.multinomial(prob, num_samples=1).item()

            # Fill input with sampled word id for the next time step
            input.fill_(word_id)

            # File write
            word = index2word[word_id]
            word = '\n' if word in ['<eos>', '</s>'] else word + ' '
            f.write(word)

            if (i+1) % 100 == 0:
              print('Sampled {} [{}/{}] words and saved to {}'.format(k+1, i+1, NUM_SAMPLED_WORDS, 'sample_prefilled.txt'))
              
        f.write('\n\n')

## Further topics

* <a href="https://arxiv.org/pdf/1609.07843.pdf">Pointer Sentinel Mixture Models</a>
* <a href="https://arxiv.org/pdf/1708.02182">Regularizing and Optimizing LSTM Language Models</a>