# HW 2: Language Modeling

In this homework you will be building several varieties of language models.

## Goal

We ask that you construct the following models in PyTorch:

1. A trigram model with linear-interpolation. $$p(y_t | y_{1:t-1}) =  \alpha_1 p(y_t | y_{t-2}, y_{t-1}) + \alpha_2 p(y_t | y_{t-1}) + (1 - \alpha_1 - \alpha_2) p(y_t) $$
2. A neural network language model (consult *A Neural Probabilistic Language Model* http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
3. An LSTM language model (consult *Recurrent Neural Network Regularization*, https://arxiv.org/pdf/1409.2329.pdf) 
4. Your own extensions to these models...


Consult the papers provided for hyperparameters.

 


## Setup

This notebook provides a working definition of the setup of the problem itself. You may construct your models inline or use an external setup (preferred) to build your system.

In [1]:
# Text text processing library
import torchtext
from torchtext.vocab import Vectors, GloVe
import numpy as np
import time
from utils import variable
from tqdm import tqdm

The dataset we will use of this problem is known as the Penn Treebank (http://aclweb.org/anthology/J93-2004). It is the most famous dataset in NLP and includes a large set of different types of annotations. We will be using it here in a simple case as just a language modeling dataset.

To start, `torchtext` requires that we define a mapping from the raw text data to featurized indices. These fields make it easy to map back and forth between readable data and math, which helps for debugging.

In [2]:
# Our input $x$
TEXT = torchtext.data.Field()

Next we input our data. Here we will use the first 10k sentences of the standard PTB language modeling split, and tell it the fields.

In [39]:
# Data distributed with the assignment
train, val, test = torchtext.datasets.LanguageModelingDataset.splits(
    path='../HW2/',
    train="shuffled_train.txt", validation="valid.txt", test="valid.txt", text_field=TEXT)

In [42]:
from sklearn.utils import shuffle
def shuffle_train_txt_file(input_filename, output_filename):
    with open(input_filename, 'r') as ifile:
        text = shuffle(ifile.read().split('\n'))
    with open(output_filename, 'w') as ofile:
        ofile.write('\n'.join(text))

def rebuild_iterators():
    if 'shuffled_train.txt' not in os.listdir():
        shuffle_train_txt_file('shuffled_train.txt', 'shuffled_train.txt')
    else:
        shuffle_train_txt_file('train.txt', 'shuffled_train.txt')        
    train, val, test = torchtext.datasets.LanguageModelingDataset.splits(
        path='../HW2/',
        train="shuffled_train.txt", validation="valid.txt", test="valid.txt", text_field=TEXT)
    train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits(
        (train, val, test), batch_size=10, device=-1, bptt_len=32, repeat=False)
    return train_iter, val_iter, test_iter

The data format for language modeling is strange. We pretend the entire corpus is one long sentence.

In [4]:
print('len(train)', len(train))

len(train) 1


Here's the vocab itself. (This dataset has unk symbols already, but torchtext adds its own.)

In [27]:
TEXT.build_vocab(train)
print('len(TEXT.vocab)', len(TEXT.vocab))

len(TEXT.vocab) 10001


When debugging you may want to use a smaller vocab size. This will run much faster.

In [28]:
if False:
    TEXT.build_vocab(train, max_size=1000)
    len(TEXT.vocab)

The batching is done in a strange way for language modeling. Each element of the batch consists of `bptt_len` words in order. This makes it easy to run recurrent models like RNNs. 

In [40]:
train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits(
    (train, val, test), batch_size=10, device=-1, bptt_len=32, repeat=False)

In [41]:
for j, batch in enumerate(train_iter):
    print(" ".join([TEXT.vocab.itos[i] for i in batch.text[:, 0].data]))
    if j > 4:
        break

companies listed below reported quarterly profit substantially different from the average of analysts ' estimates <eos> sterling was quoted at $ N up from $ N late tuesday <eos> mr. seidman said
yesterday for example that sen. dennis <unk> d. ariz. who received $ N in contributions from mr. keating <unk> mr. seidman to request that he push for a sale of lincoln before
it would be seized <eos> the canadian government previously said merieux 's bid did n't offer enough net benefit to canada to be approved and gave merieux an until <unk> to submit
additional information <eos> belgium was closed for two days france closed for a couple of hours germany was stuck <eos> international copyright secured <eos> a small yield premium over comparable treasurys and
a lack of liquidity is <unk> dealers ' efforts to drum up interest in the so-called bailout bonds <eos> changing legislation has opened the field to thousands of <unk> soviet players many
who promise more than they can deliver <eos> a ups

Here's what these batches look like. Each is a string of length 32. Sentences are ended with a special `<eos>` token.

In [31]:
it = iter(train_iter)
batch = next(it) 
print("Size of text batch [max bptt length, batch size]", batch.text.size())
print("Second in batch", batch.text[:, 2])
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 2].data]))

Size of text batch [max bptt length, batch size] torch.Size([32, 10])
Second in batch Variable containing:
  146
    3
  486
   55
    5
    2
   91
 5652
    0
   17
  440
  136
  268
  184
    8
    2
 2091
 7884
  120
    6
    7
   36
  242
    5
   13
    4
   49
   70
   41
    3
 2553
   16
[torch.LongTensor of size 32]

Converted back to string:  investment <eos> despite one of the most devastating <unk> on record net cash income in the farm belt rose to a new high of $ N billion last year <eos> northeast said


The next batch will be the continuation of the previous. This is helpful for running recurrent neural networks where you remember the current state when transitioning.

In [32]:
batch = next(it)
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 2].data]))

Converted back to string:  it would <unk> its request and still hopes for an <unk> review by the ferc so that it could complete the purchase by next summer if its bid is the one approved


There are no separate labels. But you can just use an offset `batch.text[1:]` to get the next word.

In [33]:
# Build the vocabulary with word embeddings
TEXT.vocab.load_vectors(vectors=GloVe())

print("Word embeddings size ", TEXT.vocab.vectors.size())
print("Word embedding of 'follows', first 10 dim ", TEXT.vocab.vectors[TEXT.vocab.stoi['follows']][:10])# Build the vocabulary with word embeddings


100%|██████████████████████████████████████████████████████████████████████| 2196017/2196017 [11:18<00:00, 3235.92it/s]


Word embeddings size  torch.Size([10001, 300])
Word embedding of 'follows', first 10 dim  
 0.2057
 0.1047
-0.3900
-0.1086
-0.0722
-0.1184
-0.1109
 0.1917
 0.4781
 2.0576
[torch.FloatTensor of size 10]



## Assignment

Now it is your turn to build the models described at the top of the assignment. 

Using the data given by this iterator, you should construct 3 different torch models that take in batch.text and produce a distribution over the next word. 

When a model is trained, use the following test function to produce predictions, and then upload to the kaggle competition: https://www.kaggle.com/c/cs287-hw2-s18

For the final Kaggle test, we will have you do a next word prediction task. We will provide a 10 word prefix of sentences, and it is your job to predict 10 possible next word candidates

In [None]:
!head input.txt

As a sample Kaggle submission, let us build a simple unigram model.  

In [None]:
from collections import Counter
count = Counter()
for b in iter(train_iter):
    count.update(b.text.view(-1).data.tolist())
count[TEXT.vocab.stoi["<eos>"]] = 0
predictions = [TEXT.vocab.itos[i] for i, c in count.most_common(20)]
with open("sample.txt", "w") as fout: 
    print("id,word", file=fout)
    for i, l in enumerate(open("input.txt"), 1):
        print("%d,%s"%(i, " ".join(predictions)), file=fout)


In [None]:
!head sample.txt

The metric we are using is mean average precision of your 20-best list. 

$$MAP@20 = \frac{1}{|D|} \sum_{u=1}^{|D|} \sum_{k=1}^{20} Precision(u, 1:k)$$

Ideally we would use log-likelihood or ppl as discussed in class, but this is the best Kaggle gives us. This takes into account whether you got the right answer and how highly you ranked it. 

In particular, we ask that you do not game this metric. Please submit *exactly 20* unique predictions for each example.


As always you should put up a 5-6 page write-up following the template provided in the repository:  https://github.com/harvard-ml-courses/cs287-s18/blob/master/template/

In [12]:
import torch as t
import torch.nn.functional as F

In [13]:
class CNN(t.nn.Module):
    def __init__(self, context_size, embeddings):
        super(CNN, self).__init__()
        self.context_size = context_size
        self.vocab_size = embeddings.size(0)
        self.embed_dim = embeddings.size(1)
        
        self.w = t.nn.Embedding(self.vocab_size, self.embed_dim)
        self.w.weight = t.nn.Parameter(embeddings, requires_grad=False)
        
        self.conv = t.nn.Conv1d(self.embed_dim, self.vocab_size, context_size)
        
    def forward(self, x):
        xx = self.w(x).transpose(2,1)
        xx = self.conv(xx)
        return xx

In [14]:
context_size = 5
cnn = CNN(context_size, TEXT.vocab.vectors)

```
loss(x, class) = -log(exp(x[class]) / (\sum_j exp(x[j])))
               = -x[class] + log(\sum_j exp(x[j]))
```


In [15]:
def criterion(pred, true):
    l = 0
    for k in range(pred.size(2)):
        pred_ = pred[:,:,k:k+1].squeeze()
        true_ = true[:, k:k+1].squeeze()
        l += F.cross_entropy(pred_, true_)
    return l / pred.size(2)

def accuracy(pred, true):
    pred_ = t.max(pred, 1)[1]
    return t.mean((pred_==true).float()).data.numpy()[0]

In [16]:
N_EPOCHS = 10
lr = 1e-3
optimizer = t.optim.Adam(filter(lambda p: p.requires_grad, cnn.parameters()), lr=lr)

train_accuracies = []
train_losses = []

In [17]:
print(len(train_iter), train_iter.batch_size)

2905 10


In [None]:
t0 = time.time()
for _ in range(N_EPOCHS):
    train_iter.shuffle = True
    train_iter.init_epoch()
    for i, next_batch in tqdm(enumerate(train_iter)):
        if i == 0:
            current_batch = next_batch
        else:
            optimizer.zero_grad()
            
            # context for starting words
            if i > 1:
                starting_words = last_batch.text.transpose(0,1)[:, -context_size:]
            else:
                starting_words = t.zeros(current_batch.text.size(1), context_size).float()
            x = t.cat([variable(starting_words, to_float=False).long(), current_batch.text.transpose(0,1).long()], 1)
            
            # you need the next batch first word to know what the target of the last word of the current batch is
            ending_word = next_batch.text.transpose(0,1)[:, :1]
            target = t.cat([current_batch.text.transpose(0,1)[:, 1:], ending_word], 1)
            
            # backprop
            pred = cnn(x)[:, :, :-1]  # don't take prediction for the first word of the next batch, it is done in the next batch
            
            loss = criterion(pred, target)
            loss.backward()
            optimizer.step()
            
            train_accuracies.append(accuracy(pred, target))
            train_losses.append(loss.data.numpy()[0])
            
            # update batches
            last_batch = current_batch
            current_batch = next_batch
    print('--------------------------\nEpoch %d took %.3fs' % (_, time.time() - t0))
    print("For epoch %d, train accuracy is : %.3f" % (_, np.mean(train_accuracies[-len(train_iter):])))
    print("For epoch %d, train loss is : %.3f" % (_, np.mean(train_losses[-len(train_iter):])))
    t0 = time.time()

In [24]:
def data_generator(iterator, model_str, context_size, cuda=True):
    for i, next_batch in enumerate(iterator):
        if i == 0:
            current_batch = next_batch
        else:
            if model_str == 'NNLM':
                if context_size is not None:
                    if i > 1:
                        starting_words = last_batch.text.transpose(0, 1)[:, -context_size:]
                    else:
                        starting_words = t.zeros(current_batch.text.size(1), context_size).float()
                    x = t.cat([variable(starting_words, to_float=False, cuda=cuda).long(), current_batch.text.transpose(0, 1).long()], 1)
                else:
                    raise ValueError('`context_size` should not be None')
            else:
                x = current_batch.text.transpose(0, 1).long()

            target = current_batch.text.transpose(0, 1)

            last_batch = current_batch
            current_batch = next_batch

            yield x, target


In [27]:
for i,(x,y) in enumerate(data_generator(train_iter, 'NNLM', 5, False)):
    print('i',i)
    print('x')
    print(x[:,-11:])
    print('y')
    print(y[:,-11:])
    if i > 3:
        break

i 0
x
Variable containing:
  9998   9999  10000      3   9257      0      4     73    394     34   2134
    31    295   4901     13      4     49      3      0      0     25   2471
     0     20      2    273   7821     17      9    117   2815    969      6
    40     14   3582      9      0   7238     10    391     45    487      0
    11    271     44     13      4      8      2    380     14   4073     21
    13      4     49    156      4    121   1188    363    547     35   2130
  2258     24   1891      0      6      7      0     12    212     62    608
    38     23     74     11    864     12   3959      9   1197    351      3
   155      3      7     36     93     61    112   7859   1555   1827   2786
  3464     51     44      4      4      5      2     48     63      3   4332
[torch.LongTensor of size 10x11]

y
Variable containing:
  9998   9999  10000      3   9257      0      4     73    394     34   2134
    31    295   4901     13      4     49      3      0      0     25

In [None]:
pred.size()

In [None]:
# t.max(pred, 1)[1]
# # t.mean((pred==true).float()).data.numpy()[0]
accuracy(pred,target)

In [None]:
target.size()

In [None]:
train_iter.init_epoch()
for i, next_batch in enumerate(train_iter):
    if i == 0:
        current_batch = next_batch
    else:
        # context for starting words. NO NEED FOR LSTM/RNN.
        if i > 1:
            starting_words = last_batch.text.transpose(0,1)[:, -context_size:]
        else:
            starting_words = t.zeros(current_batch.text.size(1), context_size).float()
        x = t.cat([variable(starting_words, to_float=False).long(), current_batch.text.transpose(0,1).long()], 1)

        # you need the next batch first word to know what the target of the last word of the current batch is
        ending_word = next_batch.text.transpose(0,1)[:, :1]
        target = t.cat([current_batch.text.transpose(0,1)[:, 1:], ending_word], 1)
        
        last_batch = current_batch
        current_batch = next_batch