# HW 2: Language Modeling

In this homework you will be building several varieties of language models.

## Goal

We ask that you construct the following models in Torch / NamedTensor:

1. A count-based trigram model with linear-interpolation. $$p(y_t | y_{1:t-1}) =  \alpha_1 p(y_t | y_{t-2}, y_{t-1}) + \alpha_2 p(y_t | y_{t-1}) + (1 - \alpha_1 - \alpha_2) p(y_t) $$
2. A neural network language model (consult *A Neural Probabilistic Language Model* http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
3. An LSTM language model (consult *Recurrent Neural Network Regularization*, https://arxiv.org/pdf/1409.2329.pdf) 
4. Your own extensions to these models.


Consult the papers provided for hyperparameters.

 


## Setup

This notebook provides a working definition of the setup of the problem itself. You may construct your models inline or use an external setup (preferred) to build your system.

In [1]:
import torch
import torchtext
from torchtext.vocab import Vectors

from namedtensor import ntorch, NamedTensor
from namedtensor.text import NamedField

from load_data import load_text
from models import TrigramModel, LSTM
from train_models import make_kaggle_submission

In [3]:
train_iter, val_iter, test_iter, TEXT = load_text("./data")

## Trigram Model

In [3]:
model = TrigramModel(.8, .16, len(TEXT.vocab))
model.fit(train_iter)

In [4]:
make_kaggle_submission(model, TEXT)

## LSTM

In [10]:
# Build the vocabulary with word embeddings
TEXT.vocab.load_vectors('fasttext.simple.300d')

.vector_cache/wiki.simple.vec: 293MB [03:07, 1.56MB/s]                              
  0%|          | 0/111051 [00:00<?, ?it/s]Skipping token b'111051' with 1-dimensional vector [b'300']; likely a header
 99%|█████████▉| 109946/111051 [00:09<00:00, 10917.56it/s]


In [163]:
class LSTM(ntorch.nn.Module):
    def __init__(self, TEXT, hidden_size=50, layers=1, dropout = 0.2, device = 'cpu'):
        super(LSTM, self).__init__()
        self.TEXT = TEXT
        self.pretrained_emb = TEXT.vocab.vectors.to(device)
        self.embedding = ntorch.nn.Embedding.from_pretrained(self.pretrained_emb, freeze=True)
        self.lstm = ntorch.nn.LSTM(self.pretrained_emb.shape[1], hidden_size, bidirectional=True).spec("embedding", "seqlen", "lstm")
        self.lstm_dropout = ntorch.nn.Dropout(dropout)
        self.linear = ntorch.nn.Linear(2*hidden_size, len(TEXT.vocab.itos)).spec('lstm', 'out')

    def forward(self, x):
        x = self.embedding(x)
        x, _ = self.lstm(x)
        x = self.lstm_dropout(x)
        x = self.linear(x)
        return x  

    def fit(self, train_iter, lr = 1e-2, momentum = 0.9, batch_size = 128, epochs = 10, interval = 1):
        criterion = torch.nn.CrossEntropyLoss()
        optimizer = torch.optim.SGD(self.parameters(), lr=lr, momentum=momentum)
        train_iter.batch_size = batch_size

        for epoch in range(epochs):  # loop over the dataset multiple times

            running_loss = 0.0
            for i, data in enumerate(train_iter, 0):
                # get the inputs
                inputs, labels = data.text, data.target

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward + backward + optimize
                outputs = self(inputs)
                loss = criterion(
                    outputs.transpose('batch', 'out', 'seqlen').values, 
                    labels.transpose('batch','seqlen').values
                )
                loss.backward()
                optimizer.step()

                # print statistics
                running_loss += loss.item()
                if i % interval == interval-1:    # print every 2000 mini-batches
                    print('[epoch: {}, batch: {}] loss: {}'.format(epoch + 1, i + 1, running_loss / interval))
                running_loss = 0.0

        print('Finished Training')

    def predict(self, text, predict_last = False):
        pass

In [165]:
net = LSTM(TEXT)
net.fit(train_iter, lr=1)

[epoch: 1, batch: 1] loss: 9.213565826416016
[epoch: 1, batch: 2] loss: 9.193037033081055
[epoch: 1, batch: 3] loss: 9.151989936828613
[epoch: 1, batch: 4] loss: 9.087571144104004
[epoch: 1, batch: 5] loss: 9.003547668457031
[epoch: 1, batch: 6] loss: 8.880457878112793
[epoch: 1, batch: 7] loss: 8.683112144470215
[epoch: 1, batch: 8] loss: 8.439841270446777
[epoch: 1, batch: 9] loss: 8.09653091430664
[epoch: 1, batch: 10] loss: 7.86899471282959
[epoch: 1, batch: 11] loss: 7.858415603637695
[epoch: 1, batch: 12] loss: 7.57770299911499
[epoch: 1, batch: 13] loss: 7.568233013153076
[epoch: 1, batch: 14] loss: 7.561644077301025
[epoch: 1, batch: 15] loss: 7.583150863647461
[epoch: 1, batch: 16] loss: 7.524545192718506
[epoch: 1, batch: 17] loss: 7.360491752624512
[epoch: 1, batch: 18] loss: 7.307711601257324
[epoch: 1, batch: 19] loss: 7.2943243980407715
[epoch: 1, batch: 20] loss: 7.205414295196533
[epoch: 1, batch: 21] loss: 7.203909397125244
[epoch: 1, batch: 22] loss: 7.111798286437988

[epoch: 1, batch: 178] loss: 4.297300815582275
[epoch: 1, batch: 179] loss: 4.225895881652832
[epoch: 1, batch: 180] loss: 4.177648067474365
[epoch: 1, batch: 181] loss: 4.1401448249816895
[epoch: 1, batch: 182] loss: 4.17262077331543
[epoch: 1, batch: 183] loss: 4.141529560089111
[epoch: 1, batch: 184] loss: 4.212691307067871
[epoch: 1, batch: 185] loss: 4.048442840576172
[epoch: 1, batch: 186] loss: 4.012235164642334
[epoch: 1, batch: 187] loss: 4.0375590324401855
[epoch: 1, batch: 188] loss: 4.246653079986572
[epoch: 1, batch: 189] loss: 4.302437782287598
[epoch: 1, batch: 190] loss: 4.24398946762085
[epoch: 1, batch: 191] loss: 4.106969833374023
[epoch: 1, batch: 192] loss: 4.153709888458252
[epoch: 1, batch: 193] loss: 4.127676486968994
[epoch: 1, batch: 194] loss: 4.157070636749268
[epoch: 1, batch: 195] loss: 4.242199420928955
[epoch: 1, batch: 196] loss: 4.081156253814697
[epoch: 1, batch: 197] loss: 4.167463779449463
[epoch: 1, batch: 198] loss: 4.115771770477295
[epoch: 1, ba

KeyboardInterrupt: 