# Student: Dorin Doncenco

Parts of the code have been written with the help of Copilot.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import random

# Lab exercise: neural language modeling

The goal of this lab exercise is build two neural language models:
- a neural n-gram model based on a simple MLP
- an autoregressive model based on a LSTM

Although the n-gram model is straighforward to code, there are a few "tricks" that you need to implement for the autoregressive model:
- word dropout
- variational dropout
- loss function masking

## Variational dropout

The idea of variational dropout is to apply the same mask at each position for a given sentence (if there are several sentences in a minibatch, use different masks for each input).
The idea is as follows:
- assume a sentence of n words whose embeddings are e_1, e_2, ... e_n
- at the input of the LSTM, instead of apply dropout independently to each embedding, sample a single mask that will be applied similarly at each position
- same at the output of the LSTM

See Figure 1 of this paper: https://proceedings.neurips.cc/paper/2016/file/076a0c97d09cf1a0ec3e19c7f2529f2b-Paper.pdf

To implement this, you need to build a custom module that applies the dropout only if the network is in training mode.

## Data preprocessing

You first need to download the Penn Treebank as pre-processed by Tomas Mikolov. It is available here: https://github.com/townie/PTB-dataset-from-Tomas-Mikolov-s-webpage/tree/master/data
We will use the following files:
- ptb.train.txt
- ptb.valid.txt
- ptb.test.txt

Check manually the data.

Todo:
- build a word dictionnary, i.e. a bijective mapping between words and integers. You will need to add a special token "\<BOS\>" to the dictionnary even if it doesn't appear in sentences. (if you want to generate data, you will also need a "\<EOS\>" token, but this is not a requirement for this lab exercise --- you can do this at the end if you want)
- build python list of integers representing each input. For example, for the sentence "I sleep", the tensor could look like [10, 5] if 10 is the integer associated with "I" and 5 the integer associated with "sleep". You can add this directly to the dictionnaries in \*\_data

In [3]:
def read_file(path):
    data = list()
    with open(path) as inf:
        for line in inf:
            line = line.strip()
            if len(line) == 0:
                continue
            data.append({"text": line.split()})
    return data

In [4]:
train_data = read_file("./dataset/ptb.train.txt")
dev_data = read_file("./dataset/ptb.valid.txt")
test_data = read_file("./dataset/ptb.test.txt")

In [5]:
# count the amount of each words in the train data
word_count = dict()
for data in train_data:
    for word in data["text"]:
        if word not in word_count:
            word_count[word] = 0
        word_count[word] += 1

# print the most commmon 10 words
word_count = sorted(word_count.items(), key=lambda x: x[1], reverse=True)
print(word_count[:10])

[('the', 50770), ('<unk>', 45020), ('N', 32481), ('of', 24400), ('to', 23638), ('a', 21196), ('in', 18000), ('and', 17474), ("'s", 9784), ('that', 8931)]


In [6]:
len(dev_data)

3370

In [7]:
print(len(train_data), len(dev_data), len(test_data))
print("\n\n".join(" ".join(s["text"]) for s in train_data[:5]))

42068 3370 3761
aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter

pierre <unk> N years old will join the board as a nonexecutive director nov. N

mr. <unk> is chairman of <unk> n.v. the dutch publishing group

rudolph <unk> N years old and former chairman of consolidated gold fields plc was named a nonexecutive director of this british industrial conglomerate

a form of asbestos once used to make kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than N years ago researchers reported


In [8]:
class WordDict:
    # constructor, words must be a set containing all words
    def __init__(self, words):
        assert type(words) == set
        self.words = words
        self.word_to_id = dict()
        self.id_to_word = dict()
        for idx, word in enumerate(words):
            self.word_to_id[word] = idx
            self.id_to_word[idx] = word

    # return the integer associated with a word
    def word_to_id(self, word):
        assert type(word) == str
        return self.word_to_id[word]
    
    # return the word associated with an integer
    def id_to_word(self, idx):
        assert type(idx) == int
        return self.id_to_word[idx]
    
    # number of word in the dictionnary
    def __len__(self):
        return len(self.words)

In [9]:
train_words = set()
for sentence in train_data:
    train_words.update(sentence["text"])
train_words.update(["<bos>", "<eos>"])
word_dict = WordDict(train_words)
len(word_dict)  # should be 10001

10001

## Evaluation

For evaluation, you must compute the perplexity of the test dataset (i.e. assume the dataset is one very long sentence), see:
https://lena-voita.github.io/nlp_course/language_modeling.html#evaluation

Note that you don't need to explicitly compute the root, you can use log probabilities and properties of log functions for this.
As during evaluation, you will see sentences one after the other, you can code a small class to keep track of log probabilities of words and compute the global perplexity at the end.

In [10]:
class Perplexity:
    def __init__(self):
        self.reset()
               
    def reset(self):
        self.log_probs = list()
        self.num_words = 0
        
    def add_sentence(self, log_probs):
        # log_probs: vector of log probabilities of words in a sentence
        for log_prob in log_probs:
            self.log_probs.append(log_prob)
            self.num_words += 1
        
    def compute_perplexity(self):
        # compute perplexity from the stored log probabilities
        log_probs_sum = sum(self.log_probs) / self.num_words
        perplexity = 2 ** (-log_probs_sum)
        return perplexity

## LSTM model

This model should rely on a LSTM.

1. transform the data into tensors => you can't use the same trick as for the n-gram model
2. train the network by batching the input --- be very careful when computing the loss function! And explain how to batch data, compute the loss with batch data, etc, in the report!
3. compute the perplexity on the test data
4. implement variational dropout at input and output of the LSTM

Warning: you need to use the option batch_first=True for the LSTM.

Here we will convert the data to a tensor format which contain token indices; to make them all of the same length, we pad the shorter sentences with $<$ eos $>$ tokens to match the length of the longest sentence. In order to be able to recover the length of the original sentence, we will keep tensors sentence lengths, which will be used during training to create a mask and remove the padding tokens.

In [11]:
def convert_data(data, word_dict):
    converted_data = list()
    for sentence in data:
        converted_data.append([word_dict.word_to_id["<bos>"]] + [word_dict.word_to_id[word] for word in sentence["text"]] + [word_dict.word_to_id["<eos>"]])
    return converted_data

train_data_tensor = convert_data(train_data, word_dict)
dev_data_tensor = convert_data(dev_data, word_dict)
test_data_tensor = convert_data(test_data, word_dict)

# max length of a train sentence
max_len = max([len(s) for s in train_data_tensor])

# pad sentences
def pad_sentences(data, max_len):
    sentence_lengths = [len(sentence) for sentence in data]

    for i, sentence in enumerate(data):
        if len(sentence) < max_len:
            sentence += [word_dict.word_to_id["<eos>"]] * (max_len - len(sentence))
        else:
            #if the sentence is longer than the longest in train, we truncate it and add <eos>
            sentence = sentence[:max_len-1] + [word_dict.word_to_id["<eos>"]]
            sentence_lengths[i] = max_len

    sentence_lengths = torch.tensor(sentence_lengths)
    return data, sentence_lengths

train_data_tensor, train_sentence_lenghts = pad_sentences(train_data_tensor, max_len)
dev_data_tensor, dev_sentence_lengths = pad_sentences(dev_data_tensor, max_len)
test_data_tensor, test_sentence_lengths = pad_sentences(test_data_tensor, max_len)

# convert data into tensors
def convert_to_tensors(data):
    converted_data = list()
    for sentence in data:
        converted_data.append(torch.tensor(sentence))
    converted_data = torch.stack(converted_data)
    return converted_data

train_data_tensor = convert_to_tensors(train_data_tensor)
dev_data_tensor = convert_to_tensors(dev_data_tensor)
test_data_tensor = convert_to_tensors(test_data_tensor)

The batch generator will create training batches from the data and sentence lengths, ensuring the respective order between each other is maintained, and it randomises the order of the data to add stochasticity to the training of the model. In order to save computation time, it also returns the maximum length of each batch, which can be used to reduce the size of the input and to save computation time during training.

In [12]:
def batch_generator(data, sentence_lengths, batch_size):
    # randomize sentence order
    # data: (num_sentences, max_len)
    # sentence_lengths: (num_sentences)
    # batch_size: int

    # get shuffled indices
    indices = torch.randperm(data.shape[0])
    
    # create batches
    num_batches = data.shape[0] // batch_size
    batches = list()
    batches_lengths = list()
    for i in range(num_batches):
        batch = data[indices[i*batch_size:(i+1)*batch_size], :]
        batch_lengths = sentence_lengths[indices[i*batch_size:(i+1)*batch_size]]
        batches.append(batch)
        batches_lengths.append(batch_lengths)
    if data.shape[0] % batch_size != 0:
        batch = data[indices[num_batches*batch_size:], :]
        batches.append(batch)
        batch_lengths = sentence_lengths[indices[num_batches*batch_size:]]
        batches_lengths.append(batch_lengths)

    # batches length will find the maximum sentence length in each batch
    batches_max_len = torch.tensor([torch.max(batch_lengths) for batch_lengths in batches_lengths])
    return batches, batches_lengths, batches_max_len


The language model architecture consists of an embedding layer, an LSTM layer, and a linear layer; the linear layer will output logits corresponding to each word in the dictionary, which can be converted into a probability using the softmax function. A word dropout function is available in the architecture, which will randomly replace tokens in the input with the $<$ unk $>$ token, to prevent overfitting to the training data.

The generate function is able to generate sentences given an input sequence, up until a certain limit or until the end of sentence token is generated.

In [13]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, dropout, word_dropout=0.0):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, dropout=dropout, batch_first=True)
        self.linear = nn.Linear(hidden_dim, vocab_size)
        self.dropout = nn.Dropout(dropout)
        self.word_dropout = word_dropout
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.init_weights()
    
    def init_weights(self):
        initrange = 0.1
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.linear.bias.data.zero_()
        self.linear.weight.data.uniform_(-initrange, initrange)

    def forward(self, input, hidden, sentence_lengths):
        # input: (batch_size, seq_len)
        # hidden: (num_layers, batch_size, hidden_dim)
        # sentence_lengths: (batch_size)
        batch_size = input.shape[0]
        seq_len = input.shape[1]
        if self.word_dropout > 0.0 and self.training:
            # randomly replace some input words with <unk>
            # word_dropout: probability of replacing a word with <unk>
            # input: (batch_size, seq_len)
            mask = torch.rand(input.shape) < self.word_dropout
            # clone the input tensor to avoid modifying it
            input = input.clone()
            input[mask] = word_dict.word_to_id["<unk>"]
            
        embedded = self.dropout(self.embedding(input))
        # embedded: (batch_size, seq_len, embedding_dim)
        output, hidden = self.lstm(embedded, hidden)
        # output: (batch_size, seq_len, hidden_dim)
        output = self.dropout(output)
        output = output.reshape(batch_size * seq_len, self.hidden_dim)
        # output: (batch_size * seq_len, hidden_dim)
        output = self.linear(output)
        return output
    
    def init_hidden(self, batch_size, device):
        # return a tensor of zeros of shape (num_layers, batch_size, hidden_dim)
        return (torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device),
                torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device))  
    
    def generate(self, sentence, hidden, sentence_length, max_len, device):
        # given a sentence, generate the next words until :
        #        <eos> is generated or max_len is reached
        # sentence: (seq_len) of word ids
        # hidden: (num_layers, 1, hidden_dim)
        # sentence_length: int
        # max_len: int
       
        sentence_generated = sentence.unsqueeze(0)
        for i in range(max_len):
            # we generate the next word
            output = self.forward(sentence_generated, hidden, torch.tensor([1]).to(device))
            # output: (1, vocab_size)
            output = F.softmax(output, dim=1)
            # output: (1, vocab_size)
            word_id = torch.multinomial(output[-1:], 1)
            # word_id: (1)
            sentence_generated = torch.cat((sentence_generated, word_id), dim=1)
            # sentence_generated: (seq_len)
            if word_id == word_dict.word_to_id["<eos>"]:
                break
        sentence_generated = sentence_generated.squeeze(0).cpu().numpy()
        return sentence_generated


In [14]:
# hyperparameters
vocab_size = len(word_dict)
embedding_dim = 128
hidden_dim = 128
num_layers = 1
dropout = 0.0
batch_size = 32
learning_rate = 0.005
num_epochs = 50
word_dropout = 0.25
#log_interval = len(train_data_tensor) // batch_size // 3
log_interval = len(train_data_tensor) + 1 # to never print loss during the epoch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# create model
model = LSTMLanguageModel(vocab_size, embedding_dim, hidden_dim, num_layers, dropout, word_dropout=word_dropout).to(device)
# create optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# create loss function
criterion = nn.CrossEntropyLoss(reduction="none")

In [15]:
# train
model.train()
for epoch in range(num_epochs):
    epoch_loss = 0
    epoch_perplexity = Perplexity()
    num_batches = 0
    batches, batches_lengths, batches_max_len = batch_generator(train_data_tensor, train_sentence_lenghts, batch_size)
    for batch, batch_lengths, batches_max_len in zip(batches, batches_lengths, batches_max_len):
        if num_batches > 20:
            break
        # batch: (batch_size, seq_len)
        # batch_lengths: (batch_size)
        optimizer.zero_grad()
        hidden = model.init_hidden(len(batch), device)
        # target is all words except first one
        target = batch[:, 1:batches_max_len].to(device)
        # send to device
        batch = batch.to(device)
        batch_lengths = batch_lengths.to(device)
        # pass all words except last one
        output = model(batch[:, :batches_max_len-1], hidden, batch_lengths)
        output = output.reshape(len(batch), -1, vocab_size)
        #log_output = F.log_softmax(output, dim=2)
        # output: (batch_size, vocab_size)
        # create mask to ignore padding in the loss
        mask = torch.zeros_like(target, dtype=torch.float)
        for i, length in enumerate(batch_lengths):
            mask[i, :length-1] = 1

        output = output.reshape(len(batch) * (batches_max_len-1), vocab_size)
        target = target.reshape(len(batch) * (batches_max_len-1))
        # compute loss
        loss = criterion(output, target)
        loss = torch.sum(loss * mask.reshape(-1)) / torch.sum(mask)
    
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        num_batches += 1
        if num_batches % log_interval == 0:
            print("Epoch: {}, Batch: {}/{}, Loss: {:.4f}".format(epoch+1, num_batches, len(batches), loss.item()))
        
        # add prediction to perplexity
        #output_probs = F.log_softmax(output, dim=-1)
        output_probs = F.softmax(output, dim=-1)
        output_probs = torch.log2(output_probs)
        selected_probs = output_probs[torch.arange(len(output_probs)), target].to("cpu")
        epoch_perplexity.add_sentence(selected_probs.tolist())
    print("Epoch: {}, Loss: {:.4f}, Perplexity: {:.4f}".format(epoch+1, epoch_loss / num_batches, epoch_perplexity.compute_perplexity()))
    # evaluate
    model.eval()
    # evaluate on dev set 
    dev_epoch_loss = 0
    dev_epoch_perplexity = Perplexity()
    num_batches = 0
    batches, batches_lengths, batches_max_len = batch_generator(dev_data_tensor, dev_sentence_lengths, batch_size)
    for batch, batch_lengths, batches_max_len in zip(batches, batches_lengths, batches_max_len):
        hidden = model.init_hidden(len(batch), device)
        # target is all words except first one
        target = batch[:, 1:batches_max_len].to(device)
        # send to device
        batch = batch.to(device)
        batch_lengths = batch_lengths.to(device)
        # pass all words except last one
        output = model(batch[:, :batches_max_len-1], hidden, batch_lengths)
        output = output.reshape(len(batch), -1, vocab_size)
        #log_output = F.log_softmax(output, dim=2)
        # output: (batch_size, vocab_size)
        # create mask to ignore padding in loss
        mask = torch.zeros_like(target, dtype=torch.float)
        for i, length in enumerate(batch_lengths):
            mask[i, :length-1] = 1

        output = output.reshape(len(batch) * (batches_max_len-1), vocab_size)
        target = target.reshape(len(batch) * (batches_max_len-1))
        # compute loss
        loss = criterion(output, target)
        loss = torch.sum(loss * mask.reshape(-1)) / torch.sum(mask)
    
        dev_epoch_loss += loss.item()
        num_batches += 1
        
        # add prediction to perplexity
        #output_probs = F.log_softmax(output, dim=-1)
        output_probs = F.softmax(output, dim=-1)
        output_probs = torch.log2(output_probs)
        selected_probs = output_probs[torch.arange(len(output_probs)), target].to("cpu")
        dev_epoch_perplexity.add_sentence(selected_probs.tolist())

    print("Epoch: {}, Dev Loss: {:.4f}, Dev Perplexity: {:.4f}".format(epoch+1, dev_epoch_loss / num_batches, dev_epoch_perplexity.compute_perplexity()))

    # generate a sentence
    hidden = model.init_hidden(1, device)
    generated = model.generate(torch.tensor([word_dict.word_to_id["<bos>"]]).to(device), hidden, 1, max_len, device)
    if generated[0] == word_dict.word_to_id["<bos>"]:
        generated = generated[1:]
    eos_flag = False
    if generated[-1] == word_dict.word_to_id["<eos>"]:
        generated = generated[:-1]
        eos_flag = True
    if eos_flag:
        print("Generated sentence: \n {}".format(" ".join([word_dict.id_to_word[word_id] for word_id in generated])))
    else:
        print("Generated sentence without <eos> token: \n {}".format(" ".join([word_dict.id_to_word[word_id] for word_id in generated])))
    model.train()
    print("")

Epoch: 1, Loss: 7.6749, Perplexity: 377.3656
Epoch: 1, Dev Loss: 7.0032, Dev Perplexity: 166.3631
Generated sentence: 
 stake find growth texas said

Epoch: 2, Loss: 6.8056, Perplexity: 147.8680
Epoch: 2, Dev Loss: 6.6853, Dev Perplexity: 141.1949
Generated sentence: 
 

Epoch: 3, Loss: 6.6472, Perplexity: 128.8401
Epoch: 3, Dev Loss: 6.5762, Dev Perplexity: 120.5470
Generated sentence: 
 N month writer premium have three eliminated N <unk> been a big missing played pressure laurence

Epoch: 4, Loss: 6.5641, Perplexity: 96.9004
Epoch: 4, Dev Loss: 6.5065, Dev Perplexity: 87.4678
Generated sentence: 
 when at N

Epoch: 5, Loss: 6.5281, Perplexity: 93.3364
Epoch: 5, Dev Loss: 6.4175, Dev Perplexity: 95.4229
Generated sentence: 
 <unk> from area debts 's secretary by however the that their this shevardnadze senior bank salomon longer an $ of four the that was bell experiencing

Epoch: 6, Loss: 6.4328, Perplexity: 86.5513
Epoch: 6, Dev Loss: 6.3379, Dev Perplexity: 76.4309
Generated senten

In [16]:
def generate_sentence(model, device, sentence_start):
    sentence = ["<bos>"] + sentence_start
    sentence = [word_dict.word_to_id[word] for word in sentence]
    sentence = torch.tensor(sentence).to(device)
    hidden = model.init_hidden(1, device)
    generated = model.generate(sentence, hidden, len(sentence), max_len, device)
    eos_flag = False
    if generated[-1] == word_dict.word_to_id["<eos>"]:
        generated = generated[:-1]
        eos_flag = True

    if eos_flag:
        print("Generated sentence: \n {}".format(" ".join([word_dict.id_to_word[word_id] for word_id in generated])))
    else:
        print("Generated sentence without <eos> token: \n {}".format(" ".join([word_dict.id_to_word[word_id] for word_id in generated])))

In [17]:
generate_sentence(model, device, "profits at all costs".split())

generate_sentence(model, device, "ensuring worker rights".split())

Generated sentence: 
 <bos> profits at all costs for minneapolis
Generated sentence: 
 <bos> ensuring worker rights succeed worked across to the facility house might become an oklahoma ought possible foreign bank declared <unk> users on the company the the


In [18]:
# save model
torch.save(model.state_dict(), "./lstm_language_model.pt")


In [19]:
# load model
device = 'cpu'
model = LSTMLanguageModel(vocab_size, embedding_dim, hidden_dim, num_layers, dropout, word_dropout=word_dropout).to(device)
model.load_state_dict(torch.load("./lstm_language_model.pt", map_location=torch.device(device)))

# generate some sentences
model.eval()
for i in range(10):
    hidden = model.init_hidden(1, device)
    generated = model.generate(torch.tensor([word_dict.word_to_id["<bos>"]]).to(device), hidden, 1, max_len, device)
    if generated[0] == word_dict.word_to_id["<bos>"]:
        generated = generated[1:]
    eos_flag = False
    if generated[-1] == word_dict.word_to_id["<eos>"]:
        generated = generated[:-1]
        eos_flag = True
    if eos_flag:
        print("Generated sentence: \n {}".format(" ".join([word_dict.id_to_word[word_id] for word_id in generated])))
    else:
        print("Generated sentence without <eos> token: \n {}".format(" ".join([word_dict.id_to_word[word_id] for word_id in generated])))

Generated sentence: 
 steve <unk> an result with current offers mining chinese at dpc <unk> of datapoint that follow birmingham have been missed a number of bankruptcy court
Generated sentence: 
 <unk> manufacturers traded for in the same amount of income begins days over compaq given its auction standards sounds empty their <unk> years that too likely to about the fine with what they had to have to desk
Generated sentence: 
 its offer increased a N N rise in procedural propaganda
Generated sentence: 
 after the N vote one economists darman allowed encouraged increases users to see the state to restore the massive accounts spokeswoman is serious <unk> mixte for direct <unk>
Generated sentence: 
 the capital-gains four pacific banks have been honest risk and would <unk> reached maidenform d. data to enter southern a. asian and palo alto christopher business and medium-sized signs of financial factors were the alternative would that will be able to improve the sidelines
Generated sentenc