<h1 style="color:rgb(0,120,170)">Hands-on AI II</h1>
<h2 style="color:rgb(0,120,170)">Unit 7 -- Introduction to Natural Language Processing II </h2>

# Exercise 0

- Import the same modules as discussed in the lecture notebook.
- Check if your model versions are correct.
- Use your GPU if available.

In [1]:
import u7_utils as u7

import numpy as np
import torch
import torch.utils.data
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import sys
import os
import io
import time
import math


In [2]:
u7.check_module_versions()


Installed Python version: 3.7 (✓)
Installed numpy version: 1.18.1 (✓)
Installed matplotlib version: 3.1.3 (✓)
Installed PyTorch version: 1.5.0 (✓)


In [3]:
use_cuda = torch.cuda.is_available()
device = torch.device('cuda' if use_cuda else 'cpu')
print("Device:", device)


Device: cpu


<h1 style="color:rgb(208,90,80)">ABOUT THIS NOTEBOOK</h1>
<span style="color:rgb(208,90,80)">In this notebook you should solve a small task on your one. <br><br> The goal is to train an LSTM network with a different number of hidden cells on the Penn Treebank dataset. You should decide on the validation dataset which model works best and then try it on the test dataset. This is a first example of a hyperparameter search. <br> We only evaluate how you build this hyperparameter search.</span>

<h3 style="color:rgb(0,120,170)">Defining hyper-parameters</h3>
In contrast to the lecture notebook we do not set the parameter <i> nhid </i>. This is the hyperparameter which we will later use for the search.

In [4]:
data_path = 'resources/penn/'
emsize = 200 # size of word embeddings
lr = 20 # initial learning rate
clipping = 0.25 # gradient clipping
epochs = 3 # upper epoch limit
train_batch_size = 10 # batch size for training
eval_batch_size = 5 # batch size for elidation/test
max_seq_len = 35 # sequence length
seed = 1111 # random seed to facilitate reproducability
print_interval = 1000 # report interval


In [5]:
torch.manual_seed(seed)


<torch._C.Generator at 0x7fc7c39a56b0>

<h3 style="color:rgb(0,120,170)">Data & dictionary</h3>

In [6]:
train_corpus = u7.Corpus(os.path.join(data_path, 'train.txt'))
valid_corpus = u7.Corpus(os.path.join(data_path, 'valid.txt'))
test_corpus = u7.Corpus(os.path.join(data_path, 'test.txt'))

dictionary = u7.Dictionary()
train_corpus.fill_dictionary(dictionary)
ntokens = len(dictionary)
print (f'Number of tokens in dictionary {ntokens}')

train_data = train_corpus.words_to_ids(dictionary)
print (f'Train data: number of tokens {len(train_data)}')

valid_data = valid_corpus.words_to_ids(dictionary)
print (f'Validation data: number of tokens {len(valid_data)}')

test_data = test_corpus.words_to_ids(dictionary)
print (f'Test data: number of tokens {len(test_data)}')


Number of tokens in dictionary 10001
Train data: number of tokens 929589
Validation data: number of tokens 73760
Test data: number of tokens 82430


In [7]:
train_data_batches = u7.batchify(train_data, train_batch_size, device)
print (f'Train batchified data shape: {train_data_batches.shape}')

val_data_batches = u7.batchify(valid_data, eval_batch_size, device)
print (f'Validation batchified data shape: {val_data_batches.shape}')

test_data_batches = u7.batchify(test_data, eval_batch_size, device)
print (f'Test batchified data shape: {test_data_batches.shape}')


Train batchified data shape: torch.Size([92958, 10])
Validation batchified data shape: torch.Size([14752, 5])
Test batchified data shape: torch.Size([16486, 5])


<h3 style="color:rgb(0,120,170)">Training</h3>
Nothing to do here

In [8]:
def train(model: torch.nn.Module, dictionary: u7.Dictionary,
          max_seq_len: int, train_batch_size: int, 
          train_data_batches, optimizer: torch.optim.Optimizer,
          criterion: torch.nn, clipping: int, learning_rate: int,
          print_interval: int, epoch: int):
    """
    Function to train the model. 
    :return:
    """
    model.train()
    total_loss = 0.
    start_time = time.time()
    ntokens = len(dictionary)
    start_hidden = model.init_hidden(train_batch_size)
    for batch, i in enumerate(range(0, train_data_batches.size(0) - 1, max_seq_len)):
        data, targets = u7.get_batch(train_data_batches, i, max_seq_len)

        # forward pass
        model.zero_grad()
        start_hidden = u7.repackage_hidden(start_hidden)
        output, last_hidden = model(data, start_hidden)

        # loss computation & backward pass
        output = output.view(-1, ntokens)
        loss = criterion(output, targets.view(-1))
        loss.backward()

        start_hidden = last_hidden
        # clipping gradient
        torch.nn.utils.clip_grad_norm_(model.parameters(), clipping)
        optimizer.step()

        total_loss += loss.item()
        if batch % print_interval == 0 and batch > 0:
            cur_loss = total_loss / print_interval
            elapsed = time.time() - start_time
            print(f'| epoch {epoch :3d} | {batch :5d}/{int(len(train_data_batches)/max_seq_len) :5d} batches' 
                  f'| lr {learning_rate :02.2f} | ms/batch {elapsed * 1000 / print_interval :5.2f} |'
                  f'loss {cur_loss :5.2f} | perplexity {math.exp(cur_loss) :8.2f}')
            total_loss = 0
            start_time = time.time()
            

In [9]:
class LM_LSTMModel(nn.Module):

    def __init__(self, ntoken, ninp, nhid):
        super(LM_LSTMModel, self).__init__()
        self.ntoken = ntoken
        self.encoder = nn.Embedding(ntoken, ninp)
        self.rnn = nn.LSTM(ninp, nhid)
        self.decoder = nn.Linear(nhid, ntoken)
        self.nhid = nhid
        
    def init_hidden(self, bsz):
        weight = next(self.parameters())
        return (weight.new_zeros(1, bsz, self.nhid),
                weight.new_zeros(1, bsz, self.nhid))

    def forward(self, input, hidden):
        emb = self.encoder(input)
        hiddens, last_hidden = self.rnn(emb, hidden)
        
        decoded = self.decoder(hiddens)
        return F.log_softmax(decoded, dim=-1), last_hidden
    

# Exercise 1

- Train the model for three epochs and validate after each epoch. Repeat this procedure with different number of LSTM cells (the <i> nhid </i> parameter in the lecture notebook). Save the best models for the different runs.
- What is the best model? You can use the suggested parameter values but you can try different values too if wanted. Please note that for larger number of LSTM cells the training might be pretty time-consuming.
- Load the best model and evaluate it on the test dataset.
- NOTA BENE: use the Adam optimizer to get better performance <code> optimizer = optim.Adam(model.parameters(), lr=1e-2, weight_decay=1e-5)</code>, instead of SGD as done in the lecture (you can check for it in earlier notebooks).

In [10]:
nhid= [8, 16, 32, 64, 128]


In [13]:
best_val_loss = None
criterion = nn.NLLLoss()
list_path = []
for cells in nhid:
    # This part is need to get the best model of each number of LSTM cells:
    save_path = f'model{cells}.pt'
    list_path.append(save_path)
    c = None

    print('#################################################################################################')
    print('Number of LSTM cells: ', cells)
    model = LM_LSTMModel(ntokens, emsize, cells).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-2, weight_decay=1e-5)
    # your code goes here
    for epoch in range(3):
        print('\n')
        print(f'{epoch}. epoch:')
        epoch_start_time = time.time()
        train(model, dictionary, max_seq_len, train_batch_size, train_data_batches, optimizer, criterion, clipping, lr, print_interval, epoch)
                
        
        val_loss = u7.evaluate(model, dictionary, max_seq_len, eval_batch_size, val_data_batches, criterion) # validation loss
        print('-' * 89)
        print(f'| end of epoch {epoch :3d} | time: {time.time() - epoch_start_time :5.2f}s' 
              f'| valid loss {val_loss :5.2f} | valid perplexity {math.exp(val_loss):8.2f}')
        print('-' * 89)

        # Save the model if the validation loss is the best we've seen so far.
        if c == None or val_loss < c:
            with open(save_path, 'wb') as f:
                torch.save(model, f)
            c = f'best_val_loss{cells}'
            c = val_loss # save model when model is improving
            print('saved on path', save_path)
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            lr /= 4.0
            

#################################################################################################
Number of LSTM cells:  8


0. epoch:
| epoch   0 |  1000/ 2655 batches| lr 20.00 | ms/batch 114.48 |loss  6.36 | perplexity   575.52
| epoch   0 |  2000/ 2655 batches| lr 20.00 | ms/batch 102.30 |loss  5.94 | perplexity   378.47
-----------------------------------------------------------------------------------------
| end of epoch   0 | time: 283.80s| valid loss  5.85 | valid perplexity   345.87
-----------------------------------------------------------------------------------------


  "type " + obj.__name__ + ". It won't be checked "


saved on path model8.pt


1. epoch:
| epoch   1 |  1000/ 2655 batches| lr 20.00 | ms/batch 95.00 |loss  5.78 | perplexity   325.27
| epoch   1 |  2000/ 2655 batches| lr 20.00 | ms/batch 100.03 |loss  5.73 | perplexity   308.31
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 265.62s| valid loss  5.73 | valid perplexity   308.76
-----------------------------------------------------------------------------------------
saved on path model8.pt


2. epoch:
| epoch   2 |  1000/ 2655 batches| lr 20.00 | ms/batch 96.36 |loss  5.69 | perplexity   296.49
| epoch   2 |  2000/ 2655 batches| lr 20.00 | ms/batch 98.99 |loss  5.67 | perplexity   289.87
-----------------------------------------------------------------------------------------
| end of epoch   2 | time: 265.67s| valid loss  5.68 | valid perplexity   294.12
-----------------------------------------------------------------------------------------
saved on path model8.pt
#

Best model: Model with 128 cells and 2 epochs running trials.

In [15]:
# Load the best saved model.
with open('model128.pt', 'rb') as f:
    model = torch.load(f)
    

In [16]:
test_loss = u7.evaluate(model, dictionary, max_seq_len, 
                           eval_batch_size, test_data_batches, criterion)
print('=' * 89)
#print('| End of training | test loss {:5.2f} | test perplexity {:8.2f}'.format(
#    test_loss, math.exp(test_loss)))
print(f'| End of training | test loss {test_loss :5.2f} | test perplexity {math.exp(test_loss) :5.2f}')
print('=' * 89)


| End of training | test loss  5.00 | test perplexity 149.05


# Exercise 2

- Count the parameters of the best model. How many parameters does it have?

In [20]:
# your code goes here
count_para = 0
for i in model.parameters():
    count_para += 1
    
print(f'The best model has {count_para} parameters.')


The best model has 7 parameters.
