# Assignment 2. Language modeling.

This task is devoted to language modeling. Its goal is to write in PyTorch an RNN-based language model. Since word-based language modeling requires long training and is memory-consuming due to large vocabulary, we start with character-based language modeling. We are going to train the model to generate words as sequence of characters. During training we teach it to predict characters of the words in the training set.



## Task 1. Character-based language modeling: data preparation (15 points).

We train the language models on the materials of **Sigmorphon 2018 Shared Task**. First, download the Russian datasets.

In [4]:
!wget https://raw.githubusercontent.com/sigmorphon/conll2018/master/task1/surprise/russian-train-high
!wget https://raw.githubusercontent.com/sigmorphon/conll2018/master/task1/surprise/russian-dev
!wget https://raw.githubusercontent.com/sigmorphon/conll2018/master/task1/surprise/russian-test

--2020-03-27 23:39:38--  https://raw.githubusercontent.com/sigmorphon/conll2018/master/task1/surprise/russian-train-high
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.244.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.244.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 533309 (521K) [text/plain]
Saving to: ‘russian-train-high’


2020-03-27 23:39:38 (1.31 MB/s) - ‘russian-train-high’ saved [533309/533309]

--2020-03-27 23:39:38--  https://raw.githubusercontent.com/sigmorphon/conll2018/master/task1/surprise/russian-dev
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.244.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.244.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53671 (52K) [text/plain]
Saving to: ‘russian-dev’


2020-03-27 23:39:38 (1.29 MB/s) - ‘russian-dev’ saved [53671/53671]

--2020-03-27 23:39:39--

**1.1 (1 points)**
All the files contain tab-separated triples ```<lemma>-<form>-<tags>```, where ```<form>``` may contain spaces (*будете соответствовать*). Write a function that loads a list of all word forms, that do not contain spaces.  

In [1]:
!head russian-train-high

валлонский	валлонскому	ADJ;DAT;NEUT;SG
незаконченный	незаконченным	ADJ;INS;NEUT;SG
истрёпывать	истрёпывав	V.CVB;PST
личный	личного	ADJ;ANIM;ACC;MASC;SG
серьга	серьгам	N;DAT;PL
необоснованный	необоснованным	ADJ;INS;NEUT;SG
тютя	тюти	N;NOM;PL
зарасти	заросла	V;PST;SG;FEM
облётывать	будете облётывать	V;FUT;2;PL
идеальный	идеальна	ADJ;FEM;SG;LGSPEC1


In [2]:
def read_infile(infile):
    """
    == YOUR CODE HERE ==
    """
    with open(infile) as f:
        content = f.readlines()
        
    words = []
    for line in content:
        line_split = line.split('\t')
        if len(line_split) == 3:
            words.append(line_split[0])
            words.append(line_split[1])
    return words

In [3]:
train_words = read_infile("russian-train-high")
dev_words = read_infile("russian-dev")
test_words = read_infile("russian-test")
print(len(train_words), len(dev_words), len(test_words))
print(*train_words[:10])

20000 2000 2000
валлонский валлонскому незаконченный незаконченным истрёпывать истрёпывав личный личного серьга серьгам


**1.2 (2 points)** Write a **Vocabulary** class that allows to transform symbols into their indexes. The class should have the method ```__call__``` that applies this transformation to sequences of symbols and batches of sequences as well. You can also use [SimpleVocabulary](https://github.com/deepmipt/DeepPavlov/blob/c10b079b972493220c82a643d47d718d5358c7f4/deeppavlov/core/data/simple_vocab.py#L31) from DeepPavlov. Fit an instance of this class on the training data.

In [4]:
from deeppavlov.core.data.simple_vocab import SimpleVocabulary
"""
== YOUR CODE HERE ==
"""
class Vocabulary:
    def __init__(self):
        self.unk_index = 0
        self.begin_index = 1
        self.end_index = 2
    
    def fit(self, sentences):
        self.word2index = {'UNK': 0, 'BEGIN': 1, 'END': 2}
        word_index = 3
        
        for sentence in sentences:
            for word in sentence:
                if word not in self.word2index.keys():
                    self.word2index[word] = word_index
                    word_index += 1
                    
    def __call__(self, sentence):
        vectorized_sentence = []
        for word in sentence:
            if word in self.word2index.keys():
                vectorized_sentence.append(self.word2index[word])
            else:
                vectorized_sentence.append(self.unk_index)
        return vectorized_sentence

    def __len__(self):
        return len(self.word2index)

vocab = Vocabulary()
vocab.fit([list(x) for x in train_words])
print(len(vocab))

55


**1.3 (2 points)** Write a **Dataset** class, which should be inherited from ```torch.utils.data.Dataset```. It should take a list of words and the ```vocab``` as initialization arguments.

In [5]:
import torch
from torch.utils.data import Dataset as TorchDataset

class Dataset(TorchDataset):
    
    """Custom data.Dataset compatible with data.DataLoader."""
    def __init__(self, data, vocab):
        self.data = data
        self.vocab = vocab

    def __getitem__(self, index):
        """
        Returns one tensor pair (source and target). The source tensor corresponds to the input word,
        with "BEGIN" and "END" symbols attached. The target tensor should contain the answers
        for the language model that obtain these word as input.        
        """
        """
        == YOUR CODE HERE ==
        """
        example = ['BEGIN'] + list(self.data[index]) + ['END']
        example_indexes = torch.tensor(self.vocab(example))
        return example_indexes[:-1], example_indexes[1:]

    def __len__(self):
        """
        == YOUR CODE HERE ==
        """
        return len(self.data)

In [6]:
train_dataset = Dataset(train_words, vocab)
dev_dataset = Dataset(dev_words, vocab)
test_dataset = Dataset(test_words, vocab)

**1.4 (3 points)** Use a standard ```torch.utils.data.DataLoader``` to obtain an iterable over batches. Print the shape of first 10 input batches with ```batch_size=1```.

In [7]:
from torch.utils.data import DataLoader

"""
== YOUR CODE HERE ==
"""
train_loader = DataLoader(train_dataset, batch_size=1)
dev_loader = DataLoader(dev_dataset, batch_size=1)
test_loader = DataLoader(test_dataset, batch_size=1)

In [8]:
import itertools
from pprint import pprint

pprint(list(itertools.islice(train_loader, 10)))

[[tensor([[ 1,  3,  4,  5,  5,  6,  7,  8,  9, 10, 11]]),
  tensor([[ 3,  4,  5,  5,  6,  7,  8,  9, 10, 11,  2]])],
 [tensor([[ 1,  3,  4,  5,  5,  6,  7,  8,  9,  6, 12, 13]]),
  tensor([[ 3,  4,  5,  5,  6,  7,  8,  9,  6, 12, 13,  2]])],
 [tensor([[ 1,  7, 14, 15,  4,  9,  6,  7, 16, 14,  7,  7, 17, 11]]),
  tensor([[ 7, 14, 15,  4,  9,  6,  7, 16, 14,  7,  7, 17, 11,  2]])],
 [tensor([[ 1,  7, 14, 15,  4,  9,  6,  7, 16, 14,  7,  7, 17, 12]]),
  tensor([[ 7, 14, 15,  4,  9,  6,  7, 16, 14,  7,  7, 17, 12,  2]])],
 [tensor([[ 1, 10,  8, 18, 19, 20, 21, 17,  3,  4, 18, 22]]),
  tensor([[10,  8, 18, 19, 20, 21, 17,  3,  4, 18, 22,  2]])],
 [tensor([[ 1, 10,  8, 18, 19, 20, 21, 17,  3,  4,  3]]),
  tensor([[10,  8, 18, 19, 20, 21, 17,  3,  4,  3,  2]])],
 [tensor([[ 1,  5, 10, 16,  7, 17, 11]]),
  tensor([[ 5, 10, 16,  7, 17, 11,  2]])],
 [tensor([[ 1,  5, 10, 16,  7,  6, 23,  6]]),
  tensor([[ 5, 10, 16,  7,  6, 23,  6,  2]])],
 [tensor([[ 1,  8, 14, 19, 22, 23,  4]]),
  tensor([[ 8,

**(1.5) 1 point** Explain, why this does not work with larger batch size.

**Because the examples has different lengths and torch can't merge them into a single matrix. Example of exception listed below**

In [9]:
next(iter(DataLoader(train_dataset, batch_size=2)))

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 11 and 12 in dimension 1 at ../aten/src/TH/generic/THTensor.cpp:612

**(1.6) 5 points** Write a function **collate** that allows you to deal with batches of greater size. See [discussion](https://discuss.pytorch.org/t/dataloader-for-various-length-of-data/6418/8) for an example. Implement your function as a class ```__call__``` method to make it more flexible.

In [10]:
def pad_tensor(vec, length, dim, pad_symbol):
    """
    Pads a vector ``vec`` up to length ``length`` along axis ``dim`` with pad symbol ``pad_symbol``.
    """
    """
    == YOUR CODE HERE ==
    """
    pad_size = list(vec.shape)
    pad_size[dim] = length - vec.size(dim)
    pad_tensor = torch.zeros(*pad_size, dtype=vec.dtype)
    pad_tensor.fill_(pad_symbol)
    return torch.cat([vec, pad_tensor], dim=dim)

class Padder:
    
    def __init__(self, dim=0, pad_symbol=0):
        self.dim = dim
        self.pad_symbol = pad_symbol
        
    def __call__(self, batch):
        """
        == YOUR CODE HERE ==
        """
        max_len = max(map(lambda x: x[0].shape[self.dim], batch)) # max len of y will be the same
        padded_batch = list(map(lambda x: (pad_tensor(x[0], max_len, self.dim, self.pad_symbol),
                                pad_tensor(x[1], max_len, self.dim, self.pad_symbol)), batch))
        xs = torch.stack(list(map(lambda x: x[0], padded_batch)), dim=0)
        ys = torch.stack(list(map(lambda x: x[1], padded_batch)), dim=0)
        return xs, ys

**(1.7) 1 points** Again, use ```torch.utils.data.DataLoader``` to obtain an iterable over batches. Print the shape of first 10 input batches with the batch size you like.

In [11]:
from torch.utils.data import DataLoader

"""
== YOUR CODE HERE ==
"""
train_loader = DataLoader(train_dataset, collate_fn=Padder(), batch_size=2)
dev_loader = DataLoader(dev_dataset, collate_fn=Padder(), batch_size=2)
test_loader = DataLoader(test_dataset, collate_fn=Padder(), batch_size=2)

In [12]:
pprint(list(itertools.islice(train_loader, 10)))

[(tensor([[ 1,  3,  4,  5,  5,  6,  7,  8,  9, 10, 11,  0],
        [ 1,  3,  4,  5,  5,  6,  7,  8,  9,  6, 12, 13]]),
  tensor([[ 3,  4,  5,  5,  6,  7,  8,  9, 10, 11,  2,  0],
        [ 3,  4,  5,  5,  6,  7,  8,  9,  6, 12, 13,  2]])),
 (tensor([[ 1,  7, 14, 15,  4,  9,  6,  7, 16, 14,  7,  7, 17, 11],
        [ 1,  7, 14, 15,  4,  9,  6,  7, 16, 14,  7,  7, 17, 12]]),
  tensor([[ 7, 14, 15,  4,  9,  6,  7, 16, 14,  7,  7, 17, 11,  2],
        [ 7, 14, 15,  4,  9,  6,  7, 16, 14,  7,  7, 17, 12,  2]])),
 (tensor([[ 1, 10,  8, 18, 19, 20, 21, 17,  3,  4, 18, 22],
        [ 1, 10,  8, 18, 19, 20, 21, 17,  3,  4,  3,  0]]),
  tensor([[10,  8, 18, 19, 20, 21, 17,  3,  4, 18, 22,  2],
        [10,  8, 18, 19, 20, 21, 17,  3,  4,  3,  2,  0]])),
 (tensor([[ 1,  5, 10, 16,  7, 17, 11,  0],
        [ 1,  5, 10, 16,  7,  6, 23,  6]]),
  tensor([[ 5, 10, 16,  7, 17, 11,  2,  0],
        [ 5, 10, 16,  7,  6, 23,  6,  2]])),
 (tensor([[ 1,  8, 14, 19, 22, 23,  4,  0],
        [ 1,  8, 14, 19,

## Task 2. Character-based language modeling. (35 points)

**2.1 (5 points)** Write a network that performs language modeling. It should include three layers:
1. **Embedding** layer that transforms input symbols into vectors.
2. An **RNN** layer that outputs a sequence of hidden states (you may use https://pytorch.org/docs/stable/nn.html#gru).
3. A **Linear** layer with ``softmax`` activation that produces the output distribution for each symbol.

In [25]:
import torch.nn as nn
from torch.functional import F

class RNNLM(nn.Module):

    def __init__(self, vocab_size, embeddings_dim, hidden_size):
        super(RNNLM, self).__init__()
        """
        == YOUR CODE HERE ==
        """
        self.embedding = nn.Embedding(vocab_size, embeddings_dim)
        self.rnn = nn.GRU(embeddings_dim, hidden_size)
        self.classification = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, inputs, hidden=None):
        """
        == YOUR CODE HERE ==
        """
        x = self.embedding(inputs)
#         print('embedding =', x.shape)
        x, _ = self.rnn(x)
#         print('rnn =', x.shape)
#         print('rnn =', _.shape)
        x = self.classification(x)
#         print('classification =', x.shape)
        return F.softmax(x, dim=-1).permute(0, 2, 1)

**2.2 (1 points)** Write a function ``validate_on_batch`` that takes as input a model, a batch of inputs and a batch of outputs, and the loss criterion, and outputs the loss tensor for the whole batch. This loss should not be normalized.

In [14]:
def validate_on_batch(model, criterion, x, y):
    """
    == YOUR CODE HERE ==
    """
    y_pred = model(x)
    return criterion(y_pred, y)

**2.3 (1 points)** Write a function ``train_on_batch`` that accepts all the arguments of ``validate_on_batch`` and also an optimizer, calculates loss and makes a single step of gradient optimization. This function should call ``validate_on_batch`` inside.

In [15]:
def train_on_batch(model, criterion, x, y, optimizer):
    """
    == YOUR CODE HERE ==
    """
    model.zero_grad()
    loss = validate_on_batch(model, criterion, x, y)
    loss.backward()
    optimizer.step()

**2.4 (3 points)** Write a training loop. You should define your ``RNNLM`` model, the criterion, the optimizer and the hyperparameters (number of epochs and batch size). Then train the model for a required number of epochs. On each epoch evaluate the average training loss and the average loss on the validation set. 

**2.5 (3 points)** Do not forget to average your loss over only non-padding symbols, otherwise it will be too optimistic.

In [30]:
"""
== YOUR CODE HERE ==
"""
from tqdm import tqdm
from torch.utils.tensorboard import SummaryWriter

# PARAMS
embeddings_dim = 100
hidden_size = 256
batch_size = 128
n_epochs = 5
log_every = 1 # log train loss every n batches
eval_every = 2_000 # evaluate every n examples
pad_symbol = 0

# Data
train_loader = DataLoader(train_dataset, collate_fn=Padder(pad_symbol=pad_symbol), batch_size=batch_size)
dev_loader = DataLoader(dev_dataset, collate_fn=Padder(pad_symbol=pad_symbol), batch_size=batch_size)
test_loader = DataLoader(test_dataset, collate_fn=Padder(pad_symbol=pad_symbol), batch_size=batch_size)
writer = SummaryWriter()


# Model
model = RNNLM(len(vocab), embeddings_dim, hidden_size)
criterion = nn.CrossEntropyLoss(reduction='mean')
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

In [31]:
x, y = next(iter(train_loader))

In [32]:
# model(x)

tensor([[[0.0194, 0.0205, 0.0209,  ..., 0.0188, 0.0188, 0.0188],
         [0.0162, 0.0183, 0.0168,  ..., 0.0177, 0.0177, 0.0177],
         [0.0193, 0.0179, 0.0195,  ..., 0.0152, 0.0152, 0.0152],
         ...,
         [0.0188, 0.0190, 0.0159,  ..., 0.0157, 0.0157, 0.0157],
         [0.0174, 0.0184, 0.0182,  ..., 0.0180, 0.0180, 0.0180],
         [0.0222, 0.0201, 0.0171,  ..., 0.0214, 0.0214, 0.0214]],

        [[0.0192, 0.0211, 0.0218,  ..., 0.0184, 0.0184, 0.0184],
         [0.0151, 0.0183, 0.0154,  ..., 0.0171, 0.0171, 0.0171],
         [0.0201, 0.0181, 0.0200,  ..., 0.0138, 0.0138, 0.0138],
         ...,
         [0.0194, 0.0197, 0.0149,  ..., 0.0149, 0.0149, 0.0149],
         [0.0177, 0.0186, 0.0187,  ..., 0.0187, 0.0187, 0.0187],
         [0.0231, 0.0205, 0.0158,  ..., 0.0227, 0.0227, 0.0227]],

        [[0.0188, 0.0183, 0.0196,  ..., 0.0182, 0.0182, 0.0182],
         [0.0145, 0.0159, 0.0195,  ..., 0.0169, 0.0169, 0.0169],
         [0.0205, 0.0174, 0.0184,  ..., 0.0131, 0.0131, 0.

In [29]:
def average_non_padding(loss, y):
    mask = (y != pad_symbol).int()
    sentence_loss = (loss * mask).sum(axis=0) / mask.sum(axis=0)
    return sentence_loss.mean()

In [68]:
for epoch in range(n_epochs):
    model.train()
    t = tqdm(train_loader, position=0)
    for i, (x, y) in enumerate(t):
        model.zero_grad()
        output = model(x)
        loss = average_non_padding(criterion(output, y), y)
        loss.backward()
        optimizer.step()
        
        global_step = (epoch + 1) * len(train_dataset) + (i + 1) * batch_size
        if i % log_every == 0:
            t.set_description(f"Loss: {loss.item()}")
            writer.add_scalar('training_loss', loss.item(), global_step)
        
        if i * batch_size % eval_every < batch_size:
            model.eval()
            eval_loss = 0
            for i, (x, y) in enumerate(dev_loader):
                output = model(x)
                loss = average_non_padding(criterion(output, y), y)
                eval_loss += loss.item()
            writer.add_scalar('eval_loss', eval_loss / len(dev_loader), global_step)

Loss: 4.017172813415527: 100%|██████████| 157/157 [00:24<00:00,  6.41it/s] 
Loss: 4.016729354858398: 100%|██████████| 157/157 [00:23<00:00,  6.65it/s] 
Loss: 4.020309925079346:  50%|█████     | 79/157 [00:13<00:08,  8.96it/s] 

KeyboardInterrupt: 

**2.6 (5 points)** Write a function **predict_on_batch** that outputs letter probabilities of all words in the batch.

In [None]:
"""
== YOUR CODE HERE ==
"""

**2.7 (1 points)** Calculate the letter probabilities for all words in the test dataset. Print them for 20 last words. Do not forget to disable shuffling in the ``DataLoader``.

In [None]:
"""
== YOUR CODE HERE ==
"""

**2.8 (5 points)** Write a function that generates a single word (sequence of indexes) given the model. Do not forget about the hidden state! Be careful about start and end symbol indexes. Use ``torch.multinomial`` for sampling.

In [None]:
def generate(model, max_length=20, start_index=1, end_index=2):
    """
    == YOUR CODE HERE ==
    """

**2.9 (1 points)** Use ``generate`` to sample 20 pseudowords. Do not forget to transform indexes to letters.

In [None]:
for i in range(20):
    """
    == YOUR CODE HERE ==
    """

**(2.10) 5 points** Write a batched version of the generation function. You should sample the following symbol only for the words that are not finished yet, so apply a boolean mask to trace active words.

In [None]:
def generate_batch(model, batch_size, max_length = 20, start_index=1, end_index=2):
    """
    == YOUR CODE HERE ==
    """

In [None]:
generated = []
for _ in range(2):
    generated += generate_batch(model, batch_size=10)
"""
== YOUR CODE HERE ==
"""
for elem in transformed:
    print("".join(elem))

**(2.11) 5 points** Experiment with the type of RNN, number of layers, units and/or dropout to improve the perplexity of the model.