# Assigment 5

**Submission deadlines**:

* last lab before 27.06.2022 

**Points:** Aim to get 12 out of 15+ possible points

All needed data files are on Drive: <https://drive.google.com/drive/folders/1uufpGn46Mwv4oBwajIeOj4rvAK96iaS-?usp=sharing> (or will be soon :) )

## Task 1 (5 points)

Consider the vowel reconstruction task -- i.e. inserting missing vowels (aeuioy) to obtain proper English text. For instance for the input sentence:

<pre>
h m gd smbd hs stln ll m vwls
</pre>

the best result is

<pre>
oh my god somebody has stolen all my vowels
</pre>

In this task both dev and test data come from the two books about Winnie-the-Pooh. You have to train two RNN Language Models on *pooh-train.txt*. For the first model use the code below, for the second choose different hyperparameters (different dropout, smaller number of units or layers, or just do any modification you want). 

The code below is based on
https://www.kdnuggets.com/2020/07/pytorch-lstm-text-generation-tutorial.html

In [1]:
import torch
from collections import Counter
import numpy as np
import gensim.utils as utils
from torch.utils.data import DataLoader
from torch import nn, optim
from collections import defaultdict as dd

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# device = torch.device('cpu')

SEQUENCE_LENGTH = 15

class PoohDataset(torch.utils.data.Dataset):
    def __init__(self, sequence_length, device, data_file = './data/pooh_train.txt', words_file = "./data/pooh_words.txt"):
        txt = open(data_file).read()
        
        self.words = utils.simple_preprocess(txt, min_len=1)

        self.uniq_words = set(utils.simple_preprocess(open(words_file).read(), min_len=1))

        self.index_to_word = {index: word for index, word in enumerate(self.uniq_words)}
        self.word_to_index = {word: index for index, word in enumerate(self.uniq_words)}

        self.words_indexes = [self.word_to_index[w] for w in self.words]
        self.sequence_length = sequence_length
        self.device = device


    def get_uniq_words(self):
        word_counts = Counter(self.words)
        return sorted(word_counts, key=word_counts.get, reverse=True)

    def __len__(self):
        return len(self.words_indexes) - self.sequence_length

    def __getitem__(self, index):
        return (
            torch.tensor(self.words_indexes[index:index+self.sequence_length], device=self.device),
            torch.tensor(self.words_indexes[index+1:index+self.sequence_length+1], device=self.device)
        )
        
pooh_dataset = PoohDataset(SEQUENCE_LENGTH, device)

In [3]:
class LSTMModel(nn.Module):
    def __init__(self, dataset, device, lstm_size = 512, embedding_dim = 100, num_layers = 2, dropout = 0.2):
        super(LSTMModel, self).__init__()
        self.lstm_size = lstm_size
        self.embedding_dim = embedding_dim
        self.num_layers = num_layers
        self.device = device
        

        n_vocab = len(dataset.uniq_words)
        self.embedding = nn.Embedding(
            num_embeddings=n_vocab,
            embedding_dim=self.embedding_dim,
        )
        self.lstm = nn.LSTM(
            input_size=self.embedding_dim,
            hidden_size=self.lstm_size,
            num_layers=self.num_layers,
            dropout=dropout,
        )
        self.fc = nn.Linear(self.lstm_size, n_vocab)

    def forward(self, x, prev_state):
        embed = self.embedding(x)
        output, state = self.lstm(embed, prev_state)
        logits = self.fc(output)
        return logits, state

    def init_state(self, sequence_length):
        return (torch.zeros(self.num_layers, sequence_length, self.lstm_size).to(self.device),
                torch.zeros(self.num_layers, sequence_length, self.lstm_size).to(self.device))
        
# model = LSTMModel(pooh_dataset, device) 
# model.to(device)

In [4]:
def train(dataset, model, batch_size = 512, max_epochs = 30):
    model.train()

    dataloader = DataLoader(dataset, batch_size=batch_size)
    
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    for epoch in range(max_epochs):
        state_h, state_c = model.init_state(SEQUENCE_LENGTH)
        
        for batch, (x, y) in enumerate(dataloader):
            optimizer.zero_grad()

            y_pred, (state_h, state_c) = model(x, (state_h, state_c))
            loss = criterion(y_pred.transpose(1, 2), y)

            state_h = state_h.detach()
            state_c = state_c.detach()            

            loss.backward()
            optimizer.step()

        print({ 'epoch': epoch, 'batch': batch, 'loss': loss.item() })
            
# train(pooh_dataset, model)

In [56]:
torch.save(model.state_dict(), './models/pooh_2x512_30ep_v1.model')

In [None]:
model.load_state_dict(torch.load('./models/pooh_2x512_30ep.model'))

In [25]:
# You can use the code if you want

def devowelize(s):
    vowels = set("aoiuye'")
    rv = ''.join(a for a in s if a not in vowels)
    if rv:
        return rv
    return '_' # Symbol for words without consonants

pooh_words = set(utils.simple_preprocess(open('./data/pooh_words.txt').read()))
representation = dd(set)

for w in pooh_words:
    r = devowelize(w)
    representation[r].add(w)
    
hard_words = set()
for r, ws in representation.items():
    if len(ws) > 1:
        hard_words.update(ws)
        
def reconstruction(sentence):
    result = [[]]
    for word in sentence:
        variants = representation[word]
        result = [prefix + [v] for v in variants for prefix in result]

    return result


def predict(dataset, model, text, repeats = 1):

    with torch.no_grad():
        model.eval()

        if type(text) == str:
            words = text.split()
        else:
            words = text

            
        state_h, state_c = model.init_state(len(words))

        variants = reconstruction(words)

        batch_size = 1000

        variants_log_probs_sum = torch.zeros(len(variants))

        for i in range(0, len(variants), batch_size):
            x = torch.tensor([[dataset.word_to_index[w] for w in ws] for ws in variants[i:i+batch_size]], dtype=torch.long)

            x = x.to(device)

            y_pred, _ = model(x, (state_h, state_c))

            log_probs = nn.functional.log_softmax(y_pred, dim=2)[:,:-1,:]

            probs_indices = x[:, 1:]
            correct_words_log_probs = torch.tensor([[log_probs[i,j, probs_indices[i,j]].item() for j in range(probs_indices.shape[1])]
                for i in range(probs_indices.shape[0])])

            variants_log_probs_sum[i:i+batch_size] = correct_words_log_probs.sum(dim=1)

        # variants_log_probs_sum = correct_words_log_probs.sum(dim=1)
        variants_probs = nn.functional.softmax(variants_log_probs_sum, dim=0).numpy()

        variant_picks = np.zeros(len(variants))

        for _ in range(repeats):
            index = np.random.choice(len(variants), p=variants_probs)
            variant_picks[index] += 1

        best_index = np.argmax(variant_picks)

        return ' '.join(variants[0])

You can assume that only words from pooh_words.txt can occur in the reconstructed text. For decoding you have two options (choose one, or implement both ang get **+1** bonus point)

1. Sample reconstructed text several times (with quite a low temperature), choose the most likely result.
2. Perform beam search.

Of course in the sampling procedure you should consider only words matching the given consonants.

Report accuracy of your methods (for both language models). The accuracy should be computed by the following function, it should be *greater than 0.25*.


```python
def accuracy(original_sequence, reconstructed_sequence):
    sa = original_sequence
    sb = reconstructed_sequence
    score = len([1 for (a,b) in zip(sa, sb) if a == b])
    return score / len(original_sequence)
```


In [48]:
def accuracy(original_sequence, reconstructed_sequence):
    sa = original_sequence
    sb = reconstructed_sequence
    score = len([1 for (a,b) in zip(sa, sb) if a == b])
    # print(score)
    return score / len(original_sequence)

def test_sequence(dataset, model, seq):
    devowelized = list(map(devowelize, seq))

    reconstructed = predict(dataset, model, devowelized, repeats=30)
    
    # print(f"{' '.join(seq)} | {reconstructed}")

    return accuracy(seq, reconstructed.split())
    

In [49]:
test_file = open("data/pooh_test.txt")
test_data = utils.simple_preprocess(test_file.read())
print(test_data[:20])

['it', 'missage', 'he', 'said', 'to', 'himself', 'that', 'what', 'it', 'is', 'and', 'that', 'letter', 'is', 'and', 'so', 'is', 'that', 'and', 'so']


In [57]:
from tqdm import tqdm

def sample_data(data, n_samples, seq_length):
    indices = np.random.choice(len(data)-seq_length, n_samples)

    return [data[i:i+seq_length] for i in indices]

seq_len = 5
data = sample_data(test_data, 100, seq_len)

score = np.mean([test_sequence(pooh_dataset, model, seq) for seq in data])

print(f"Base model score: {score}")


Base model score: 0.53


In [62]:
model_2 = LSTMModel(pooh_dataset, device, lstm_size=256, embedding_dim=50, num_layers=1) 
model_2.to(device)



LSTMModel(
  (embedding): Embedding(2581, 50)
  (lstm): LSTM(50, 256, dropout=0.2)
  (fc): Linear(in_features=256, out_features=2581, bias=True)
)

In [63]:
train(pooh_dataset, model_2)

{'epoch': 0, 'batch': 89, 'loss': 6.095155239105225}
{'epoch': 1, 'batch': 89, 'loss': 5.66037654876709}
{'epoch': 2, 'batch': 89, 'loss': 5.257487773895264}
{'epoch': 3, 'batch': 89, 'loss': 4.873077392578125}
{'epoch': 4, 'batch': 89, 'loss': 4.4854044914245605}
{'epoch': 5, 'batch': 89, 'loss': 4.171422958374023}
{'epoch': 6, 'batch': 89, 'loss': 3.836507797241211}
{'epoch': 7, 'batch': 89, 'loss': 3.4898197650909424}
{'epoch': 8, 'batch': 89, 'loss': 3.2256109714508057}
{'epoch': 9, 'batch': 89, 'loss': 2.9390370845794678}
{'epoch': 10, 'batch': 89, 'loss': 2.67284893989563}
{'epoch': 11, 'batch': 89, 'loss': 2.414641857147217}
{'epoch': 12, 'batch': 89, 'loss': 2.208237409591675}
{'epoch': 13, 'batch': 89, 'loss': 1.9928768873214722}
{'epoch': 14, 'batch': 89, 'loss': 1.7707672119140625}
{'epoch': 15, 'batch': 89, 'loss': 1.5765235424041748}
{'epoch': 16, 'batch': 89, 'loss': 1.399997591972351}
{'epoch': 17, 'batch': 89, 'loss': 1.2500643730163574}
{'epoch': 18, 'batch': 89, 'loss

In [64]:
score = np.mean([test_sequence(pooh_dataset, model_2, seq) for seq in data])

print(f"2nd model score: {score}")


2nd model score: 0.53


## Task 2 (6 points)

This task is about text generation. You have to:

**A**. Create text corpora containing texts with similar vocabulary (for instance books from the same genre, or written by the same author). This corpora should have approximately 1M words. You can consider using the following sources: Project Gutenberg (https://www.gutenberg.org/), Wolne Lektury (https://wolnelektury.pl/), parts of BookCorpus, https://github.com/soskek/bookcorpus, but generally feel free. Texts could be in English, Polish or any other language you know.

**B**. choose the tokenization procedure. It should have two stages:

1. word tokenization (you can use nltk.tokenize.word_tokenize, tokenizer from spaCy, pytorch, keras, ...). Test your tokenizer on your corpora, and look at a set of tokens containing both letters and special characters. If some of them should be in your opinion treated as a sequence of tokens, then modify the tokenization procedure

2. sub-word tokenization (you can either use the existing procedure, like wordpiece or sentencepiece, or create something by yourself). Here is a simple idea: take 8K most popular words (W), 1K most popular suffixes (S), and 1K most popular prefixes (P). Words in W are its own tokens. Word x outside W should be tokenized as 'p_ _s' where p is the longest prefix of x in P, and s is the longest suffix of x in S

**C**. write text generation procedure. The procedure should fulfill the following requirements:

1. it should use the RNN language model (trained on sub-word tokens)
2. generated tokens should be presented as a text containing words (without extra spaces, or other extra characters, as begin-of-word introduced during tokenization)
3. all words in a generated text should belond to the corpora (note that this is not guaranteed by LSTM)
4. in generation Top-P sampling should be used (see NN-NLP.6, slide X) 
5. in generated texts every token 3-gram should be uniq
6. *(optionally, +1 point)* all token bigrams in generated texts occur in the corpora

In [5]:
# Nie dokończyłem 2 zadania, BooksDataset dzieli na tokeny w sposób opisany w B2

import nltk

W_SIZE = 8000
PS_SIZE = 1000


class BooksDataset(torch.utils.data.Dataset):
    def __init__(self, sequence_length, device, file_paths):
        txt = "\n\n\n\n".join([open(f).read() for f in file_paths])
        
        self.words = nltk.tokenize.word_tokenize(txt)

        self.unique_words = self.get_unique_words()

        self.tokens = self.subword_tokenization()

        self.unique_tokens = self.get_unique_tokens()

        self.uniq_words = self.unique_tokens

        self.index_to_token = {index: token for index, token in enumerate(self.unique_tokens)}
        self.token_to_index = {token: index for index, token in enumerate(self.unique_tokens)}

        self.tokens_indexes = [self.token_to_index[t] for t in self.tokens]
        
        self.sequence_length = sequence_length
        self.device = device


    def subword_tokenization(self):
        word_counts = Counter(self.words)
        self.word_tokens = set(sorted(word_counts, key=word_counts.get, reverse=True)[:W_SIZE])
        self.prefix_tokens = self.get_prefix_tokens()
        self.suffix_tokens = self.get_suffix_tokens()
        self.unique_tokens = self.word_tokens | self.prefix_tokens | self.suffix_tokens

        return [t for w in self.words for t in self.tokenize_word(w)]


    def tokenize_word(self, word):
        if word in self.word_tokens:
            return [word, " "]

        wl = len(word)

        possible_tokenizes = [[word[:i], word[i:j], word[j:]] for i in range(wl) for j in range(i, wl)
            if word[:i] in self.prefix_tokens and word[j:] in self.suffix_tokens]


        if len(possible_tokenizes) == 0:
            return [word, " "]

        index = np.argmax([len(t[0])**2 + len(t[2])**2 for t in possible_tokenizes])

        return possible_tokenizes[index] + [" "]
        
        
    def get_most_popular(self, count):
        return set(sorted(count, key= lambda key: (count.get(key), len(key)), reverse=True)[:PS_SIZE])
        

    def get_prefix_tokens(self):
        prefix_count = dd(int)

        for word in self.words:
            for i in range(1, len(word)):
                prefix_count[word[:i]] += 1

        result = self.get_most_popular(prefix_count)

        return result

    def get_suffix_tokens(self):
        suffix_count = dd(int)

        for word in self.words:
            for i in range(0, len(word)-1):
                suffix_count[word[i:]] += 1

        result = self.get_most_popular(suffix_count)

        return result

    def get_unique_tokens(self):
        token_counts = Counter(self.tokens)
        return sorted(token_counts, key=token_counts.get, reverse=True)

    def get_unique_words(self):
        word_counts = Counter(self.words)
        return sorted(word_counts, key=word_counts.get, reverse=True)

    def __len__(self):
        return len(self.tokens_indexes) - self.sequence_length

    def __getitem__(self, index):
        return (
            torch.tensor(self.tokens_indexes[index:index+self.sequence_length], device=self.device),
            torch.tensor(self.tokens_indexes[index+1:index+self.sequence_length+1], device=self.device)
        )
        

In [12]:
import os

# book_paths = ["./data/books/" + title for title in os.listdir("./data/books/")][:5]
book_paths = ["./data/books/Heretics.txt", "./data/books/Orthodoxy.txt"]

print(book_paths)

books_dataset = BooksDataset(SEQUENCE_LENGTH, device, book_paths)

['./data/books/Heretics.txt', './data/books/Orthodoxy.txt']


In [8]:
model = LSTMModel(books_dataset, device) 
model.to(device)

LSTMModel(
  (embedding): Embedding(14422, 100)
  (lstm): LSTM(100, 512, num_layers=2, dropout=0.2)
  (fc): Linear(in_features=512, out_features=14422, bias=True)
)

In [None]:
train(books_dataset, model, max_epochs=10)

In [None]:
torch.save(model.state_dict(), './models/books.model')

In [None]:
def predict(dataset, model, words, next_words=100):
    model.eval()

    state_h, state_c = model.init_state(len(words))

    for i in range(0, next_words):
        x = torch.tensor([[dataset.word_to_index[w] for w in words[i:]]])
        y_pred, (state_h, state_c) = model(x, (state_h, state_c))

        last_word_logits = y_pred[0][-1]
        p = torch.nn.functional.softmax(last_word_logits, dim=0).detach().numpy()
        word_index = np.random.choice(len(last_word_logits), p=p)
        words.append(dataset.index_to_word[word_index])

    return words    

## Task 3

In this task you have to create a network which looks at characters of the word and tries to guess whether the word is a noun, a verb, an adjective, and so on. To be more precise: the input is a word (without context), the output is a POS-tag (Part-of-Speech). Since some words are unambiguous, and we have no context, our network is supposed to return the set of possible tags.

The data is taken from Universal Dependencies English corpus, and of course it contains errors, especially because not all possible tags occured in the data.

Train a network (4p) or two networks (+2p) solving this task. Both networks should look at character n-grams occuring in the word. There are two options:

* **Fixed size:** for instance take 2,3, and 4-character suffixes of the word, use them as  features (whith 1-hot encoding). You can also combine prefix and suffix features. Simple, useful trick: when looking at suffixes, add some '_' characters at the beginning of the word to guarantee that shorter words have suffixes of a desired length.

* **Variable size:** take for instance 4-grams (or 4 grams and 3-grams), use Deep Averaging Network. Simple trick: add extra character at the beginning and at the end of the word, to add the information, that ngram occurs at special position ('ed' at the end has slightly different meaning that 'ed' in the middle)


## Task 4

Apply seq2seq model (you can modify the code from this tutorial: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) to compute grapheme to phoneme conversion for English. Train the model on dev_cmu_dict.txt and test it on test_cmu_dict.txt. Report accuracy of your solution using two metrics:
* exact match (how many words are perfectly converted to phonemes)
* exact match without stress (how many words are perfectly converted to phonemes when we remove the information about stress)
