## Task 2 (6 points)

This task is about text generation. You have to:

**A**. Create text corpora containing texts with similar vocabulary (for instance books from the same genre, or written by the same author). This corpora should have approximately 1M words. You can consider using the following sources: Project Gutenberg (https://www.gutenberg.org/), Wolne Lektury (https://wolnelektury.pl/), parts of BookCorpus, https://github.com/soskek/bookcorpus, but generally feel free. Texts could be in English, Polish or any other language you know.

**B**. choose the tokenization procedure. It should have two stages:

1. word tokenization (you can use nltk.tokenize.word_tokenize, tokenizer from spaCy, pytorch, keras, ...). Test your tokenizer on your corpora, and look at a set of tokens containing both letters and special characters. If some of them should be in your opinion treated as a sequence of tokens, then modify the tokenization procedure

2. sub-word tokenization (you can either use the existing procedure, like wordpiece or sentencepiece, or create something by yourself). Here is a simple idea: take 8K most popular words (W), 1K most popular suffixes (S), and 1K most popular prefixes (P). Words in W are its own tokens. Word x outside W should be tokenized as 'p_ _s' where p is the longest prefix of x in P, and s is the longest prefix of W

**C**. write text generation procedure. The procedure should fulfill the following requirements:

1. it should use the RNN language model (trained on sub-word tokens)
2. generated tokens should be presented as a text containing words (without extra spaces, or other extra characters, as begin-of-word introduced during tokenization)
3. all words in a generated text should belond to the corpora (note that this is not guaranteed by LSTM)
4. in generation Top-P sampling should be used (see NN-NLP.6, slide X)
5. in generated texts every token 3-gram should be uniq
6. *(optionally, +1 point)* all token bigrams in generated texts occur in the corpora

In [1]:
from __future__ import unicode_literals, print_function, division
from __future__ import unicode_literals, print_function, division

import re
from io import open

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from nltk import ngrams
from nltk.tokenize import word_tokenize
from tokenizers import BertWordPieceTokenizer
from torch.utils.data import DataLoader

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
SEQUENCE_LENGTH = 15

In [3]:
def preprocess(path):
    data = open(path).read()
    # Cutting off gutenberg credits
    tokenized_data = word_tokenize(data)[165:]
    tokenized_data = list(map(lambda x: x.lower(), tokenized_data))
    tokenized_data = list(map(lambda x: re.sub('[^A-Za-z0-9]+', '', x), tokenized_data))
    tokenized_data = [w for w in tokenized_data if len(w)]
    return tokenized_data

def sub_word_tokenize(data):
    tokenizer = BertWordPieceTokenizer()
    tokenizer.train_from_iterator(data, vocab_size=10000)
    tokenized_data = tokenizer.encode(" ".join(data))
    return tokenizer, tokenized_data

In [4]:
class ShakespeareDataset(torch.utils.data.Dataset):
    def __init__(self, data_ids, sequence_length, device):

        self.words_indexes = data_ids
        self.sequence_length = sequence_length
        self.device = device

    def __len__(self):
        return len(self.words_indexes) - self.sequence_length

    def __getitem__(self, index):
        return (
            torch.tensor(self.words_indexes[index:index+self.sequence_length], device=self.device),
            torch.tensor(self.words_indexes[index+1:index+self.sequence_length+1], device=self.device)
        )

In [5]:
class LSTMModel(nn.Module):
    def __init__(self, n_vocab, device):
        super(LSTMModel, self).__init__()
        self.lstm_size = 512
        self.embedding_dim = 100
        self.num_layers = 2
        self.device = device

        self.embedding = nn.Embedding(
            num_embeddings=n_vocab,
            embedding_dim=self.embedding_dim,
        )
        self.lstm = nn.LSTM(
            input_size=self.embedding_dim,
            hidden_size=self.lstm_size,
            num_layers=self.num_layers,
            dropout=0.2,
        )
        self.fc = nn.Linear(self.lstm_size, n_vocab)

    def forward(self, x, prev_state):
        embed = self.embedding(x)
        output, state = self.lstm(embed, prev_state)
        logits = self.fc(output)
        return logits, state

    def init_state(self, sequence_length):
        return (torch.zeros(self.num_layers, sequence_length, self.lstm_size).to(self.device),
                torch.zeros(self.num_layers, sequence_length, self.lstm_size).to(self.device))

In [6]:
data = preprocess('/home/maria/Documents/NLP/data/assignment_5/pg100.txt')
tokenizer, tokenized_data = sub_word_tokenize(data)






In [7]:
dataset = ShakespeareDataset(tokenized_data.ids, SEQUENCE_LENGTH, device)
model = LSTMModel(10000, device)
model.to(device)
# trained for 20 epochs ~4h?
model.load_state_dict(torch.load('/home/maria/Documents/NLP/data/assignment_5/shakespeare_2.model'))

<All keys matched successfully>

In [8]:
prefixes_sufixes = {}
for word in data:
    e = tokenizer.encode(word)
    if len(e.ids) > 1:
        prefixes_sufixes[e.ids[0]] = e.ids[1:]
sufixes_ids = []
for i in range(10000):
    if tokenizer.id_to_token(i)[:2] == "##":
        sufixes_ids.append(i)

In [9]:
def top_p_sampling(p, top_p):
    sorted_logits, sorted_indices = torch.sort(torch.from_numpy(p), descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
    sorted_indices_to_remove = cumulative_probs > top_p
    p[sorted_indices[sorted_indices_to_remove]] = 0.0
    p = p/p.sum()
    return p

In [10]:
def predict(model, tokenizer, text, prefixes_sufixes, next_words=15):
    model.eval()

    words = tokenizer.encode(text).ids
    state_h, state_c = model.init_state(len(words))

    continuation = []
    existing_3_grams = set(ngrams(words, 3))
    for i in range(next_words):

        x = torch.tensor([words[i:]])
        x = x.to(device)

        y_pred, (state_h, state_c) = model(x, (state_h, state_c))
        next_from_continuation = -1
        if tokenizer.decode([words[-1]]) not in data and len(continuation) > 0:
            words.append(continuation.pop())
        else:
            last_word_logits = y_pred[0][-1]
            p = torch.nn.functional.softmax(last_word_logits, dim=0).detach().cpu().numpy()

            to_exclude = set(sufixes_ids)
            if len(continuation) > 0:
                next_from_continuation = continuation.pop(0)
                to_exclude = to_exclude - {next_from_continuation}

            to_exclude = list(to_exclude)
            current_2gram = words[2:]
            for ngram in existing_3_grams:
                if ngram[0] == current_2gram[0] and ngram[1] == current_2gram[1]:
                    to_exclude.append(ngram[2])

            p[np.isin(np.arange(len(p)), to_exclude)] = 0
            p = p/p.sum()
            p = top_p_sampling(p, 0.9)
            word_index = np.random.choice(len(last_word_logits), p=p)
            words.append(word_index)
            if not word_index == next_from_continuation:
                continuation = []
            if word_index in prefixes_sufixes:
                continuation = prefixes_sufixes[word_index]
        existing_3_grams.add(tuple(words[3:]))
    if tokenizer.decode([words[-1]]) not in data and len(continuation) > 0:
        words.extend(continuation)
    return words, tokenizer.decode(words)

In [11]:
prompt = "to be or not to be"
words, text = predict(model, tokenizer, prompt, prefixes_sufixes, 30)

In [12]:
print(text)
for w in text.split():
    print(w, w in data)

to be or not to be that seed this room nor kneel d duty o err tars water on such car and see poor negligent brethren then white 6 d in the way thick d
to True
be True
or True
not True
to True
be True
that True
seed True
this True
room True
nor True
kneel True
d True
duty True
o True
err True
tars False
water True
on True
such True
car True
and True
see True
poor True
negligent True
brethren True
then True
white True
6 True
d True
in True
the True
way True
thick True
d True


In [13]:
prompt = "our doubts are traitors and make"
words, text = predict(model, tokenizer, prompt, prefixes_sufixes, 30)
print(text)

our doubts are traitors and make a reason follow when i should look my vow i have sat up my prison to the alasterab shore let my proud clear rage smile once more that


In [14]:
prompt = "it never hurts to keep looking for sunshine"
words, text = predict(model, tokenizer, prompt, prefixes_sufixes, 30)
print(text)

it never hurts to keep looking for sunshine of soldier where you are with welcome instateig d here mansion enter the old countess and attendant distur on the ground in his life princess of arc servant
