## Initialization

# Data

In [1]:
train_path = '/dgm_for_text_data/02-21.10way.clean'
valid_path = '/dgm_for_text_data/22.auto.clean'
test_path  = '/dgm_for_text_data/23.auto.clean'

Our data files consist of lines of the Penn Tree Bank data set. Each line is a sentence in a tree shape. Let's pretty print the first line from the training set to see what we're dealing with!

In [2]:
import os
from nltk import Tree
from nltk.treeprettyprinter import TreePrettyPrinter

def filereader(path:str):
    """
        Opens a PTS datafile yields one line at a time.
    """
    with open(os.getcwd() + path, mode='r') as f:
        for line in f:
            yield line

In [3]:
def convert_to_sentence(line: str):
    """
        Takes in a line from a PTS datafile and returns it as a lower-case string.
    """
    tree = Tree.fromstring(line)
    sentence = ' '.join(tree.leaves()).lower()
    return sentence

#### Let's see how our data looks:

In [4]:
line = next(filereader(train_path))
print(f'Original: {line}')
print(f'Prased: {convert_to_sentence(line)}')

Original: (TOP (S (PP (IN In) (NP (NP (DT an) (NNP Oct.) (CD 19) (NN review)) (PP (IN of) (NP (`` ``) (NP (DT The) (NN Misanthrope)) ('' '') (PP (IN at) (NP (NP (NNP Chicago) (POS 's)) (NNP Goodman) (NNP Theatre))))) (PRN (-LRB- -LRB-) (`` ``) (S (NP (JJ Revitalized) (NNS Classics)) (VP (VBP Take) (NP (DT the) (NNP Stage)) (PP (IN in) (NP (NNP Windy) (NNP City))))) (, ,) ('' '') (NP (NNP Leisure) (CC &) (NNP Arts)) (-RRB- -RRB-)))) (, ,) (NP (NP (NP (DT the) (NN role)) (PP (IN of) (NP (NNP Celimene)))) (, ,) (VP (VBN played) (PP (IN by) (NP (NNP Kim) (NNP Cattrall)))) (, ,)) (VP (VBD was) (VP (ADVP (RB mistakenly)) (VBN attributed) (PP (TO to) (NP (NNP Christina) (NNP Haag))))) (. .)))

Prased: in an oct. 19 review of `` the misanthrope '' at chicago 's goodman theatre -lrb- `` revitalized classics take the stage in windy city , '' leisure & arts -rrb- , the role of celimene , played by kim cattrall , was mistakenly attributed to christina haag .


#### Creating our datasets

We have the data, now we want to create dataloaders so we can shuffle and batch our data. For this we will use pytorch's built in classes.

In [5]:
# We have a training, validation and test data set. For each, we need a list of sentences.
train_sents = [convert_to_sentence(l) for l in filereader(train_path)]
valid_sents = [convert_to_sentence(l) for l in filereader(valid_path)]
test_sents = [convert_to_sentence(l) for l in filereader(test_path)]

For our models, we tensors that are easily interpretable for our machines. For this, we convert our sentences to tensors where each word is represented by a number, also we want to have special character tokens and limit our vocabulary so our parameter space is a bit more managable. To do all this, we use the *tokenizers.py* file with the WordTokenizer class, presented by the NLP2-Team.

*Note*: We had special tokens to our sentences such as BOS, EOS and UNK. The BOS and EOS tokens are important, because it tells the model what constitutes as the beginning and the end of a sentence.

We see that, in the code block below, that add_special_tokens is set to True, and appends a BOS and EOS token to the sentences, and are represented as number 1 and 2 in tensor-form. We decode the sentences taking in these special tokens into account.

In [6]:
from tokenizers import WordTokenizer

# How big we want our vocabulary to be
vocab_size = 10000

# Creating and train our tokenizer. We want a relatively small vocabulary of 10000 words. Credits to the NLP2 team for creating this tokenizer.
tokenizer = WordTokenizer(train_sents, max_vocab_size=vocab_size)

# We check if the tokenizer en- and decodes our sentences correctly. Just look at the top-5 sentences in our training set.
for sentence in train_sents[:5]:
    tokenized = tokenizer.encode(sentence, add_special_tokens=True)
    sentence_decoded = tokenizer.decode(tokenized, skip_special_tokens=False) 

    print('original: ' + sentence)
    print(f'{"-"*10}')
    print('tokenized: ', tokenized)
    print(f'{"-"*10}')
    print('decoded: ' + sentence_decoded)
    print(f'{"-"*10}')
    print('\n\n')

original: in an oct. 19 review of `` the misanthrope '' at chicago 's goodman theatre -lrb- `` revitalized classics take the stage in windy city , '' leisure & arts -rrb- , the role of celimene , played by kim cattrall , was mistakenly attributed to christina haag .
----------
tokenized:  [1, 4754, 926, 6325, 185, 7745, 6332, 584, 9060, 3, 10, 1161, 2002, 16, 4273, 9063, 22, 584, 3, 3, 8917, 9060, 8556, 4754, 3, 2067, 18, 10, 5363, 8, 1113, 24, 18, 9060, 7826, 6332, 3, 18, 6822, 1744, 5193, 3, 18, 9725, 5944, 1200, 9167, 3, 3, 25, 2]
----------
decoded: [BOS] in an oct. 19 review of `` the [UNK] '' at chicago 's goodman theatre -lrb- `` [UNK] [UNK] take the stage in [UNK] city , '' leisure & arts -rrb- , the role of [UNK] , played by kim [UNK] , was mistakenly attributed to [UNK] [UNK] . [EOS]
----------



original: ms. haag plays elianti .
----------
tokenized:  [1, 6042, 3, 6826, 3, 25, 2]
----------
decoded: [BOS] ms. [UNK] plays [UNK] . [EOS]
----------



original: rolls-royce mo

#### Creating custom pytorch data sets

To work with pytorch data loaders, we want custom pytorch dataset.

In [7]:
from torch.utils.data import Dataset

class PTBDataset(Dataset):
    """
        A custom PTB dataset. 
    """
    def __init__(self, sentences: list, tokenizer: WordTokenizer):
        self.sentences = sentences
        self.tokenizer = tokenizer
    
    def __len__(self):
        """
            Return the length of the dataset.
        """
        return len(self.sentences)

    def __getitem__(self, idx: int):
        """
            Returns a tokenized item at position idx from the dataset.
        """
        item = self.sentences[idx]
        tokenized = self.tokenizer.encode(item, add_special_tokens=True)
        return tokenized

In [8]:
# We instantiate the datasets.
train_set = PTBDataset(train_sents, tokenizer)
valid_set = PTBDataset(valid_sents, tokenizer)
test_set = PTBDataset(test_sents, tokenizer)

# Lets print some information about our datasets
print(f'train/validation/test :: {len(train_set)}/{len(valid_set)}/{len(test_set)}')

train/validation/test :: 39832/1700/2416


Now let's create dataloaders that can load/shuffle and batch our data. W
When pytorch batches our data, it stacks the tensors. However, since our sentences are not equal in size, their corresponding tensors will also have different sizes. To fix this problem, we pad each sentence in a batch to the size, taking our longest sentence as the target size. This way, stacking tensors won't be a problem.

In [9]:
from torch.utils.data import DataLoader
import torch

def padded_collate(batch: list):
    """
     Pad each sentence to the length of the longest sentence in the batch
    """
    sentence_lengths = [len(s) for s in batch]
    max_length = max(sentence_lengths)
    padded_batch = [s + [0] * (max_length - len(s)) for s in batch]
    return torch.LongTensor(padded_batch)

We want to test if our data loader works, so we create one of our test set with a tiny batch size of 2. From to batches, we print the output.

In [10]:
train_loader = DataLoader(train_set, batch_size=2, shuffle=False, collate_fn=padded_collate)

# Small test for a data loader
for di, d in enumerate(train_loader):
    print(d)
    print(d.tolist())
    print(f'{"-"*20}')
    if di == 1:
        break

tensor([[   1, 4754,  926, 6325,  185, 7745, 6332,  584, 9060,    3,   10, 1161,
         2002,   16, 4273, 9063,   22,  584,    3,    3, 8917, 9060, 8556, 4754,
            3, 2067,   18,   10, 5363,    8, 1113,   24,   18, 9060, 7826, 6332,
            3,   18, 6822, 1744, 5193,    3,   18, 9725, 5944, 1200, 9167,    3,
            3,   25,    2],
        [   1, 6042,    3, 6826,    3,   25,    2,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0]])
[[1, 4754, 926, 6325, 185, 7745, 6332, 584, 9060, 3, 10, 1161, 2002, 16, 4273, 9063, 22, 584, 3, 3, 8917, 9060, 8556, 4754, 3, 2067, 18, 10, 5363, 8, 1113, 24, 18, 9060, 7826, 6332, 3, 18, 6822, 1744, 5193, 3, 18, 9725, 5944, 1200, 9167, 3, 3, 25, 2], [1, 6042, 3, 6826, 3, 25, 2, 0, 0, 0, 0, 0,

## Defining the Model
We import the model from our models folder. For encoding, this will be our RNNLM, defined in RNNLM.py

In [11]:
from models.RNNLM import RNNLM
embedding_size = 50
hidden_size = 50
# vocab size is our vocab size, embedding is 500, hidden is 100. These number are arbitrary for now. Still trying to make the model work
rnnlm = RNNLM(vocab_size, embedding_size, hidden_size)

In [12]:
import torch
import torch.nn as nn

# Define loss function
criterion = nn.NLLLoss()
optim = torch.optim.Adam(rnnlm.parameters())

In [35]:
all_losses = []

loss = 0
# %%
def train_model_on_batch(model: RNNLM, optim: torch.optim.Optimizer, input_tensor: torch.Tensor):
    optim.zero_grad()
    # inp is to be shaped as Sentences(=batch) x Words
    hidden = model.init_hidden(input_tensor)

    for idx in range(input_tensor.shape[1] - 1):
        current_words = input_tensor[:, idx] # Get the current word for each sentence
        print(current_words)
        print(f'{"-"*20}')
        output, hidden = model(current_words, hidden)
        print(output)
        print(output.size())
        break
        # next_words = inp[:, idx + 1]

        # # TODO: Ensure this works
        # local_loss = criterion(torch.log(pred), torch.tensor(next_words))

        # loss += local_loss

    # loss.backward()
    # optim.step()

In [36]:
for batch in train_loader:
    print(batch)
    print(f'{"-"*20}')
    train_model_on_batch(rnnlm, optim, batch)
    break

tensor([[   1, 4754,  926, 6325,  185, 7745, 6332,  584, 9060,    3,   10, 1161,
         2002,   16, 4273, 9063,   22,  584,    3,    3, 8917, 9060, 8556, 4754,
            3, 2067,   18,   10, 5363,    8, 1113,   24,   18, 9060, 7826, 6332,
            3,   18, 6822, 1744, 5193,    3,   18, 9725, 5944, 1200, 9167,    3,
            3,   25,    2],
        [   1, 6042,    3, 6826,    3,   25,    2,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0]])
--------------------
tensor([1, 1])
--------------------
tensor([[0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000,  ..., 0.5000, 0.5000, 0.5000]],
       grad_fn=<SoftmaxBackward>)
torch.Size([2, 10000])
