## Initialization

# Data

In [16]:
train_path = '/dgm_for_text_data/02-21.10way.clean'
valid_path = '/dgm_for_text_data/22.auto.clean'
test_path  = '/dgm_for_text_data/23.auto.clean'

Our data files consist of lines of the Penn Tree Bank data set. Each line is a sentence in a tree shape. Let's pretty print the first line from the training set to see what we're dealing with!

In [17]:
import os
from nltk import Tree
from nltk.treeprettyprinter import TreePrettyPrinter

def filereader(path):
    """
        Read a PTS data file line by line
    """
    with open(os.getcwd() + path, mode='r') as f:
        for line in f:
            yield line

In [18]:
def convert_to_sentence(line):
    """
        Converts a line of a PTS data file into a lower case sentence string
    """
    tree = Tree.fromstring(line)
    sentence = ' '.join(tree.leaves()).lower()
    return sentence

In [19]:
line = next(filereader(train_path))
tree = Tree.fromstring(line)
print(line)
print(TreePrettyPrinter(tree))

(TOP (S (PP (IN In) (NP (NP (DT an) (NNP Oct.) (CD 19) (NN review)) (PP (IN of) (NP (`` ``) (NP (DT The) (NN Misanthrope)) ('' '') (PP (IN at) (NP (NP (NNP Chicago) (POS 's)) (NNP Goodman) (NNP Theatre))))) (PRN (-LRB- -LRB-) (`` ``) (S (NP (JJ Revitalized) (NNS Classics)) (VP (VBP Take) (NP (DT the) (NNP Stage)) (PP (IN in) (NP (NNP Windy) (NNP City))))) (, ,) ('' '') (NP (NNP Leisure) (CC &) (NNP Arts)) (-RRB- -RRB-)))) (, ,) (NP (NP (NP (DT the) (NN role)) (PP (IN of) (NP (NNP Celimene)))) (, ,) (VP (VBN played) (PP (IN by) (NP (NNP Kim) (NNP Cattrall)))) (, ,)) (VP (VBD was) (VP (ADVP (RB mistakenly)) (VBN attributed) (PP (TO to) (NP (NNP Christina) (NNP Haag))))) (. .)))

                                                                                                                                                                TOP                                                                                                                                                       

We are, for now, just intereseted in the sentences, which is just the leaves of our tree

---



Now we make our usable data sets

In [20]:
train_sents = [convert_to_sentence(l) for l in filereader(train_path)]
valid_sents = [convert_to_sentence(l) for l in filereader(valid_path)]
test_sents = [convert_to_sentence(l) for l in filereader(test_path)]

In [21]:
from tokenizers import WordTokenizer

# Train your tokenizer. Credits to the NLP2 team for creating this tokenizer
tokenizer = WordTokenizer(train_sents, max_vocab_size=10000)

for sentence in train_sents[:5]:
    tokenized = tokenizer.encode(sentence, add_special_tokens=True)
    sentence_decoded = tokenizer.decode(tokenized, skip_special_tokens=False) 

    print('original: ' + sentence)
    print('tokenized: ', tokenized)
    print('decoded: ' + sentence_decoded)
    print()

original: in an oct. 19 review of `` the misanthrope '' at chicago 's goodman theatre -lrb- `` revitalized classics take the stage in windy city , '' leisure & arts -rrb- , the role of celimene , played by kim cattrall , was mistakenly attributed to christina haag .
tokenized:  [1, 4754, 926, 6325, 185, 7745, 6332, 584, 9060, 3, 10, 1161, 2002, 16, 4273, 9063, 22, 584, 3, 3, 8917, 9060, 8556, 4754, 3, 2067, 18, 10, 5363, 8, 1113, 24, 18, 9060, 7826, 6332, 3, 18, 6822, 1744, 5193, 3, 18, 9725, 5944, 1200, 9167, 3, 3, 25, 2]
decoded: [BOS] in an oct. 19 review of `` the [UNK] '' at chicago 's goodman theatre -lrb- `` [UNK] [UNK] take the stage in [UNK] city , '' leisure & arts -rrb- , the role of [UNK] , played by kim [UNK] , was mistakenly attributed to [UNK] [UNK] . [EOS]

original: ms. haag plays elianti .
tokenized:  [1, 6042, 3, 6826, 3, 25, 2]
decoded: [BOS] ms. [UNK] plays [UNK] . [EOS]

original: rolls-royce motor cars inc. said it expects its u.s. sales to remain steady at about

Now create our own custom data sets in pytorch

In [22]:
from torch.utils.data import Dataset

class PTBDataset(Dataset):
    """
        A PTB Dataset
    """
    def __init__(self, sentences, tokenizer):
        self.sentences = sentences
        self.tokenizer = tokenizer
    
    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        item = self.sentences[idx]
        # We make sure to add the special tokens, since our models need to know how to start and end a sentence
        tokenized = self.tokenizer.encode(item, add_special_tokens=True)
        return tokenized

In [23]:
train_set = PTBDataset(train_sents, tokenizer)
valid_set = PTBDataset(valid_sents, tokenizer)
test_set = PTBDataset(test_sents, tokenizer)

# Lets print some information about our datasets
print(f'train/validation/test :: {len(train_set)}/{len(valid_set)}/{len(test_set)}')

train/validation/test :: 39832/1700/2416


Now let's create dataloaders that can load/shuffle and batch our data. Since we will want to give our models all equel sized input i.e. sentences of equal length. * This may change *

In [39]:
from torch.utils.data import DataLoader
import torch

def padded_collate(batch):
    """
     Pad each sentence to the length of the longest sentence in the batch
    """
    sentence_lengths = [len(s) for s in batch]
    max_length = max(sentence_lengths)
    padded_batch = [s + [0] * (max_length - len(s)) for s in batch]
    return torch.LongTensor(padded_batch)

train_loader = DataLoader(train_set, batch_size=2, shuffle=False, collate_fn=padded_collate)

# Small test for a data loader
t = 0
for d in train_loader:
    print(d)
    print(d.tolist())
    print()
    t += 1
    if t == 2:
        break

tensor([[   1, 4754,  926, 6325,  185, 7745, 6332,  584, 9060,    3,   10, 1161,
         2002,   16, 4273, 9063,   22,  584,    3,    3, 8917, 9060, 8556, 4754,
            3, 2067,   18,   10, 5363,    8, 1113,   24,   18, 9060, 7826, 6332,
            3,   18, 6822, 1744, 5193,    3,   18, 9725, 5944, 1200, 9167,    3,
            3,   25,    2],
        [   1, 6042,    3, 6826,    3,   25,    2,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0]])
[[1, 4754, 926, 6325, 185, 7745, 6332, 584, 9060, 3, 10, 1161, 2002, 16, 4273, 9063, 22, 584, 3, 3, 8917, 9060, 8556, 4754, 3, 2067, 18, 10, 5363, 8, 1113, 24, 18, 9060, 7826, 6332, 3, 18, 6822, 1744, 5193, 3, 18, 9725, 5944, 1200, 9167, 3, 3, 25, 2], [1, 6042, 3, 6826, 3, 25, 2, 0, 0, 0, 0, 0,