## Initialization

# Data

In [1]:
train_path = '/dgm_for_text_data/02-21.10way.clean'
valid_path = '/dgm_for_text_data/22.auto.clean'
test_path  = '/dgm_for_text_data/23.auto.clean'

Our data files consist of lines of the Penn Tree Bank data set. Each line is a sentence in a tree shape. Let's pretty print the first line from the training set to see what we're dealing with!

In [2]:
import os
from nltk import Tree
from nltk.treeprettyprinter import TreePrettyPrinter

def filereader(path):
    """
        Read a PTS data file line by line
    """
    with open(os.getcwd() + path, mode='r') as f:
        for line in f:
            yield line

In [3]:
def convert_to_sentence(line):
    """
        Converts a line of a PTS data file into a lower case sentence string
    """
    tree = Tree.fromstring(line)
    sentence = ' '.join(tree.leaves()).lower()
    return sentence

In [4]:
line = next(filereader(train_path))
tree = Tree.fromstring(line)
print(line)
print(TreePrettyPrinter(tree))

(TOP (S (PP (IN In) (NP (NP (DT an) (NNP Oct.) (CD 19) (NN review)) (PP (IN of) (NP (`` ``) (NP (DT The) (NN Misanthrope)) ('' '') (PP (IN at) (NP (NP (NNP Chicago) (POS 's)) (NNP Goodman) (NNP Theatre))))) (PRN (-LRB- -LRB-) (`` ``) (S (NP (JJ Revitalized) (NNS Classics)) (VP (VBP Take) (NP (DT the) (NNP Stage)) (PP (IN in) (NP (NNP Windy) (NNP City))))) (, ,) ('' '') (NP (NNP Leisure) (CC &) (NNP Arts)) (-RRB- -RRB-)))) (, ,) (NP (NP (NP (DT the) (NN role)) (PP (IN of) (NP (NNP Celimene)))) (, ,) (VP (VBN played) (PP (IN by) (NP (NNP Kim) (NNP Cattrall)))) (, ,)) (VP (VBD was) (VP (ADVP (RB mistakenly)) (VBN attributed) (PP (TO to) (NP (NNP Christina) (NNP Haag))))) (. .)))

                                                                                                                                                                TOP                                                                                                                                                       

We are, for now, just intereseted in the sentences, which is just the leaves of our tree

---



Now we make our usable data sets

In [5]:
train_sents = [convert_to_sentence(l) for l in filereader(train_path)]
valid_sents = [convert_to_sentence(l) for l in filereader(valid_path)]
test_sents = [convert_to_sentence(l) for l in filereader(test_path)]

In [13]:
from tokenizers import WordTokenizer

# Train your tokenizer.
tokenizer = WordTokenizer(train_sents, max_vocab_size=10000)

for sentence in train_sents[:5]:
    tokenized = tokenizer.encode(sentence, add_special_tokens=False)
    sentence_decoded = tokenizer.decode(tokenized) 

    print('original: ' + sentence)
    print('tokenized: ', tokenized)
    print('decoded: ' + sentence_decoded)
    print()

original: in an oct. 19 review of `` the misanthrope '' at chicago 's goodman theatre -lrb- `` revitalized classics take the stage in windy city , '' leisure & arts -rrb- , the role of celimene , played by kim cattrall , was mistakenly attributed to christina haag .
tokenized:  [1, 4754, 926, 6325, 185, 7745, 6332, 584, 9060, 3, 10, 1161, 2002, 16, 4273, 9063, 22, 584, 3, 3, 8917, 9060, 8556, 4754, 3, 2067, 18, 10, 5363, 8, 1113, 24, 18, 9060, 7826, 6332, 3, 18, 6822, 1744, 5193, 3, 18, 9725, 5944, 1200, 9167, 3, 3, 25, 2]
decoded: in an oct. 19 review of `` the [UNK] '' at chicago 's goodman theatre -lrb- `` [UNK] [UNK] take the stage in [UNK] city , '' leisure & arts -rrb- , the role of [UNK] , played by kim [UNK] , was mistakenly attributed to [UNK] [UNK] .

original: ms. haag plays elianti .
tokenized:  [1, 6042, 3, 6826, 3, 25, 2]
decoded: ms. [UNK] plays [UNK] .

original: rolls-royce motor cars inc. said it expects its u.s. sales to remain steady at about 1,200 cars in 1990 .
to