Skip to content

Tokenizer

WCuddington edited this page Mar 7, 2019 · 1 revision

The tokenizer takes the string created by an importer and splits it. It splits it by line and by word. It returns a 2d array called words. The words array is set up like this: words[lineIndex][wordIndex]. Because of this, we can not only keep track of the individual words but also what line in a file they occur on, and we can also keep track of line breaks.


The tokenizer uses the NLTK library to split perform the tokenization.

First though, it strips all punctuation (including apostrophes, so "don't" becomes "dont") from the imported string and leaves only the plain words. It does this using string.translate.

Then, the new, punctuation-free string is split into an array based on the newline character. The string_array variable then is an array of lines.

At this point, NLTK tokenizes the words from string_array and puts them into words.

Clone this wiki locally