Tokenizer

The tokenizer takes the string created by an importer and splits it. It splits it by line and by word. It returns a 2d array called words. The words array is set up like this: words[lineIndex][wordIndex]. Because of this, we can not only keep track of the individual words but also what line in a file they occur on, and we can also keep track of line breaks.

The tokenizer uses the NLTK library to split perform the tokenization.

First though, it strips all punctuation (including apostrophes, so "don't" becomes "dont") from the imported string and leaves only the plain words. It does this using string.translate.

Then, the new, punctuation-free string is split into an array based on the newline character. The string_array variable then is an array of lines.

At this point, NLTK tokenizes the words from string_array and puts them into words.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer

Clone this wiki locally