-
Notifications
You must be signed in to change notification settings - Fork 1
Tokenizer
The tokenizer takes the string created by an importer and splits it. It splits it by line and by word. It returns a 2d array called words
. The words
array is set up like this: words[lineIndex][wordIndex]
. Because of this, we can not only keep track of the individual words but also what line in a file they occur on, and we can also keep track of line breaks.
The tokenizer uses the NLTK library to split perform the tokenization.
First though, it strips all punctuation (including apostrophes, so "don't" becomes "dont") from the imported string and leaves only the plain words. It does this using string.translate
.
Then, the new, punctuation-free string is split into an array based on the newline character. The string_array
variable then is an array of lines.
At this point, NLTK tokenizes the words from string_array
and puts them into words
.