# WordTokenizers.jl
(Disclaimer: The package name is misleading, it may also provide you sentence and subword tokenizers.)

https://github.com/JuliaText/WordTokenizers.jl

Notebooks at https://github.com/Ayushk4/JuliaCon20_Talk


In [1]:
using WordTokenizers

# Sentence Tokenizer (Sentence Splitters / segmenters)
Uses a rule based approach with a large list of exceptions.

In [2]:
text = "Dr. Jane Doe says leatherback sea turtle is the largest, measuring six or seven feet (2 m) in length at maturity, and three to five feet (1 to 1.5 m) in width, weighing up to 2000 pounds (about 900 kg). Most other species are smaller, being two to four feet in length (0.5 to 1 m) and proportionally less wide. The Flatback turtle is found solely on the northerncoast of Australia."

"Dr. Jane Doe says leatherback sea turtle is the largest, measuring six or seven feet (2 m) in length at maturity, and three to five feet (1 to 1.5 m) in width, weighing up to 2000 pounds (about 900 kg). Most other species are smaller, being two to four feet in length (0.5 to 1 m) and proportionally less wide. The Flatback turtle is found solely on the northerncoast of Australia."

In [3]:
split_sentences(text)

3-element Array{SubString{String},1}:
 "Dr. Jane Doe says leatherback sea turtle is the largest, measuring six or seven feet (2 m) in length at maturity, and three to five feet (1 to 1.5 m) in width, weighing up to 2000 pounds (about 900 kg)."
 "Most other species are smaller, being two to four feet in length (0.5 to 1 m) and proportionally less wide."                                                                                               
 "The Flatback turtle is found solely on the northerncoast of Australia."                                                                                                                                    

## Word Tokenizers
These Tokenizers assume that Sentence Splitting has already been done.
- Poorman's tokenizer
- Punctuation space tokenize
- Penn Tokenizer
- Improved Penn Tokenizer
- NLTK Word tokenizer
- Reversible Tokenizer
- TokTok Tokenizer

In [4]:
# WordTokenizers for high-speed tokenization
# 1. nltk_word_tokenizer
# 2. multilingual toktok_tokenizer
# 3. tweet_tokenizer
# 4. plug in external tokenizers, 7+ tokenizers

In [5]:
sentence = "@JohnDoe says: Well, we couldn't have this cliche-ridden, \"Touched by Angel\" to check out #GitHub https://github.com."

"@JohnDoe says: Well, we couldn't have this cliche-ridden, \"Touched by Angel\" to check out #GitHub https://github.com."

In [6]:
split(sentence)

16-element Array{SubString{String},1}:
 "@JohnDoe"           
 "says:"              
 "Well,"              
 "we"                 
 "couldn't"           
 "have"               
 "this"               
 "cliche-ridden,"     
 "\"Touched"          
 "by"                 
 "Angel\""            
 "to"                 
 "check"              
 "out"                
 "#GitHub"            
 "https://github.com."

In [7]:
# Modified version of NLTK's tokenizer
print(WordTokenizers.nltk_word_tokenize(sentence))

["@", "JohnDoe", "says", ":", "Well", ",", "we", "could", "n't", "have", "this", "cliche-ridden", ",", "``", "Touched", "by", "Angel", "''", "to", "check", "out", "#", "GitHub", "https", ":", "/", "/", "github.com", "."]

In [8]:
# Modified version of Jon Safari's toktok tokenizer
print(WordTokenizers.toktok_tokenize(sentence))

["@", "JohnDoe", "says", ":", "Well", ",", "we", "couldn", "'", "t", "have", "this", "cliche-ridden", ",", "\"", "Touched", "by", "Angel", "\"", "to", "check", "out", "#", "GitHub", "https://github.com", "."]

In [9]:
# Modified version of NLTK's casual tokenizer
print(WordTokenizers.tweet_tokenize(sentence))

["@JohnDoe", "says", ":", "Well", ",", "we", "couldn't", "have", "this", "cliche-ridden", ",", "\"", "Touched", "by", "Angel", "\"", "to", "check", "out", "#GitHub", "https://github.com", "."]

In [10]:
# Obtained using google translate
spanish_text = "@JohnDoe dice: Bueno, no podríamos tener este cliché, \"Tocado por Angel\" para ver #GitHub https://github.com."
print(WordTokenizers.toktok_tokenize(spanish_text))

["@", "JohnDoe", "dice", ":", "Bueno", ",", "no", "podríamos", "tener", "este", "cliché", ",", "\"", "Tocado", "por", "Angel", "\"", "para", "ver", "#", "GitHub", "https://github.com", "."]

In [11]:
# Obtained using google translate
hindi_text = "@ जॉनडे कहते हैं: ठीक है, हम इस क्लिज्ड, \"एंजेल द्वारा छुआ\" #GitHub https://github.com की जांच कर सकते हैं।"
print(WordTokenizers.toktok_tokenize(hindi_text))

["@", "जॉनडे", "कहते", "हैं", ":", "ठीक", "है", ",", "हम", "इस", "क्लिज्ड", ",", "\"", "एंजेल", "द्वारा", "छुआ", "\"", "#", "GitHub", "https://github.com", "की", "जांच", "कर", "सकते", "हैं", "।"]

## Token-Buffer API
TokenBuffer API and supporting utility lexers enables high-speed non-regex based tokenization.

TokenBuffer turns a string into a readable stream, used for building tokenizers. Utility lexers such as spaces and number read characters from the stream and into an array of tokens.

Lexers return true or false to indicate whether they matched in the input stream and therefore can be combined easily - 
`spacesornumber(ts) = spaces(ts) || number(ts)`


In [12]:
using WordTokenizers: TokenBuffer, isdone, character, spaces, nltk_url1, nltk_url2, nltk_phonenumbers

In [13]:
function my_tok(input)
   urls(ts) = nltk_url1(ts) || nltk_url2(ts)

   ts = TokenBuffer(input)
   while !isdone(ts)
       spaces(ts) && continue
       urls(ts) ||
       nltk_phonenumbers(ts) ||
       character(ts)
   end
   return ts.tokens
end

my_tok("A url https://github.com/JuliaText/WordTokenizers.jl/ and phonenumber +0 (987) - 2344321 #MeaningLess")

7-element Array{String,1}:
 "A"                                              
 "url"                                            
 "https://github.com/JuliaText/WordTokenizers.jl/"
 "and"                                            
 "phonenumber"                                    
 "+0 (987) - 2344321"                             
 "#MeaningLess"                                   

# Statistical (and subword) Tokenizers
More on this later.


----------------------------------