JuliaText · oxinabox · Jun 6, 2019 · Jan 22, 2019 · Jan 25, 2019 · Jan 31, 2019
diff --git a/README.md b/README.md
@@ -65,12 +65,14 @@ The word tokenizers basically assume sentence splitting has already been done.
 
  - **Penn Tokenizer:** (`penn_tokenize`) This is Robert MacIntyre's orginal tokenizer used for the Penn Treebank. Splits contractions.
  - **Improved Penn Tokenizer:** (`improved_penn_tokenize`) NLTK's improved Penn Treebank Tokenizer. Very similar to the original, some improvements on punctuation and contractions. This matches to NLTK's `nltk.tokenize.TreeBankWordTokenizer.tokenize`
- - **NLTK Word tokenizer:** (`nltk_word_tokenize`) NLTK's even more improved version of the Penn Tokenizer. This version has better unicode handling and some other changes. This matches to the most commonly used `nltk.word_tokenize`, minus the sentence tokenizing step. 
-- **Reversible Tokenizer:** (`rev_tokenize` and `rev_detokenize`) This tokenizer splits on punctuations, space and special symbols. The generated tokens can be de-tokenized by using the `rev_detokenizer` function into the state before tokenization.
-- **TokTok Tokenizer:** (`toktok_tokenize`) This tokenizer is a simple, general tokenizer, where the input has one sentence per line; thus only final period is tokenized. Tok-tok has been tested on and gives reasonably good results for English, Persian, Russian, Czech, French, German, Vietnamese, Tajik, and a few others. **(default tokenizer)**
+ - **NLTK Word tokenizer:** (`nltk_word_tokenize`) NLTK's even more improved version of the Penn Tokenizer. This version has better unicode handling and some other changes. This matches to the most commonly used `nltk.word_tokenize`, minus the sentence tokenizing step.
 
   (To me it seems like a weird historical thing that NLTK has 2 successive variation on improving the Penn tokenizer, but for now I am matching it and having both.  See [[NLTK#2005]](https://github.com/nltk/nltk/issues/2005))
 
+ - **Reversible Tokenizer:** (`rev_tokenize` and `rev_detokenize`) This tokenizer splits on punctuations, space and special symbols. The generated tokens can be de-tokenized by using the `rev_detokenizer` function into the state before tokenization.
+ - **TokTok Tokenizer:** (`toktok_tokenize`) This tokenizer is a simple, general tokenizer, where the input has one sentence per line; thus only final period is tokenized. Tok-tok has been tested on and gives reasonably good results for English, Persian, Russian, Czech, French, German, Vietnamese, Tajik, and a few others. **(default tokenizer)**
+ - **Tweet Tokenizer:** (`tweet_tokenizer`) NLTK's casual tokenizer for that is solely designed for tweets. Apart from twitter specific, this tokenizer has good handling for emoticons, and other web aspects like support for HTML Entities. This closely matches NLTK's `nltk.tokenize.TweetTokenizer`
+
 
 # Sentence Splitters
 We currently only have one sentence splitter.
@@ -112,3 +114,94 @@ So
 `split(foo, Words)` is the same as `tokenize(foo)`,  
 and  
 `split(foo, Sentences)` is the same as `split_sentences(foo)`.
+
+## Using TokenBuffer API for Custom Tokenizers
+We offer a `TokenBuffer` API and supporting utility parsers
+for high speed tokenization.
+
+The order in which the parsers are written needs to be taken care of in some cases-
+
+For example: `987-654-3210` matches as a phone number
+as well as numbers, but number will only match upto `987`
+and split about it.
+
+```julia
+julia> using WordTokenizers: TokenBuffer, isdone, character, spaces, nltk_phonenumbers, number
+
+julia> order1(ts) = number(ts) || nltk_phonenumbers(ts)
+order1 (generic function with 1 method)
+
+julia> order2(ts) = nltk_phonenumbers(ts) || number(ts)
+order2 (generic function with 1 method)
+
+julia> function tokenize1(input)
+           ts = TokenBuffer(input)
+           while !isdone(ts)
+               order1(ts) ||
+               character(ts)
+           end
+           return ts.tokens
+       end
+tokenize1 (generic function with 1 method)
+
+julia> function tokenize2(input)
+           ts = TokenBuffer(input)
+           while !isdone(ts)
+               order2(ts) ||
+               character(ts)
+           end
+           return ts.tokens
+       end
+tokenize2 (generic function with 1 method)
+
+julia> tokenize1("987-654-3210") # number(ts) || nltk_phonenumbers(ts)
+5-element Array{String,1}:
+ "987"
+ "-"
+ "654"
+ "-"
+ "3210"
+
+julia> tokenize2("987-654-3210") # nltk_phonenumbers(ts) || number(ts)
+1-element Array{String,1}:
+ "987-654-3210"
+```
+
+#### Writing your own TokenBuffer parsers
+
+`TokenBuffer` turns a string into a readable stream, used for building tokenizers.
+Utility parsers such as `spaces` and `number` read characters from the
+stream and into an array of tokens.
+
+Parsers return `true` or `false` to indicate whether they matched
+in the input stream. They can therefore be combined easily, e.g.
+
+    spacesornumber(ts) = spaces(ts) || number(ts)
+
+either skips whitespace or parses a number token, if possible.
+
+The simplest possible tokenizer accepts any `character` with no token breaks:
+
+    function tokenise(input)
+        ts = TokenBuffer(input)
+        while !isdone(ts)
+            character(ts)
+        end
+        return ts.tokens
+    end
+
+    tokenise("foo bar baz") # ["foo bar baz"]
+
+The second simplest splits only on spaces:
+
+    function tokenise(input)
+        ts = TokenBuffer(input)
+        while !isdone(ts)
+            spaces(ts) || character(ts)
+        end
+        return ts.tokens
+    end
+
+    tokenise("foo bar baz") # ["foo", "bar", "baz"]
+
+You may see `nltk_word_tokenize` for a more advanced example.
diff --git a/REQUIRE b/REQUIRE
@@ -1 +1,3 @@
 julia 0.7
+HTML_Entities
+StrTables
diff --git a/src/WordTokenizers.jl b/src/WordTokenizers.jl
@@ -1,8 +1,14 @@
 
 module WordTokenizers
 
+using HTML_Entities
+using StrTables
+using Unicode
+
+
 export poormans_tokenize, punctuation_space_tokenize,
        penn_tokenize, improved_penn_tokenize, nltk_word_tokenize,
+       tweet_tokenize,
        tokenize,
        rulebased_split_sentences,
        split_sentences,
@@ -16,6 +22,7 @@ include("words/simple.jl")
 include("words/nltk_word.jl")
 include("words/reversible_tokenize.jl")
 include("words/sedbased.jl")
+include("words/tweet_tokenizer.jl")
 include("sentences/sentence_splitting.jl")
 include("words/TokTok.jl")
 

diff --git a/src/split_api.jl b/src/split_api.jl
@@ -3,7 +3,7 @@
 export Words, Sentences
 
 const tokenizers = [poormans_tokenize, punctuation_space_tokenize,
-       penn_tokenize, improved_penn_tokenize, nltk_word_tokenize]
+       penn_tokenize, improved_penn_tokenize, nltk_word_tokenize, tweet_tokenize]
 const sentence_splitters = [rulebased_split_sentences]
 
 const Words = tokenize

diff --git a/src/words/fast.jl b/src/words/fast.jl
@@ -17,23 +17,23 @@ either skips whitespace or parses a number token, if possible.
 The simplest possible tokeniser accepts any `character` with no token breaks:
 
     function tokenise(input)
-      ts = TokenBuffer(input)
-      while !isdone(ts)
-        character(ts)
-      end
-      return ts.tokens
+        ts = TokenBuffer(input)
+        while !isdone(ts)
+            character(ts)
+        end
+        return ts.tokens
     end
 
     tokenise("foo bar baz") # ["foo bar baz"]
 
 The second simplest splits only on spaces:
 
     function tokenise(input)
-      ts = TokenBuffer(input)
-      while !isdone(ts)
-        spaces(ts) || character(ts)
-      end
-      return ts.tokens
+        ts = TokenBuffer(input)
+        while !isdone(ts)
+            spaces(ts) || character(ts)
+        end
+        return ts.tokens
     end
 
     tokenise("foo bar baz") # ["foo", "bar", "baz"]
@@ -214,9 +214,13 @@ end
 
 Matches numbers such as `10,000.5`, preserving formatting.
 """
-function number(ts, sep = (':', ',', '\'', '.'))
-    isdigit(ts[]) || return false
+function number(ts, sep = (':', ',', '\'', '.'); check_sign = false)
     i = ts.idx
+    if check_sign && ts[] ∈ ['+', '-'] && ( i == 1 || isspace(ts[i-1]))
+        i += 1
+    end
+
+    i <= length(ts.input) && isdigit(ts[i]) || return false
     while i <= length(ts.input) && (isdigit(ts[i]) ||
                 (ts[i] in sep && i < length(ts.input) && isdigit(ts[i+1])))
         i += 1
@@ -225,4 +229,3 @@ function number(ts, sep = (':', ',', '\'', '.'))
     ts.idx = i
     return true
 end
-