-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Twitter tokenizer #3
Comments
I am interested in working on this issue, I search for a while and mainly came across the following two tweet - tokenizers.
Which of the two will be better? |
The NLTK one. This package is Apache 2, and so the Tweet NLP liscense is not compatible. |
I was going through the codebase. I noticed that sed was used than Regex matching in julia. Is it because of speed performance? Should I also stick with sedbased tokenizer? |
We actually generate julia code, based on the sed script. This is done here
|
I am nearing the completion - for now I seem to be stuck on this for a while. I need to decode the |
You do? Anyway, StringEncodings.jl should be what you are after |
A certain limited range of numbers are interpreted by web browsers as representing in the Windows-1252 encoding. A part of the tokenizer is - "Replacing HTML entities from the text by converting them to their corresponding unicode character". This is where I need it. Also, thanks. I will look into StringEncodings.jl. |
Oh neat! An emoticon lexer. |
Twitter language tends to not like normal tokenizers much.
There are some twitter tokenizers around.
So could port one of those
The text was updated successfully, but these errors were encountered: