Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Twitter tokenizer #3

Closed
oxinabox opened this issue May 8, 2018 · 8 comments
Closed

Add a Twitter tokenizer #3

oxinabox opened this issue May 8, 2018 · 8 comments

Comments

@oxinabox
Copy link
Member

oxinabox commented May 8, 2018

Twitter language tends to not like normal tokenizers much.

There are some twitter tokenizers around.
So could port one of those

@Ayushk4
Copy link
Member

Ayushk4 commented Jan 17, 2019

I am interested in working on this issue, I search for a while and mainly came across the following two tweet - tokenizers.

Which of the two will be better?

@oxinabox
Copy link
Member Author

The NLTK one.

This package is Apache 2, and so the Tweet NLP liscense is not compatible.
We already have taken tokenizers from NLTK,
and so we can do so again.

@Ayushk4
Copy link
Member

Ayushk4 commented Jan 18, 2019

I was going through the codebase. I noticed that sed was used than Regex matching in julia. Is it because of speed performance? Should I also stick with sedbased tokenizer?

@oxinabox
Copy link
Member Author

sed isn't actually being used.

We actually generate julia code, based on the sed script.
sed is basically being used as a DSL.

This is done here

function generate_tokenizer_from_sed(sed_script, extended=false)::Expr

@Ayushk4
Copy link
Member

Ayushk4 commented Jan 23, 2019

I am nearing the completion - for now I seem to be stuck on this for a while. I need to decode the Windows-1252 encoding (cp1252) into UTF- 8 (unicode). Any leads that I could get on this?

@oxinabox
Copy link
Member Author

You do?
Where are you encountering this?

Anyway, StringEncodings.jl should be what you are after

@Ayushk4
Copy link
Member

Ayushk4 commented Jan 23, 2019

A certain limited range of numbers are interpreted by web browsers as representing in the Windows-1252 encoding.

A part of the tokenizer is - "Replacing HTML entities from the text by converting them to their corresponding unicode character". This is where I need it.

Also, thanks. I will look into StringEncodings.jl.

@oxinabox oxinabox mentioned this issue Feb 3, 2019
3 tasks
@oxinabox
Copy link
Member Author

Oh neat! An emoticon lexer.
(Not reviewed)

@oxinabox oxinabox closed this as completed Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants