Add a Twitter tokenizer #3

oxinabox · 2018-05-08T07:36:21Z

Twitter language tends to not like normal tokenizers much.

There are some twitter tokenizers around.
So could port one of those

Ayushk4 · 2019-01-17T16:51:02Z

I am interested in working on this issue, I search for a while and mainly came across the following two tweet - tokenizers.

Which of the two will be better?

oxinabox · 2019-01-17T17:07:47Z

The NLTK one.

This package is Apache 2, and so the Tweet NLP liscense is not compatible.
We already have taken tokenizers from NLTK,
and so we can do so again.

Ayushk4 · 2019-01-18T10:50:41Z

I was going through the codebase. I noticed that sed was used than Regex matching in julia. Is it because of speed performance? Should I also stick with sedbased tokenizer?

oxinabox · 2019-01-18T12:33:04Z

sed isn't actually being used.

We actually generate julia code, based on the sed script.
sed is basically being used as a DSL.

This is done here

WordTokenizers.jl/src/words/sedbased.jl

Line 9 in 5fad6ff

function generate_tokenizer_from_sed(sed_script, extended=false)::Expr

Ayushk4 · 2019-01-23T17:17:51Z

I am nearing the completion - for now I seem to be stuck on this for a while. I need to decode the Windows-1252 encoding (cp1252) into UTF- 8 (unicode). Any leads that I could get on this?

oxinabox · 2019-01-23T18:26:07Z

You do?
Where are you encountering this?

Anyway, StringEncodings.jl should be what you are after

Ayushk4 · 2019-01-23T18:51:31Z

A certain limited range of numbers are interpreted by web browsers as representing in the Windows-1252 encoding.

A part of the tokenizer is - "Replacing HTML entities from the text by converting them to their corresponding unicode character". This is where I need it.

Also, thanks. I will look into StringEncodings.jl.

oxinabox · 2019-04-12T20:22:55Z

Oh neat! An emoticon lexer.
(Not reviewed)

This was referenced May 8, 2018

Add Some sentiment datasets JuliaText/CorpusLoaders.jl#13

Closed

Blog post: ** Sentiment Analysis with SOWE in Flux.jl ** oxinabox/oxinabox.github.io#3

Open

oxinabox mentioned this issue Feb 3, 2019

Add Tweet Tokenizer #13

Merged

3 tasks

oxinabox closed this as completed Jun 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Twitter tokenizer #3

Add a Twitter tokenizer #3

oxinabox commented May 8, 2018

Ayushk4 commented Jan 17, 2019 •

edited

Loading

oxinabox commented Jan 17, 2019

Ayushk4 commented Jan 18, 2019

oxinabox commented Jan 18, 2019

Ayushk4 commented Jan 23, 2019

oxinabox commented Jan 23, 2019

Ayushk4 commented Jan 23, 2019 •

edited

Loading

oxinabox commented Apr 12, 2019

Add a Twitter tokenizer #3

Add a Twitter tokenizer #3

Comments

oxinabox commented May 8, 2018

Ayushk4 commented Jan 17, 2019 • edited Loading

oxinabox commented Jan 17, 2019

Ayushk4 commented Jan 18, 2019

oxinabox commented Jan 18, 2019

Ayushk4 commented Jan 23, 2019

oxinabox commented Jan 23, 2019

Ayushk4 commented Jan 23, 2019 • edited Loading

oxinabox commented Apr 12, 2019

Ayushk4 commented Jan 17, 2019 •

edited

Loading

Ayushk4 commented Jan 23, 2019 •

edited

Loading