Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tweet Tokenizer #13

Merged
merged 45 commits into from
Jun 6, 2019
Merged

Conversation

Ayushk4
Copy link
Member

@Ayushk4 Ayushk4 commented Feb 3, 2019

The tweet tokenizer has been added.

  • Tokenizer added
  • Documentation
  • Tests
The following is how the tokenizer works:
  1. The regular expressions are made for WORD_REGEX (core tokenizer), HANG_REGEX
    and EMOTICONS_REGEX.
  2. The function replace_html_entities is used to replace the html_entities (eg: "Price: £100" becomes "Price: £100")
  3. The string is processed optionally for reducing the length of strings (like "......" becomes "..." and "waaaaay" becomes "waaay" also the twitter handles are optionally removed.
  4. The String is tokenized.
  5. preserve_case by default is set to true. If it is set to false,
    then the tokenizer will downcase everything except for emoticons.
I have 2 questions related to this -

@codecov-io
Copy link

codecov-io commented Feb 3, 2019

Codecov Report

Merging #13 into master will decrease coverage by 4.98%.
The diff coverage is 72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #13      +/-   ##
==========================================
- Coverage   80.86%   75.88%   -4.99%     
==========================================
  Files           9       10       +1     
  Lines         277      651     +374     
==========================================
+ Hits          224      494     +270     
- Misses         53      157     +104
Impacted Files Coverage Δ
src/split_api.jl 0% <ø> (ø) ⬆️
src/words/fast.jl 81.81% <66.66%> (+0.56%) ⬆️
src/words/tweet_tokenizer.jl 72.04% <72.04%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d877dc...c7bd296. Read the comment docs.

@oxinabox
Copy link
Member

oxinabox commented Feb 3, 2019

for record keeping purposes this is to address #3

@oxinabox
Copy link
Member

oxinabox commented Feb 3, 2019

Should we host documentation for this repository similar to https://juliatext.github.io/TextAnalysis.jl/? I feel that this will grow as more tokenizers will be added, so maybe we can have examples on each.

My feeling is that tokenizers will remain simple enough that they can all go in the readme.
At the point in which this changes, then we can look at a solution like Documenter.jl and a documentation page. But I find in general dealing with that is a surprisingly constant source of complexity to a project.

Copy link
Member

@oxinabox oxinabox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!
I've just done a first pass for style and basic optimisation.
Looks pretty Ok.
but I'll do another more careful pass over once these changes are made,
and once tests are written.

All globals need to be declared const.

You seem to use a lot of inner functions which are called only once,
and I am not seeing much gain in clarity from them.
The functionality could be moved to where it is used,
or they coukd be made outer functions
as suits.

@Ayushk4
Copy link
Member Author

Ayushk4 commented Jun 1, 2019

Now only the tests and documentation part remain :)

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@Ayushk4
Copy link
Member Author

Ayushk4 commented Jun 5, 2019

I have made the suggested changes, you may review this PR.

@oxinabox
Copy link
Member

oxinabox commented Jun 5, 2019

Last tiny things then we can merge this.

@Ayushk4
Copy link
Member Author

Ayushk4 commented Jun 5, 2019

The changes have been made and pushed. The CI tests also pass.

@oxinabox
Copy link
Member

oxinabox commented Jun 5, 2019

🎉

@oxinabox oxinabox merged commit db3707b into JuliaText:master Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants