You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Twitter usernames and hashtags which being with a number are not correctly parsed, e.g.
RT @1310kfkanews: #1310kfkanews
is tokenized with the "@" and "#" as separate tokens.
1 RT rt NN pos2=NNP 3 nmod _ O
2 @ @ SYM pos2=IN 3 punct _ O
3 1310kfkanews 0kfkanews NN pos2=NNS 12 dep _ O
4 : : : pos2=, 12 punct _ O
5 # # NN pos2=SYM 6 compound _ U-CARDINAL
6 1310kfkanews 0kfkanews NN pos2=NNS 10 dep _ U-MONEY
I am working on a comparison of tokenizers for microblog texts, and am finding issues with nlpj 1.1.3 (from http://nlp.mathcs.emory.edu/nlp4j/nlp4j-appassembler-1.1.3.tgz).
Twitter usernames and hashtags which being with a number are not correctly parsed, e.g.
RT @1310kfkanews: #1310kfkanews
is tokenized with the "@" and "#" as separate tokens.
[This issue previously reported as https://github.com/emorynlp/nlp4j-tokenization/issues/11]
The text was updated successfully, but these errors were encountered: