New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenize.lua removes sentences containing special characters #316

incasedo opened this Issue Jun 10, 2017 · 3 comments


None yet
2 participants

incasedo commented Jun 10, 2017

when I use tokenize.lua to process corpus, I have got this result:

Tokenization completed in 2437.165 seconds - 9650000 sentences
Tokenization completed in 1762.637 seconds - 9649999 sentences

after I checked, I found that was a special character (�) caused the problem.

please see attchment file s1.txt
line 61:
the 9th international conference on low level measurements of actinides and long - lived radio �nuclides in biological and environ mental samples.

@incasedo incasedo closed this Jun 10, 2017

@incasedo incasedo reopened this Jun 10, 2017


This comment has been minimized.


guillaumekln commented Jun 16, 2017

Thanks for reporting and providing the data. I reproduced the issue.


This comment has been minimized.


guillaumekln commented Jun 19, 2017

Actually it is more than a special character, it is a null character (byte 0). So I'm more inclined to classify this input as invalid because even the Lua read primitive fails to completely read the line.


This comment has been minimized.


incasedo commented Jun 21, 2017

thanks! It is best to add the wrong line number on corpus, because it is very hard to locate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment