Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix for #814 #1176

Merged
merged 1 commit into from
Mar 2, 2017
Merged

[MRG] Fix for #814 #1176

merged 1 commit into from
Mar 2, 2017

Conversation

jayantj
Copy link
Contributor

@jayantj jayantj commented Mar 1, 2017

The current code attempts to decode all characters for vocab words individually as ascii.
Instead, the code is changed to do the following - characters are read as raw bytes first, then decoded as utf-8 once the entire word has been read.

Also includes a test fasttext model with non-ascii vocab words, and checks vector lookup for the same.

Haven't been able to test the newly released FastText vectors for wiki in various languages since the models are large-ish (~10 GB).

Related issue reported here

@anmolgulati
Copy link
Contributor

@jayantj This looks good, but should we wait for to test this on the new pre-trained models first and then ask Lev to merge? Though I feel, it shouldn't create any difference, as you have already added a test for non-ascii words.

@tmylk
Copy link
Contributor

tmylk commented Mar 2, 2017

Agree with @anmol01gulati. Does it work on https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md, @jayantj ?

@tmylk tmylk merged commit 4d8333a into piskvorky:develop Mar 2, 2017
@piskvorky
Copy link
Owner

piskvorky commented Mar 3, 2017

The previous behaviour was not good, how did that pass the code review and unit tests :(

Does fastText always output utf8? Is the encoding really fixed?

@piskvorky
Copy link
Owner

@jayantj @tmylk the fastText file header is incorrect (states I wrote the code in 2013). Please fix.

@piskvorky
Copy link
Owner

piskvorky commented Mar 3, 2017

@tmylk use better issue/PR titles, so search engines can index them and users find them, when they hit the same error.

@piskvorky
Copy link
Owner

piskvorky commented Mar 6, 2017

ping @jayantj status on the questions above?

@jayantj
Copy link
Contributor Author

jayantj commented Mar 7, 2017

The fastText readme says that the training file should contain utf-8 encoded text. That was the rationale behind choosing utf8 as the encoding while deserializing the model.

Looking at the fastText code and experimenting with different encodings though, fastText doesn't really decode the bytes in the training file and simply treats the input file as a stream of bytes.

The only place it does make use of the values inside the bytes is while tokenizing the text, where it checks each byte against whitespace characters (' ', '\n', '\r', '\t', '\v', '\f')

So, encodings which have the same values for these characters as ascii would produce sane output. (utf16 and utf32 don't, lots of others do). Original encoding information isn't stored anywhere though.

I can think of two ways to handle this -

  1. Assume words in input file are utf8 encoded (since FastText readme mentions it, and it might be enforced in the future), and raise a more friendly error when the input isn't valid utf8
  2. Provide an parameter encoding allowing the user to specify the input file encoding (defaulting to utf8, of course)

Any thoughts?

@jayantj
Copy link
Contributor Author

jayantj commented Mar 7, 2017

Re: file header, I'll create a single PR with the updated header and the fix for the above issue, whichever one we go with.

@tmylk
Copy link
Contributor

tmylk commented Mar 7, 2017

Option 2 sounds as the most flexible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants