[MRG] Fix for #814 #1176

jayantj · 2017-03-01T21:26:36Z

The current code attempts to decode all characters for vocab words individually as ascii.
Instead, the code is changed to do the following - characters are read as raw bytes first, then decoded as utf-8 once the entire word has been read.

Also includes a test fasttext model with non-ascii vocab words, and checks vector lookup for the same.

Haven't been able to test the newly released FastText vectors for wiki in various languages since the models are large-ish (~10 GB).

Related issue reported here

anmolgulati · 2017-03-01T21:52:03Z

@jayantj This looks good, but should we wait for to test this on the new pre-trained models first and then ask Lev to merge? Though I feel, it shouldn't create any difference, as you have already added a test for non-ascii words.

tmylk · 2017-03-02T18:51:56Z

Agree with @anmol01gulati. Does it work on https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md, @jayantj ?

piskvorky · 2017-03-03T05:44:51Z

The previous behaviour was not good, how did that pass the code review and unit tests :(

Does fastText always output utf8? Is the encoding really fixed?

piskvorky · 2017-03-03T05:47:09Z

@jayantj @tmylk the fastText file header is incorrect (states I wrote the code in 2013). Please fix.

piskvorky · 2017-03-03T05:52:12Z

@tmylk use better issue/PR titles, so search engines can index them and users find them, when they hit the same error.

piskvorky · 2017-03-06T19:14:44Z

ping @jayantj status on the questions above?

jayantj · 2017-03-07T05:08:56Z

The fastText readme says that the training file should contain utf-8 encoded text. That was the rationale behind choosing utf8 as the encoding while deserializing the model.

Looking at the fastText code and experimenting with different encodings though, fastText doesn't really decode the bytes in the training file and simply treats the input file as a stream of bytes.

The only place it does make use of the values inside the bytes is while tokenizing the text, where it checks each byte against whitespace characters (' ', '\n', '\r', '\t', '\v', '\f')

So, encodings which have the same values for these characters as ascii would produce sane output. (utf16 and utf32 don't, lots of others do). Original encoding information isn't stored anywhere though.

I can think of two ways to handle this -

Assume words in input file are utf8 encoded (since FastText readme mentions it, and it might be enforced in the future), and raise a more friendly error when the input isn't valid utf8
Provide an parameter encoding allowing the user to specify the input file encoding (defaulting to utf8, of course)

Any thoughts?

jayantj · 2017-03-07T05:11:08Z

Re: file header, I'll create a single PR with the updated header and the fix for the above issue, whichever one we go with.

tmylk · 2017-03-07T11:57:34Z

Option 2 sounds as the most flexible

fixes bug for non-ascii words in fasttext model vocab + tests

4ff9c4e

jayantj mentioned this pull request Mar 1, 2017

Loading fastText binary output to gensim like word2vec #814

Closed

xrmx mentioned this pull request Mar 2, 2017

ValueError when loading wiki.en.vec using gensim's word2vec facebookresearch/fastText#155

Closed

tmylk merged commit 4d8333a into piskvorky:develop Mar 2, 2017

jayantj mentioned this pull request Mar 7, 2017

[MRG] Load FastText models with specified encoding #1189

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Fix for #814 #1176

[MRG] Fix for #814 #1176

jayantj commented Mar 1, 2017 •

edited

Loading

anmolgulati commented Mar 1, 2017

tmylk commented Mar 2, 2017

piskvorky commented Mar 3, 2017 •

edited

Loading

piskvorky commented Mar 3, 2017

piskvorky commented Mar 3, 2017 •

edited

Loading

piskvorky commented Mar 6, 2017 •

edited

Loading

jayantj commented Mar 7, 2017 •

edited

Loading

jayantj commented Mar 7, 2017

tmylk commented Mar 7, 2017

[MRG] Fix for #814 #1176

[MRG] Fix for #814 #1176

Conversation

jayantj commented Mar 1, 2017 • edited Loading

anmolgulati commented Mar 1, 2017

tmylk commented Mar 2, 2017

piskvorky commented Mar 3, 2017 • edited Loading

piskvorky commented Mar 3, 2017

piskvorky commented Mar 3, 2017 • edited Loading

piskvorky commented Mar 6, 2017 • edited Loading

jayantj commented Mar 7, 2017 • edited Loading

jayantj commented Mar 7, 2017

tmylk commented Mar 7, 2017

jayantj commented Mar 1, 2017 •

edited

Loading

piskvorky commented Mar 3, 2017 •

edited

Loading

piskvorky commented Mar 3, 2017 •

edited

Loading

piskvorky commented Mar 6, 2017 •

edited

Loading

jayantj commented Mar 7, 2017 •

edited

Loading