support both old and new fastText model #1319

prakhar2b · 2017-05-13T05:01:33Z

Fix #1301 . Support both old and new fastText model.

prakhar2b · 2017-05-13T05:59:25Z

test failing as data used lee_fasttext is modeled using older facebook's fastText, we need to update it too. Also, I'll handle whitespace error asap.

Note :- As mentioned in this comment, facebook made changes last week to introduce two variable magic and version , and dictionary pruning, which also causes error for all trained model while loading using gensim fasttext wrapper. This PR modifies wrapper accordingly.

piskvorky

Minor code style nitpick; thanks for the PR!

piskvorky · 2017-05-14T09:58:51Z

gensim/models/wrappers/fasttext.py

@@ -256,7 +256,7 @@ def load_binary_data(self, model_binary_file):
            self.load_vectors(f)

    def load_model_params(self, file_handle):
-        (dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t) = self.struct_unpack(file_handle, '@12i1d')
+        (_,_,dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t) = self.struct_unpack(file_handle, '@14i1d')


PEP8: space after comma. Outer brackets not needed.

piskvorky · 2017-05-14T09:59:17Z

gensim/models/wrappers/fasttext.py

@@ -275,7 +275,7 @@ def load_dict(self, file_handle):
        # Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)
        assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
        assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
-        ntokens, = self.struct_unpack(file_handle, '@q')
+        ntokens,pruneidx_size = self.struct_unpack(file_handle, '@2q')


Space after comma.

piskvorky · 2017-05-14T09:59:54Z

gensim/models/wrappers/fasttext.py

@@ -289,6 +289,11 @@ def load_dict(self, file_handle):
            assert self.wv.vocab[word].index == i, 'mismatch between gensim word index and fastText word index'
            self.wv.vocab[word].count = count

+            for j in range(pruneidx_size):
+                _,_ = self.struct_unpack(file_handle,'@2i')


Spaces after commas.

piskvorky · 2017-05-14T10:00:07Z

gensim/models/wrappers/fasttext.py

+            for j in range(pruneidx_size):
+                _,_ = self.struct_unpack(file_handle,'@2i')
+
+            _ = self.struct_unpack(file_handle,'@?')


piskvorky · 2017-05-14T10:01:52Z

And do we continue to support the old format too? Or will everyone's code (and models) break after this change?

Are the FastText datasets that people commonly use still load-able (FB released models in many languages some time ago IIRC).

prakhar2b · 2017-05-15T01:00:10Z

@piskvorky the fastText datasets (pretrained vectors) are updated according to the recent changes (see this update 12 days ago). The issue #1301 (about Hebrew pretrained vector) was because of these changes only. Issue 1236 (about French pretrained vector) is a glitch from facebook's side, not in gensim's code.
cc @jayantj

jayantj · 2017-05-15T06:15:05Z

@prakhar2b I think we should definitely be supporting loading both the old and new FastText models with the gensim fasttext wrapper. FastText seem to have updated the links on their page with the newer models, however there might still be a number of people using the old models.

Also, adding tests for loading both older and newer format models would be good.

prakhar2b · 2017-05-20T02:13:22Z

@tmylk @menshikh-iv please look into the branch conflict
cc @jayantj

jayantj · 2017-05-22T08:49:32Z

gensim/test/test_fasttext_wrapper.py

+        #Test model successfully loaded from fastText (new format) .vec and .bin files
+        new_model = fasttext.FastText.load_fasttext_format(self.test_new_model_file)
+        vocab_size, model_size = 1763, 10
+        self.assertEqual(self.test_new_model.wv.syn0.shape, (vocab_size, model_size))


Shouldn't this and the subsequent assert statements make use of new_model?

self.assertEqual(self.`test_new_model`.wv.syn0.shape, (vocab_size, model_size))

new_model is used here after assert statements

self.model_sanity(new_model)

Yes, so why not for the other assert statements? If I understand correctly, the whole point of the test is to make sure the model we loaded (new_model) has been loaded correctly.

test_new_model is defined here in line 38. Only new model is being checked in this function testLoadFastTextNewFormat. Please compare it with function testLoadFastTextFormat to see the context.

jayantj · 2017-05-22T08:49:37Z

gensim/models/wrappers/fasttext.py


    def load_vectors(self, file_handle):
+        if self.new_format:
+            _ = self.struct_unpack(file_handle,'@?')


Can you please add a comment clarifying what this is for?

jayantj · 2017-05-22T08:49:44Z

gensim/models/wrappers/fasttext.py

+            logger.warnings("Please report to gensim or fastText.")
+        ntokens= self.struct_unpack(file_handle, '@1q')
+        if self.new_format:
+            pruneidx_size = self.struct_unpack(file_handle, '@q')


pruneidx_size seems to be used in the original FastText code to load dictionaries -

void Dictionary::load(std::istream& in) { words_.clear(); std::fill(word2int_.begin(), word2int_.end(), -1); in.read((char*) &size_, sizeof(int32_t)); in.read((char*) &nwords_, sizeof(int32_t)); in.read((char*) &nlabels_, sizeof(int32_t)); in.read((char*) &ntokens_, sizeof(int64_t)); in.read((char*) &pruneidx_size_, sizeof(int64_t)); for (int32_t i = 0; i < size_; i++) { char c; entry e; while ((c = in.get()) != 0) { e.word.push_back(c); } in.read((char*) &e.count, sizeof(int64_t)); in.read((char*) &e.type, sizeof(entry_type)); words_.push_back(e); word2int_[find(e.word)] = i; } pruneidx_.clear(); for (int32_t i = 0; i < pruneidx_size_; i++) { int32_t first; int32_t second; in.read((char*) &first, sizeof(int32_t)); in.read((char*) &second, sizeof(int32_t)); pruneidx_[first] = second; } initTableDiscard(); initNgrams(); }

I'm not sure if it's present in the models we're loading or not though?

for j in range(pruneidx_size): _,_ = self.struct_unpack(file_handle,'@2i')

Presence (or absence) of these two lines doesn't affect the code.

I don't see those two lines in the code. Do you mean pruneidx_size is always zero, and it doesn't matter whether we keep those lines or not?
If yes, we should add an assert statement, since if pruneidx_size isn't zero and we don't read the corresponding bytes, the subsequent bytes are not going to be parsed correctly.

It's -1 actually. Added these two lines in the new commit. I think we should not use assert statement here as some models might have non-negative values, so adding these two lines should be sufficient.

jayantj · 2017-05-22T08:49:50Z

gensim/models/wrappers/fasttext.py

-    def load_dict(self, file_handle):
-        (vocab_size, nwords, _) = self.struct_unpack(file_handle, '@3i')
+    def load_dict(self, file_handle, encoding='utf8'):
+        vocab_size, nwords, _ = self.struct_unpack(file_handle, '@3i')


I'd prefer keeping the changes related to the issue with the french wiki in a separate PR. We don't want those changes to block this PR from being merged.
In general, I think it's good practice to keep fixes for different issues in different branches/PRs.

separated the PRs. Thanks :)

jayantj · 2017-05-22T08:49:57Z

gensim/models/wrappers/fasttext.py

+        if magic == 793712314:  # newer format 
+            self.new_format = True
+            dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t = self.struct_unpack(file_handle, '@12i1d')
+        else:  # older format


Better to set self.new_format = False inside this else, rather than inside initialize_word_vectors

jayantj · 2017-05-22T08:50:00Z

gensim/models/wrappers/fasttext.py

@@ -256,7 +257,14 @@ def load_binary_data(self, model_binary_file):
            self.load_vectors(f)

    def load_model_params(self, file_handle):
-        (dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t) = self.struct_unpack(file_handle, '@12i1d')
+        magic, v= self.struct_unpack(file_handle, '@2i')
+        if magic == 793712314:  # newer format 


I think it'd be good to store this value in a global variable. FASTTEXT_MAGIC_HEADER or something similar.

jayantj · 2017-05-22T08:50:05Z

gensim/models/wrappers/fasttext.py

@@ -256,7 +257,14 @@ def load_binary_data(self, model_binary_file):
            self.load_vectors(f)

    def load_model_params(self, file_handle):
-        (dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t) = self.struct_unpack(file_handle, '@12i1d')
+        magic, v= self.struct_unpack(file_handle, '@2i')


Code style - keep 1 space before and after =
Also, version would be preferable to v.

jayantj · 2017-05-22T08:51:35Z

Left some comments, but apart from that, looks good!
Thanks for the fix.

What issues are you facing with the branch conflict?

prakhar2b · 2017-05-22T12:36:00Z

@jayantj updated the PR according to review. Check errors are in the vec file uploaded.
cc @tmylk

tmylk · 2017-05-22T21:50:40Z

The old fasttext tests should still pass - strange to see them broken.
Also please add .vec to flake8 ignore list

xiaobaozi34 · 2017-05-22T22:22:17Z

I pulled 'mismatch' branch, and re-install gensim, re-run the FastText_Tutorial.ipynb, but still got the same error:

AssertionError                            Traceback (most recent call last)
<ipython-input-3-993376a144dd> in <module>()
      7 lee_train_file = data_dir + 'lee_background.cor'
      8 
----> 9 model = FastText.train(ft_home, lee_train_file)
     10 
     11 print(model)

/Users/wqb921/anaconda2/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.pyc in train(cls, ft_path, corpus_file, output_file, model, size, alpha, window, min_count, loss, sample, negative, iter, min_n, max_n, sorted_vocab, threads)
    209 
    210         output = utils.check_output(args=cmd)
--> 211         model = cls.load_fasttext_format(output_file)
    212         cls.delete_training_files(output_file)
    213         return model

/Users/wqb921/anaconda2/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.pyc in load_fasttext_format(cls, model_file, encoding)
    237         model = cls()
    238         model.wv = cls.load_word2vec_format('%s.vec' % model_file, encoding=encoding)
--> 239         model.load_binary_data('%s.bin' % model_file, encoding=encoding)
    240         return model
    241 

/Users/wqb921/anaconda2/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.pyc in load_binary_data(self, model_binary_file, encoding)
    254         with utils.smart_open(model_binary_file, 'rb') as f:
    255             self.load_model_params(f)
--> 256             self.load_dict(f, encoding=encoding)
    257             self.load_vectors(f)
    258 

/Users/wqb921/anaconda2/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.pyc in load_dict(self, file_handle, encoding)
    275         (vocab_size, nwords, _) = self.struct_unpack(file_handle, '@3i')
    276         # Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)
--> 277         assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
    278         assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
    279         ntokens, = self.struct_unpack(file_handle, '@q')

AssertionError: mismatch between vocab sizes```

prakhar2b · 2017-05-23T05:03:00Z

ohh, the last PR was only a one word change, so in excitement I entered True instead of False, and committed without testing. The old fasttext tests are not failing now. But, the new fasttext models should not have thrown error anyway.

The only error is about assigned variable never being used. How should I handle it, it is required @tmylk

@xiaobaozi34 I checked, it's running perfectly with this PR on my system. Could you please re-run it with logging and show me the logs please. Thank you

jayantj · 2017-05-23T09:55:28Z

gensim/models/wrappers/fasttext.py

        # Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)
        assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
        assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
-        ntokens, = self.struct_unpack(file_handle, '@q')
+        ntokens = self.struct_unpack(file_handle, '@1q')


Could change this to simply self.struct_unpack(file_handle, '@1q') # number of tokens - unused to remove the unused variable.

jayantj · 2017-05-23T09:55:45Z

gensim/models/wrappers/fasttext.py

            assert self.wv.vocab[word].index == i, 'mismatch between gensim word index and fastText word index'
            self.wv.vocab[word].count = count

+        if self.new_format:
+            for j in range(pruneidx_size):
+                _, _ = self.struct_unpack(file_handle, '@2i')


Same as above.

jayantj · 2017-05-23T09:55:55Z

gensim/models/wrappers/fasttext.py

    def load_vectors(self, file_handle):
+        if self.new_format:
+            _ = self.struct_unpack(file_handle, '@?')  # bool quant_input in fasttext.cc


Same as above.

…into mismatch

jayantj · 2017-05-23T10:18:12Z

@prakhar2b Left a couple of minor comments re failing build, we should be good to go after that. Thanks for taking care of this.

A couple of tips for the future -

For issues like these, it's good to follow TDD - commit a failing test first, then commit the changes required to fix it - look at this comment [WIP] Fix backward incompatibility due to random_state #1327 (comment)
It ensures the changes correctly fix the issue, and also quickens up the development process for you personally.
It's good practice to ensure commit messages actually reflect all the changes made in the commit - if a commit makes more than one unrelated update, it's usually better to separate it out into multiple commits.

jayantj · 2017-05-23T10:35:08Z

gensim/models/wrappers/fasttext.py

@@ -286,7 +286,7 @@ def load_dict(self, file_handle, encoding='utf8'):
        # Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)
        assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
        assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
-        ntokens = self.struct_unpack(file_handle, '@1q')
+        self.struct_unpack(file_handle, '@1q') # number of tokens 


PEP8: Inline comments should be separated by at least two spaces from the statement

prakhar2b · 2017-05-23T11:07:34Z

Finally ! 😄 @jayantj Thanks for the review and tips. Now feels like doing some real work, will only get better from here
cc @tmylk

menshikh-iv · 2017-05-25T08:46:26Z

Nice work @prakhar2b 🥇

matanox · 2017-05-25T16:31:32Z

Looking at the recent release train, would this probably go into some newer release within a month or so?

menshikh-iv · 2017-05-25T17:33:06Z

@matanster yes, we plan the release on the next week.

mismatch in bin and vec file resolved

9f9dd24

piskvorky requested changes May 14, 2017

View reviewed changes

jayantj mentioned this pull request May 16, 2017

Gensim error when loading French FastText #1236

Closed

prakhar2b added 4 commits May 19, 2017 19:53

support old and new format

d7725ca

handling whitespace

de39ab0

support old and new format

f0c3e25

branch conflict with resolved

5f5ace6

prakhar2b changed the title ~~[WIP] mismatch in bin and vec file resolved.~~ [WIP] support both old and new fastText model May 19, 2017

prakhar2b added 3 commits May 20, 2017 07:32

support old and new fastText format

1509512

model trained with new fastText format

d7e5403

test for loading new fastText format

58a66c2

jayantj reviewed May 22, 2017

View reviewed changes

prakhar2b and others added 4 commits May 22, 2017 15:14

Merge branch 'develop' into mismatch

9c9d3ec

french PR separated

8ffb220

magic header separated

06ac316

changes according to review

3deb394

prakhar2b added 3 commits May 23, 2017 09:28

.vec added to flake8 ignore list

b038fdb

self.new_format mistake corrected

4f6aa4d

localBounderror resolved

5c09bdf

jayantj reviewed May 23, 2017

View reviewed changes

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

5cdf4e6

…into mismatch

unused variable error handled

55a2d37

jayantj reviewed May 23, 2017

View reviewed changes

two space before comment

aeb05c1

prakhar2b changed the title ~~[WIP] support both old and new fastText model~~ support both old and new fastText model May 23, 2017

Merge branch 'develop' into mismatch

092ef86

prakhar2b mentioned this pull request May 25, 2017

FastText model gets error with typical methods #1343

Closed

menshikh-iv merged commit 7b6afc0 into piskvorky:develop May 25, 2017

prakhar2b mentioned this pull request May 31, 2017

Bin file Turkish in old format? facebookresearch/fastText#222

Closed

This was referenced Jul 3, 2017

[WIP] Labeled w2v #1153

Closed

fixed test for training (new fastText format) #1371

Merged

menshikh-iv mentioned this pull request Oct 26, 2017

Adding dtype to LDAModel to speed it up #1656

Merged

support both old and new fastText model #1319

support both old and new fastText model #1319

Conversation

prakhar2b commented May 13, 2017 • edited

prakhar2b commented May 13, 2017 • edited

piskvorky left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented May 14, 2017 • edited

prakhar2b commented May 15, 2017 • edited

jayantj commented May 15, 2017

prakhar2b commented May 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj May 22, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj May 22, 2017 • edited

Choose a reason for hiding this comment

jayantj commented May 22, 2017

prakhar2b commented May 22, 2017

tmylk commented May 22, 2017

xiaobaozi34 commented May 22, 2017

prakhar2b commented May 23, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj commented May 23, 2017 • edited

Choose a reason for hiding this comment

prakhar2b commented May 23, 2017

menshikh-iv commented May 25, 2017

matanox commented May 25, 2017

menshikh-iv commented May 25, 2017

prakhar2b commented May 13, 2017 •

edited

prakhar2b commented May 13, 2017 •

edited

piskvorky left a comment •

edited

piskvorky commented May 14, 2017 •

edited

prakhar2b commented May 15, 2017 •

edited

jayantj May 22, 2017 •

edited

jayantj May 22, 2017 •

edited

prakhar2b commented May 23, 2017 •

edited

jayantj commented May 23, 2017 •

edited