Loading fastText binary output to gensim like word2vec #814

phunterlau · 2016-08-05T21:51:17Z

Facebook's recent open sourced fasttext https://github.com/facebookresearch/fastText improves the word2vec SkipGram model. It follows a similar output format for word - vector key value pairs, and the similarity calculation is about the same too, but their binary output format is kind of different from that of the C version of word2vec binary format. Do we want to support loading fastText model output in gensim? Thanks.

The text was updated successfully, but these errors were encountered:

gojomo · 2016-08-05T23:38:12Z

Definitely! Reading/writing the fastText word-vector format (the .vec and perhaps consulting part of the .bin) would be an obvious 1st step.

(As a later step, the .bin might include enough extra info for models to continue training… though their classification modes might not map directly to the existing gensim output-layer models.)

gojomo · 2016-08-06T04:31:14Z

It appears the .vec output of fastText is already compatible with the original word2vec.c text format, and readable in gensim by load_word2vec_format(filename, binary=False).

The .bin output, written in parallel (rather than as an alternative format like in word2vec.c), seems to have extra info – such as the vectors for char-ngrams – that wouldn't map directly into gensim models unless/until they're extended with new features. So supporting the load of such info isn't a mere matter of format understanding/translation.

phunterlau · 2016-08-06T09:36:09Z

@gojomo thanks. I can confirm the vec format is compatible with gensim

tmylk · 2016-08-10T06:59:50Z

See FastText comparison notebook in #815

gojomo · 2016-08-19T23:45:17Z

@tmylk - we may want to keep this open for the larger issue of doing something with the .bin output. We might be able to map its weights (and word-frequency info) into gensim's objects, to support continued training, as a small translation-of-values patch.

Loading the buckets-of-subword-vectors, and making them usable for OOV prediction, would require a bit more actual functionality... but would still be practical. Maybe at first, the subword-buckets wind up in a different class – even perhaps a KeyedVectors variant/sibling - which would offer both subword vector lookup, by hashed key, and word-vector-reconstruction (incl. OOV words), by composition of subword vectors. (cc @droudy)

tmylk · 2017-01-24T20:31:15Z

Implemented in https://github.com/RaRe-Technologies/gensim/blob/2a70e3a726404cd4230542a35cfd2dc4d63da6f1/gensim/models/wrappers/fasttext.py#L246

AritzBi · 2017-03-01T15:59:05Z

FastText just published pre-trained word vectors for 90 languages trained on Wikipedia. I am trying to load the Spanish, Basque or English models with gensim=1.0.0 and the method FastText.load_fasttext_format but I have the following error:

File "/home/aritzbi/test_gensim_fastext/venv/local/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 255, in load_binary_data
    self.load_dict(f)
  File "/home/aritzbi/test_gensim_fastext/venv/local/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 282, in load_dict
    char = char.decode()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Should I use some other method?

anmolgulati · 2017-03-01T19:44:38Z

cc @jayantj @tmylk We should investigate about these new pre-trained models. Why do they fail tests. And also add these model files into tests.

jayantj · 2017-03-01T20:13:03Z

Thanks for reporting this issue. At first glance, it seems like the code makes an assumption that the characters constituting the vocab words can be decoded as ascii. That would be a dangerous assumption to make. Looking into it further.

And yes, adding these model files (or maybe simply models with non-ascii characters, and possibly even utf-16/utf-32 characters) to tests would be a good idea. Will do as soon as I get to the root of this issue.

jayantj · 2017-03-01T20:28:41Z

Yes, confirming that this is the issue. I'll push a fix for this asap.

jayantj · 2017-03-01T21:28:10Z

Fix pushed as part of #1176
I haven't been able to test loading the new pre-trained models yet, since they are rather large (~10 GB) and the download is taking forever.

AritzBi · 2017-03-02T10:56:54Z

I've just tested @jayantj fix with Spanish and Basque models and they are properly loaded. Thanks for the quick fix!!

evanmiltenburg · 2017-03-02T23:09:52Z

I tried the pre-trained English model, with the following command:

fasttext = FastText.load_fasttext_format('/Users/Emiel/Downloads/wiki.en/wiki.en')

I get the following error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-e2cc3eaf9300> in <module>()
      1 # We use the FastText wrapper from Gensim.
      2 # Download the vectors from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
----> 3 fasttext = FastText.load_fasttext_format('/Users/Emiel/Downloads/wiki.en/wiki.en')
      4 
      5 # Alternatively:

/Users/Emiel/anaconda3/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_fasttext_format(cls, model_file)
    236         model = cls()
    237         model.wv = cls.load_word2vec_format('%s.vec' % model_file)
--> 238         model.load_binary_data('%s.bin' % model_file)
    239         return model
    240 

/Users/Emiel/anaconda3/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_binary_data(self, model_binary_file)
    253         with utils.smart_open(model_binary_file, 'rb') as f:
    254             self.load_model_params(f)
--> 255             self.load_dict(f)
    256             self.load_vectors(f)
    257 

/Users/Emiel/anaconda3/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_dict(self, file_handle)
    280             word = ''
    281             char, = self.struct_unpack(file_handle, '@c')
--> 282             char = char.decode()
    283             # Read vocab word
    284             while char != '\x00':

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data

This is with Gensim 1.0.0, freshly installed from PyPi, on OS X. The following code did work (but doesn't load the binary file):

fasttext = Word2Vec.load_word2vec_format('/Users/Emiel/Downloads/wiki.en/wiki.en.vec', binary=False)

Does @jayantj's fix also solve this issue? If so, should I install Gensim from GitHub, or will the patch soon also be on PyPi?

tmylk · 2017-03-03T00:20:09Z

Please install from github for now

tmylk · 2017-03-04T00:34:02Z

Fixed in gensim 1.0.1 available on PyPI

tmylk · 2017-03-06T12:13:32Z

Fixed in #1176

already-taken-m17 · 2017-03-14T16:26:13Z

Hi,
Not sure if this is the right place to put up my doubt, but asking that anyway.
Does trained .bin file of fasttext contains ngram vectors of sizes [3-6] only or ngram vectors of all sizes ? Upon loading the model with gensim and iterating through ngrams, I found that ngrams of all sizes are present.
If ngrams of all sizes are present, then my another doubt is which ngrams are used to make vector of out of vocabulary word. ngrams of sizes [3-6] or all ?

jayantj · 2017-03-14T16:55:05Z

Hi @already-taken-m17
It depends on the hyperparameters the model was trained with - the default values of min_n and max_n are 3 and 6.
Which model are you loading, and how exactly are you iterating through ngrams?

For out-of-vocabulary words, again, ngrams of sizes [min_n,max_n] are used (and only those ngrams which were present in the ngram vocabulary of the training data)

already-taken-m17 · 2017-03-15T03:47:42Z

Hi @jayantj , thanks for the reply.
I trained the model with original C++ fasttext implementation fastText Github repo
I used following command to train:
./fasttext skipgram -input data.txt -output model
This must take default parameters and store ngrams of sizes [3,6].
For iterating through ngrams, I am using model.wv.ngrams after loading the model using fasttext wrapper of gensim.

dimeldo · 2017-03-23T09:55:13Z

@jayantj I'm getting the same error when trying to load fasttext pre-trained "wiki.he.bin" using this command:
embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(args.embedding_dictionary, binary=True, unicode_errors='ignore')

I'm getting this error:

return unicode(text, encoding, errors=errors)
File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 32: invalid start byte

piskvorky · 2017-03-23T09:56:26Z

@jayantj @tmylk we have a new 10TB disk on h2 -- feel free to download these "large fastText files" there, for testing.

evanmiltenburg · 2017-03-23T09:56:49Z

The .bin file is not in word2vec binary format. Use the .vec file and load it with the flag binary=False. Or use the FastText wrapper:

from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format(filename_without_extension)

According to the source, this is shorthand for:

from gensim.models.wrappers import FastText
model = FastText.load_word2vec_format('FILENAME.vec')
model.load_binary_data('FILENAME.bin')

The binary data is specific to the FastText algorithm.

Edit: in response to @dimeldo.

dimeldo · 2017-03-23T10:01:13Z

@evanmiltenburg will definitely try that, thanks.

Edit: It worked well, thanks.

jayantj · 2017-03-23T20:50:48Z

Thanks for the update, good to hear it worked.

jdchoi77 · 2017-03-27T14:09:04Z

I tried to load using the load_fasttext_format function, which takes forever to read the .vec file. It seems like fastText manages to load all vectors by just reading .bin file, which is much faster. Would it be possible to avoid reading the .vec file for loading when there is a .bin file? I'm only asking since I prefer to use gensim for both word2vec and fasttext instead of adapting another library. Thanks.

piskvorky · 2017-04-03T23:42:39Z

@tmylk @jayantj how does that SaleStock load its data? We definitely don't want to be slower than other Python tools/wrappers.

tmylk · 2017-04-04T23:12:43Z

SaleStock is using C++ code closer to original FastText. Created #1261 wishlist issue.

piskvorky · 2017-04-09T02:46:57Z

If it's really that annoyingly slow, we could read the code in C (Cython-compiled) -- seems easy enough.

prakhar2b · 2017-04-10T05:53:30Z

@piskvorky @tmylk Cythonizing would be fine, but as of now, we are first getting info from .vec file and then while reading .bin file, we use assert statement to confirm that there is no mismatch in the info obtained from .vec and .bin file. This seems unnecessary.

jayantj · 2017-04-10T06:43:47Z

Yes, it should be possible to load the model only from the .bin file without having to read the .vec file.
As of now, the .vec file is used to initialize the KeyedVectors instance, which include:

The vocabulary (words and counts)
The vectors for in-vocab words

The .bin file contains the in-vocab words too (loaded in FastText.load_dict), however the word vectors will have to be initialized by making use of the char-ngram vectors. They are not directly present in the .bin file.

Changing this would require non-trivial changes to the FastText class. It could be useful to do some quick profiling to see whether loading the .vec file takes up a significant portion of time before expending effort on changing this behaviour (ideally, for models of different sizes - say 50 MB, 500 MB, 5 GB)

arashsa · 2017-04-24T22:24:20Z

I am having some issues reading fasttext files.

from gensim.models import KeyedVectors
no_model = KeyedVectors.load_word2vec_format('wiki.no/wiki.no.vec')

The above code works, but with that I'm not able to get oov words.

from gensim.models.wrappers import FastText
model = FastText.load_word2vec_format('wiki.no/wiki.no.vec')
model.load_binary_data('wiki.no/wiki.no.bin')

With the above code I get the error:

AttributeError: 'FastTextKeyedVectors' object has no attribute 'load_binary_data'

There are no examples in the documentation as to the best way to read a fasttext file and get oov vectors.

piskvorky · 2017-04-25T00:41:53Z

@tmylk if so, please update the docs with your team.

tmylk · 2017-04-27T02:15:38Z

@arashsa The code has been updated since the comment above, please use load_fasttext_format now

arashsa · 2017-05-07T12:30:38Z

@tmylk when I try the load_tasttext_format method I get this error:

AssertionError: mismatch between vocab sizes

tmylk · 2017-05-08T07:34:50Z

@arashsa There is an actual mismatch in the sizes of vocab in .vec and .bin sizes. So it is possible it's there for Norwegian. Please report it to FastText for french.

rajivgrover009 · 2017-12-19T17:56:13Z

Does it support online training?

menshikh-iv · 2017-12-20T09:18:21Z

@rajivgrover009 our fasttext implementation - yes.

rajivgrover009 · 2017-12-20T21:35:19Z

@menshikh-iv i meant continue training out pretrained fasttext models. is it possible to use pretrained models by fasttext and continue training to add use case specific vocabulary??

menshikh-iv · 2017-12-21T05:35:58Z

@anmolgulati show me a concrete link please, probably, only main matrices saved, i.e. - you can only use it (but can't continue training).

sauravm8 · 2018-07-15T11:43:45Z

Trying to load Bengali Fastext model, both for .vec and .bin format

For both .bin and .vec formats white trying with models.KeyedVectors.load_word2vec_format( )

ImportError:
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try git clean -xdf (removes all
files not under version control). Otherwise reinstall numpy.

Same error when trying FastText.load_vectors( ) or FastText.load_binary_data( )

menshikh-iv · 2018-07-15T14:02:10Z

@sauravm8 you have issues with numpy installation, resolve it first and reinstall gensim after

for .bin use https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText.load_fasttext_format (this typically contains full model with parameters, ngrams, etc), you can continue training after loading
for .vec use https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format (this contains ONLY word-vectors -> no ngrams + you can't update an model)

sauravm8 · 2018-07-15T14:04:18Z

@menshikh-iv Yes. It was a version conflict between numpy and python. Solved.

harrypotter0 · 2018-07-23T08:09:34Z

@menshikh-iv Hi Sir,
I tried to load a model (faq.model.bin which is trained using fasttext) using gensim wrapper, code I used for loading the model :

import os
from nltk.tokenize import word_tokenize, sent_tokenize
from pprint import pprint
import re
from textblob import TextBlob
import string
from nltk.corpus import stopwords
from gensim.models import Word2Vec

import gensim
from gensim.models import word2vec, KeyedVectors
from threading import Semaphore
import logging

import numpy as np
import os

vector_dim = 300
root_path = os.getcwd()
from nltk.tokenize import word_tokenize
import multiprocessing

def readStr():
    return raw_input().strip()

if __name__ == "__main__":
    from gensim.models.wrappers import FastText
    file1 = open("fasttext_finance.txt","w")
    print "Loading the model"
    model_path = "/home/akash/Documents/nlp-tests/models/finance_model/faq.model.bin"
    model = FastText.load_fasttext_format(model_path)
    # model = gensim.models.fasttext.load_fasttext_format(model_path)
    print(model.most_similar('banks'))

    lis = ["income","maturity","tax","mutual","fund","banks","cash","pf","epf","bankrupt",
            "loans","money","benefit","insurance","debt","advantage","sbi","kotak","shares",
            "food","hotel","retirement","travel","food","health","salary","account","advantage","disadvantage"]

    for word in lis:
        N = 50
        print("Most similar words to {} are :{}\n".format(word,model.most_similar(positive=[word],topn=N)))
        file1.write("Most similar words to {} are :{}\n\n".format(word,model.most_similar(positive=[word],topn=N)))
    file1.close()

and here is the error I receive:

/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py:410: RuntimeWarning: divide by zero encountered in remainder
  ngram_indices.append(len(self.wv.vocab) + ngram_hash % self.bucket)
Traceback (most recent call last):
  File "testing_merged_all.py", line 31, in <module>
    model = FastText.load_fasttext_format(model_path)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 271, in load_fasttext_format
    model.load_binary_data(encoding=encoding)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 297, in load_binary_data
    self.load_vectors(f)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 384, in load_vectors
    self.init_ngrams()
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 412, in init_ngrams
    self.wv.syn0_ngrams = self.wv.syn0_ngrams.take(ngram_indices, axis=0)
IndexError: index 2534933 is out of bounds for size 2534933
Is it memory error or something else?

Apart from the above, here is the script I used for training a pretrained model on my own corpus and generated the model which I used in above script for testing it. Here goes the script:

./fasttext supervised \
  -pretrainedVectors /home/akash/Downloads/wiki.en.vec \
  -input output.txt \
  -dim 300 \
  -output faq.model

Have I not trained the pretrained model correctly or where is the error? As the same model(faq.model.bin) is working fine with the pyfasttext library. Please look into this.

menshikh-iv · 2018-07-31T10:12:36Z

@harrypotter0 we don't support "supervised" models, can you share your .bin please, I want to reproduce this issue?

Sherriiie · 2018-09-04T17:33:02Z

How could i save fast text model to bin and vec files ?

kevin369ml · 2018-09-04T18:02:16Z

@Sherriiie If you download the original pre-trained files from official website of fasttext at here: https://fasttext.cc/docs/en/english-vectors.html and unzip them, they are .vec files.
Regarding .bin file, usually I use gensim to transform vec to bin like this:
vec_file = gensim.models.KeyedVectors.load_word2vec_format("crawl_300d_2M.vec", binary=False) vec_file.save_word2vec_format("crawl_300d_2M.bin", binary=True).
It is much faster by using binary files.

menshikh-iv · 2018-09-05T02:14:52Z

@kevin28520 note (it can be obvious, but to avoid confusion) than .bin produced in this way are different with .bin distributed by FB: proposed .bin will contain ONLY word-vectors (no ngrams), this is still equivalent of .vec file distributed by FB.

ghazeefa · 2019-04-25T14:03:30Z

I am using fasttext pre-train model for urdu language https://fasttext.cc/docs/en/pretrained-vectors.html
Why i am getting different vectors from .bin and .vec file ? Which one should use to evaluate the model?

import gensim.models.keyedvectors as word2vec1
from scipy import spatial
from gensim.models import FastText

pathToBinVectors = 'C:/Users/admin/fasttextwiki/wiki.ur.vec'
embed_map = word2vec1.KeyedVectors.load_word2vec_format(pathToBinVectors)
gg = embed_map.wv.get_vector('سائیکل')
hh = embed_map.wv.get_vector('گاڑی')
a=1-spatial.distance.cosine(gg,hh)
print(a*4)

i got similarity score 1.8220717906951904
when i load .bin file

model = FastText.load_fasttext_format('C:/Users/admin/fasttextwiki/wiki.ur.bin')
gg = model.wv.get_vector('سائیکل')
hh = model.wv.get_vector('گاڑی')
a=1-spatial.distance.cosine(gg,hh)
print(a*4)

i got similarity score 0.376111775636673

mpenkov · 2019-04-26T10:22:15Z

@ghazeefa please open a new ticket for your problem

tmylk added wishlist Feature request and removed wishlist Feature request labels Aug 10, 2016

tmylk closed this as completed Aug 10, 2016

schwittlick mentioned this issue Dec 1, 2016

Compare Word2Vec Training with Facebook/fastText Schwittleymani/ECO#180

Open

gojomo reopened this Mar 1, 2017

jayantj mentioned this issue Mar 1, 2017

[MRG] Fix for #814 #1176

Merged

tmylk closed this as completed Mar 6, 2017

tmylk mentioned this issue Apr 4, 2017

Improve FastText loading times #1261

Closed

menshikh-iv added the need info Not enough information for reproduce an issue, need more info from author label Jul 31, 2018

Repository owner locked as off-topic and limited conversation to collaborators Apr 26, 2019

Loading fastText binary output to gensim like word2vec #814

Loading fastText binary output to gensim like word2vec #814

Comments

phunterlau commented Aug 5, 2016

gojomo commented Aug 5, 2016 • edited Loading

gojomo commented Aug 6, 2016

phunterlau commented Aug 6, 2016

tmylk commented Aug 10, 2016

gojomo commented Aug 19, 2016 • edited Loading

tmylk commented Jan 24, 2017

AritzBi commented Mar 1, 2017 • edited Loading

anmolgulati commented Mar 1, 2017

jayantj commented Mar 1, 2017 • edited Loading

jayantj commented Mar 1, 2017

jayantj commented Mar 1, 2017

AritzBi commented Mar 2, 2017

evanmiltenburg commented Mar 2, 2017 • edited Loading

tmylk commented Mar 3, 2017

tmylk commented Mar 4, 2017

tmylk commented Mar 6, 2017

already-taken-m17 commented Mar 14, 2017

jayantj commented Mar 14, 2017 • edited Loading

already-taken-m17 commented Mar 15, 2017

dimeldo commented Mar 23, 2017 • edited Loading

piskvorky commented Mar 23, 2017

evanmiltenburg commented Mar 23, 2017 • edited Loading

dimeldo commented Mar 23, 2017 • edited Loading

jayantj commented Mar 23, 2017

jdchoi77 commented Mar 27, 2017

piskvorky commented Apr 3, 2017 • edited Loading

tmylk commented Apr 4, 2017

piskvorky commented Apr 9, 2017

prakhar2b commented Apr 10, 2017

jayantj commented Apr 10, 2017 • edited Loading

arashsa commented Apr 24, 2017

piskvorky commented Apr 25, 2017

tmylk commented Apr 27, 2017

arashsa commented May 7, 2017

tmylk commented May 8, 2017

rajivgrover009 commented Dec 19, 2017

menshikh-iv commented Dec 20, 2017

rajivgrover009 commented Dec 20, 2017

menshikh-iv commented Dec 21, 2017

sauravm8 commented Jul 15, 2018 • edited Loading

menshikh-iv commented Jul 15, 2018

sauravm8 commented Jul 15, 2018

harrypotter0 commented Jul 23, 2018 • edited Loading

menshikh-iv commented Jul 31, 2018

Sherriiie commented Sep 4, 2018

kevin369ml commented Sep 4, 2018 • edited Loading

menshikh-iv commented Sep 5, 2018 • edited Loading

ghazeefa commented Apr 25, 2019

mpenkov commented Apr 26, 2019

gojomo commented Aug 5, 2016 •

edited

Loading

gojomo commented Aug 19, 2016 •

edited

Loading

AritzBi commented Mar 1, 2017 •

edited

Loading

jayantj commented Mar 1, 2017 •

edited

Loading

evanmiltenburg commented Mar 2, 2017 •

edited

Loading

jayantj commented Mar 14, 2017 •

edited

Loading

dimeldo commented Mar 23, 2017 •

edited

Loading

evanmiltenburg commented Mar 23, 2017 •

edited

Loading

dimeldo commented Mar 23, 2017 •

edited

Loading

piskvorky commented Apr 3, 2017 •

edited

Loading

jayantj commented Apr 10, 2017 •

edited

Loading

sauravm8 commented Jul 15, 2018 •

edited

Loading

harrypotter0 commented Jul 23, 2018 •

edited

Loading

kevin369ml commented Sep 4, 2018 •

edited

Loading

menshikh-iv commented Sep 5, 2018 •

edited

Loading