Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading fastText binary output to gensim like word2vec #814

Closed
phunterlau opened this issue Aug 5, 2016 · 49 comments
Closed

Loading fastText binary output to gensim like word2vec #814

phunterlau opened this issue Aug 5, 2016 · 49 comments
Labels
need info Not enough information for reproduce an issue, need more info from author

Comments

@phunterlau
Copy link

Facebook's recent open sourced fasttext https://github.com/facebookresearch/fastText improves the word2vec SkipGram model. It follows a similar output format for word - vector key value pairs, and the similarity calculation is about the same too, but their binary output format is kind of different from that of the C version of word2vec binary format. Do we want to support loading fastText model output in gensim? Thanks.

@gojomo
Copy link
Collaborator

gojomo commented Aug 5, 2016

Definitely! Reading/writing the fastText word-vector format (the .vec and perhaps consulting part of the .bin) would be an obvious 1st step.

(As a later step, the .bin might include enough extra info for models to continue training… though their classification modes might not map directly to the existing gensim output-layer models.)

@gojomo
Copy link
Collaborator

gojomo commented Aug 6, 2016

It appears the .vec output of fastText is already compatible with the original word2vec.c text format, and readable in gensim by load_word2vec_format(filename, binary=False).

The .bin output, written in parallel (rather than as an alternative format like in word2vec.c), seems to have extra info – such as the vectors for char-ngrams – that wouldn't map directly into gensim models unless/until they're extended with new features. So supporting the load of such info isn't a mere matter of format understanding/translation.

@phunterlau
Copy link
Author

@gojomo thanks. I can confirm the vec format is compatible with gensim

@tmylk tmylk added wishlist Feature request and removed wishlist Feature request labels Aug 10, 2016
@tmylk
Copy link
Contributor

tmylk commented Aug 10, 2016

See FastText comparison notebook in #815

@tmylk tmylk closed this as completed Aug 10, 2016
@gojomo
Copy link
Collaborator

gojomo commented Aug 19, 2016

@tmylk - we may want to keep this open for the larger issue of doing something with the .bin output. We might be able to map its weights (and word-frequency info) into gensim's objects, to support continued training, as a small translation-of-values patch.

Loading the buckets-of-subword-vectors, and making them usable for OOV prediction, would require a bit more actual functionality... but would still be practical. Maybe at first, the subword-buckets wind up in a different class – even perhaps a KeyedVectors variant/sibling - which would offer both subword vector lookup, by hashed key, and word-vector-reconstruction (incl. OOV words), by composition of subword vectors. (cc @droudy)

@AritzBi
Copy link

AritzBi commented Mar 1, 2017

FastText just published pre-trained word vectors for 90 languages trained on Wikipedia. I am trying to load the Spanish, Basque or English models with gensim=1.0.0 and the method FastText.load_fasttext_format but I have the following error:

File "/home/aritzbi/test_gensim_fastext/venv/local/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 255, in load_binary_data
    self.load_dict(f)
  File "/home/aritzbi/test_gensim_fastext/venv/local/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 282, in load_dict
    char = char.decode()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Should I use some other method?

@anmolgulati
Copy link
Contributor

cc @jayantj @tmylk We should investigate about these new pre-trained models. Why do they fail tests. And also add these model files into tests.

@jayantj
Copy link
Contributor

jayantj commented Mar 1, 2017

Thanks for reporting this issue. At first glance, it seems like the code makes an assumption that the characters constituting the vocab words can be decoded as ascii. That would be a dangerous assumption to make. Looking into it further.

And yes, adding these model files (or maybe simply models with non-ascii characters, and possibly even utf-16/utf-32 characters) to tests would be a good idea. Will do as soon as I get to the root of this issue.

@gojomo gojomo reopened this Mar 1, 2017
@jayantj
Copy link
Contributor

jayantj commented Mar 1, 2017

Yes, confirming that this is the issue. I'll push a fix for this asap.

@jayantj
Copy link
Contributor

jayantj commented Mar 1, 2017

Fix pushed as part of #1176
I haven't been able to test loading the new pre-trained models yet, since they are rather large (~10 GB) and the download is taking forever.

@AritzBi
Copy link

AritzBi commented Mar 2, 2017

I've just tested @jayantj fix with Spanish and Basque models and they are properly loaded. Thanks for the quick fix!!

@evanmiltenburg
Copy link

evanmiltenburg commented Mar 2, 2017

I tried the pre-trained English model, with the following command:

fasttext = FastText.load_fasttext_format('/Users/Emiel/Downloads/wiki.en/wiki.en')

I get the following error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-e2cc3eaf9300> in <module>()
      1 # We use the FastText wrapper from Gensim.
      2 # Download the vectors from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
----> 3 fasttext = FastText.load_fasttext_format('/Users/Emiel/Downloads/wiki.en/wiki.en')
      4 
      5 # Alternatively:

/Users/Emiel/anaconda3/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_fasttext_format(cls, model_file)
    236         model = cls()
    237         model.wv = cls.load_word2vec_format('%s.vec' % model_file)
--> 238         model.load_binary_data('%s.bin' % model_file)
    239         return model
    240 

/Users/Emiel/anaconda3/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_binary_data(self, model_binary_file)
    253         with utils.smart_open(model_binary_file, 'rb') as f:
    254             self.load_model_params(f)
--> 255             self.load_dict(f)
    256             self.load_vectors(f)
    257 

/Users/Emiel/anaconda3/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_dict(self, file_handle)
    280             word = ''
    281             char, = self.struct_unpack(file_handle, '@c')
--> 282             char = char.decode()
    283             # Read vocab word
    284             while char != '\x00':

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data

This is with Gensim 1.0.0, freshly installed from PyPi, on OS X. The following code did work (but doesn't load the binary file):

fasttext = Word2Vec.load_word2vec_format('/Users/Emiel/Downloads/wiki.en/wiki.en.vec', binary=False)

Does @jayantj's fix also solve this issue? If so, should I install Gensim from GitHub, or will the patch soon also be on PyPi?

@tmylk
Copy link
Contributor

tmylk commented Mar 3, 2017

Please install from github for now

@tmylk
Copy link
Contributor

tmylk commented Mar 4, 2017

Fixed in gensim 1.0.1 available on PyPI

@tmylk
Copy link
Contributor

tmylk commented Mar 6, 2017

Fixed in #1176

@tmylk tmylk closed this as completed Mar 6, 2017
@already-taken-m17
Copy link

Hi,
Not sure if this is the right place to put up my doubt, but asking that anyway.
Does trained .bin file of fasttext contains ngram vectors of sizes [3-6] only or ngram vectors of all sizes ? Upon loading the model with gensim and iterating through ngrams, I found that ngrams of all sizes are present.
If ngrams of all sizes are present, then my another doubt is which ngrams are used to make vector of out of vocabulary word. ngrams of sizes [3-6] or all ?

@jayantj
Copy link
Contributor

jayantj commented Mar 14, 2017

Hi @already-taken-m17
It depends on the hyperparameters the model was trained with - the default values of min_n and max_n are 3 and 6.
Which model are you loading, and how exactly are you iterating through ngrams?

For out-of-vocabulary words, again, ngrams of sizes [min_n,max_n] are used (and only those ngrams which were present in the ngram vocabulary of the training data)

@already-taken-m17
Copy link

Hi @jayantj , thanks for the reply.
I trained the model with original C++ fasttext implementation fastText Github repo
I used following command to train:
./fasttext skipgram -input data.txt -output model
This must take default parameters and store ngrams of sizes [3,6].
For iterating through ngrams, I am using model.wv.ngrams after loading the model using fasttext wrapper of gensim.

@dimeldo
Copy link

dimeldo commented Mar 23, 2017

@jayantj I'm getting the same error when trying to load fasttext pre-trained "wiki.he.bin" using this command:
embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(args.embedding_dictionary, binary=True, unicode_errors='ignore')

I'm getting this error:

return unicode(text, encoding, errors=errors)
File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 32: invalid start byte

@piskvorky
Copy link
Owner

@jayantj @tmylk we have a new 10TB disk on h2 -- feel free to download these "large fastText files" there, for testing.

@evanmiltenburg
Copy link

evanmiltenburg commented Mar 23, 2017

The .bin file is not in word2vec binary format. Use the .vec file and load it with the flag binary=False. Or use the FastText wrapper:

from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format(filename_without_extension)

According to the source, this is shorthand for:

from gensim.models.wrappers import FastText
model = FastText.load_word2vec_format('FILENAME.vec')
model.load_binary_data('FILENAME.bin')

The binary data is specific to the FastText algorithm.

Edit: in response to @dimeldo.

@dimeldo
Copy link

dimeldo commented Mar 23, 2017

@evanmiltenburg will definitely try that, thanks.

Edit: It worked well, thanks.

@jayantj
Copy link
Contributor

jayantj commented Mar 23, 2017

Thanks for the update, good to hear it worked.

@jdchoi77
Copy link

I tried to load using the load_fasttext_format function, which takes forever to read the .vec file. It seems like fastText manages to load all vectors by just reading .bin file, which is much faster. Would it be possible to avoid reading the .vec file for loading when there is a .bin file? I'm only asking since I prefer to use gensim for both word2vec and fasttext instead of adapting another library. Thanks.

@piskvorky
Copy link
Owner

piskvorky commented Apr 3, 2017

@tmylk @jayantj how does that SaleStock load its data? We definitely don't want to be slower than other Python tools/wrappers.

@tmylk
Copy link
Contributor

tmylk commented Apr 4, 2017

SaleStock is using C++ code closer to original FastText. Created #1261 wishlist issue.

@piskvorky
Copy link
Owner

If it's really that annoyingly slow, we could read the code in C (Cython-compiled) -- seems easy enough.

@prakhar2b
Copy link
Contributor

@piskvorky @tmylk Cythonizing would be fine, but as of now, we are first getting info from .vec file and then while reading .bin file, we use assert statement to confirm that there is no mismatch in the info obtained from .vec and .bin file. This seems unnecessary.

@jayantj
Copy link
Contributor

jayantj commented Apr 10, 2017

Yes, it should be possible to load the model only from the .bin file without having to read the .vec file.
As of now, the .vec file is used to initialize the KeyedVectors instance, which include:

  1. The vocabulary (words and counts)
  2. The vectors for in-vocab words

The .bin file contains the in-vocab words too (loaded in FastText.load_dict), however the word vectors will have to be initialized by making use of the char-ngram vectors. They are not directly present in the .bin file.

Changing this would require non-trivial changes to the FastText class. It could be useful to do some quick profiling to see whether loading the .vec file takes up a significant portion of time before expending effort on changing this behaviour (ideally, for models of different sizes - say 50 MB, 500 MB, 5 GB)

@arashsa
Copy link

arashsa commented Apr 24, 2017

I am having some issues reading fasttext files.

from gensim.models import KeyedVectors
no_model = KeyedVectors.load_word2vec_format('wiki.no/wiki.no.vec')

The above code works, but with that I'm not able to get oov words.

from gensim.models.wrappers import FastText
model = FastText.load_word2vec_format('wiki.no/wiki.no.vec')
model.load_binary_data('wiki.no/wiki.no.bin')

With the above code I get the error:

AttributeError: 'FastTextKeyedVectors' object has no attribute 'load_binary_data'

There are no examples in the documentation as to the best way to read a fasttext file and get oov vectors.

@piskvorky
Copy link
Owner

@tmylk if so, please update the docs with your team.

@tmylk
Copy link
Contributor

tmylk commented Apr 27, 2017

@arashsa The code has been updated since the comment above, please use load_fasttext_format now

@arashsa
Copy link

arashsa commented May 7, 2017

@tmylk when I try the load_tasttext_format method I get this error:

AssertionError: mismatch between vocab sizes

@tmylk
Copy link
Contributor

tmylk commented May 8, 2017

@arashsa There is an actual mismatch in the sizes of vocab in .vec and .bin sizes. So it is possible it's there for Norwegian. Please report it to FastText for french.

@rajivgrover009
Copy link

Does it support online training?

@menshikh-iv
Copy link
Contributor

@rajivgrover009 our fasttext implementation - yes.

@rajivgrover009
Copy link

@menshikh-iv i meant continue training out pretrained fasttext models. is it possible to use pretrained models by fasttext and continue training to add use case specific vocabulary??

@menshikh-iv
Copy link
Contributor

@anmolgulati show me a concrete link please, probably, only main matrices saved, i.e. - you can only use it (but can't continue training).

@sauravm8
Copy link

sauravm8 commented Jul 15, 2018

Trying to load Bengali Fastext model, both for .vec and .bin format

  1. For both .bin and .vec formats white trying with models.KeyedVectors.load_word2vec_format( )

ImportError:
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try git clean -xdf (removes all
files not under version control). Otherwise reinstall numpy.

  1. Same error when trying FastText.load_vectors( ) or FastText.load_binary_data( )

@menshikh-iv
Copy link
Contributor

@sauravm8 you have issues with numpy installation, resolve it first and reinstall gensim after

@sauravm8
Copy link

@menshikh-iv Yes. It was a version conflict between numpy and python. Solved.

@harrypotter0
Copy link

harrypotter0 commented Jul 23, 2018

@menshikh-iv Hi Sir,
I tried to load a model (faq.model.bin which is trained using fasttext) using gensim wrapper, code I used for loading the model :

import os
from nltk.tokenize import word_tokenize, sent_tokenize
from pprint import pprint
import re
from textblob import TextBlob
import string
from nltk.corpus import stopwords
from gensim.models import Word2Vec

import gensim
from gensim.models import word2vec, KeyedVectors
from threading import Semaphore
import logging

import numpy as np
import os

vector_dim = 300
root_path = os.getcwd()
from nltk.tokenize import word_tokenize
import multiprocessing

def readStr():
    return raw_input().strip()

if __name__ == "__main__":
    from gensim.models.wrappers import FastText
    file1 = open("fasttext_finance.txt","w")
    print "Loading the model"
    model_path = "/home/akash/Documents/nlp-tests/models/finance_model/faq.model.bin"
    model = FastText.load_fasttext_format(model_path)
    # model = gensim.models.fasttext.load_fasttext_format(model_path)
    print(model.most_similar('banks'))

    lis = ["income","maturity","tax","mutual","fund","banks","cash","pf","epf","bankrupt",
            "loans","money","benefit","insurance","debt","advantage","sbi","kotak","shares",
            "food","hotel","retirement","travel","food","health","salary","account","advantage","disadvantage"]

    for word in lis:
        N = 50
        print("Most similar words to {} are :{}\n".format(word,model.most_similar(positive=[word],topn=N)))
        file1.write("Most similar words to {} are :{}\n\n".format(word,model.most_similar(positive=[word],topn=N)))
    file1.close()

and here is the error I receive:

/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py:410: RuntimeWarning: divide by zero encountered in remainder
  ngram_indices.append(len(self.wv.vocab) + ngram_hash % self.bucket)
Traceback (most recent call last):
  File "testing_merged_all.py", line 31, in <module>
    model = FastText.load_fasttext_format(model_path)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 271, in load_fasttext_format
    model.load_binary_data(encoding=encoding)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 297, in load_binary_data
    self.load_vectors(f)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 384, in load_vectors
    self.init_ngrams()
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 412, in init_ngrams
    self.wv.syn0_ngrams = self.wv.syn0_ngrams.take(ngram_indices, axis=0)
IndexError: index 2534933 is out of bounds for size 2534933
Is it memory error or something else?

Apart from the above, here is the script I used for training a pretrained model on my own corpus and generated the model which I used in above script for testing it. Here goes the script:

./fasttext supervised \
  -pretrainedVectors /home/akash/Downloads/wiki.en.vec \
  -input output.txt \
  -dim 300 \
  -output faq.model

Have I not trained the pretrained model correctly or where is the error? As the same model(faq.model.bin) is working fine with the pyfasttext library. Please look into this.

@menshikh-iv
Copy link
Contributor

@harrypotter0 we don't support "supervised" models, can you share your .bin please, I want to reproduce this issue?

@menshikh-iv menshikh-iv added the need info Not enough information for reproduce an issue, need more info from author label Jul 31, 2018
@Sherriiie
Copy link

How could i save fast text model to bin and vec files ?

@kevin369ml
Copy link

kevin369ml commented Sep 4, 2018

@Sherriiie If you download the original pre-trained files from official website of fasttext at here: https://fasttext.cc/docs/en/english-vectors.html and unzip them, they are .vec files.
Regarding .bin file, usually I use gensim to transform vec to bin like this:
vec_file = gensim.models.KeyedVectors.load_word2vec_format("crawl_300d_2M.vec", binary=False) vec_file.save_word2vec_format("crawl_300d_2M.bin", binary=True).
It is much faster by using binary files.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Sep 5, 2018

@kevin28520 note (it can be obvious, but to avoid confusion) than .bin produced in this way are different with .bin distributed by FB: proposed .bin will contain ONLY word-vectors (no ngrams), this is still equivalent of .vec file distributed by FB.

@ghazeefa
Copy link

I am using fasttext pre-train model for urdu language https://fasttext.cc/docs/en/pretrained-vectors.html
Why i am getting different vectors from .bin and .vec file ? Which one should use to evaluate the model?

import gensim.models.keyedvectors as word2vec1
from scipy import spatial
from gensim.models import FastText

pathToBinVectors = 'C:/Users/admin/fasttextwiki/wiki.ur.vec'
embed_map = word2vec1.KeyedVectors.load_word2vec_format(pathToBinVectors)
gg = embed_map.wv.get_vector('سائیکل')
hh = embed_map.wv.get_vector('گاڑی')
a=1-spatial.distance.cosine(gg,hh)
print(a*4)

i got similarity score 1.8220717906951904
when i load .bin file

model = FastText.load_fasttext_format('C:/Users/admin/fasttextwiki/wiki.ur.bin')
gg = model.wv.get_vector('سائیکل')
hh = model.wv.get_vector('گاڑی')
a=1-spatial.distance.cosine(gg,hh)
print(a*4)

i got similarity score 0.376111775636673

@mpenkov
Copy link
Collaborator

mpenkov commented Apr 26, 2019

@ghazeefa please open a new ticket for your problem

Repository owner locked as off-topic and limited conversation to collaborators Apr 26, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
need info Not enough information for reproduce an issue, need more info from author
Projects
None yet
Development

No branches or pull requests