-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading fastText binary output to gensim like word2vec #814
Comments
Definitely! Reading/writing the fastText word-vector format (the (As a later step, the |
It appears the The |
@gojomo thanks. I can confirm the vec format is compatible with gensim |
See FastText comparison notebook in #815 |
@tmylk - we may want to keep this open for the larger issue of doing something with the Loading the buckets-of-subword-vectors, and making them usable for OOV prediction, would require a bit more actual functionality... but would still be practical. Maybe at first, the subword-buckets wind up in a different class – even perhaps a KeyedVectors variant/sibling - which would offer both subword vector lookup, by hashed key, and word-vector-reconstruction (incl. OOV words), by composition of subword vectors. (cc @droudy) |
FastText just published pre-trained word vectors for 90 languages trained on Wikipedia. I am trying to load the Spanish, Basque or English models with gensim=1.0.0 and the method FastText.load_fasttext_format but I have the following error:
Should I use some other method? |
Thanks for reporting this issue. At first glance, it seems like the code makes an assumption that the characters constituting the vocab words can be decoded as ascii. That would be a dangerous assumption to make. Looking into it further. And yes, adding these model files (or maybe simply models with non-ascii characters, and possibly even utf-16/utf-32 characters) to tests would be a good idea. Will do as soon as I get to the root of this issue. |
Yes, confirming that this is the issue. I'll push a fix for this asap. |
Fix pushed as part of #1176 |
I've just tested @jayantj fix with Spanish and Basque models and they are properly loaded. Thanks for the quick fix!! |
I tried the pre-trained English model, with the following command:
I get the following error:
This is with Gensim 1.0.0, freshly installed from PyPi, on OS X. The following code did work (but doesn't load the binary file):
Does @jayantj's fix also solve this issue? If so, should I install Gensim from GitHub, or will the patch soon also be on PyPi? |
Please install from github for now |
Fixed in gensim 1.0.1 available on PyPI |
Fixed in #1176 |
Hi, |
Hi @already-taken-m17 For out-of-vocabulary words, again, ngrams of sizes [ |
Hi @jayantj , thanks for the reply. |
@jayantj I'm getting the same error when trying to load fasttext pre-trained "wiki.he.bin" using this command: I'm getting this error:
|
The
According to the source, this is shorthand for:
The binary data is specific to the FastText algorithm. Edit: in response to @dimeldo. |
@evanmiltenburg will definitely try that, thanks. Edit: It worked well, thanks. |
Thanks for the update, good to hear it worked. |
I tried to load using the |
SaleStock is using C++ code closer to original FastText. Created #1261 wishlist issue. |
If it's really that annoyingly slow, we could read the code in C (Cython-compiled) -- seems easy enough. |
@piskvorky @tmylk Cythonizing would be fine, but as of now, we are first getting info from |
Yes, it should be possible to load the model only from the
The Changing this would require non-trivial changes to the |
I am having some issues reading fasttext files.
The above code works, but with that I'm not able to get oov words.
With the above code I get the error:
There are no examples in the documentation as to the best way to read a fasttext file and get oov vectors. |
@tmylk if so, please update the docs with your team. |
@arashsa The code has been updated since the comment above, please use load_fasttext_format now |
@tmylk when I try the load_tasttext_format method I get this error:
|
@arashsa There is an actual mismatch in the sizes of vocab in |
Does it support online training? |
@rajivgrover009 our fasttext implementation - yes. |
@menshikh-iv i meant continue training out pretrained fasttext models. is it possible to use pretrained models by fasttext and continue training to add use case specific vocabulary?? |
@anmolgulati show me a concrete link please, probably, only main matrices saved, i.e. - you can only use it (but can't continue training). |
Trying to load Bengali Fastext model, both for .vec and .bin format
|
@sauravm8 you have issues with numpy installation, resolve it first and reinstall gensim after
|
@menshikh-iv Yes. It was a version conflict between numpy and python. Solved. |
@menshikh-iv Hi Sir,
and here is the error I receive:
Apart from the above, here is the script I used for training a pretrained model on my own corpus and generated the model which I used in above script for testing it. Here goes the script:
Have I not trained the pretrained model correctly or where is the error? As the same model(faq.model.bin) is working fine with the pyfasttext library. Please look into this. |
@harrypotter0 we don't support "supervised" models, can you share your |
How could i save fast text model to bin and vec files ? |
@Sherriiie If you download the |
@kevin28520 note (it can be obvious, but to avoid confusion) than |
I am using fasttext pre-train model for urdu language https://fasttext.cc/docs/en/pretrained-vectors.html import gensim.models.keyedvectors as word2vec1 pathToBinVectors = 'C:/Users/admin/fasttextwiki/wiki.ur.vec' i got similarity score 1.8220717906951904 model = FastText.load_fasttext_format('C:/Users/admin/fasttextwiki/wiki.ur.bin') i got similarity score 0.376111775636673 |
@ghazeefa please open a new ticket for your problem |
Facebook's recent open sourced
fasttext
https://github.com/facebookresearch/fastText improves theword2vec
SkipGram model. It follows a similar output format forword
-vector
key value pairs, and the similarity calculation is about the same too, but their binary output format is kind of different from that of the C version ofword2vec
binary format. Do we want to support loadingfastText
model output ingensim
? Thanks.The text was updated successfully, but these errors were encountered: