Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filename overwrite in FastText.load_fasttext_format #2407

Closed
menshikh-iv opened this issue Mar 7, 2019 · 8 comments
Closed

Filename overwrite in FastText.load_fasttext_format #2407

menshikh-iv opened this issue Mar 7, 2019 · 8 comments
Assignees
Labels
difficulty easy Easy issue: required small fix

Comments

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Mar 7, 2019

I download FB pretrained common-crawl model using https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz from https://fasttext.cc/docs/en/crawl-vectors.html page.
When I try to load it, I get an error

Code:

from gensim.models import FastText

ft_cc = FastText.load_fasttext_format("models/cc.en.300.bin.gz", full_model=False)

Stacktrace:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<timed exec> in <module>

~/.local/lib/python3.6/site-packages/gensim/models/fasttext.py in load_fasttext_format(cls, model_file, encoding, full_model)
   1012 
   1013         """
-> 1014         return _load_fasttext_format(model_file, encoding=encoding, full_model=full_model)
   1015 
   1016     def load_binary_data(self, encoding='utf8'):

~/.local/lib/python3.6/site-packages/gensim/models/fasttext.py in _load_fasttext_format(model_file, encoding, full_model)
   1245     if not model_file.endswith('.bin'):
   1246         model_file += '.bin'
-> 1247     with smart_open(model_file, 'rb') as fin:
   1248         m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
   1249 

~/.local/lib/python3.6/site-packages/smart_open/smart_open_lib.py in smart_open(uri, mode, **kw)
    179         raise TypeError('mode should be a string')
    180 
--> 181     fobj = _shortcut_open(uri, mode, **kw)
    182     if fobj is not None:
    183         return fobj

~/.local/lib/python3.6/site-packages/smart_open/smart_open_lib.py in _shortcut_open(uri, mode, **kw)
    299     #
    300     if six.PY3:
--> 301         return open(parsed_uri.uri_path, mode, buffering=buffering, **open_kwargs)
    302     elif not open_kwargs:
    303         return open(parsed_uri.uri_path, mode, buffering=buffering)

FileNotFoundError: [Errno 2] No such file or directory: 'models/cc.en.300.bin.gz.bin'

Problem in filename overwriting that happens
https://github.com/RaRe-Technologies/gensim/blob/cebc9dbeacddf7689c35aadb8854ae850d4561d2/gensim/models/fasttext.py#L1302-L1303

To avoid that, I should gunzip model file first (that's definitely not a thing that user expect).

Versions:
no matters, all python & gensim version affected, including last release.

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Mar 7, 2019
Repository owner deleted a comment from menshikh-iv Mar 7, 2019
@piskvorky
Copy link
Owner

piskvorky commented Mar 7, 2019

Not necessarily a bug, but I do think we could support this. Especially since we're already using smart_open (I think), so .gz comes for free.

The other option is to inform users through documentation/tutorials that they need to unpack their models first.

@menshikh-iv can you open a PR?

@piskvorky piskvorky removed the bug Issue described a bug label Mar 7, 2019
@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Mar 7, 2019

Not necessarily a bug,

from user PoV - definitely a bug (fasttext try to open a file that user don't specify, instead of passed file and failed on it).

can you open a PR?

No time, sorry

@piskvorky
Copy link
Owner

piskvorky commented Mar 7, 2019

@menshikh-iv what is the if not model_file.endswith('.bin'): condition for? Looks weird to me, to be changing the user input like that.

@menshikh-iv
Copy link
Contributor Author

@piskvorky ask an author of this code, not me (btw @mpenkov isn't an author too, they just refactored it, according to git blame).

I guess that this is for the case when user have 2 files X.bin and X.vec and allow user to specify load_fasttext_format("X") (without extension) and automatically load X.bin (instead of specified X that doesnt exist).

@piskvorky
Copy link
Owner

piskvorky commented Mar 7, 2019

Well asking you, the project maintainer from when these changes happened and were merged, seemed like a good start to understand them :)

If the filename change is not needed (or its reason is undocumented/unknown/outdated), then I propose simply removing it. Silently modifying user input looks like an anti-pattern to me.

Feel free to open a PR when you have the time. We'll have to be careful with updating tutorials too (if they rely on this silent extension promotion).

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Mar 7, 2019

I'm already not a maintainer @piskvorky, this change was introduced 2 years ago (when I only start, just merged PR by Jayant approve), I really remember nothing about it.

I found an "source" of this changes, this code firstly was introduced in #1341 (and after moves to other classes, like FastText).

@jayantj @prakhar2b can you comment that?

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 7, 2019

My understanding is this mirrors FB behavior. You specify model_name, and it creates model_name.bin and model_name.vec.

This feature allows the user to specify model_name only and not worry about the file extension. In my opinion, we’re better off without it.

@prakhar2b
Copy link
Contributor

prakhar2b commented Mar 17, 2019

I made these changes two years ago. There was a bug in facebook's french pretrained vector
mismatch in vec and bin files in french pretrained vector

As a workaround, we made changes in gensim to load fastText models using only bin files. #1341

A lot of changes were made after that in Gensim and fastText by someone else, and this bug has arisen most probably because backward compatibility was not taken care of.

@menshikh-iv @piskvorky

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix
Projects
None yet
Development

No branches or pull requests

4 participants