New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dictionary.save_as_text/load_from_text is dangerous #56

Closed
heuer opened this Issue Oct 11, 2011 · 12 comments

Comments

Projects
None yet
7 participants
@heuer
Copy link

heuer commented Oct 11, 2011

I thought that Dictionary.save_as_text and load_from_text is equivalent to Dictionary.save/load, but it isn't. The text format does not keep the "num_docs" and after loading a Dict from a text, several methods do not work anymore::

>>> dct = Dictionary()
>>> _ = dct.doc2bow(['Hi', 'there'], allow_update=True)
>>> _ = dct.doc2bow(['Hi', 'all'], allow_update=True)
>>> _ = dct.doc2bow(['Hi', 'there', 'world'], allow_update=True)
>>> print dct
Dictionary(4 unique tokens)
>>> print dct.num_docs
3
>>> dct.save('./test.dct')
>>> d2 = Dictionary.load('./test.dct')
>>> print d2.num_docs
3
>>> # Filter extremes
>>> d2.filter_extremes(no_below=1)
>>> print d2
Dictionary(2 unique tokens)
>>> print d2.token2id
{'world': 0, 'all': 1}

Everything works as expected. Now the text version (assuming the dct from above)::

>>> dct.save_as_text('./test.txt')
>>> d2 = Dictionary.load_from_text('./test.txt')
>>> print d2.num_docs
0
>>> print d2 # Everything is fine here, expecting 4 tokens
Dictionary(4 unique tokens)
>>> # Filter extremes
>>> d2.filter_extremes(no_below=1)
>>> print d2 # Whoops, all tokens are gone, expected 2, see above
Dictionary(0 unique tokens)
>>> print d2.token2id
{}

This behaviour is not documented anywhere and I'd think that save_as_text and load_from_text should lead to a fully functional Dictionary object

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Oct 11, 2011

Yeah, they are not meant to be equivalent (why have 2 equivalent functions?). The text format is meant mostly for human inspection, not as a custom re-implementation of the pickle serialization protocol.

The preferred way to save and load objects in gensim is via save() and load(). Text is for debugging and portability. You're right that the documentation should mention this; I'll fix it (or you open a pull request if you have the words ready :)

@heuer

This comment has been minimized.

Copy link

heuer commented Oct 11, 2011

why have 2 equivalent functions?

Because pickle could be unsafe and the text format can be inspected and modified by humans before it's imported back.

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Oct 11, 2011

Oh, security in gensim, that's a new one!

I'm afraid taking care of that properly would require more effort than just serializing to text in Dictionary. This is the first time I hear of this use case -- users usually run experiments with their own code and data -- so at the moment, I would suggest you override functions you deem unsafe for your scenario yourself. Perhaps setting pickle protocol to 0 (ascii) in save() would be enough for your human inspection requirements? Can you say more about these requirements?

If there are more users with similar needs (or if your solution is sufficiently generic), we can brainstorm some gensim-wide "fix", I'm open to that.

@fannix

This comment has been minimized.

Copy link

fannix commented Apr 12, 2012

I guess I also wrongly made this assumption. IMHO, the name is misleading. If I can use two versions of loading and saving. The UNIX philosophy tell us to go for the text version... :-)

@fannix

This comment has been minimized.

Copy link

fannix commented Apr 12, 2012

I think more suitable name for sav_as_text would be output_mapping or print_mapping

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Apr 12, 2012

You're right.

I'm in favour of save_mapping, or write_mapping -- something that evokes opening files (unlike print_mapping). Can you rename it?

@fannix

This comment has been minimized.

Copy link

fannix commented Apr 12, 2012

I am not sure... It might take a little time for me to be familiar with the code, and only after that I am able to rename it.

@salmoni

This comment has been minimized.

Copy link

salmoni commented Nov 9, 2012

I'm coming in a bit late but would like to help: I find the text version of the dictionary extremely useful because I can manipulate it with different (non-Python) tools. E.g., even a spreadsheet. However, I often need to read the new dictionary back in again. After looking at the code, the problem seems to be that reading in an array doesn't reset the Dictionary's num_docs attribute. This is used in filter_extremes so any call to this method results in zero documents surviving the filter. This also applies when using the load method.

I'm not sure if the filter_extremes method needs to re-calculate num_docs in case further calls are made?

What I've done is a very quick fix but helps filter_extremes work with loaded dictionaries.

@mattsta

This comment has been minimized.

Copy link

mattsta commented Feb 22, 2014

Just got bit by this. Tried to save_as_text then load_from_text, but load doesn't respect the format from the save.

Possible fixes:

  • make load_from_text work. (ideal)
  • rename save_as_text to save_human_readable then remove load_from_text.
  • improve documentation to make it clear _as_text aren't for reusable persistence.
@vlejd

This comment has been minimized.

Copy link
Contributor

vlejd commented Jun 8, 2017

I would like to work on this. I will try to make the load_from_text work, optimally without breaking the backward compatibility.

vlejd added a commit to vlejd/gensim that referenced this issue Jun 8, 2017

Fix Dictionary save_as_text method RaRe-Technologies#56 + fix lint er…
…rors

save_as_text now writes num_docs on the first line.
load_as_text loads it in backward compatible way.

vlejd added a commit to vlejd/gensim that referenced this issue Jun 8, 2017

Fix Dictionary save_as_text method RaRe-Technologies#56 + fix lint er…
…rors

save_as_text now writes num_docs on the first line.
load_as_text loads it in backward compatible way.

menshikh-iv added a commit that referenced this issue Jun 15, 2017

Merge pull request #1402 from vlejd/56_fix_dict_save_as_text
Fix Dictionary save_as_text method #56 + fix lint errors
@vlejd

This comment has been minimized.

Copy link
Contributor

vlejd commented Jun 19, 2017

Is this fixed by #1402 ? Do we still need to change the documentation?

@menshikh-iv

This comment has been minimized.

Copy link
Member

menshikh-iv commented Oct 3, 2017

Works fine in 3.0.0

@menshikh-iv menshikh-iv closed this Oct 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment