New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fix Dictionary save_as_text method #56 + fix lint errors #1402

Merged

menshikh-iv merged 5 commits into piskvorky:develop from vlejd:56_fix_dict_save_as_text

Jun 15, 2017

Contributor

vlejd commented Jun 8, 2017 •

edited

Loading

save_as_text now writes num_docs on the first line.
load_as_text loads it in backward compatible way.
Fixed some lint error on the way.


          Fix Dictionary save_as_text method piskvorky#56 + fix lint errors

2351a90

save_as_text now writes num_docs on the first line.
load_as_text loads it in backward compatible way.

vlejd force-pushed the 56_fix_dict_save_as_text branch from cac6314 to 2351a90 Compare

June 8, 2017 19:23

vlejd added 3 commits

June 8, 2017 21:26


          Merge branch 'develop' into 56_fix_dict_save_as_text

10dc847


          Fix error with sorting


          Fix unicode errors

b1679f0

piskvorky requested a review from menshikh-iv

June 12, 2017 08:38

menshikh-iv suggested changes

View reviewed changes

gensim/test/test_corpora_dictionary.py Outdated

@@ @@ -157,16 +156,15 @@ def testFilterTokens(self): @@
                       d.filter_tokens([0])
                       expected = {'computer': 0, 'eps': 8, 'graph': 10, 'human': 1,
-                              'interface': 2, 'minors': 11, 'response': 3, 'survey': 4,
-                              'system': 5, 'time': 6, 'trees': 9, 'user': 7}
+                                  'interface': 2, 'minors': 11, 'response': 3, 'survey': 4,

Contributor

menshikh-iv Jun 12, 2017

Please use hanging indent (no vertical)

Contributor Author

vlejd Jun 14, 2017

Ok.

gensim/test/test_corpora_dictionary.py Outdated

+                  def test_saveAsText(self):
+                      """`Dictionary` can be saved as textfile. """
+                      tmpf = get_tmpfile('save_dict_test.txt')
+                      small_text = [["prvé", "slovo"],

Contributor

menshikh-iv Jun 12, 2017

Hanging indent

Contributor Author

vlejd Jun 14, 2017

Ok.

gensim/test/test_corpora_dictionary.py Outdated

-                              "The intersection graph of paths in trees",
-                              "Graph minors IV Widths of trees and well quasi ordering",
-                              "Graph minors A survey"]
+                                   "A survey of user opinion of computer system response time",

Contributor

menshikh-iv Jun 12, 2017

Hanging indent

Contributor Author

vlejd Jun 14, 2017

Ok.

gensim/test/test_corpora_dictionary.py Outdated

                       stoplist = set('for a of the and to in'.split())
                       texts = [[word for word in document.lower().split() if word not in stoplist]
-                              for document in documents]
+                               for document in documents]

Contributor

menshikh-iv Jun 12, 2017

Hanging indent

Contributor Author

vlejd Jun 14, 2017

Ok.

gensim/test/test_corpora_dictionary.py Outdated

                       # remove words that appear only once
                       all_tokens = sum(texts, [])
                       tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
                       texts = [[word for word in text if word not in tokens_once]
-                              for text in texts]
+                               for text in texts]

Contributor

menshikh-iv Jun 12, 2017

Hanging indent

Contributor Author

vlejd Jun 14, 2017

Ok. It actually fits on one line.

gensim/test/test_corpora_dictionary.py Outdated

+                          self.assertEqual(serialized_lines[2][1:], "\tdruhé\t2\n")
+                          self.assertEqual(serialized_lines[3][1:], "\tprvé\t1\n")
+                  def test_loadFromText(self):

Contributor

menshikh-iv Jun 12, 2017

Please add case with old file format

Contributor Author

vlejd Jun 14, 2017

The first part is for format, the second is for new. I create two different functions for better verbosity.


          Fix review

76ed41d

Hanging indent and split test_loadFromText to test_loadFromText_legacy.

Contributor

menshikh-iv commented Jun 15, 2017

Thank you @vlejd

menshikh-iv merged commit 09e16cd into piskvorky:develop

This was referenced Jun 19, 2017

Variable num_docs not set after calling load_from_text() in Dictionary #327

Closed

Dictionary.save_as_text/load_from_text is dangerous #56

Closed

piskvorky reviewed

View reviewed changes

Owner

piskvorky left a comment

Thanks for the clean up! I'd ask for one minor fix re. the logging level.

gensim/corpora/dictionary.py

+                                      result.num_docs = int(line.strip())
+                                      continue
+                                  else:
+                                      logging.warning("Text does not contain num_docs on the first line.")

Owner

piskvorky Jun 19, 2017

Is this really a warning?

There's no ambiguity (3 columns vs 1 column), so I think INFO (or even DEBUG) should be sufficient.

Contributor Author

vlejd Jun 20, 2017

The warning is not about number ambiguity, but about the fact that you are using old serialization format.
I put there a warning, because it may cause you errors somewhere down the pipeline (as mentioned by the original issue #56). If it does not make sense, I can change it, but I think warning should be the best.

Owner

piskvorky Jun 21, 2017 •

edited

Loading

Perhaps, but the current message doesn't help the users.

What "text"? Why should it contain "num_docs"? What should the user do about it?

This should be a backward compatible change, so "errors down the pipeline" are not acceptable. CC @menshikh-iv

Contributor Author

vlejd Jun 21, 2017

Something like: "This dictionary was serialized using old format. Please set the dictionary.num_docs manually or some dictionary methods may not work as intended."

Contributor

menshikh-iv commented Jun 22, 2017 •

edited

Loading

@vlejd you PR has a few problems with python >= 3.5 on Windows:

======================================================================
ERROR: `Dictionary` can be loaded from textfile.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python35-x64\lib\site-packages\gensim\test\test_corpora_dictionary.py", line 234, in test_loadFromText
    d = Dictionary.load_from_text(tmpf)
  File "C:\Python35-x64\lib\site-packages\gensim\corpora\dictionary.py", line 358, in load_from_text
    line = utils.to_unicode(line)
  File "C:\Python35-x64\lib\site-packages\gensim\utils.py", line 235, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte
======================================================================
ERROR: `Dictionary` can be loaded from textfile in legacy format.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python35-x64\lib\site-packages\gensim\test\test_corpora_dictionary.py", line 220, in test_loadFromText_legacy
    d = Dictionary.load_from_text(tmpf)
  File "C:\Python35-x64\lib\site-packages\gensim\corpora\dictionary.py", line 358, in load_from_text
    line = utils.to_unicode(line)
  File "C:\Python35-x64\lib\site-packages\gensim\utils.py", line 235, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte
======================================================================
FAIL: `Dictionary` can be saved as textfile.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python35-x64\lib\site-packages\gensim\test\test_corpora_dictionary.py", line 197, in test_saveAsText
    self.assertEqual(serialized_lines[1][1:], "\tdruh�\t2\n")
AssertionError: '\tdruhé\t2\n' != '\tdruh�\t2\n'
- 	druhé	2
? 	    ^^
+ 	druh�	2
? 	    ^
-------------------- >> begin captured logging << --------------------
gensim.corpora.dictionary: INFO: adding document #0 to Dictionary(0 unique tokens: [])
gensim.corpora.dictionary: INFO: built Dictionary(3 unique tokens: ['druh�', 'prv�', 'slovo']) from 3 documents (total 6 corpus positions)
gensim.corpora.dictionary: INFO: saving dictionary mapping to C:\Users\appveyor\AppData\Local\Temp\1\save_dict_test.txt
--------------------- >> end captured logging << ---------------------
======================================================================

Can you fix this ?

menshikh-iv mentioned this pull request

Windows 2.2.0 issues #1441

Closed

Contributor Author

vlejd commented Jun 22, 2017

Not really. I do not have access to windows. But I can change the test to not use strange characters. The correct solution would be to properly set up the encoding which I can not do without testing on windows (it would be a shot into dark).

vlejd mentioned this pull request

Fix dict save as text #1443

Closed

Owner

piskvorky commented Jun 23, 2017

-1 on masking the error by removing non-ASCII characters.

Contributor

menshikh-iv commented Jun 23, 2017

I'll soon get a laptop with windows and try to understand what the problem is @vlejd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment