Fixes 'not enough arguments for format string' error #3286

gilbertfrancois · 2022-02-04T16:17:25Z

For Doc2Vec, when building the vocabulary and the raw_int and corpus_count are not the same, the log gives a warning. However, the log string expects 4 arguments, but only 3 are given, so the log produces an error, shown below.

The fix was to give the difference of raw_int and corpus_count as the 3rd argument.

Part of the stack trace:

--- Logging error ---
Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/3.8.10/lib/python3.8/logging/__init__.py", line 1085, in emit
    msg = self.format(record)
  File "/home/ubuntu/.pyenv/versions/3.8.10/lib/python3.8/logging/__init__.py", line 929, in format
    return fmt.format(record)
  File "/home/ubuntu/.pyenv/versions/3.8.10/lib/python3.8/logging/__init__.py", line 668, in format
    record.message = record.getMessage()
  File "/home/ubuntu/.pyenv/versions/3.8.10/lib/python3.8/logging/__init__.py", line 373, in getMessage
    msg = msg % self.args
TypeError: not enough arguments for format string
Call stack:
  File "/home/ubuntu/.pyenv/versions/3.8.10/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
...

...
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/gilbert-lab-aKPhFkG7-py3.8/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 993, in _scan_vocab
    logger.warning(
Message: 'Highest int doctag (%i) larger than count of documents (%i). This means at least %i excess, unused slots (%i bytes) will be allocated for vectors.'
Arguments: (60108, 60082, 5200)

gojomo

looks good - reflecting original unrealized intent!

piskvorky · 2022-02-05T07:44:20Z

Great. @gilbertfrancois please fix the flake tests (trailing whitespace not allowed) and we're good to merge.

gensim/models/doc2vec.py

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>

gilbertfrancois · 2022-02-05T09:18:10Z

Done. Thanks @piskvorky for the suggestions.

gilbertfrancois · 2022-02-05T22:25:28Z

Are some tests failing because of this patch? I can't discover in the logs what the problem is.

piskvorky · 2022-02-06T07:42:45Z

I see segmentation faults in the logs. Although that's weird, for an added log parameter… @gilbertfrancois could you investigate further? Was it my REAL.itemsize suggestion?

@mpenkov what's the way to run tests locally? I cannot find it anymore on our wiki / https://github.com/RaRe-Technologies/gensim/wiki/Developer-page :(

gilbertfrancois · 2022-02-06T12:30:37Z

I could setup the environment and I am running the tests now. I'll keep you posted about the progress.

gilbertfrancois · 2022-02-08T17:36:25Z

As far as I can see is that the failing tests are not related to the change in this PR. I ran the following experiments using tox -e py37-linux and tox -e py38-linux:

gensim-4.1.2, py37-linux, passes all tests
gensim-4.1.2, py38-linux, passes all tests
gensim-master, py37-linux, waits forever at gensim/test/test_ensemblelda.py::TestEnsembleLda::test_add_models
gensim-master, py38-linux, passes, but very slow

With the current master branch, the tests with ldamulticore takes a lot of time and for py37 it never finishes. However, when running the test directly with unittest instead of pytest and tox, it passes and does the tests very fast. So I had the impression that it was not related to the code, but to the testing method. Moreover, the code of ldamodel and ldamulticore has hardly changed since release 4.1.2.

I saw that the testing has changed since the last release. When I remove coverage testing from tox.ini, all tests pass again on py37-linux and py38-linux on the master branch.

diff --git a/tox.ini b/tox.ini
index 058b37d9..4719e1af 100644
--- a/tox.ini
+++ b/tox.ini
@@ -43,7 +43,7 @@ exclude_lines =
 ignore_errors = True

 [pytest]
-addopts = -rfxEXs --durations=20 --showlocals --cov=gensim/ --cov-report=xml
+addopts = -rfxEXs --durations=20 --showlocals


 [testenv]

I think it is safe to merge this PR and open a new issue to solve the testing + coverage for ensemble_lda and ldamulticore.

For reference, I've attached the logs (with verbose option) for the different runs.

Fail:
log-gensim-master-tox3.24.5-py37-linux.txt

Success:
log-gensim-master-tox3.24.5-nocov-py37-linux.txt
log-gensim-master-tox3.24.5-nocov-py38-linux.txt
log-gensim4.1.2-tox3.24.4-py37-linux.txt
log-gensim4.1.2-tox3.24.4-py38-linux.txt
.

piskvorky · 2022-02-08T17:46:21Z

Thanks so much for your investigation @gilbertfrancois!

CC @mpenkov – let's whip the CI testing back into shape. Are the coverage tests good for anything? Do we need them?

gilbertfrancois · 2022-02-08T20:11:24Z

I do think coverage serves a useful purpose, by giving a metric on how much of the written code is tested. But it seems that measuring coverage in combination with multiprocessing is creating problems. I would suggest to temporarily stop measuring coverage or exclude measuring coverage for tests that make use of multiprocessing (e.g. ensemble_lda, ldamulticore), until a solution is found.

I found this stackoverflow page that is mentioning problems with coverage + multiprocessing.

https://stackoverflow.com/questions/28297497/python-code-coverage-and-multiprocessing

Most important is to have good working unittests.

piskvorky · 2022-02-08T21:11:15Z

by giving a metric on how much of the written code is tested

I personally don't find that particularly useful. But let's wait for @mpenkov 's judgement – he might remember why we added these coverage tests in the first place :)

gilbertfrancois · 2022-02-25T06:20:05Z

@mpenkov I see that you include/exclude platforms and/or python versions for coverage. I'm not sure if that is the right way to fix this problem.

For multiprocessing code, coverage needs to be run with the argument --concurrency=multiprocessing and afterward the command coverage combine should be executed. It is described here in the section Execution. I don't know if it is possible with the pytest-cov plugin.

mpenkov · 2022-02-25T06:40:22Z

At the moment, I'm just trying to get CI to pass with some consistency. The coverage stuff appears to be what's causing the build fails (mysteriously...).

It sounds like getting coverage to work properly will take more effort, and I'm not sure that this PR is the place to do it.

codecov · 2022-02-25T06:47:01Z

Codecov Report

Merging #3286 (8ed1d07) into develop (7e898f4) will increase coverage by 0.50%.
The diff coverage is n/a.

❗ Current head 8ed1d07 differs from pull request most recent head 3f81cc5. Consider uploading reports for the commit 3f81cc5 to get more accurate results

@@             Coverage Diff             @@
##           develop    #3286      +/-   ##
===========================================
+ Coverage    79.03%   79.53%   +0.50%     
===========================================
  Files           68       68              
  Lines        11781    11781              
===========================================
+ Hits          9311     9370      +59     
+ Misses        2470     2411      -59

Impacted Files	Coverage Δ
gensim/models/doc2vec.py	`81.61% <ø> (ø)`
gensim/models/ldamodel.py	`87.45% <0.00%> (+0.18%)`	⬆️
gensim/utils.py	`71.86% <0.00%> (+0.32%)`	⬆️
gensim/corpora/wikicorpus.py	`93.75% <0.00%> (+6.25%)`	⬆️
gensim/models/ldamulticore.py	`90.58% <0.00%> (+15.29%)`	⬆️
gensim/scripts/segment_wiki.py	`96.61% <0.00%> (+25.42%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e898f4...3f81cc5. Read the comment docs.

Fixes 'not enough arguments for format string' err

2cd519d

piskvorky requested a review from gojomo February 4, 2022 16:53

piskvorky added bug Issue described a bug difficulty easy Easy issue: required small fix impact LOW Low impact on affected users reach LOW Affects only niche use-case users labels Feb 4, 2022

piskvorky added this to the 4.1.0 milestone Feb 4, 2022

gojomo approved these changes Feb 4, 2022

View reviewed changes

piskvorky requested changes Feb 5, 2022

View reviewed changes

gensim/models/doc2vec.py Outdated Show resolved Hide resolved

piskvorky requested changes Feb 5, 2022

View reviewed changes

gensim/models/doc2vec.py Outdated Show resolved Hide resolved

gilbertfrancois and others added 2 commits February 5, 2022 10:16

Update gensim/models/doc2vec.py

fe20b3a

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>

Update gensim/models/doc2vec.py

d82d37c

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>

piskvorky requested a review from mpenkov February 8, 2022 17:47

mpenkov added 2 commits February 25, 2022 14:31

run code coverage on Py3.8 Linux only

d6545ab

messing around with tox.ini

3ba44b9

messing around with tox.ini

8ed1d07

mpenkov mentioned this pull request Feb 25, 2022

Get coverage to work properly under Github Actions #3295

Open

mpenkov approved these changes Feb 25, 2022

View reviewed changes

Update CHANGELOG.md

3f81cc5

mpenkov merged commit 6fc9e38 into piskvorky:develop Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes 'not enough arguments for format string' error #3286

Fixes 'not enough arguments for format string' error #3286

gilbertfrancois commented Feb 4, 2022

gojomo left a comment

piskvorky commented Feb 5, 2022

gilbertfrancois commented Feb 5, 2022

gilbertfrancois commented Feb 5, 2022

piskvorky commented Feb 6, 2022 •

edited

Loading

gilbertfrancois commented Feb 6, 2022

gilbertfrancois commented Feb 8, 2022 •

edited

Loading

piskvorky commented Feb 8, 2022 •

edited

Loading

gilbertfrancois commented Feb 8, 2022 •

edited

Loading

piskvorky commented Feb 8, 2022

gilbertfrancois commented Feb 25, 2022

mpenkov commented Feb 25, 2022

codecov bot commented Feb 25, 2022 •

edited

Loading

Fixes 'not enough arguments for format string' error #3286

Fixes 'not enough arguments for format string' error #3286

Conversation

gilbertfrancois commented Feb 4, 2022

gojomo left a comment

Choose a reason for hiding this comment

piskvorky commented Feb 5, 2022

gilbertfrancois commented Feb 5, 2022

gilbertfrancois commented Feb 5, 2022

piskvorky commented Feb 6, 2022 • edited Loading

gilbertfrancois commented Feb 6, 2022

gilbertfrancois commented Feb 8, 2022 • edited Loading

piskvorky commented Feb 8, 2022 • edited Loading

gilbertfrancois commented Feb 8, 2022 • edited Loading

piskvorky commented Feb 8, 2022

gilbertfrancois commented Feb 25, 2022

mpenkov commented Feb 25, 2022

codecov bot commented Feb 25, 2022 • edited Loading

Codecov Report

piskvorky commented Feb 6, 2022 •

edited

Loading

gilbertfrancois commented Feb 8, 2022 •

edited

Loading

piskvorky commented Feb 8, 2022 •

edited

Loading

gilbertfrancois commented Feb 8, 2022 •

edited

Loading

codecov bot commented Feb 25, 2022 •

edited

Loading