Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes 'not enough arguments for format string' error #3286

Merged
merged 7 commits into from
Feb 25, 2022

Conversation

gilbertfrancois
Copy link
Contributor

For Doc2Vec, when building the vocabulary and the raw_int and corpus_count are not the same, the log gives a warning. However, the log string expects 4 arguments, but only 3 are given, so the log produces an error, shown below.

The fix was to give the difference of raw_int and corpus_count as the 3rd argument.

Part of the stack trace:

--- Logging error ---
Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/3.8.10/lib/python3.8/logging/__init__.py", line 1085, in emit
    msg = self.format(record)
  File "/home/ubuntu/.pyenv/versions/3.8.10/lib/python3.8/logging/__init__.py", line 929, in format
    return fmt.format(record)
  File "/home/ubuntu/.pyenv/versions/3.8.10/lib/python3.8/logging/__init__.py", line 668, in format
    record.message = record.getMessage()
  File "/home/ubuntu/.pyenv/versions/3.8.10/lib/python3.8/logging/__init__.py", line 373, in getMessage
    msg = msg % self.args
TypeError: not enough arguments for format string
Call stack:
  File "/home/ubuntu/.pyenv/versions/3.8.10/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
...

...
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/gilbert-lab-aKPhFkG7-py3.8/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 993, in _scan_vocab
    logger.warning(
Message: 'Highest int doctag (%i) larger than count of documents (%i). This means at least %i excess, unused slots (%i bytes) will be allocated for vectors.'
Arguments: (60108, 60082, 5200)

@piskvorky piskvorky requested a review from gojomo February 4, 2022 16:53
@piskvorky piskvorky added bug Issue described a bug difficulty easy Easy issue: required small fix impact LOW Low impact on affected users reach LOW Affects only niche use-case users labels Feb 4, 2022
@piskvorky piskvorky added this to the 4.1.0 milestone Feb 4, 2022
Copy link
Collaborator

@gojomo gojomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good - reflecting original unrealized intent!

@piskvorky
Copy link
Owner

Great. @gilbertfrancois please fix the flake tests (trailing whitespace not allowed) and we're good to merge.

gensim/models/doc2vec.py Outdated Show resolved Hide resolved
gensim/models/doc2vec.py Outdated Show resolved Hide resolved
gilbertfrancois and others added 2 commits February 5, 2022 10:16
Co-authored-by: Radim Řehůřek <me@radimrehurek.com>
Co-authored-by: Radim Řehůřek <me@radimrehurek.com>
@gilbertfrancois
Copy link
Contributor Author

Done. Thanks @piskvorky for the suggestions.

@gilbertfrancois
Copy link
Contributor Author

Are some tests failing because of this patch? I can't discover in the logs what the problem is.

@piskvorky
Copy link
Owner

piskvorky commented Feb 6, 2022

I see segmentation faults in the logs. Although that's weird, for an added log parameter… @gilbertfrancois could you investigate further? Was it my REAL.itemsize suggestion?

@mpenkov what's the way to run tests locally? I cannot find it anymore on our wiki / https://github.com/RaRe-Technologies/gensim/wiki/Developer-page :(

@gilbertfrancois
Copy link
Contributor Author

I could setup the environment and I am running the tests now. I'll keep you posted about the progress.

@gilbertfrancois
Copy link
Contributor Author

gilbertfrancois commented Feb 8, 2022

As far as I can see is that the failing tests are not related to the change in this PR. I ran the following experiments using tox -e py37-linux and tox -e py38-linux:

gensim-4.1.2, py37-linux, passes all tests
gensim-4.1.2, py38-linux, passes all tests
gensim-master, py37-linux, waits forever at gensim/test/test_ensemblelda.py::TestEnsembleLda::test_add_models
gensim-master, py38-linux, passes, but very slow

With the current master branch, the tests with ldamulticore takes a lot of time and for py37 it never finishes. However, when running the test directly with unittest instead of pytest and tox, it passes and does the tests very fast. So I had the impression that it was not related to the code, but to the testing method. Moreover, the code of ldamodel and ldamulticore has hardly changed since release 4.1.2.

I saw that the testing has changed since the last release. When I remove coverage testing from tox.ini, all tests pass again on py37-linux and py38-linux on the master branch.

diff --git a/tox.ini b/tox.ini
index 058b37d9..4719e1af 100644
--- a/tox.ini
+++ b/tox.ini
@@ -43,7 +43,7 @@ exclude_lines =
 ignore_errors = True

 [pytest]
-addopts = -rfxEXs --durations=20 --showlocals --cov=gensim/ --cov-report=xml
+addopts = -rfxEXs --durations=20 --showlocals


 [testenv]

I think it is safe to merge this PR and open a new issue to solve the testing + coverage for ensemble_lda and ldamulticore.

For reference, I've attached the logs (with verbose option) for the different runs.

Fail:
log-gensim-master-tox3.24.5-py37-linux.txt

Success:
log-gensim-master-tox3.24.5-nocov-py37-linux.txt
log-gensim-master-tox3.24.5-nocov-py38-linux.txt
log-gensim4.1.2-tox3.24.4-py37-linux.txt
log-gensim4.1.2-tox3.24.4-py38-linux.txt
.

@piskvorky
Copy link
Owner

piskvorky commented Feb 8, 2022

Thanks so much for your investigation @gilbertfrancois!

CC @mpenkov – let's whip the CI testing back into shape. Are the coverage tests good for anything? Do we need them?

@gilbertfrancois
Copy link
Contributor Author

gilbertfrancois commented Feb 8, 2022

I do think coverage serves a useful purpose, by giving a metric on how much of the written code is tested. But it seems that measuring coverage in combination with multiprocessing is creating problems. I would suggest to temporarily stop measuring coverage or exclude measuring coverage for tests that make use of multiprocessing (e.g. ensemble_lda, ldamulticore), until a solution is found.

I found this stackoverflow page that is mentioning problems with coverage + multiprocessing.

https://stackoverflow.com/questions/28297497/python-code-coverage-and-multiprocessing

Most important is to have good working unittests.

@piskvorky
Copy link
Owner

by giving a metric on how much of the written code is tested

I personally don't find that particularly useful. But let's wait for @mpenkov 's judgement – he might remember why we added these coverage tests in the first place :)

@gilbertfrancois
Copy link
Contributor Author

@mpenkov I see that you include/exclude platforms and/or python versions for coverage. I'm not sure if that is the right way to fix this problem.

For multiprocessing code, coverage needs to be run with the argument --concurrency=multiprocessing and afterward the command coverage combine should be executed. It is described here in the section Execution. I don't know if it is possible with the pytest-cov plugin.

@mpenkov
Copy link
Collaborator

mpenkov commented Feb 25, 2022

At the moment, I'm just trying to get CI to pass with some consistency. The coverage stuff appears to be what's causing the build fails (mysteriously...).

It sounds like getting coverage to work properly will take more effort, and I'm not sure that this PR is the place to do it.

@codecov
Copy link

codecov bot commented Feb 25, 2022

Codecov Report

Merging #3286 (8ed1d07) into develop (7e898f4) will increase coverage by 0.50%.
The diff coverage is n/a.

❗ Current head 8ed1d07 differs from pull request most recent head 3f81cc5. Consider uploading reports for the commit 3f81cc5 to get more accurate results

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3286      +/-   ##
===========================================
+ Coverage    79.03%   79.53%   +0.50%     
===========================================
  Files           68       68              
  Lines        11781    11781              
===========================================
+ Hits          9311     9370      +59     
+ Misses        2470     2411      -59     
Impacted Files Coverage Δ
gensim/models/doc2vec.py 81.61% <ø> (ø)
gensim/models/ldamodel.py 87.45% <0.00%> (+0.18%) ⬆️
gensim/utils.py 71.86% <0.00%> (+0.32%) ⬆️
gensim/corpora/wikicorpus.py 93.75% <0.00%> (+6.25%) ⬆️
gensim/models/ldamulticore.py 90.58% <0.00%> (+15.29%) ⬆️
gensim/scripts/segment_wiki.py 96.61% <0.00%> (+25.42%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e898f4...3f81cc5. Read the comment docs.

@mpenkov mpenkov merged commit 6fc9e38 into piskvorky:develop Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix impact LOW Low impact on affected users reach LOW Affects only niche use-case users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants