New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
paperless-ngx: classifier training hangs & times out #240591
Comments
a tiny bit more context: The classifier task gets as far as printing these messages in /var/lib/paperless/log/paperless.log:
and then it gets stuck. That last message comes from this spot in the code: https://github.com/paperless-ngx/paperless-ngx/blob/7a464d8a6eff11bcd0100330cb1687da50e196e6/src/documents/classifier.py#L279 |
...which means it's not even getting stuck on the first call to |
I'm seeing the same behavior. The only difference is I am getting stuck at the tags classifier.
It hangs forever here, pegging a single core.
|
It appears it can get stuck on any classification task. A temporary workaround is to disable classification for some of the lesser used tags/correspondents/doctypes until it works again. |
The likely cause for this bug has been found: OpenBLAS. Building numpy with i.e. the proprietary mkl BLAS implementation instead resolves this issue. I'm not sure how to start approaching upstream in this as OpenBLAS is like 3 libraries down the dependency tree. |
The nixpkgs manual BLAS/LAPACK section suggests using LD_LIBRARY_PATH to select a different BLAS implementation at runtime:
One could of course use an overlay to override BLAS, but that triggers a huge number of rebuilds on most systems. I'm going to see if I can get something working with LD_LIBRARY_PATH. |
So far this seems to actually work! I wasn't able to use the services.paperless.extraConfig = {
LD_LIBRARY_PATH = "${lib.getLib pkgs.mkl}/lib";
}; (that will be |
Another thing we could perhaps try somehow is to prevent BLAS from loading somehow because the upstream wheels for numpy somehow don't include any BLAS implementation at all and therefore neither does the paperless-ngx docker image. |
This is an experimental fix to try and get around an issue with the default BLAS/LAPACK implementation. See [1] for more details. [1]: NixOS/nixpkgs#240591
This is an experimental fix to try and get around an issue with the default BLAS/LAPACK implementation. See [1] for more details. [1]: NixOS/nixpkgs#240591
How well is this working now? I might be experiencing the same issue. |
I'm still running this config and the problem still has not returned in my setup, even after adding some new auto classifiers. |
Should we add this to the paperless module? |
FWIW, I've experienced the same problem and upon finding this issue also set |
OpenBLAS is on the newest release. |
Can confirm that adding MKL fixed the problem I was having. "Classifier file does not exist" |
I'm using nixos-23.11 where OpenBLAS is at 0.3.24. But if the problem persists with 0.3.26 from nixos-unstable, just waiting for the update to propagate to the next release obviously won't be the solution, unfortunately. |
Instead of using MKL I've tried the following, and it seems to work for me, too: services.paperless.extraConfig = {
OPENBLAS_NUM_THREADS = 1;
OMP_NUM_THREADS = 1;
GOTO_NUM_THREADS = 1;
}; (It's probably sufficient to set one of them, but I wanted to be sure the setting takes effect no matter what.) |
The culprit is |
Aha:
So maybe forcing to use OpenBLAS with singleThreaded = true might be the proper solution here?
|
Paperless only uses OpenBLAS via scikit-learn via numpy. This is a generic library and paperless is not the exclusive user of any of these. The proper solution is to figure out what causes OpenBLAS to spin on At least FreeBSD appears to run into the same issue, so I don't think it's an obvious packaging issue on our end. Upstream wheels of numpy do not appear to use OpenBLAS at all, perhaps we could also look into whether our numpy using OpenBLAS is necessary, supported and desirable. |
Proposed #299008 as a workaround until a proper solution is found. |
Fixes NixOS#240591 (cherry picked from commit 70fa188)
Describe the bug
Directly related to this discussion: paperless-ngx/paperless-ngx#2373
paperless-ngx runs a hourly
documents.tasks.train_classifier
celery beat task. This is supposed to take a few minutes on most systems, but on NixOS (for some users, at least) the task runs forever and is eventually killed by the celery timeout. The affected worker process doesn't respond to SIGTERM or even SIGQUIT; only SIGKILL can interrupt it.Steps To Reproduce
Steps to reproduce the behavior:
Expected behavior
The classifier training task should finish successfully in a few minutes or less.
Screenshots
Additional context
See paperless-ngx/paperless-ngx#2373 for more background. This appears to be specific to paperless-ngx on NixOS.
Notify maintainers
@lukegb @gador @erikarvstedt @Flakebi
Metadata
Please run
nix-shell -p nix-info --run "nix-info -m"
and paste the result."x86_64-linux"
Linux 5.15.83, NixOS, 23.05 (Stoat), 23.05.git.9790f3242da2M
yes
yes
nix-env (Nix) 2.13.3
""
"home-manager-23.05.tar.gz, nixos-23.05"
/nix/var/nix/profiles/per-user/root/channels/nixos
The text was updated successfully, but these errors were encountered: