Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paperless-ngx: classifier training hangs & times out #240591

Closed
benley opened this issue Jun 29, 2023 · 20 comments · Fixed by #299008
Closed

paperless-ngx: classifier training hangs & times out #240591

benley opened this issue Jun 29, 2023 · 20 comments · Fixed by #299008

Comments

@benley
Copy link
Member

benley commented Jun 29, 2023

Describe the bug

Directly related to this discussion: paperless-ngx/paperless-ngx#2373

paperless-ngx runs a hourly documents.tasks.train_classifier celery beat task. This is supposed to take a few minutes on most systems, but on NixOS (for some users, at least) the task runs forever and is eventually killed by the celery timeout. The affected worker process doesn't respond to SIGTERM or even SIGQUIT; only SIGKILL can interrupt it.

Steps To Reproduce

Steps to reproduce the behavior:

  1. Set up paperless-ngx and add some documents to it
  2. Wait a while(?) or manually trigger the train_classifier task
  3. It times out after 30 minutes, probably.

Expected behavior

The classifier training task should finish successfully in a few minutes or less.

Screenshots

Additional context

See paperless-ngx/paperless-ngx#2373 for more background. This appears to be specific to paperless-ngx on NixOS.

Notify maintainers

@lukegb @gador @erikarvstedt @Flakebi

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

  • system: "x86_64-linux"
  • host os: Linux 5.15.83, NixOS, 23.05 (Stoat), 23.05.git.9790f3242da2M
  • multi-user?: yes
  • sandbox: yes
  • version: nix-env (Nix) 2.13.3
  • channels(benley): ""
  • channels(root): "home-manager-23.05.tar.gz, nixos-23.05"
  • nixpkgs: /nix/var/nix/profiles/per-user/root/channels/nixos
@benley
Copy link
Member Author

benley commented Jun 29, 2023

a tiny bit more context:

The classifier task gets as far as printing these messages in /var/lib/paperless/log/paperless.log:

[2023-06-29 15:05:01,974] [DEBUG] [paperless.classifier] Gathering data from database...
[2023-06-29 15:05:02,341] [DEBUG] [paperless.classifier] 91 documents, 16 tag(s), 38 correspondent(s), 9 document type(s). 0 storage path(es)
[2023-06-29 15:05:02,341] [DEBUG] [paperless.classifier] Vectorizing data...
[2023-06-29 15:05:05,693] [DEBUG] [paperless.classifier] Training tags classifier...
[2023-06-29 15:05:23,120] [DEBUG] [paperless.classifier] Training correspondent classifier...

and then it gets stuck.

That last message comes from this spot in the code: https://github.com/paperless-ngx/paperless-ngx/blob/7a464d8a6eff11bcd0100330cb1687da50e196e6/src/documents/classifier.py#L279

@benley
Copy link
Member Author

benley commented Jun 29, 2023

...which means it's not even getting stuck on the first call to MLPClassifier().fit(), as the tags classifier finishes. It's the second one.

@ryane
Copy link
Contributor

ryane commented Oct 16, 2023

I'm seeing the same behavior. The only difference is I am getting stuck at the tags classifier.

[2023-10-16 06:10:12,745] [DEBUG] [paperless.classifier] Gathering data from database...
[2023-10-16 06:10:13,159] [DEBUG] [paperless.classifier] 309 documents, 14 tag(s), 1 correspondent(s), 0 document type(s). 1 storage path(es)
[2023-10-16 06:10:13,160] [DEBUG] [paperless.classifier] Vectorizing data...
[2023-10-16 06:10:13,961] [DEBUG] [paperless.classifier] Training tags classifier...

It hangs forever here, pegging a single core.

 - system: `"x86_64-linux"`
 - host os: `Linux 6.1.57, NixOS, 23.11 (Tapir), 23.11.20231011.5e4c2ad`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.17.0`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

@Atemu
Copy link
Member

Atemu commented Oct 16, 2023

It appears it can get stuck on any classification task. A temporary workaround is to disable classification for some of the lesser used tags/correspondents/doctypes until it works again.

@Atemu
Copy link
Member

Atemu commented Dec 22, 2023

The likely cause for this bug has been found: OpenBLAS. Building numpy with i.e. the proprietary mkl BLAS implementation instead resolves this issue.

I'm not sure how to start approaching upstream in this as OpenBLAS is like 3 libraries down the dependency tree.

@benley
Copy link
Member Author

benley commented Jan 29, 2024

I'm not sure how to start approaching upstream in this as OpenBLAS is like 3 libraries down the dependency tree.

The nixpkgs manual BLAS/LAPACK section suggests using LD_LIBRARY_PATH to select a different BLAS implementation at runtime:

$ LD_LIBRARY_PATH=$(nix-build -A mkl)/lib${LD_LIBRARY_PATH:+:}$LD_LIBRARY_PATH nix-shell -p octave --run octave

One could of course use an overlay to override BLAS, but that triggers a huge number of rebuilds on most systems. I'm going to see if I can get something working with LD_LIBRARY_PATH.

@benley
Copy link
Member Author

benley commented Jan 29, 2024

One could of course use an overlay to override BLAS, but that triggers a huge number of rebuilds on most systems. I'm going to see if I can get something working with LD_LIBRARY_PATH.

So far this seems to actually work! I wasn't able to use the amd-blis library due to missing symbols, but mkl worked:

services.paperless.extraConfig = {
  LD_LIBRARY_PATH = "${lib.getLib pkgs.mkl}/lib";
};

(that will be services.paperless.settings in nixos-unstable; I am running 23.11)

@Atemu
Copy link
Member

Atemu commented Jan 30, 2024

Another thing we could perhaps try somehow is to prevent BLAS from loading somehow because the upstream wheels for numpy somehow don't include any BLAS implementation at all and therefore neither does the paperless-ngx docker image.

ambroisie added a commit to ambroisie/nix-config that referenced this issue Jan 30, 2024
This is an experimental fix to try and get around an issue with the
default BLAS/LAPACK implementation. See [1] for more details.

[1]: NixOS/nixpkgs#240591
ambroisie added a commit to ambroisie/nix-config that referenced this issue Jan 30, 2024
This is an experimental fix to try and get around an issue with the
default BLAS/LAPACK implementation. See [1] for more details.

[1]: NixOS/nixpkgs#240591
@Lyndeno
Copy link
Contributor

Lyndeno commented Mar 18, 2024

One could of course use an overlay to override BLAS, but that triggers a huge number of rebuilds on most systems. I'm going to see if I can get something working with LD_LIBRARY_PATH.

So far this seems to actually work! I wasn't able to use the amd-blis library due to missing symbols, but mkl worked:

services.paperless.extraConfig = {
  LD_LIBRARY_PATH = "${lib.getLib pkgs.mkl}/lib";
};

(that will be services.paperless.settings in nixos-unstable; I am running 23.11)

How well is this working now? I might be experiencing the same issue.

@benley
Copy link
Member Author

benley commented Mar 18, 2024

How well is this working now? I might be experiencing the same issue.

I'm still running this config and the problem still has not returned in my setup, even after adding some new auto classifiers.

@SuperSandro2000
Copy link
Member

Should we add this to the paperless module?

@martinholters
Copy link

FWIW, I've experienced the same problem and upon finding this issue also set LD_LIBRARY_PATH to use MKL, which made the problem go away for me, too.
With MKL being non-free I'm not 100% happy with it, though. Are there any chances this might also be solved by employing a newer OpenBLAS version? I lack the time to dig deeper into this, unfortunately.

@Atemu
Copy link
Member

Atemu commented Mar 19, 2024

OpenBLAS is on the newest release.

@Lyndeno
Copy link
Contributor

Lyndeno commented Mar 19, 2024

How well is this working now? I might be experiencing the same issue.

I'm still running this config and the problem still has not returned in my setup, even after adding some new auto classifiers.

Can confirm that adding MKL fixed the problem I was having. "Classifier file does not exist"

@martinholters
Copy link

OpenBLAS is on the newest release.

I'm using nixos-23.11 where OpenBLAS is at 0.3.24. But if the problem persists with 0.3.26 from nixos-unstable, just waiting for the update to propagate to the next release obviously won't be the solution, unfortunately.

Lyndeno added a commit to Lyndeno/nix-config that referenced this issue Mar 20, 2024
@martinholters
Copy link

Instead of using MKL I've tried the following, and it seems to work for me, too:

services.paperless.extraConfig = {
  OPENBLAS_NUM_THREADS = 1;
  OMP_NUM_THREADS = 1;
  GOTO_NUM_THREADS = 1;
};

(It's probably sufficient to set one of them, but I wanted to be sure the setting takes effect no matter what.)
Can someone confirm this makes the problem go away with OpenBLAS? Or was it just fluke for me?

@Atemu
Copy link
Member

Atemu commented Mar 24, 2024

The culprit is OMP_NUM_THREADS. Setting it to 1 works around the issue.

@martinholters
Copy link

Aha:
"OpenBLAS ignores OPENBLAS_NUM_THREADS and GOTO_NUM_THREADS when compiled with USE_OPENMP=1."
And in NixOS, OpenBLAS is configured to set USE_OPENMP=1 on many platforms. So that's why OMP_NUM_THREADS is the one to set. But I've also found this:

# Multi-threaded applications must not call a threaded OpenBLAS
# (the only exception is when an application uses OpenMP as its
# *only* form of multi-threading). See
# https://github.com/xianyi/OpenBLAS/wiki/Faq/4bded95e8dc8aadc70ce65267d1093ca7bdefc4c#multi-threaded
# https://github.com/xianyi/OpenBLAS/issues/2543
# This flag builds a single-threaded OpenBLAS using the flags
# stated in thre.
, singleThreaded ? false

So maybe forcing to use OpenBLAS with singleThreaded = true might be the proper solution here?

@Atemu
Copy link
Member

Atemu commented Mar 24, 2024

Paperless only uses OpenBLAS via scikit-learn via numpy. This is a generic library and paperless is not the exclusive user of any of these.

The proper solution is to figure out what causes OpenBLAS to spin on sched_yield() when multiple OMP threads are used and the reason might be anywhere in the stack. The next step is likely to find a more minimal reproducer as low in this stack as possible and go bother the relevant upstream with it.

At least FreeBSD appears to run into the same issue, so I don't think it's an obvious packaging issue on our end.

Upstream wheels of numpy do not appear to use OpenBLAS at all, perhaps we could also look into whether our numpy using OpenBLAS is necessary, supported and desirable.

Atemu added a commit to Atemu/nixpkgs that referenced this issue Mar 25, 2024
Atemu added a commit to Atemu/nixpkgs that referenced this issue Mar 25, 2024
@Atemu
Copy link
Member

Atemu commented Mar 26, 2024

Proposed #299008 as a workaround until a proper solution is found.

SuperSandro2000 pushed a commit to SuperSandro2000/nixpkgs that referenced this issue Mar 26, 2024
Atemu added a commit to Atemu/nixpkgs that referenced this issue Mar 26, 2024
Atemu added a commit to Atemu/nixpkgs that referenced this issue Mar 26, 2024
Atemu added a commit to Atemu/nixpkgs that referenced this issue Mar 26, 2024
benley pushed a commit that referenced this issue Mar 26, 2024
SuperSandro2000 pushed a commit to SuperSandro2000/nixpkgs that referenced this issue Mar 27, 2024
SuperSandro2000 pushed a commit to SuperSandro2000/nixpkgs that referenced this issue Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants