Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Document Processing Pipeline More Fault Tolerant #79

Merged
merged 7 commits into from
Feb 27, 2023

Commits on Feb 20, 2023

  1. Configuration menu
    Copy the full SHA
    d651616 View commit details
    Browse the repository at this point in the history

Commits on Feb 21, 2023

  1. Rewrote doc processing pipeline to split pdfs and store them to tempo…

    …rary disk storage to avoid clogging Redis with huge files and to combat timeout conditions encountered with very large (multi-thousand page) files. Need to add code to make this work properly with local deployments as well as S3. Also need to add tests.
    JSv4 committed Feb 21, 2023
    Configuration menu
    Copy the full SHA
    721706d View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    ca1f3cf View commit details
    Browse the repository at this point in the history

Commits on Feb 22, 2023

  1. Configuration menu
    Copy the full SHA
    a8131eb View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    42ec035 View commit details
    Browse the repository at this point in the history
  3. Local pre-commit had broken and was not running properly. Manually ra…

    …n all style checks and linting.
    JSv4 committed Feb 22, 2023
    Configuration menu
    Copy the full SHA
    3bd6690 View commit details
    Browse the repository at this point in the history

Commits on Feb 26, 2023

  1. Finally have good, robust performance for document parsing pipeline u…

    …sing page-wise queueing, Tuning celerty workers is an ongoing process. For now, trying concurrency=1 and then scaling the celeryworker instances via docker-compose. May be able to increase concurrency. Should document this tuning processing for others with different envs (more cores, less cores, whatever).
    JSv4 committed Feb 26, 2023
    Configuration menu
    Copy the full SHA
    e7548c5 View commit details
    Browse the repository at this point in the history