Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Document Processing Pipeline More Fault Tolerant #79

Merged
merged 7 commits into from
Feb 27, 2023

Conversation

JSv4
Copy link
Owner

@JSv4 JSv4 commented Feb 22, 2023

Experimenting with some real-world document collections, it's become apparent that the document processing could not be done on a doc-by-doc basis. The variation is too great. Most documents are OK, but some can go to the thousands or pages or have poor image compression. This creates situations where our celery workers crash due to over-consumption of memory or timeout due to insanely long processing times. New processing pipeline splits pdfs by page and then assigns each page to a worker in the celery queue. The bytes contents of the pages are no longer stored in the worker args as this consumes memory in Redis. Instead, they're stored to local FS or S3, depending on if USE_AWS is set.

…rary disk storage to avoid clogging Redis with huge files and to combat timeout conditions encountered with very large (multi-thousand page) files. Need to add code to make this work properly with local deployments as well as S3. Also need to add tests.
…sing page-wise queueing, Tuning celerty workers is an ongoing process. For now, trying concurrency=1 and then scaling the celeryworker instances via docker-compose. May be able to increase concurrency. Should document this tuning processing for others with different envs (more cores, less cores, whatever).
@codecov
Copy link

codecov bot commented Feb 26, 2023

Codecov Report

Merging #79 (e7548c5) into main (d349fb3) will increase coverage by 2.13%.
The diff coverage is 46.15%.

@@            Coverage Diff             @@
##             main      #79      +/-   ##
==========================================
+ Coverage   66.77%   68.91%   +2.13%     
==========================================
  Files          47       47              
  Lines        1794     1853      +59     
==========================================
+ Hits         1198     1277      +79     
+ Misses        596      576      -20     
Impacted Files Coverage Δ
opencontractserver/utils/etl.py 19.82% <0.00%> (ø)
opencontractserver/utils/pdf.py 39.53% <37.50%> (-5.30%) ⬇️
opencontractserver/tasks/doc_tasks.py 66.07% <47.42%> (+37.38%) ⬆️
opencontractserver/documents/signals.py 100.00% <100.00%> (ø)
opencontractserver/tasks/__init__.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@JSv4 JSv4 merged commit 44d85ff into main Feb 27, 2023
@JSv4 JSv4 deleted the JSv4/enhance-doc-pipeline branch February 27, 2023 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant