Make Document Processing Pipeline More Fault Tolerant #79

JSv4 · 2023-02-22T05:32:47Z

Experimenting with some real-world document collections, it's become apparent that the document processing could not be done on a doc-by-doc basis. The variation is too great. Most documents are OK, but some can go to the thousands or pages or have poor image compression. This creates situations where our celery workers crash due to over-consumption of memory or timeout due to insanely long processing times. New processing pipeline splits pdfs by page and then assigns each page to a worker in the celery queue. The bytes contents of the pages are no longer stored in the worker args as this consumes memory in Redis. Instead, they're stored to local FS or S3, depending on if USE_AWS is set.

…ng celery queue.

…rary disk storage to avoid clogging Redis with huge files and to combat timeout conditions encountered with very large (multi-thousand page) files. Need to add code to make this work properly with local deployments as well as S3. Also need to add tests.

…for_processing.

…n all style checks and linting.

…sing page-wise queueing, Tuning celerty workers is an ongoing process. For now, trying concurrency=1 and then scaling the celeryworker instances via docker-compose. May be able to increase concurrency. Should document this tuning processing for others with different envs (more cores, less cores, whatever).

codecov · 2023-02-26T06:20:46Z

Codecov Report

Merging #79 (e7548c5) into main (d349fb3) will increase coverage by 2.13%.
The diff coverage is 46.15%.

@@            Coverage Diff             @@
##             main      #79      +/-   ##
==========================================
+ Coverage   66.77%   68.91%   +2.13%     
==========================================
  Files          47       47              
  Lines        1794     1853      +59     
==========================================
+ Hits         1198     1277      +79     
+ Misses        596      576      -20

Impacted Files	Coverage Δ
opencontractserver/utils/etl.py	`19.82% <0.00%> (ø)`
opencontractserver/utils/pdf.py	`39.53% <37.50%> (-5.30%)`	⬇️
opencontractserver/tasks/doc_tasks.py	`66.07% <47.42%> (+37.38%)`	⬆️
opencontractserver/documents/signals.py	`100.00% <100.00%> (ø)`
opencontractserver/tasks/__init__.py	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

JSv4 added 7 commits February 19, 2023 22:15

Restructure documentation a little bit and add some guidance on purgi…

d651616

…ng celery queue.

Most of local file saving implemented. Still need to tweak split_pdf_…

ca1f3cf

…for_processing.

Added local fs-based pdf sharding and processing.

a8131eb

Added some basic tests for document parser tasks.

42ec035

Local pre-commit had broken and was not running properly. Manually ra…

3bd6690

…n all style checks and linting.

JSv4 merged commit 44d85ff into main Feb 27, 2023

JSv4 deleted the JSv4/enhance-doc-pipeline branch February 27, 2023 06:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Document Processing Pipeline More Fault Tolerant #79

Make Document Processing Pipeline More Fault Tolerant #79

JSv4 commented Feb 22, 2023

codecov bot commented Feb 26, 2023

Make Document Processing Pipeline More Fault Tolerant #79

Make Document Processing Pipeline More Fault Tolerant #79

Conversation

JSv4 commented Feb 22, 2023

codecov bot commented Feb 26, 2023

Codecov Report