Skip to content

Cache intermediate results with diskcache #1509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shemhamforash23 opened this issue May 2, 2025 · 0 comments
Open

Cache intermediate results with diskcache #1509

shemhamforash23 opened this issue May 2, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@shemhamforash23
Copy link

Requested feature

There are situations when an application fails during processing because of OOM or any other reason and all processed page batches and image descriptions are lost.

Few issues arise from this:

  • you have to start processing again and hope it will not fail this time.
  • if using remote VLM provider for image description you spend tokens/money on repeated processing of the same data over and over again.
  • if you are resource constrained, you risk never process some documents because of failures. It's all or nothing.

It would be great if there was an option to enable caching of intermediate results of page processing and image descriptions.
For example diskcache is a great choice for this task.

Reasons to implement caching:

  • more reliable processing on resource constrained systems.
  • start from the last failure point, making eventual processing of large documents possible after few restarts.
  • spend less time, money and compute on processing the same data over and over again
@shemhamforash23 shemhamforash23 added the enhancement New feature or request label May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant