Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pagewise Processing #2

Open
6 tasks
krvoigt opened this issue Dec 7, 2021 · 2 comments
Open
6 tasks

Pagewise Processing #2

krvoigt opened this issue Dec 7, 2021 · 2 comments
Labels

Comments

@krvoigt
Copy link

krvoigt commented Dec 7, 2021

Current situation

Processors iterate over the files in a workspace on their own. While it is possible to restrict the processing to a single page or a list/range of pages, the API is targeted towards processors deriving the pages to process on their own. Setup functionality (like loading models or other data into memory) is intertwined with processing, making it difficult to separate the two (i.e. if doing pagewise processing with pageID restriction, the setup in process still happens for every call.

How it should be

The process method should be deprecated and replaced with a process_page method.

Processors should have a setup method that encapsulates all the post-initialization but pre-processing steps necessary for processing.

Steps

  • Refactor processor code in OCR-D/core to provide entry points for process_page and setup
  • Deprecate process
  • Test
  • Change all the processors
  • Communicate change in Tech Call
  • Reflect changed API in documentation
@krvoigt krvoigt added the Epic label Dec 7, 2021
@paulpestov
Copy link

Maybe we could describe more what problem we are trying to solve and what users can expect after the implementation.

Setup functionality (like loading models or other data into memory) is intertwined with processing, making it difficult to separate the two..
E.g. why is it useful to make this separation?

PS: I think the purpose behind this feature would normally serve as epic description (Like "ruduce processing time by X to meet metric Y") and one of the actual user stories from that epic would be "as processor dev I want to process pages in parallel"

@kba
Copy link
Member

kba commented Jan 17, 2022

Setup functionality (like loading models or other data into memory) is intertwined with processing, making it difficult to separate the two..
E.g. why is it useful to make this separation?

It improves performance because setting up the processor can be done just once instead of with every call to process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants