-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
additional CLI/METS/ocrd-tool.json specs for quality estimator tools and annotations #172
Comments
From a knowledge archaeology point of view it is good enough to have at least some OCR-results, rather than loosing a whole document with several thousand pages. One could even argue, that for really big pages (newspapers, maps, 2°-prints) it is preferable to save parts of a single page that were recognized well. Anyway, future OCR is expected to perform better than today's, so there will be an improvement over time. I really like the idea of alternative workflows, but I'm afraid the Wizardry of Workflow Configuration (WWC) will grow in a manner unintelligibly by common mortals like me. The first step, IMHO, is to extend the core-CLI in a way, that each processor reports its outcome in a unified way, so the core can decide what to do next, i.e.
|
Yes, that's radical, but doable IMO. We could strive for line level result aggregation. We would need some way of marking partial failure (or "rejection") in our annotations – e.g. when layout segmentation failed to detect a text region or failed to meet a score across parts of the page – and optionally act on it – e.g. by (removign the failed segments and) running another segmentation processor on top of the partial result (which works as long as processors are annotating incrementally). But if we allowed partial failure then we cannot use binary CLI interfaces anymore (processor or test succeeds or fails)... Some people have proposed standardizing log messages and then parsing these instead, but I would not recommend that approach at all. How about standardizing the processor/evaluator CLI's exit codes, so we can differentiate between all-out success (0), all-out failure (1) and partial failure (2), perhaps even temporary failure (3) (e.g. if GPU ran OOM)?
Writing such generalized workflow configurations will become even more challenging, yes. But finding good/general workflows is already hard and requires expert knowledge and experimental data. We need not look at the divide between field experts and users as a drawback: it could also be called division of labour! The more work goes into a workflow configuration, the more versatile and usable it becomes, too. If users had to choose between 100 specialised, yet simple workflows (and perhaps to write the 101st by themselves), or just 2-3 general, but complex ("intelligent") workflows, then I think they would usually prefer the latter.
I gave it some thought, and I do think this "skip strategy" is a valid choice – if done right. Skipping a page must not mean to not create a page in the output file group though, but to at least pass a copy of the input annotation. (That's also the natural generalisation of empty segments within a page.) Thus, the workflow can still re-run the same step with other processors/parameters (as long as they are capable of incremental annotation), but it could also just follow up with the next step. The only reason forcing a re-run or necessitate cancelling the workflow entirely would now be if too many pages are empty or quality is just too bad on average. And I do agree that core has partial responsibility for this behaviour, because it could call processors page by page, catching their exceptions and falling back to pass-through. See Above that however, we are not talking about core's Python API here, but about the OCR-D workflow engine (whatever that will be). AFAICS |
We all know we need some form of quality estimates to control where computation is spent and what workflow steps are used.
External quality control
One might consider this as a problem external to workflow configuration and module implementation. In that interpretation, it remains the workflow engine's job to test results, and to enter alternative pre-defined workflow paths – or give up a workspace – when they fail.
So the user could still influence computation by providing configurations with switches/conditionals. But that influence is rather limited – you could not say where to check, how, or with which models.
Also, module implementators cannot contribute their expert knowledge about how to get good quality estimates.
OCR-D quality control
Alternatively, one might want to model these tests explicitly, defining and managing specialised tools for testing, and configuring their usage in workflows along with the processors themselves.
So – at this point I am repeating a proposal made in discussing #171 – …
How about we introduced a dedicated CLI for OCR-D workflow quality estimators analogous to OCR-D workflow processors? Modules could bundle their knowledge about what a good result is for a particular processor along with everything else. And specialized modules could provide QE tools by the bunch. Let's call them
benchmarksevaluators for the time being. We are primarily interested in anbenchmark'sevaluator's score, which needs to be compared to some threshold to determine if the processor's result was "okay" enough to continue processing. That threshold could be different, depending on the workflow configuration.BenchmarksEvaluators could also have configurable parameters (like model files) of their own.An important question is how to deal with page-wise vs document-wise quality. Do we want to stop if even a single page indicated failure? Or does it take a bad average? Or a pro-longed series of bad pages? Or a
min-worst-page
/min-average
kind of set-up?Regardless, besides the binary
score > threshold
check we might also be interested in additional, verbose output on what was analysed and what patterns were found – either as part of logging, or as a structured report. So we also must allow creating an output fileGrp.Now let's assume a call to a
benchmarkevaluator looks like this:(with its return value indicating success or failure, just as the return value of a processor).
Then we could write dynamic workflows like so:
(borrowing the
sh -e
convention that any non-zero return value causes the workflow to fail, except if it was immediate to a conditional expression)That would be fully dynamic, but still not allow arbitrary information flow like determining model names. For the latter, some notion of functional CLI would be needed...
Originally posted by @bertsky in #171
EDIT term
benchmark→ evaluatorThe text was updated successfully, but these errors were encountered: