-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: ocrd-sanitize script to preprocess/postprocess OCR-D workspaces` #544
Comments
|
I don't know if I missed the point a bit, but I do see two different groups of use cases here:
Should these use case groups maybe put into two separate processors/tools? |
Yes, probably. Or even task-specific processors ( |
Of interest in this context: https://github.com/tboenig/AletheiaTools |
Another useful operation: Assign |
|
Something related: extract METS/MODS from xml_doc created from OAI-Response like this:
|
Let's keep OAI-PMH in a separate issue, c.f. #539. Also, if you want to extract METS from a GetRecord OAI-PMH request on the command line with xmlstarlet, see #453 (comment) |
Snippet for METS/MODS fileGrp, using wl/bl approach: def clear_fileGroups(xml_root, black_list=None, white_list=None):
file_sections = xml_root.findall('.//mets:fileSec', XMLNS)
if not file_sections or (len(file_sections) < 1):
raise Exception('invalid xml data !')
for file_section in file_sections:
sub_groups = list(file_section)
for sub_group in sub_groups:
subgroup_label = sub_group.attrib['USE']
if black_list:
for fg in black_list:
if subgroup_label== fg:
file_section.remove(sub_group)
sanitze_pysical_strctMap(xml_root, subgroup_label)
if white_list:
if not subgroup_label in white_list:
file_section.remove(sub_group)
sanitze_pysical_strctMap(xml_root, subgroup_label)
def sanitze_pysical_strctMap(xml_root, file_ref):
pages = xml_root.findall('.//mets:structMap[@TYPE="PHYSICAL"]/mets:div/mets:div[@TYPE="page"]', XMLNS)
for page in pages:
removals = []
for fptr in page:
file_id = fptr.attrib['FILEID']
if file_ref in file_id:
removals.append(fptr)
if removals:
for removal in removals:
page.remove(removal) |
Also convenient: re-index all METS-Filegroups after any undesired reference entries were dropped. |
My largest demand for a sanitizer would be ensuring ingest into Kitodo.Presentation / DFG-Viewer works. According to this we are already close, but...
|
I stand corrected: As this example by @stefanCCS – METS and ALTO – shows, |
METS/PAGE/ALTO provided by digitization workflow software or repositories will not always adhere to the conventions we have in OCR-D. OTOH the workspaces that are the result of OCR-D workflows contains a lot of redundant information that is not relevant for ingestion into production systems or contradict the local conventions of the production system.
Also, our conventions have been shifting and will continue to do so to meet the needs of users and developers.
Many users therefore have developed scripts to preprocess input and postprocess output of OCR-D.
OCR-D/core should provide a processor
ocrd-sanitize
which is only concerned with "housekeeping" of workspaces. Possible actions include:mets:fileGrp
, either by allowlist or denylist. I.e. removemets:fileGrp
and containingmets:file
(and files on disk) that are not required anymorexlink:href
to match local conventionspage:TextEquiv
information in PAGE-XMLThese are just some ideas, we'd love to hear yours. Please share your post-processing/post-processing scripts or feature requests for such a tool so we can develop a solution together for common tasks.
The text was updated successfully, but these errors were encountered: