Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocrd-tesserocr-segment: segmentation fault #182

Closed
jbarth-ubhd opened this issue Jan 17, 2022 · 11 comments
Closed

ocrd-tesserocr-segment: segmentation fault #182

jbarth-ubhd opened this issue Jan 17, 2022 · 11 comments

Comments

@jbarth-ubhd
Copy link

And with this image:

https://digi.ub.uni-heidelberg.de/diglitData/v/justinian1627bd2_-_1281.tif

and ocrd.sif (singularity container) created from docker ocrd_all at Nov 9 10:13 2021 & at Jan 17 15:11 2022 [UPDATE]

and this workflow:

/usr/bin/time singularity exec $HOME/ocrd.sif ocrd workspace init >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff OCR-D-IMG/00001.tif >>ocrd.log 2>&1 || exit

/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-olena-binarize -P k 0.10 -I OCR-D-IMG -O OCR-D-001 >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-anybaseocr-crop -I OCR-D-001 -O OCR-D-002 >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-olena-binarize -I OCR-D-002 -O OCR-D-003 >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-cis-ocropy-deskew -P level-of-operation page -I OCR-D-003 -O OCR-D-004 >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-tesserocr-segment -P find_tables false -P shrink_polygons true -I OCR-D-004 -O OCR-D-005 >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-calamari-recognize -I OCR-D-005 -O OCR-D-OCR -P checkpoint "$HOME/ocrd_models/calamari/calamari_models_experimental/historical_french_2020-10-14/*.ckpt.json" >>ocrd.log 2>&1 || exit

I'll get a segmentation fault

Core was generated by `/usr/bin/python3 /usr/bin/ocrd-tesserocr-segment -P find_tables false -P shrink'.
Program terminated with signal 11, Segmentation fault.
@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Jan 18, 2022

Tried to reproduce this bug with plain tesseract:

tesseract --tessdata-dir tessdata_dir -l Latin OCR-D-005_00001.IMG-BIN.png output --psm 11 -c textord_tabfind_find_tables=0 -c poly_wide_objects_better=0

but I don't know if those options are equivalent to the above.

@jbarth-ubhd
Copy link
Author

I'll have approx. 1500 "core.12345" files of 62k TIFs = 2.4 % (!). Dear @stweil, could you prioritize this issue?

@stweil
Copy link
Contributor

stweil commented Feb 2, 2022

I must try to reproduce it in my environment. That would be easier if the problem would also occur with plain tesseract.

@bertsky
Copy link
Collaborator

bertsky commented Feb 2, 2022

@jbarth-ubhd poly_allow_detailed_fx and poly_wide_objects_better is a completely different, Tesseract-internal mechanism. (It is only used – as PolygonalCopy – indirectly in a few places, like debugging or equation detection, never for extracting outlines. Since it is not exposed to the API, I have no idea what the quality would be.)

The mechanism used for ocrd_tesserocr's shrink_polygons is explained by its documentation:

When detecting any segments, annotate polygon coordinates instead of bounding box rectangles by projecting the convex hull of all symbols.

If shrink_polygons, then during segmentation (on any level),
query Tesseract for all symbols/glyphs of each segment and calculate
the convex hull for them. Annotate the resulting polygon instead of
the coarse bounding box. (This is more precise and helps avoid
overlaps between neighbours, especially when not segmenting all
levels at once.)

@stweil, the underlying cause is a bug in the iterator (state) functions – but I have no time to work on Tesseract, and my fix has become more difficult to work on after the recent upstream changes.

@stweil
Copy link
Contributor

stweil commented Feb 2, 2022

Please try this patch for the Tesseract code:

diff --git a/src/ccmain/pageiterator.cpp b/src/ccmain/pageiterator.cpp
index e8d528b6..829a1cd1 100644
--- a/src/ccmain/pageiterator.cpp
+++ b/src/ccmain/pageiterator.cpp
@@ -566,7 +566,14 @@ void PageIterator::Orientation(tesseract::Orientation *orientation,
                                tesseract::WritingDirection *writing_direction,
                                tesseract::TextlineOrder *textline_order,
                                float *deskew_angle) const {
-  BLOCK *block = it_->block()->block;
+  auto *block_res = it_->block();
+  if (block_res == nullptr) {
+    *orientation = ORIENTATION_PAGE_UP;
+    *writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
+    *textline_order = TEXTLINE_ORDER_TOP_TO_BOTTOM;
+    return;
+  }
+  auto *block = block_res->block;
 
   // Orientation
   FCOORD up_in_image(0.0, 1.0);

@stweil
Copy link
Contributor

stweil commented Feb 3, 2022

@jbarth-ubhd, the latest tesseract git main includes the patch which fixes the segmentation fault. Maybe you want to try it and can report whether it produces usable results for the examples which crashed with the old code. I cannot test it myself without the model historical_french_2020-10-14 which you used for the example.

@jbarth-ubhd
Copy link
Author

@stweil
Copy link
Contributor

stweil commented Feb 3, 2022

Thanks. I could run your workflow after an update to latest tesseract and had no problems.

@kba
Copy link
Member

kba commented Feb 3, 2022

@jbarth-ubhd The fix @stweil mentioned is also part of the newest ocrd_all release, so please update your docker/singularity image.

@stweil
Copy link
Contributor

stweil commented Feb 8, 2022

@jbarth-ubhd, can we close this issue?

@jbarth-ubhd
Copy link
Author

yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants