Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processor tesserocr-segment-line terminates with exception (TopologyException: Input geom 1 is invalid) #149

Closed
stweil opened this issue Sep 4, 2020 · 23 comments

Comments

@stweil
Copy link
Contributor

stweil commented Sep 4, 2020

21:19:10.443 INFO processor.TesserocrSegmentLine - INPUT FILE 65 / phys396119
21:19:10.577 INFO processor.TesserocrSegmentLine - Page 'phys396119' images will use DPI estimated from segmentation
21:19:10.850 ERROR shapely.geos - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 0 107 at 0 107
Traceback (most recent call last):
  File "/home/stweil/src/github/OCR-D/venv-20200904/bin/ocrd-tesserocr-segment-line", line 8, in <module>
    sys.exit(ocrd_tesserocr_segment_line())
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd_tesserocr/cli.py", line 26, in ocrd_tesserocr_segment_line
    return ocrd_cli_wrap_processor(TesserocrSegmentLine, *args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 69, in run_processor
    processor.process()
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd_tesserocr/segment_line.py", line 119, in process
    interline = line_poly.intersection(region_poly)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/shapely/geometry/base.py", line 676, in intersection
    return geom_factory(self.impl['intersection'](self, other))
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/shapely/topology.py", line 70, in __call__
    self._check_topology(err, this, other)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/shapely/topology.py", line 38, in _check_topology
    self.fn.__name__, repr(geom)))
shapely.errors.TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7f89253f7c88>
@stweil
Copy link
Contributor Author

stweil commented Sep 4, 2020

The error occured with standard workflow and urn:nbn:de:bsz:180-digad-8419.

@stweil
Copy link
Contributor Author

stweil commented Sep 4, 2020

The error still occurs with revision 974459e.

@bertsky
Copy link
Collaborator

bertsky commented Sep 4, 2020

Since Tesseract only gives us bboxes here, the invalid polygon must be from the region. I need to know the exact workflow – what do you mean by standard workflow?

Also, this might be another instance of "won't fix because PAGE coordinates must be correct on the input side" (we cannot make all processors robust to all sorts of coordinate invalidities/inconsistencies). So be prepared to wait for a fix in the page segmenter instead...

@stweil
Copy link
Contributor Author

stweil commented Sep 4, 2020

"Standard" means one of the workflows suggested at https://ocr-d.de/en/workflows. I use this script:

#!/bin/bash

set -x
set -e

export LANG=C.UTF-8

URN=urn:nbn:de:bsz:180-digad-8419
METS=https://digi.bib.uni-mannheim.de/mets/$URN

date --iso-8601=seconds

time -p ocrd workspace --directory $URN clone $METS

cd $URN

time -p ocrd process \
  "olena-binarize -I MAX -O OCR-D-BIN -P impl sauvola" \
  "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
  "olena-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P impl kim" \
  "cis-ocropy-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
  "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
  "tesserocr-segment-region -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG" \
  "segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true" \
  "cis-ocropy-deskew -I OCR-D-SEG-REPAIR -O OCR-D-SEG-REG-DESKEW -P level-of-operation region" \
  "cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP -P level-of-operation region" \
  "tesserocr-segment-line -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE" \
  "tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P model fast/Fraktur_50000000.334_450937" \
  "fileformat-transform -I OCR-D-OCR-TESS -O FULLTEXT -P from-to \"page alto\"" \
  "fileformat-transform -I OCR-D-OCR-TESS -O OCR-D-OCR-TEXT -P from-to \"page text\""

date --iso-8601=seconds

@bertsky
Copy link
Collaborator

bertsky commented Sep 4, 2020

Thanks @stweil for the neatly encapsulated script. Unfortunately though, I cannot reproduce the problem. Which versions of ocrd_anybaseocr, ocrd_cis and ocrd_segment have you been running?

@stweil
Copy link
Contributor Author

stweil commented Sep 5, 2020

I used latest ocrd_all with ocrd_tesserocr updated to latest git release.

@stweil
Copy link
Contributor Author

stweil commented Sep 5, 2020

A fresh run reproduced the problem ...

All data is available here.

@bertsky
Copy link
Collaborator

bertsky commented Sep 7, 2020

I used latest ocrd_all with ocrd_tesserocr updated to latest git release.
A fresh run reproduced the problem ...
All data is available here.

I have tried again with (Dockerized) OCR-D/ocrd_all@dd35c37 (built at 2020-08-28T18:02:22Z) and ocrd_tesserocr 5761661 (that's your 974459e plus the release commit) – it runs smoothly.

Perhaps it's an effect of differences between Ubuntu 18.04 (Docker, my host) and Debian (your host) in Shapely's base libraries?

@stweil
Copy link
Contributor Author

stweil commented Sep 7, 2020

Can you compare the generated files on your side with my data (see link above) to see where they differ?

@stweil
Copy link
Contributor Author

stweil commented Sep 7, 2020

I'll repeat the test as soon as @kba has finished a new ocrd_all release.

@stweil
Copy link
Contributor Author

stweil commented Sep 7, 2020

The error still occurs. Tested with ocrd_all branch OCR-D/update-2020-09-07 on Debian buster.

@bertsky
Copy link
Collaborator

bertsky commented Sep 8, 2020

BTW, your script cannot have worked like that on the previous ocrd_all release (based on core 2.15), because that was not able to cope with OAI-PMH responses. And it does not work verbatim with the current version either, because you output to FULLTEXT at the end, but that already exists after ocrd workspace clone. Also, for ocrd process, I wonder how you avoid OCR-D/core#589 (I have to use OCR-D/core#594).

Can you compare the generated files on your side with my data (see link above) to see where they differ?

Unfortunately, I have no permissions for your mets.xml. I can download the fileGrp directories (if I ignore robots.txt), though. Looks like your Olena already has slightly different results (barely visible differences), followed by slight (1-2 pixel) differences in the cropping and (below 1°) deskewing. That might explain why the error is not triggered on my host and on the Docker release.

@stweil
Copy link
Contributor Author

stweil commented Sep 8, 2020

Unfortunately, I have no permissions for your mets.xml.

I am sorry. That's a known problem (see OCR-D/core#403). Access should work now.

@stweil
Copy link
Contributor Author

stweil commented Sep 12, 2020

And it does not work verbatim with the current version either, because you output to FULLTEXT at the end, but that already exists after ocrd workspace clone.

I have created FULLTEXT with a different OCR process in the meantime, so the script needs a slight update (either write to a different file group, remove the old FULLTEXT or simply omit that processor).

@stweil
Copy link
Contributor Author

stweil commented Sep 13, 2020

@bertsky, I get the same error on another host with Debian bullseye and a local build of Python 3.7.9 for a different book using this script:

#!/bin/bash

set -x
set -e

export LANG=C.UTF-8

PPN=PPN1024726142
METS=http://gei-digital.gei.de/viewer/metsresolver?id=$PPN

date --iso-8601=seconds

time -p ocrd workspace --directory $PPN clone $METS

cd $PPN

time -p ocrd process \
  "olena-binarize -I MAX -O OCR-D-BIN -P impl sauvola" \
  "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
  "olena-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P impl kim" \
  "cis-ocropy-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
  "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
  "tesserocr-segment-region -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG" \
  "segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true" \
  "cis-ocropy-deskew -I OCR-D-SEG-REPAIR -O OCR-D-SEG-REG-DESKEW -P level-of-operation region" \
  "cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP -P level-of-operation region" \
  "tesserocr-segment-line -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE" \
  "tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P model fast/Fraktur_50000000.334_450937" \
  "fileformat-transform -I OCR-D-OCR-TESS -O FULLTEXT2 -P from-to \"page alto\"" \
  "fileformat-transform -I OCR-D-OCR-TESS -O OCR-D-OCR-TEXT -P from-to \"page text\""

date --iso-8601=seconds

@bertsky
Copy link
Collaborator

bertsky commented Sep 14, 2020

@stweil could you please repeat from tesserocr-segment-region onwards – after pulling #152 and OCR-D/ocrd_segment#43 (perhaps using --overwrite on the same workspace)?

@stweil
Copy link
Contributor Author

stweil commented Sep 14, 2020

Here is the result from a fresh run:

10:45:23.889 INFO processor.TesserocrSegmentRegion - Detected region 'region0006': 174,1285 955,1356 946,1463 165,1392 (FLOWING_TEXT)
Traceback (most recent call last):
  File "/home/stweil/src/github/OCR-D/venv-20200912/bin/ocrd-tesserocr-segment-region", line 8, in <module>
    sys.exit(ocrd_tesserocr_segment_region())
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/cli.py", line 16, in ocrd_tesserocr_segment_region
    return ocrd_cli_wrap_processor(TesserocrSegmentRegion, *args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 69, in run_processor
    processor.process()
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 173, in process
    self._process_page(layout, page, page_image, page_coords, input_file.pageId)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 223, in _process_page
    polygon2 = polygon_for_parent(polygon, page)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 360, in polygon_for_parent
    interp = asPolygon(np.round(interp.exterior.coords))
NameError: name 'np' is not defined

@kba
Copy link
Member

kba commented Sep 14, 2020

NameError: name 'np' is not defined

Does

from numpy import np

fix that? Could be, I was too thorough in cleaning up imports in the last round of refactoring...

@bertsky
Copy link
Collaborator

bertsky commented Sep 14, 2020

Sorry, I had forgotten to include that change in the commit.

@bertsky
Copy link
Collaborator

bertsky commented Sep 14, 2020

But the ordeal is not over yet: there is still one case I can see that can fail – when a polygon is invalid but simplification does not change anything, regardless of the tolerance level. (I have to detect that and re-order the point sequence...)

@bertsky
Copy link
Collaborator

bertsky commented Sep 14, 2020

But the ordeal is not over yet: there is still one case I can see that can fail – when a polygon is invalid but simplification does not change anything, regardless of the tolerance level. (I have to detect that and re-order the point sequence...)

6bbe873 should suffice.

@stweil
Copy link
Contributor Author

stweil commented Sep 14, 2020

The workflow for PPN1024726142 now passes - nearly. There is a new problem when creating the ALTO files which is caused by a negative x coodinate. See issue #153 for more details.

@stweil
Copy link
Contributor Author

stweil commented Sep 15, 2020

This issue was fixed in the latest code.

@stweil stweil closed this as completed Sep 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants