You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyPdfium2 produces very fragmented cells, with sub-word level boundaries, in many PDFs.
There is some logic to try to merge them
merged_text = "".join(cell.text for cell in group)
merged_bbox = BoundingBox(
l=min(cell.rect.to_bounding_box().l for cell in group),
t=min(cell.rect.to_bounding_box().t for cell in group),
r=max(cell.rect.to_bounding_box().r for cell in group),
b=max(cell.rect.to_bounding_box().b for cell in group),
)
While this works for most cases sometimes the bounding boxes overlap and we end up having duplicated characters. We could instead get the text_bounded based on the new computed BoundingBox
merged_bbox = BoundingBox(
l=min(cell.rect.to_bounding_box().l for cell in group),
t=min(cell.rect.to_bounding_box().t for cell in group),
r=max(cell.rect.to_bounding_box().r for cell in group),
b=max(cell.rect.to_bounding_box().b for cell in group),
)
bbox = merged_bbox.to_bottom_left_origin(page_size.height)
merged_text = self.text_page.get_text_bounded(*bbox.as_tuple())
Bug
PyPdfium2 produces very fragmented cells, with sub-word level boundaries, in many PDFs.
There is some logic to try to merge them
While this works for most cases sometimes the bounding boxes overlap and we end up having duplicated characters. We could instead get the text_bounded based on the new computed BoundingBox
loans-leases-with-images.pdf
...
Steps to reproduce
Parse the pdf provided (page 3 for example) and look for offffering
One example in the following PDF is
BBox1: off
BBox2: ffering
that leads to: offffering
...
Docling version
2.31
...
Python version
3.12
...
The text was updated successfully, but these errors were encountered: