Skip to content

Releases: docling-project/docling

v2.37.0

16 Jun 11:02
Compare
Choose a tag to compare

Feature

  • Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745) (7d3302c)
  • Support xlsm files (#1520) (df14022)

Fix

  • Pptx line break and space handling (#1664) (f28d23c)
  • asciidoc: Set default size when missing in image directive (#1769) (b886e4d)
  • Handle NoneType error in MsPowerpointDocumentBackend (#1747) (7a275c7)
  • Prov for merged-elems (#1728) (6613b9e)
  • tesseract: Initialize df_osd to avoid uninitialized variable error (#1718) (e979750)
  • Allow custom torch_dtype in vlm models (#1735) (f7f3113)
  • Improve extraction from textboxes in Word docs (#1701) (9dbcb3d)
  • Add WEBP to the list of image file extensions (#1711) (a2b83fe)

Documentation

v2.36.1

04 Jun 11:43
Compare
Choose a tag to compare

Fix

Documentation

v2.36.0

03 Jun 13:54
Compare
Choose a tag to compare

Feature

v2.35.0

02 Jun 12:30
Compare
Choose a tag to compare

Feature

  • Add visualization of bbox on page with html export. (#1663) (b356b33)

Fix

  • Guess HTML content starting with script tag (#1673) (984cb13)
  • UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte (#1665) (51d3450)

Documentation

v2.34.0

22 May 18:44
Compare
Choose a tag to compare

Feature

  • ocr: Auto-detect rotated pages in Tesseract (#1167) (45265bf)
  • Establish confidence estimation for document and pages (#1313) (9087524)

Fix

  • Fix ZeroDivisionError for cell_bbox.area() (#1636) (c2f595d)
  • integration: Update the Apify Actor integration (#1619) (14d4f5b)

v2.33.0

20 May 19:54
Compare
Choose a tag to compare

Feature

  • Add textbox content extraction in msword_backend (#1538) (12a0e64)

Fix

  • Fix issue with detecting docx files, and files with upper case extensions (#1609) (f4d9d41)
  • Load_from_doctags static usage (#1617) (0e00a26)
  • Incorrect force_backend_text behaviour for VLM DocTag pipelines (#1371) (f2e9c07)
  • pypdfium: Resolve overlapping text when merging bounding boxes (#1549) (98b5eeb)

v2.32.0

14 May 14:28
Compare
Choose a tag to compare

Feature

  • Improve parallelization for remote services API calls (#1548) (3a04f2a)
  • Support image/webp file type (#1415) (12dab0a)

Fix

  • ocr: Orig field in TesseractOcrCliModel as str (#1553) (9f8b479)
  • settings: Fix nested settings load via environment variables (#1551) (2efb7a7)

Documentation

  • Add advanced chunking & serialization example (#1589) (9f28abf)

v2.31.2

13 May 10:09
Compare
Choose a tag to compare

Fix

v2.31.1

12 May 09:44
Compare
Choose a tag to compare

Fix

  • Add smoldocling in download utils (#1577) (127e386)
  • HTML: Handle row spans in header rows (#1536) (776e7ec)
  • Mime error in document streams (#1523) (f1658ed)
  • Usage of hashlib for FIPS (#1512) (7c70573)
  • Guard against attribute errors in TesseractOcrModel del (#1494) (4ab7e9d)
  • Enable cuda_use_flash_attention2 for PictureDescriptionVlmModel (#1496) (cc45396)
  • Updated the time-recorder label for reading order (#1490) (976e92e)
  • Incorrect scaling of TableModel bboxes when do_cell_matching is False (#1459) (94d66a0)

Documentation

v2.31.0

25 Apr 08:28
Compare
Choose a tag to compare

Feature

  • Add tutorial using Milvus and Docling for RAG pipeline (#1449) (a2fbbba)

Fix

  • html: Handle address, details, and summary tags (#1436) (ed20124)
  • Treat overflowing -v flags as DEBUG (#1419) (8012a3e)
  • codecov: Fix codecov argument and yaml file (#1399) (fa7fc9e)

Documentation