Add text_blocks: paragraph & list grouping of OCR lines#387
Merged
Conversation
text_regions merges glyphs into lines but nothing grouped those lines into paragraphs or detected lists; ocr/structure stops at flat rows. group_paragraphs starts a new paragraph wherever the vertical gap exceeds line_gap_factor x the median line height; detect_lists recognises bullet / ordinal items by their leading marker and left indent.
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 34 |
| Duplication | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



摘要
新增
group_paragraphs/detect_lists— 把 OCR 行分組成段落並偵測項目符號 / 編號清單。text_regions.find_text_lines把字形併成行,但沒有功能把那些行分組成段落或偵測清單;ocr/structure止於平面列。group_paragraphs在垂直間距超過line_gap_factor× 中位行高處開始新段落(標準留白分組啟發法);detect_lists以前導標記(•/-/*或1./2)/a.)與左縮排辨識清單項目,回傳{text, marker, indent, box}。純標準函式庫,作用於純行字典;重用table_grid_fill的框邊界讀取器。Qt-free。五層
utils/text_blocks/—group_paragraphs、detect_lists。AC_group_paragraphs+AC_detect_lists/ MCPac_group_paragraphs+ac_detect_lists/ Script Builder(OCR)。測試
test_text_blocks_batch.py— 大間距分段、單一段落、項目符號 / 序號偵測、縮排記錄、空輸入、wiring + facade。7 passed。ruff / bandit / radon / float-scan / Qt-free 全乾淨。