Skip to content

feat(parser): improve PDF parsing and arXiv HTML extraction#42

Merged
chenhaot merged 7 commits intomainfrom
feat/table-figure-parsing
Mar 15, 2026
Merged

feat(parser): improve PDF parsing and arXiv HTML extraction#42
chenhaot merged 7 commits intomainfrom
feat/table-figure-parsing

Conversation

@joe32140
Copy link
Copy Markdown
Contributor

@joe32140 joe32140 commented Mar 13, 2026

Summary

  • arXiv HTML: Extract tables inline as markdown and keep figure captions inline as text; drop image links from parser output to reduce noise
  • PDF: Replace the raw pymupdf fallback with pymupdf4llm + pymupdf-layout as the default parser for better reading order, dehyphenation, and table extraction
  • Review quality: Fix paragraph indexing for long table-like paragraphs so quote matching lands on the correct checklist/table section
  • Cleanup: Remove dead parser code and simplify some markdown helper work in the parser path

Validation

  • python -m compileall src tests
  • openaireview review examples/2602.18458v1.pdf --method zero_shot --provider openrouter --model anthropic/claude-sonnet-4-6
  • openaireview review https://arxiv.org/abs/2602.18458 --method zero_shot --provider openrouter --model anthropic/claude-sonnet-4-6
  • pytest (not available in this environment)

Notes

  • Image-aware multimodal HTML parsing was explored locally and intentionally dropped from this PR.
  • Untracked local example artifacts under examples/ are not part of the PR.

joe32140 and others added 4 commits March 12, 2026 21:07
- Add _tabular_to_markdown(): converts ltx_tabular elements to markdown
  table syntax with header separator row
- Add _figure_or_table_to_markdown(): converts ltx_figure/ltx_table
  elements to markdown — images as ![alt](absolute_url) with italic
  caption, tables as bold caption + markdown table
- In parse_arxiv_html: pre-process all figures/tables before the main
  extraction loop, replacing them with ltx_para marker divs in-place
  so they appear at the correct document position
- Use exact class matching (not substring) to avoid matching
  ltx_figure_panel and other nested subfigure elements
- Capture final URL after redirects for correct image URL resolution
- Remove ltx_caption from the element regex (captions are now included
  in their parent figure/table blocks)

viz: add responsive image and improved table styles
- .para img: block display, centered, max-width 100%, subtle border
- .para em: italic caption styling
- .para table: scrollable with overflow-x: auto, word-wrap in cells

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace raw pymupdf fallback with pymupdf4llm+layout for correct reading
order, 2-column hyphenation, GNN-based table extraction, and cleaner LLM
input. Adds _clean_pymupdf4llm_markdown() to strip noisy picture placeholders
and handle <br> separators. Removes dead _parse_pdf_pymupdf() function.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _clean_pymupdf4llm_markdown: gate stripped computation behind fast
  substring check so it only runs on picture-placeholder lines
- _extract_title_from_markdown: merge two split/loop passes into one

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@joe32140 joe32140 requested a review from dangng2004 March 13, 2026 17:18
@joe32140 joe32140 changed the title feat(parser): rich PDF and arXiv HTML parsing with tables and figures feat(parser): improve PDF parsing and arXiv HTML extraction Mar 14, 2026
@joe32140
Copy link
Copy Markdown
Contributor Author

image

Now it can parse table better and the comment can successfully refer to the correct table.

@chenhaot chenhaot merged commit 4955daa into main Mar 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants