feat(parser): improve PDF parsing and arXiv HTML extraction by joe32140 · Pull Request #42 · ChicagoHAI/OpenAIReview

joe32140 · 2026-03-13T17:18:22Z

Summary

arXiv HTML: Extract tables inline as markdown and keep figure captions inline as text; drop image links from parser output to reduce noise
PDF: Replace the raw pymupdf fallback with pymupdf4llm + pymupdf-layout as the default parser for better reading order, dehyphenation, and table extraction
Review quality: Fix paragraph indexing for long table-like paragraphs so quote matching lands on the correct checklist/table section
Cleanup: Remove dead parser code and simplify some markdown helper work in the parser path

Validation

python -m compileall src tests
openaireview review examples/2602.18458v1.pdf --method zero_shot --provider openrouter --model anthropic/claude-sonnet-4-6
openaireview review https://arxiv.org/abs/2602.18458 --method zero_shot --provider openrouter --model anthropic/claude-sonnet-4-6
pytest (not available in this environment)

Notes

Image-aware multimodal HTML parsing was explored locally and intentionally dropped from this PR.
Untracked local example artifacts under examples/ are not part of the PR.

- Add _tabular_to_markdown(): converts ltx_tabular elements to markdown table syntax with header separator row - Add _figure_or_table_to_markdown(): converts ltx_figure/ltx_table elements to markdown — images as ![alt](absolute_url) with italic caption, tables as bold caption + markdown table - In parse_arxiv_html: pre-process all figures/tables before the main extraction loop, replacing them with ltx_para marker divs in-place so they appear at the correct document position - Use exact class matching (not substring) to avoid matching ltx_figure_panel and other nested subfigure elements - Capture final URL after redirects for correct image URL resolution - Remove ltx_caption from the element regex (captions are now included in their parent figure/table blocks) viz: add responsive image and improved table styles - .para img: block display, centered, max-width 100%, subtle border - .para em: italic caption styling - .para table: scrollable with overflow-x: auto, word-wrap in cells Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace raw pymupdf fallback with pymupdf4llm+layout for correct reading order, 2-column hyphenation, GNN-based table extraction, and cleaner LLM input. Adds _clean_pymupdf4llm_markdown() to strip noisy picture placeholders and handle <br> separators. Removes dead _parse_pdf_pymupdf() function. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- _clean_pymupdf4llm_markdown: gate stripped computation behind fast substring check so it only runs on picture-placeholder lines - _extract_title_from_markdown: merge two split/loop passes into one Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

joe32140 · 2026-03-14T00:09:02Z

Now it can parse table better and the comment can successfully refer to the correct table.

joe32140 and others added 4 commits March 12, 2026 21:07

chore:update slug

34eb82a

joe32140 requested a review from dangng2004 March 13, 2026 17:18

joe32140 added 2 commits March 13, 2026 16:31

Fix comment indexing for long table paragraphs

4cee03a

Refine arXiv HTML parsing without image links

7251ed5

joe32140 changed the title ~~feat(parser): rich PDF and arXiv HTML parsing with tables and figures~~ feat(parser): improve PDF parsing and arXiv HTML extraction Mar 14, 2026

Fix malformed zero-shot response parsing

58f18fd

chenhaot approved these changes Mar 15, 2026

View reviewed changes

chenhaot merged commit 4955daa into main Mar 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parser): improve PDF parsing and arXiv HTML extraction#42

feat(parser): improve PDF parsing and arXiv HTML extraction#42
chenhaot merged 7 commits intomainfrom
feat/table-figure-parsing

joe32140 commented Mar 13, 2026 •

edited

Loading

Uh oh!

joe32140 commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joe32140 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Notes

Uh oh!

joe32140 commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joe32140 commented Mar 13, 2026 •

edited

Loading