feat(parser): improve PDF parsing and arXiv HTML extraction#42
Merged
feat(parser): improve PDF parsing and arXiv HTML extraction#42
Conversation
- Add _tabular_to_markdown(): converts ltx_tabular elements to markdown table syntax with header separator row - Add _figure_or_table_to_markdown(): converts ltx_figure/ltx_table elements to markdown — images as  with italic caption, tables as bold caption + markdown table - In parse_arxiv_html: pre-process all figures/tables before the main extraction loop, replacing them with ltx_para marker divs in-place so they appear at the correct document position - Use exact class matching (not substring) to avoid matching ltx_figure_panel and other nested subfigure elements - Capture final URL after redirects for correct image URL resolution - Remove ltx_caption from the element regex (captions are now included in their parent figure/table blocks) viz: add responsive image and improved table styles - .para img: block display, centered, max-width 100%, subtle border - .para em: italic caption styling - .para table: scrollable with overflow-x: auto, word-wrap in cells Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace raw pymupdf fallback with pymupdf4llm+layout for correct reading order, 2-column hyphenation, GNN-based table extraction, and cleaner LLM input. Adds _clean_pymupdf4llm_markdown() to strip noisy picture placeholders and handle <br> separators. Removes dead _parse_pdf_pymupdf() function. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _clean_pymupdf4llm_markdown: gate stripped computation behind fast substring check so it only runs on picture-placeholder lines - _extract_title_from_markdown: merge two split/loop passes into one Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
Author
chenhaot
approved these changes
Mar 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
pymupdffallback withpymupdf4llm+pymupdf-layoutas the default parser for better reading order, dehyphenation, and table extractionValidation
python -m compileall src testsopenaireview review examples/2602.18458v1.pdf --method zero_shot --provider openrouter --model anthropic/claude-sonnet-4-6openaireview review https://arxiv.org/abs/2602.18458 --method zero_shot --provider openrouter --model anthropic/claude-sonnet-4-6pytest(not available in this environment)Notes
examples/are not part of the PR.