Skip to content

docs: add VLM long-document understanding dev note and recipes#579

Merged
nabinchha merged 3 commits intomainfrom
nmulepati/docs/vlm-long-document-understanding-dev-note
Apr 28, 2026
Merged

docs: add VLM long-document understanding dev note and recipes#579
nabinchha merged 3 commits intomainfrom
nmulepati/docs/vlm-long-document-understanding-dev-note

Conversation

@nabinchha
Copy link
Copy Markdown
Contributor

📋 Summary

Adds a developer note and 9 runnable recipe scripts documenting how we generated ~11.4M synthetic visual QA pairs with Data Designer to improve long-document visual reasoning in Nemotron-3-Nano-Omni-30B-A3B (MMLongBench-Doc: 26% → 57.5%).

🔗 Related Issue

N/A

🔄 Changes

✨ Added

  • Developer note blog post (docs/devnotes/posts/vlm-long-document-understanding.md) covering the iterative pipeline development process, evaluation-driven design, and lessons learned
  • 4 pipeline architecture/hero images in docs/devnotes/posts/assets/vlm-long-document-understanding/
  • 9 self-contained recipe scripts in docs/assets/recipes/vlm_long_doc/ (01 through 09): seed prep, Nemotron-Parse OCR, text QA, page classification, visual QA, single-page QA, multi-page windowed QA, whole-document QA, and frontier judge filtering
  • Recipe documentation pages in docs/recipes/vlm_long_doc/ with download links
  • Recipe card on the main recipes page (docs/recipes/cards.md)
  • Navigation entries in mkdocs.yml for both the dev note and recipe pages
  • Author entries in docs/devnotes/.authors.yml

🧪 Testing

  • N/A — documentation and recipe scripts only; no testable library code changed

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated — N/A

Made with Cursor

@nabinchha nabinchha requested a review from a team as a code owner April 28, 2026 15:56
@github-actions
Copy link
Copy Markdown
Contributor

Docs preview: https://5a7383c0.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

@nabinchha nabinchha merged commit 7c5a722 into main Apr 28, 2026
52 checks passed
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

This PR adds a developer note and 9 self-contained recipe scripts documenting how ~11.4M synthetic visual QA pairs were generated with Data Designer to improve long-document VLM reasoning. The documentation and recipe pipeline are well-structured, but two logic bugs were found in the recipe scripts.

  • 01-seed-dataset-preparation.py: adaptive_window_size uses strict < comparisons at every boundary, so documents with exactly 20, 30, 40, 50, or 60 pages silently fall through to the default window size of 2 instead of the intended 4–7.
  • 07-multi-page-windowed-qa-sdg.py: In _inference_params, the non-reasoning branch for Qwen3.5-122B-A10B populates extra_body with temperature=0.7/top_p=0.8 but the outer temperature/top_p variables remain at 1.0/0.95, so ChatCompletionInferenceParams receives the wrong values.

Confidence Score: 3/5

Two P1 logic bugs in recipe scripts should be fixed before merging.

Two P1 findings are present: a boundary gap that silently produces wrong window sizes in the seed prep script, and mismatched inference parameters in the multi-page recipe. Both affect correctness of generated data when the scripts are run as documented.

docs/assets/recipes/vlm_long_doc/01-seed-dataset-preparation.py (window size boundary) and docs/assets/recipes/vlm_long_doc/07-multi-page-windowed-qa-sdg.py (inference params mismatch).

Important Files Changed

Filename Overview
docs/assets/recipes/vlm_long_doc/01-seed-dataset-preparation.py Seed prep script with a boundary gap in adaptive_window_size — docs with exactly 20, 30, 40, 50, or 60 pages silently get window size 2 instead of 4–7.
docs/assets/recipes/vlm_long_doc/07-multi-page-windowed-qa-sdg.py Multi-page windowed QA recipe; non-reasoning branch for Qwen3.5-122B-A10B sets temperature=0.7/top_p=0.8 in extra_body but passes temperature=1.0/top_p=0.95 to ChatCompletionInferenceParams.
docs/assets/recipes/vlm_long_doc/02-nemotron-parse-ocr-sdg.py Nemotron-Parse OCR pipeline; regex-based bbox parsing and custom column generator look correct.
docs/assets/recipes/vlm_long_doc/05-visual-qa-sdg.py Single-page visual QA pipeline with relevance and correctness judges; no logic issues found.
docs/assets/recipes/vlm_long_doc/09-frontier-judge-sdg.py Frontier judge with 5-rubric scoring; weighted composite score computation and config wiring are correct.
docs/assets/recipes/vlm_long_doc/08-whole-document-qa-sdg.py Whole-document QA recipe targeting full-document multi-page reasoning; no issues found.
docs/devnotes/posts/vlm-long-document-understanding.md Developer note blog post; content and structure look accurate.
mkdocs.yml Navigation entries for dev note and recipe pages added correctly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[01-seed-dataset-preparation.py\nDownload PDFs → render pages → parquet] --> B[seed_per_page.parquet]
    A --> C[seed_windowed.parquet]
    A --> D[seed_whole_document.parquet]

    B --> E[02-nemotron-parse-ocr-sdg.py\nOCR → transcribed_texts]
    B --> F[03-text-qa-sdg.py\nText QA]
    B --> G[04-page-classification-sdg.py\nPage classification]

    G --> H[05-visual-qa-sdg.py\nSingle-page Visual QA]
    B --> I[06-single-page-qa-sdg.py\nAnchored Single-page QA]

    C --> J[07-multi-page-windowed-qa-sdg.py\nMulti-page windowed QA]
    D --> K[08-whole-document-qa-sdg.py\nWhole-document QA]

    H --> L[09-frontier-judge-sdg.py\nFrontier model judge\n5-rubric scoring + weighted composite]
    I --> L
    J --> L
    K --> L
    F --> L
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/assets/recipes/vlm_long_doc/01-seed-dataset-preparation.py
Line: 113-125

Comment:
**Boundary gap in `adaptive_window_size` silently returns wrong window size**

All of the conditions use strict `>` and `<`, so documents with exactly 20, 30, 40, 50, or 60 pages fall through every branch and return the default `2` instead of the expected 4, 5, 6, or 7. For example, a 20-page document gets a window of 2 instead of 4, causing the windowed seed to produce far smaller windows than intended.

```suggestion
    if n_pages > 10 and n_pages <= 20:
        return 3
    elif n_pages > 20 and n_pages <= 30:
        return 4
    elif n_pages > 30 and n_pages <= 40:
        return 5
    elif n_pages > 40 and n_pages <= 50:
        return 6
    elif n_pages > 50 and n_pages <= 60:
        return 7
    elif n_pages > 60:
        return 8
    return 2
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: docs/assets/recipes/vlm_long_doc/07-multi-page-windowed-qa-sdg.py
Line: 96-107

Comment:
**Non-reasoning branch sets mismatched `temperature`/`top_p` for `Qwen3.5-122B-A10B`**

In the `else` (non-reasoning) branch for `Qwen/Qwen3.5-122B-A10B`, `extra_body` is populated with `temperature=0.7, top_p=0.8`, but the outer variables `temperature` and `top_p` are left at `1.0` and `0.95` (copied from the reasoning branch). `ChatCompletionInferenceParams` receives the outer variables, so the actual inference runs at temperature=1.0/top_p=0.95 — the non-reasoning settings in `extra_body` have no effect.

```suggestion
        else:
            extra_body = {
                "temperature": 0.7,
                "top_p": 0.8,
                "top_k": 20,
                "min_p": 0.0,
                "presence_penalty": 1.5,
                "repetition_penalty": 1.0,
            }
            temperature = 0.7
            top_p = 0.8
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "added links" | Re-trigger Greptile

Comment on lines +113 to +125
if n_pages > 10 and n_pages < 20:
return 3
elif n_pages > 20 and n_pages < 30:
return 4
elif n_pages > 30 and n_pages < 40:
return 5
elif n_pages > 40 and n_pages < 50:
return 6
elif n_pages > 50 and n_pages < 60:
return 7
elif n_pages > 60:
return 8
return 2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Boundary gap in adaptive_window_size silently returns wrong window size

All of the conditions use strict > and <, so documents with exactly 20, 30, 40, 50, or 60 pages fall through every branch and return the default 2 instead of the expected 4, 5, 6, or 7. For example, a 20-page document gets a window of 2 instead of 4, causing the windowed seed to produce far smaller windows than intended.

Suggested change
if n_pages > 10 and n_pages < 20:
return 3
elif n_pages > 20 and n_pages < 30:
return 4
elif n_pages > 30 and n_pages < 40:
return 5
elif n_pages > 40 and n_pages < 50:
return 6
elif n_pages > 50 and n_pages < 60:
return 7
elif n_pages > 60:
return 8
return 2
if n_pages > 10 and n_pages <= 20:
return 3
elif n_pages > 20 and n_pages <= 30:
return 4
elif n_pages > 30 and n_pages <= 40:
return 5
elif n_pages > 40 and n_pages <= 50:
return 6
elif n_pages > 50 and n_pages <= 60:
return 7
elif n_pages > 60:
return 8
return 2
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/vlm_long_doc/01-seed-dataset-preparation.py
Line: 113-125

Comment:
**Boundary gap in `adaptive_window_size` silently returns wrong window size**

All of the conditions use strict `>` and `<`, so documents with exactly 20, 30, 40, 50, or 60 pages fall through every branch and return the default `2` instead of the expected 4, 5, 6, or 7. For example, a 20-page document gets a window of 2 instead of 4, causing the windowed seed to produce far smaller windows than intended.

```suggestion
    if n_pages > 10 and n_pages <= 20:
        return 3
    elif n_pages > 20 and n_pages <= 30:
        return 4
    elif n_pages > 30 and n_pages <= 40:
        return 5
    elif n_pages > 40 and n_pages <= 50:
        return 6
    elif n_pages > 50 and n_pages <= 60:
        return 7
    elif n_pages > 60:
        return 8
    return 2
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +96 to +107
top_p = 0.95
else:
extra_body = {
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"min_p": 0.0,
"presence_penalty": 1.5,
"repetition_penalty": 1.0,
}
temperature = 1.0
top_p = 0.95
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Non-reasoning branch sets mismatched temperature/top_p for Qwen3.5-122B-A10B

In the else (non-reasoning) branch for Qwen/Qwen3.5-122B-A10B, extra_body is populated with temperature=0.7, top_p=0.8, but the outer variables temperature and top_p are left at 1.0 and 0.95 (copied from the reasoning branch). ChatCompletionInferenceParams receives the outer variables, so the actual inference runs at temperature=1.0/top_p=0.95 — the non-reasoning settings in extra_body have no effect.

Suggested change
top_p = 0.95
else:
extra_body = {
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"min_p": 0.0,
"presence_penalty": 1.5,
"repetition_penalty": 1.0,
}
temperature = 1.0
top_p = 0.95
else:
extra_body = {
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"min_p": 0.0,
"presence_penalty": 1.5,
"repetition_penalty": 1.0,
}
temperature = 0.7
top_p = 0.8
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/vlm_long_doc/07-multi-page-windowed-qa-sdg.py
Line: 96-107

Comment:
**Non-reasoning branch sets mismatched `temperature`/`top_p` for `Qwen3.5-122B-A10B`**

In the `else` (non-reasoning) branch for `Qwen/Qwen3.5-122B-A10B`, `extra_body` is populated with `temperature=0.7, top_p=0.8`, but the outer variables `temperature` and `top_p` are left at `1.0` and `0.95` (copied from the reasoning branch). `ChatCompletionInferenceParams` receives the outer variables, so the actual inference runs at temperature=1.0/top_p=0.95 — the non-reasoning settings in `extra_body` have no effect.

```suggestion
        else:
            extra_body = {
                "temperature": 0.7,
                "top_p": 0.8,
                "top_k": 20,
                "min_p": 0.0,
                "presence_penalty": 1.5,
                "repetition_penalty": 1.0,
            }
            temperature = 0.7
            top_p = 0.8
```

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link
Copy Markdown
Contributor

Code Review: PR #579 — docs: add VLM long-document understanding dev note and recipes

Summary

This PR is docs-only: it adds a substantial dev-note blog post (docs/devnotes/posts/vlm-long-document-understanding.md, ~600 lines) narrating how the team generated ~11.4M synthetic visual QA pairs to push Nemotron-3-Nano-Omni-30B-A3B from 26% → 59% on MMLongBench-Doc, plus nine runnable uv-scripted recipes (01–09) covering seed prep, Nemotron-Parse OCR, text QA, page classification, visual QA, single-page QA, multi-page windowed QA, whole-document QA, and a frontier-judge filter. Supporting: 4 hero/pipeline images, 9 thin recipe wrapper pages, a card on docs/recipes/cards.md, and mkdocs.yml nav entries. No library code is touched.

Overall this is a high-quality, well-organized contribution: the dev note is genuinely useful (reads like a lessons-learned write-up rather than a press release), the recipes are self-contained with clear prerequisites and vLLM launch examples, and the SPDX/copyright headers and from __future__ import annotations conventions are consistent with the project style. The review focuses on a handful of accuracy and consistency issues.

Findings

Correctness / Accuracy

  • Results-table formatting errors (docs/devnotes/posts/vlm-long-document-understanding.md, lines ~534–540). Several Jan 14 cells look like they lost a digit:

    • layout.3150 should be 31.50
    • table.2634 should be 26.34
    • chart.3005 should be 30.05
    • image.2733 should be 27.33, and Jan 29 3.221 should be 32.21
      These render as leading-zero decimals that break the column's pattern and make the trend chart hard to interpret. Worth a careful second pass on the entire table before publication since it's the centerpiece of the post.
  • Boundary gap in adaptive_window_size (01-seed-dataset-preparation.py:112–131). The ladder uses strict > / < on both sides, so exact boundary page counts (10, 20, 30, 40, 50, 60) all fall through to the default of 2 rather than the surrounding tier. For example, a 20-page document gets window size 2, but a 19-page and a 21-page doc get 3 and 4 respectively. Easy fix: switch to 10 <= n_pages < 20, etc. (or use a small tier list and bisect).

  • Dev-note code snippets diverge from shipped recipes. The embedded snippets in the devnote use column_name="png_path" with ModalityDataType.URL, but every shipped recipe uses column_name="png_images_base64" with ModalityDataType.BASE64. The devnote itself later explains that the team moved away from embedded base64 to file paths because base64 caused 10+ minute DuckDB stalls — so readers who copy the recipes as-is onto a production seed will hit exactly the failure mode the post warns about. Two options:

    1. Add a one-line note at the top of each recipe's docstring: "This recipe expects base64 images in the seed for portability; for production-scale runs, switch png_images_base64 → a local file-path column with ModalityDataType.URL and start vLLM with --allowed-local-media-path."
    2. Or flip the recipes to URL-mode by default and document the base64 path as the fallback. Option 1 is less invasive.

Consistency / Style

  • os._exit(0) in 01-seed-dataset-preparation.py:311. The comment explains this is to avoid hanging on background threads from datasets/fsspec, which is a reasonable workaround, but it skips interpreter cleanup (atexit, finalizers, buffered I/O). Worth at least setting a non-zero code when per_page_rows was empty (currently the function returns before reaching os._exit, so the process still exits 0 on the "no documents processed" error path at line 292).

  • Recipe wrapper pages under docs/recipes/vlm_long_doc/*.md are minimal — just a download button and a --8<-- include. This matches some existing recipes (e.g., plugin_development) but is thinner than others (e.g., mcp_and_tooluse), and means each wrapper page is effectively a duplicate of the devnote's "Try For Yourself" table. Not blocking, but a 2-3 sentence intro per page ("this recipe is stage N of the pipeline, inputs expected / outputs produced") would make them navigable on their own.

  • Recipe card (docs/recipes/cards.md:5080). Minor: "Nemotron-3-Nano-Omni-30B-A3B's training recipe" is possessive-of-a-possessive; consider "the Nemotron-3-Nano-Omni-30B-A3B training recipe."

  • mkdocs.yml ordering for the new devnote entry (line 5231) correctly follows the "most recent first" comment at the top of the Dev Notes section.

Security

  • 09-frontier-judge-sdg.py correctly uses the api_key=api_key_env pattern, passing the env-var name as a string — the engine's EnvironmentResolver resolves it via os.environ[secret], so raw keys never land in code or logs. Good.
  • 01-seed-dataset-preparation.py sends User-Agent: Mozilla/5.0 to arbitrary PDF URLs from FinePDFs. This is fine for an example recipe that a user runs locally with their own output dir; just worth noting that a 20s timeout + unvalidated-URL fetch loop is not something to run on a shared runner without scoping.
  • No secrets, keys, or tokens are leaked in the diff.

Test Coverage

N/A — documentation and recipe scripts only. The recipes depend on private/gated vLLM deployments and frontier endpoints, so end-to-end CI is not practical. A lightweight import/syntax-smoke test (python -c "import ast; ast.parse(open(p).read())" per recipe) would at least catch regressions from future Data Designer API breaks, but that's out of scope for this PR.

Minor nits

Verdict

Approve with nits. The content is solid and the recipes follow project conventions. The only item I'd gate on is fixing the malformed cells in the MMLongBench-Doc results table (leading-zero typos like .3150/.2634/.3005/.2733/3.221) — they're prominent and misleading. The adaptive_window_size boundary gap and the base64-vs-file-path consistency note are worth addressing before readers start running the recipes but are fixable post-merge. Everything else is polish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants