docs: add VLM long-document understanding dev note and recipes by nabinchha · Pull Request #579 · NVIDIA-NeMo/DataDesigner

nabinchha · 2026-04-28T15:56:42Z

📋 Summary

Adds a developer note and 9 runnable recipe scripts documenting how we generated ~11.4M synthetic visual QA pairs with Data Designer to improve long-document visual reasoning in Nemotron-3-Nano-Omni-30B-A3B (MMLongBench-Doc: 26% → 57.5%).

🔗 Related Issue

N/A

🔄 Changes

✨ Added

Developer note blog post (docs/devnotes/posts/vlm-long-document-understanding.md) covering the iterative pipeline development process, evaluation-driven design, and lessons learned
4 pipeline architecture/hero images in docs/devnotes/posts/assets/vlm-long-document-understanding/
9 self-contained recipe scripts in docs/assets/recipes/vlm_long_doc/ (01 through 09): seed prep, Nemotron-Parse OCR, text QA, page classification, visual QA, single-page QA, multi-page windowed QA, whole-document QA, and frontier judge filtering
Recipe documentation pages in docs/recipes/vlm_long_doc/ with download links
Recipe card on the main recipes page (docs/recipes/cards.md)
Navigation entries in mkdocs.yml for both the dev note and recipe pages
Author entries in docs/devnotes/.authors.yml

🧪 Testing

N/A — documentation and recipe scripts only; no testable library code changed

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO)
Architecture docs updated — N/A

Made with Cursor

…ng-dev-note

github-actions · 2026-04-28T15:57:55Z

Docs preview: https://5a7383c0.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

greptile-apps · 2026-04-28T16:00:10Z

Greptile Summary

This PR adds a developer note and 9 self-contained recipe scripts documenting how ~11.4M synthetic visual QA pairs were generated with Data Designer to improve long-document VLM reasoning. The documentation and recipe pipeline are well-structured, but two logic bugs were found in the recipe scripts.

01-seed-dataset-preparation.py: adaptive_window_size uses strict < comparisons at every boundary, so documents with exactly 20, 30, 40, 50, or 60 pages silently fall through to the default window size of 2 instead of the intended 4–7.
07-multi-page-windowed-qa-sdg.py: In _inference_params, the non-reasoning branch for Qwen3.5-122B-A10B populates extra_body with temperature=0.7/top_p=0.8 but the outer temperature/top_p variables remain at 1.0/0.95, so ChatCompletionInferenceParams receives the wrong values.

Confidence Score: 3/5

Two P1 logic bugs in recipe scripts should be fixed before merging.

Two P1 findings are present: a boundary gap that silently produces wrong window sizes in the seed prep script, and mismatched inference parameters in the multi-page recipe. Both affect correctness of generated data when the scripts are run as documented.

docs/assets/recipes/vlm_long_doc/01-seed-dataset-preparation.py (window size boundary) and docs/assets/recipes/vlm_long_doc/07-multi-page-windowed-qa-sdg.py (inference params mismatch).

Important Files Changed

Filename	Overview
docs/assets/recipes/vlm_long_doc/01-seed-dataset-preparation.py	Seed prep script with a boundary gap in `adaptive_window_size` — docs with exactly 20, 30, 40, 50, or 60 pages silently get window size 2 instead of 4–7.
docs/assets/recipes/vlm_long_doc/07-multi-page-windowed-qa-sdg.py	Multi-page windowed QA recipe; non-reasoning branch for `Qwen3.5-122B-A10B` sets `temperature=0.7/top_p=0.8` in `extra_body` but passes `temperature=1.0/top_p=0.95` to `ChatCompletionInferenceParams`.
docs/assets/recipes/vlm_long_doc/02-nemotron-parse-ocr-sdg.py	Nemotron-Parse OCR pipeline; regex-based bbox parsing and custom column generator look correct.
docs/assets/recipes/vlm_long_doc/05-visual-qa-sdg.py	Single-page visual QA pipeline with relevance and correctness judges; no logic issues found.
docs/assets/recipes/vlm_long_doc/09-frontier-judge-sdg.py	Frontier judge with 5-rubric scoring; weighted composite score computation and config wiring are correct.
docs/assets/recipes/vlm_long_doc/08-whole-document-qa-sdg.py	Whole-document QA recipe targeting full-document multi-page reasoning; no issues found.
docs/devnotes/posts/vlm-long-document-understanding.md	Developer note blog post; content and structure look accurate.
mkdocs.yml	Navigation entries for dev note and recipe pages added correctly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[01-seed-dataset-preparation.py\nDownload PDFs → render pages → parquet] --> B[seed_per_page.parquet]
    A --> C[seed_windowed.parquet]
    A --> D[seed_whole_document.parquet]

    B --> E[02-nemotron-parse-ocr-sdg.py\nOCR → transcribed_texts]
    B --> F[03-text-qa-sdg.py\nText QA]
    B --> G[04-page-classification-sdg.py\nPage classification]

    G --> H[05-visual-qa-sdg.py\nSingle-page Visual QA]
    B --> I[06-single-page-qa-sdg.py\nAnchored Single-page QA]

    C --> J[07-multi-page-windowed-qa-sdg.py\nMulti-page windowed QA]
    D --> K[08-whole-document-qa-sdg.py\nWhole-document QA]

    H --> L[09-frontier-judge-sdg.py\nFrontier model judge\n5-rubric scoring + weighted composite]
    I --> L
    J --> L
    K --> L
    F --> L

Prompt To Fix All With AI

This is a comment left during a code review.
Path: docs/assets/recipes/vlm_long_doc/01-seed-dataset-preparation.py
Line: 113-125

Comment:
**Boundary gap in `adaptive_window_size` silently returns wrong window size**

All of the conditions use strict `>` and `<`, so documents with exactly 20, 30, 40, 50, or 60 pages fall through every branch and return the default `2` instead of the expected 4, 5, 6, or 7. For example, a 20-page document gets a window of 2 instead of 4, causing the windowed seed to produce far smaller windows than intended.

```suggestion
    if n_pages > 10 and n_pages <= 20:
        return 3
    elif n_pages > 20 and n_pages <= 30:
        return 4
    elif n_pages > 30 and n_pages <= 40:
        return 5
    elif n_pages > 40 and n_pages <= 50:
        return 6
    elif n_pages > 50 and n_pages <= 60:
        return 7
    elif n_pages > 60:
        return 8
    return 2
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: docs/assets/recipes/vlm_long_doc/07-multi-page-windowed-qa-sdg.py
Line: 96-107

Comment:
**Non-reasoning branch sets mismatched `temperature`/`top_p` for `Qwen3.5-122B-A10B`**

In the `else` (non-reasoning) branch for `Qwen/Qwen3.5-122B-A10B`, `extra_body` is populated with `temperature=0.7, top_p=0.8`, but the outer variables `temperature` and `top_p` are left at `1.0` and `0.95` (copied from the reasoning branch). `ChatCompletionInferenceParams` receives the outer variables, so the actual inference runs at temperature=1.0/top_p=0.95 — the non-reasoning settings in `extra_body` have no effect.

```suggestion
        else:
            extra_body = {
                "temperature": 0.7,
                "top_p": 0.8,
                "top_k": 20,
                "min_p": 0.0,
                "presence_penalty": 1.5,
                "repetition_penalty": 1.0,
            }
            temperature = 0.7
            top_p = 0.8
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "added links" | Re-trigger Greptile}

greptile-apps · 2026-04-28T16:00:15Z

+    if n_pages > 10 and n_pages < 20:
+        return 3
+    elif n_pages > 20 and n_pages < 30:
+        return 4
+    elif n_pages > 30 and n_pages < 40:
+        return 5
+    elif n_pages > 40 and n_pages < 50:
+        return 6
+    elif n_pages > 50 and n_pages < 60:
+        return 7
+    elif n_pages > 60:
+        return 8
+    return 2


Boundary gap in adaptive_window_size silently returns wrong window size

All of the conditions use strict > and <, so documents with exactly 20, 30, 40, 50, or 60 pages fall through every branch and return the default 2 instead of the expected 4, 5, 6, or 7. For example, a 20-page document gets a window of 2 instead of 4, causing the windowed seed to produce far smaller windows than intended.

Suggested change

if n_pages > 10 and n_pages < 20:

return 3

elif n_pages > 20 and n_pages < 30:

return 4

elif n_pages > 30 and n_pages < 40:

return 5

elif n_pages > 40 and n_pages < 50:

return 6

elif n_pages > 50 and n_pages < 60:

return 7

elif n_pages > 60:

return 8

return 2

if n_pages > 10 and n_pages <= 20:

return 3

elif n_pages > 20 and n_pages <= 30:

return 4

elif n_pages > 30 and n_pages <= 40:

return 5

elif n_pages > 40 and n_pages <= 50:

return 6

elif n_pages > 50 and n_pages <= 60:

return 7

elif n_pages > 60:

return 8

return 2

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/assets/recipes/vlm_long_doc/01-seed-dataset-preparation.py Line: 113-125 Comment: **Boundary gap in `adaptive_window_size` silently returns wrong window size** All of the conditions use strict `>` and `<`, so documents with exactly 20, 30, 40, 50, or 60 pages fall through every branch and return the default `2` instead of the expected 4, 5, 6, or 7. For example, a 20-page document gets a window of 2 instead of 4, causing the windowed seed to produce far smaller windows than intended. ```suggestion if n_pages > 10 and n_pages <= 20: return 3 elif n_pages > 20 and n_pages <= 30: return 4 elif n_pages > 30 and n_pages <= 40: return 5 elif n_pages > 40 and n_pages <= 50: return 6 elif n_pages > 50 and n_pages <= 60: return 7 elif n_pages > 60: return 8 return 2 ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-28T16:00:16Z

+            top_p = 0.95
+        else:
+            extra_body = {
+                "temperature": 0.7,
+                "top_p": 0.8,
+                "top_k": 20,
+                "min_p": 0.0,
+                "presence_penalty": 1.5,
+                "repetition_penalty": 1.0,
+            }
+            temperature = 1.0
+            top_p = 0.95


Non-reasoning branch sets mismatched temperature/top_p for Qwen3.5-122B-A10B

In the else (non-reasoning) branch for Qwen/Qwen3.5-122B-A10B, extra_body is populated with temperature=0.7, top_p=0.8, but the outer variables temperature and top_p are left at 1.0 and 0.95 (copied from the reasoning branch). ChatCompletionInferenceParams receives the outer variables, so the actual inference runs at temperature=1.0/top_p=0.95 — the non-reasoning settings in extra_body have no effect.

Suggested change

top_p = 0.95

else:

extra_body = {

"temperature": 0.7,

"top_p": 0.8,

"top_k": 20,

"min_p": 0.0,

"presence_penalty": 1.5,

"repetition_penalty": 1.0,

}

temperature = 1.0

top_p = 0.95

else:

extra_body = {

"temperature": 0.7,

"top_p": 0.8,

"top_k": 20,

"min_p": 0.0,

"presence_penalty": 1.5,

"repetition_penalty": 1.0,

}

temperature = 0.7

top_p = 0.8

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/assets/recipes/vlm_long_doc/07-multi-page-windowed-qa-sdg.py Line: 96-107 Comment: **Non-reasoning branch sets mismatched `temperature`/`top_p` for `Qwen3.5-122B-A10B`** In the `else` (non-reasoning) branch for `Qwen/Qwen3.5-122B-A10B`, `extra_body` is populated with `temperature=0.7, top_p=0.8`, but the outer variables `temperature` and `top_p` are left at `1.0` and `0.95` (copied from the reasoning branch). `ChatCompletionInferenceParams` receives the outer variables, so the actual inference runs at temperature=1.0/top_p=0.95 — the non-reasoning settings in `extra_body` have no effect. ```suggestion else: extra_body = { "temperature": 0.7, "top_p": 0.8, "top_k": 20, "min_p": 0.0, "presence_penalty": 1.5, "repetition_penalty": 1.0, } temperature = 0.7 top_p = 0.8 ``` How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-04-28T16:03:16Z

Code Review: PR #579 — docs: add VLM long-document understanding dev note and recipes

Summary

This PR is docs-only: it adds a substantial dev-note blog post (docs/devnotes/posts/vlm-long-document-understanding.md, ~600 lines) narrating how the team generated ~11.4M synthetic visual QA pairs to push Nemotron-3-Nano-Omni-30B-A3B from 26% → 59% on MMLongBench-Doc, plus nine runnable uv-scripted recipes (01–09) covering seed prep, Nemotron-Parse OCR, text QA, page classification, visual QA, single-page QA, multi-page windowed QA, whole-document QA, and a frontier-judge filter. Supporting: 4 hero/pipeline images, 9 thin recipe wrapper pages, a card on docs/recipes/cards.md, and mkdocs.yml nav entries. No library code is touched.

Overall this is a high-quality, well-organized contribution: the dev note is genuinely useful (reads like a lessons-learned write-up rather than a press release), the recipes are self-contained with clear prerequisites and vLLM launch examples, and the SPDX/copyright headers and from __future__ import annotations conventions are consistent with the project style. The review focuses on a handful of accuracy and consistency issues.

Findings

Correctness / Accuracy

Results-table formatting errors (docs/devnotes/posts/vlm-long-document-understanding.md, lines ~534–540). Several Jan 14 cells look like they lost a digit:
- layout → .3150 should be 31.50
- table → .2634 should be 26.34
- chart → .3005 should be 30.05
- image → .2733 should be 27.33, and Jan 29 3.221 should be 32.21
  These render as leading-zero decimals that break the column's pattern and make the trend chart hard to interpret. Worth a careful second pass on the entire table before publication since it's the centerpiece of the post.
Boundary gap in adaptive_window_size (01-seed-dataset-preparation.py:112–131). The ladder uses strict > / < on both sides, so exact boundary page counts (10, 20, 30, 40, 50, 60) all fall through to the default of 2 rather than the surrounding tier. For example, a 20-page document gets window size 2, but a 19-page and a 21-page doc get 3 and 4 respectively. Easy fix: switch to 10 <= n_pages < 20, etc. (or use a small tier list and bisect).
Dev-note code snippets diverge from shipped recipes. The embedded snippets in the devnote use column_name="png_path" with ModalityDataType.URL, but every shipped recipe uses column_name="png_images_base64" with ModalityDataType.BASE64. The devnote itself later explains that the team moved away from embedded base64 to file paths because base64 caused 10+ minute DuckDB stalls — so readers who copy the recipes as-is onto a production seed will hit exactly the failure mode the post warns about. Two options:
1. Add a one-line note at the top of each recipe's docstring: "This recipe expects base64 images in the seed for portability; for production-scale runs, switch png_images_base64 → a local file-path column with ModalityDataType.URL and start vLLM with --allowed-local-media-path."
2. Or flip the recipes to URL-mode by default and document the base64 path as the fallback. Option 1 is less invasive.

Consistency / Style

os._exit(0) in 01-seed-dataset-preparation.py:311. The comment explains this is to avoid hanging on background threads from datasets/fsspec, which is a reasonable workaround, but it skips interpreter cleanup (atexit, finalizers, buffered I/O). Worth at least setting a non-zero code when per_page_rows was empty (currently the function returns before reaching os._exit, so the process still exits 0 on the "no documents processed" error path at line 292).
Recipe wrapper pages under docs/recipes/vlm_long_doc/*.md are minimal — just a download button and a --8<-- include. This matches some existing recipes (e.g., plugin_development) but is thinner than others (e.g., mcp_and_tooluse), and means each wrapper page is effectively a duplicate of the devnote's "Try For Yourself" table. Not blocking, but a 2-3 sentence intro per page ("this recipe is stage N of the pipeline, inputs expected / outputs produced") would make them navigable on their own.
Recipe card (docs/recipes/cards.md:5080). Minor: "Nemotron-3-Nano-Omni-30B-A3B's training recipe" is possessive-of-a-possessive; consider "the Nemotron-3-Nano-Omni-30B-A3B training recipe."
mkdocs.yml ordering for the new devnote entry (line 5231) correctly follows the "most recent first" comment at the top of the Dev Notes section.

Security

09-frontier-judge-sdg.py correctly uses the api_key=api_key_env pattern, passing the env-var name as a string — the engine's EnvironmentResolver resolves it via os.environ[secret], so raw keys never land in code or logs. Good.
01-seed-dataset-preparation.py sends User-Agent: Mozilla/5.0 to arbitrary PDF URLs from FinePDFs. This is fine for an example recipe that a user runs locally with their own output dir; just worth noting that a 20s timeout + unvalidated-URL fetch loop is not something to run on a shared runner without scoping.
No secrets, keys, or tokens are leaked in the diff.

Test Coverage

N/A — documentation and recipe scripts only. The recipes depend on private/gated vLLM deployments and frontier endpoints, so end-to-end CI is not practical. A lightweight import/syntax-smoke test (python -c "import ast; ast.parse(open(p).read())" per recipe) would at least catch regressions from future Data Designer API breaks, but that's out of scope for this PR.

Minor nits

05-visual-qa-sdg.py and 06-single-page-qa-sdg.py embed very long Jinja-templated prompts directly in the source. Consider moving them to sibling .txt/.md files loaded at runtime to reduce review friction and let docs tooling syntax-highlight the prompt. Not blocking.
The .authors.yml diff adds trintamaki and nvilia; nmulepati and etramel referenced in the post frontmatter already exist (confirmed at lines 14 and 38). Good.
Broken-link sanity: the devnote's GitHub PR references (feat: support multiple images per column in image context #257, feat: Elevate non-LLM concurrency limits to RunConfig #242, fix: early shutdown race condition #201, fix: hard-disable early shutdown when RunConfig.disable_early_shutdown=true #203, feat: Expose shutdown options as RunConfig #186, feat: Plumb LLM retry controls through RunConfig #208, feat: move buffer size control to RunConfig #209, fix: do not attempt to deserialize llm text response #233, feat: support IndexRange and PartitionBlock seed selection strategy #8, fix: add support for wildcard pattern in seed dataset path #12) resolve against NVIDIA-NeMo/DataDesigner — can't verify them without network, but worth spot-checking that feat: support IndexRange and PartitionBlock seed selection strategy #8 and fix: add support for wildcard pattern in seed dataset path #12 point to the intended PRs given their low numbers.

Verdict

Approve with nits. The content is solid and the recipes follow project conventions. The only item I'd gate on is fixing the malformed cells in the MMLongBench-Doc results table (leading-zero typos like .3150/.2634/.3005/.2733/3.221) — they're prominent and misleading. The adaptive_window_size boundary gap and the base64-vs-file-path consistency note are worth addressing before readers start running the recipes but are fixable post-merge. Everything else is polish.

nabinchha added 3 commits April 27, 2026 15:36

Add resources for long-document-understanding-dev-note

60dd25a

Merge branch 'main' into nmulepati/docs/vlm-long-document-understandi…

62681c1

…ng-dev-note

added links

11f9af7

nabinchha requested a review from a team as a code owner April 28, 2026 15:56

nabinchha temporarily deployed to agentic-ci April 28, 2026 15:56 — with GitHub Actions Inactive

andreatgretel approved these changes Apr 28, 2026

View reviewed changes

nabinchha merged commit 7c5a722 into main Apr 28, 2026
52 checks passed

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add VLM long-document understanding dev note and recipes#579

docs: add VLM long-document understanding dev note and recipes#579
nabinchha merged 3 commits intomainfrom
nmulepati/docs/vlm-long-document-understanding-dev-note

nabinchha commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 28, 2026

Confidence Score: 3/5

Flowchart

Uh oh!

greptile-apps Bot Apr 28, 2026

Uh oh!

greptile-apps Bot Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nabinchha commented Apr 28, 2026

📋 Summary

🔗 Related Issue

🔄 Changes

✨ Added

🧪 Testing

✅ Checklist

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 28, 2026

Code Review: PR #579 — docs: add VLM long-document understanding dev note and recipes

Summary

Findings

Correctness / Accuracy

Consistency / Style

Security

Test Coverage

Minor nits

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants