Add post-training example with structured-JSON captions by Xuanmeng-Zhang · Pull Request #17 · NVIDIA/cosmos-framework

Xuanmeng-Zhang · 2026-06-04T04:06:13Z

This PR fixes a caption-format mismatch: the post-training example trained on dense captions, while inference used the model’s native structured JSON prompt format. Structured JSON (caption_json) is now the default across captioning, training, and inference, with dense captions kept as a fallback. The example also migrates to the maintained nvidia/BridgeData2-Subset-Synthetic-Captions dataset.

Caption Format & Pipeline

Adds inference/structured_caption.py as the canonical structured-caption schema and serializer.
Shares caption_json_to_prompt between training and inference, making prompts byte-identical.
Updates caption_from_video.py to save both caption.json and caption.txt.
Updates captions_to_sft_jsonl.py to emit both caption_json and caption, add loader-matching filters, and write a summary JSON.
Updates sft_dataset.py so caption_json is the preferred training target, with dense caption preserved as fallback.
Adds inference_prompts_to_json.py to rewrite validation prompts into structured JSON.
Adds video_metadata.py for ffprobe-based media metadata.

Dataset Migration

Replaces dense-only nvidia/bridge-v2-subset-synthetic-captions references with nvidia/BridgeData2-Subset-Synthetic-Captions.
Pins the dataset revision that includes caption_json.
Updates docs, launch scripts, dataset registry entries, tests, and H100 staging references.
Documents caption_json as primary and dense caption as backup.

Validation

Validated on base Cosmos3-Nano using 51 validation clips across T2V, I2V, and V2V. Structured JSON matched or outperformed dense captions on conditioned modes, with no
regression observed.

The full example flow was also re-run end to end:

download -> launch shell -> export -> JSON-prompt inference

Tests

Adds 36 unit tests covering:

Structured-caption schema, parsing, and assembly
JSONL conversion
Inference-prompt rewriting
Caption generation
Loader caption selection

The post-training example trained on dense captions while inference uses the model's native structured-JSON prompt format — a misalignment. This makes structured JSON (`caption_json`) the default across captioning, training, and inference, with the dense narrative kept as a backup, and migrates the example onto the maintained `nvidia/BridgeData2-Subset-Synthetic-Captions` dataset. Caption format & pipeline: - New `inference/structured_caption.py`: canonical schema + parse/assemble of the two-phase VLM output; `caption_json_to_prompt` is the single serializer shared by the loader and inference for byte-identical train↔infer prompts. - `caption_from_video.py` saves both `caption.json` (structured) and `caption.txt` (dense); two-phase `video_captioner.txt` emits the full canonical Phase-1 JSON. - `captions_to_sft_jsonl.py` emits `caption_json` + dense `caption`; adds `--num-video-frames`, loader-matching filters, and a `<output>.summary.json`. - `sft_dataset.py` `_select_caption`: `caption_json` top priority, dict serialized verbatim (no prose-period/suffix/media mangling on the JSON path); configurable `max_num_tokens` (recipes raise it to 2048 for the longer JSON prompts). - New `inference_prompts_to_json.py` rewrites val inference prompts to JSON. - New `video_metadata.py` (ffprobe media fields). Evaluation: - New `cosmos_framework/scripts/eval.py`: CPU PSNR/SSIM of generated vs GT video, aggregated per conditioning mode, with `--compare-baseline` for A/B deltas. Verified on base Cosmos3-Nano (JSON vs dense prompt, 51 clips × T2V/I2V/V2V): JSON ≥ dense on the conditioned modes (I2V/V2V SSIM win-rate 71%/69%, p<0.01). Dataset unification: - Migrate every reference (docs, launch shells, checkpoints.py DATASETS registry + its tests, H100 staging) from the dense-only `nvidia/bridge-v2-subset-synthetic- captions` to `nvidia/BridgeData2-Subset-Synthetic-Captions` at the revision that carries `caption_json`. Docs gain Format + Evaluate sections. Tests: 52 passing (structured_caption, captions_to_sft_jsonl, inference_prompts_to_json, caption_from_video, sft_dataset caption selection, eval). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

lfengad · 2026-06-04T07:43:58Z

@@ -0,0 +1,284 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Seems that we could remove this eval.py file? The eval part is not necessary?

Got, already remove it.

lfengad · 2026-06-04T07:47:16Z

+CPU-only "vision" evaluation: pair each predicted ``vision.mp4`` with its ground-truth
+video, compute per-clip PSNR and SSIM, and aggregate the means **per conditioning mode**
+(``t2v`` / ``i2v`` / ``v2v``). This is a dependency-light port of imaginaire4's
+``cosmos3.scripts.eval`` *vision* path (which computes PSNR only); SSIM is added here with


This cosmos3.*** better be removed since this is deprecated.

lfengad · 2026-06-04T07:48:35Z

+def compute_video_metrics(gt_cthw_uint8: torch.Tensor, pred_path: Path) -> dict[str, float]:
+    """Read ``pred_path``, align it to GT, and return ``{"psnr", "ssim"}``.
+
+    Alignment mirrors the imaginaire4 reference: read at most ``T_gt + 1`` frames (so an


The imaginare4 mention better be removed in all the texts.

The PSNR/SSIM eval script was added only to verify the structured-JSON captions help during development; it is not needed in the post-training example. Remove cosmos_framework/scripts/eval.py and its test, and drop the 'Evaluate generated videos' section from docs/dataset_jsonl.md. This also removes the deprecated `cosmos3.scripts.eval` reference and the imaginaire4 mentions (both were confined to eval.py). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

lfengad · 2026-06-04T12:26:17Z

overall LGTM

foreverlms

LGTM!

yy-code-nv · 2026-06-05T08:23:28Z

+                            # Structured-JSON captions are long; raise the token budget so
+                            # the loader does not truncate them (see sft_dataset.py
+                            # _MAX_NUM_TOKENS). 2048 covers the example set (measured max ~1790).
+                            max_num_tokens=2048,


This value should be tunable? dataloader_train.max_sequence_length is a tunable parameter in toml_config, might use the same value for this? refer to

cosmos-framework/cosmos_framework/configs/base/vlm/experiment/llava_ov_datapacker_experiment.py

Line 351 in 003d66d

max_seq_len="${dataloader_train.max_tokens}",

Good idea. As suggested, I made this configurable with a new [dataloader_train].max_caption_tokens field, which defaults to 2048. And I kept it separate from max_sequence_length intentionally: max_sequence_length caps the packed sequence length (45056), while max_caption_tokens controls per-caption truncation. Reusing 45056 here would effectively disable per-caption truncation. I left VLM unchanged because it already caps length through max_sequence_length.

yy-code-nv · 2026-06-05T08:30:17Z

+so that partial or slightly-off VLM output still round-trips instead of being
+dropped; the goal is structural validation, not rejection.
+"""
+


Can this class used by prompt upsample? I think they should have similar final json structure?

Yes. StructuredCaption is a mirror of the same external_api/t2v_i2v_video_json_schema.json the upsampler loads in prompt_upsampling.py, so they share the field contract. The upsampler currently handles the response as an opaque dict (json.loads → dict, no typed validation), so it doesn't strictly need the class today, but it could adopt it to validate/normalize. Should we or make it here or leave it as a follow-up?

yy-code-nv

commented

…n budget as [dataloader_train].max_caption_tokens for VFM SFT recipes. The recipe maps it to the dataset parameter; VLM recipes skip it because packing is capped by max_sequence_length. Set the example TOMLs to 2048, which covers the structured JSON captions measured so far (max ~1790 tokens), while keeping 2048 as the default when omitted.Rename max_num_tokens/_MAX_NUM_TOKENS to max_caption_tokens/_MAX_CAPTION_TOKENS end to end across sft_dataset.py, open_source_dataloader.py, recipes, and caption_from_video.py. This keeps the TOML and dataset naming aligned and avoids confusion with model.max_num_tokens_after_packing.

Xuanmeng-Zhang · 2026-06-06T05:56:24Z

@foreverlms @yy-code-nv Please take a look at your convenience. I added a new configurable [dataloader_train].max_caption_tokens field with 2048 as default. And I kept it separate from max_sequence_length intentionally: max_sequence_length caps the packed sequence length (45056), while max_caption_tokens controls per-caption truncation. I left VLM unchanged because it already caps length through max_sequence_length.

Xuanmeng-Zhang requested a review from lfengad June 4, 2026 04:06

Xuanmeng-Zhang force-pushed the unify-structured-jsonl-captions branch from 17471de to 8eef586 Compare June 4, 2026 04:17

Merge branch 'main' into unify-structured-jsonl-captions

d6b4eb2

lfengad reviewed Jun 4, 2026

View reviewed changes

lfengad and others added 2 commits June 4, 2026 16:59

Merge branch 'main' into unify-structured-jsonl-captions

ec68974

lfengad requested a review from foreverlms June 4, 2026 12:15

lfengad previously approved these changes Jun 4, 2026

View reviewed changes

foreverlms requested a review from yy-code-nv June 5, 2026 04:11

foreverlms previously approved these changes Jun 5, 2026

View reviewed changes

yy-code-nv reviewed Jun 5, 2026

View reviewed changes

Xuanmeng-Zhang dismissed stale reviews from foreverlms and lfengad via 080b1bb June 6, 2026 05:31

Xuanmeng-Zhang requested review from foreverlms and yy-code-nv June 6, 2026 05:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add post-training example with structured-JSON captions#17

Add post-training example with structured-JSON captions#17
Xuanmeng-Zhang wants to merge 5 commits into
mainfrom
unify-structured-jsonl-captions

Xuanmeng-Zhang commented Jun 4, 2026 •

edited

Loading

Uh oh!

lfengad Jun 4, 2026

Uh oh!

Xuanmeng-Zhang Jun 4, 2026 •

edited

Loading

Uh oh!

lfengad Jun 4, 2026

Uh oh!

lfengad Jun 4, 2026

Uh oh!

lfengad commented Jun 4, 2026

Uh oh!

foreverlms left a comment

Uh oh!

yy-code-nv Jun 5, 2026

Uh oh!

Xuanmeng-Zhang Jun 6, 2026

Uh oh!

yy-code-nv Jun 5, 2026

Uh oh!

Xuanmeng-Zhang Jun 6, 2026

Uh oh!

yy-code-nv left a comment

Uh oh!

Xuanmeng-Zhang commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -0,0 +1,284 @@
		# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Conversation

Xuanmeng-Zhang commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Caption Format & Pipeline

Dataset Migration

Validation

Tests

Uh oh!

lfengad Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Xuanmeng-Zhang Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lfengad Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

lfengad Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

lfengad commented Jun 4, 2026

Uh oh!

foreverlms left a comment

Choose a reason for hiding this comment

Uh oh!

yy-code-nv Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Xuanmeng-Zhang Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

yy-code-nv Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Xuanmeng-Zhang Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

yy-code-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Xuanmeng-Zhang commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Xuanmeng-Zhang commented Jun 4, 2026 •

edited

Loading

Xuanmeng-Zhang Jun 4, 2026 •

edited

Loading