Add post-training example with structured-JSON captions#17
Add post-training example with structured-JSON captions#17Xuanmeng-Zhang wants to merge 5 commits into
Conversation
The post-training example trained on dense captions while inference uses the model's native structured-JSON prompt format — a misalignment. This makes structured JSON (`caption_json`) the default across captioning, training, and inference, with the dense narrative kept as a backup, and migrates the example onto the maintained `nvidia/BridgeData2-Subset-Synthetic-Captions` dataset. Caption format & pipeline: - New `inference/structured_caption.py`: canonical schema + parse/assemble of the two-phase VLM output; `caption_json_to_prompt` is the single serializer shared by the loader and inference for byte-identical train↔infer prompts. - `caption_from_video.py` saves both `caption.json` (structured) and `caption.txt` (dense); two-phase `video_captioner.txt` emits the full canonical Phase-1 JSON. - `captions_to_sft_jsonl.py` emits `caption_json` + dense `caption`; adds `--num-video-frames`, loader-matching filters, and a `<output>.summary.json`. - `sft_dataset.py` `_select_caption`: `caption_json` top priority, dict serialized verbatim (no prose-period/suffix/media mangling on the JSON path); configurable `max_num_tokens` (recipes raise it to 2048 for the longer JSON prompts). - New `inference_prompts_to_json.py` rewrites val inference prompts to JSON. - New `video_metadata.py` (ffprobe media fields). Evaluation: - New `cosmos_framework/scripts/eval.py`: CPU PSNR/SSIM of generated vs GT video, aggregated per conditioning mode, with `--compare-baseline` for A/B deltas. Verified on base Cosmos3-Nano (JSON vs dense prompt, 51 clips × T2V/I2V/V2V): JSON ≥ dense on the conditioned modes (I2V/V2V SSIM win-rate 71%/69%, p<0.01). Dataset unification: - Migrate every reference (docs, launch shells, checkpoints.py DATASETS registry + its tests, H100 staging) from the dense-only `nvidia/bridge-v2-subset-synthetic- captions` to `nvidia/BridgeData2-Subset-Synthetic-Captions` at the revision that carries `caption_json`. Docs gain Format + Evaluate sections. Tests: 52 passing (structured_caption, captions_to_sft_jsonl, inference_prompts_to_json, caption_from_video, sft_dataset caption selection, eval). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
17471de to
8eef586
Compare
| @@ -0,0 +1,284 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Seems that we could remove this eval.py file? The eval part is not necessary?
There was a problem hiding this comment.
Got, already remove it.
| CPU-only "vision" evaluation: pair each predicted ``vision.mp4`` with its ground-truth | ||
| video, compute per-clip PSNR and SSIM, and aggregate the means **per conditioning mode** | ||
| (``t2v`` / ``i2v`` / ``v2v``). This is a dependency-light port of imaginaire4's | ||
| ``cosmos3.scripts.eval`` *vision* path (which computes PSNR only); SSIM is added here with |
There was a problem hiding this comment.
This cosmos3.*** better be removed since this is deprecated.
| def compute_video_metrics(gt_cthw_uint8: torch.Tensor, pred_path: Path) -> dict[str, float]: | ||
| """Read ``pred_path``, align it to GT, and return ``{"psnr", "ssim"}``. | ||
|
|
||
| Alignment mirrors the imaginaire4 reference: read at most ``T_gt + 1`` frames (so an |
There was a problem hiding this comment.
The imaginare4 mention better be removed in all the texts.
The PSNR/SSIM eval script was added only to verify the structured-JSON captions help during development; it is not needed in the post-training example. Remove cosmos_framework/scripts/eval.py and its test, and drop the 'Evaluate generated videos' section from docs/dataset_jsonl.md. This also removes the deprecated `cosmos3.scripts.eval` reference and the imaginaire4 mentions (both were confined to eval.py). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
overall LGTM |
| # Structured-JSON captions are long; raise the token budget so | ||
| # the loader does not truncate them (see sft_dataset.py | ||
| # _MAX_NUM_TOKENS). 2048 covers the example set (measured max ~1790). | ||
| max_num_tokens=2048, |
There was a problem hiding this comment.
This value should be tunable? dataloader_train.max_sequence_length is a tunable parameter in toml_config, might use the same value for this? refer to
There was a problem hiding this comment.
Good idea. As suggested, I made this configurable with a new [dataloader_train].max_caption_tokens field, which defaults to 2048. And I kept it separate from max_sequence_length intentionally: max_sequence_length caps the packed sequence length (45056), while max_caption_tokens controls per-caption truncation. Reusing 45056 here would effectively disable per-caption truncation. I left VLM unchanged because it already caps length through max_sequence_length.
| so that partial or slightly-off VLM output still round-trips instead of being | ||
| dropped; the goal is structural validation, not rejection. | ||
| """ | ||
|
|
There was a problem hiding this comment.
Can this class used by prompt upsample? I think they should have similar final json structure?
There was a problem hiding this comment.
Yes. StructuredCaption is a mirror of the same external_api/t2v_i2v_video_json_schema.json the upsampler loads in prompt_upsampling.py, so they share the field contract. The upsampler currently handles the response as an opaque dict (json.loads → dict, no typed validation), so it doesn't strictly need the class today, but it could adopt it to validate/normalize. Should we or make it here or leave it as a follow-up?
…n budget as [dataloader_train].max_caption_tokens for VFM SFT recipes. The recipe maps it to the dataset parameter; VLM recipes skip it because packing is capped by max_sequence_length. Set the example TOMLs to 2048, which covers the structured JSON captions measured so far (max ~1790 tokens), while keeping 2048 as the default when omitted.Rename max_num_tokens/_MAX_NUM_TOKENS to max_caption_tokens/_MAX_CAPTION_TOKENS end to end across sft_dataset.py, open_source_dataloader.py, recipes, and caption_from_video.py. This keeps the TOML and dataset naming aligned and avoids confusion with model.max_num_tokens_after_packing.
|
@foreverlms @yy-code-nv Please take a look at your convenience. I added a new configurable |
This PR fixes a caption-format mismatch: the post-training example trained on dense captions, while inference used the model’s native structured JSON prompt format. Structured JSON (caption_json) is now the default across captioning, training, and inference, with dense captions kept as a fallback. The example also migrates to the maintained nvidia/BridgeData2-Subset-Synthetic-Captions dataset.
Caption Format & Pipeline
inference/structured_caption.pyas the canonical structured-caption schema and serializer.caption_json_to_promptbetween training and inference, making prompts byte-identical.caption_from_video.pyto save both caption.json and caption.txt.captions_to_sft_jsonl.pyto emit both caption_json and caption, add loader-matching filters, and write a summary JSON.sft_dataset.pyso caption_json is the preferred training target, with dense caption preserved as fallback.inference_prompts_to_json.pyto rewrite validation prompts into structured JSON.video_metadata.pyfor ffprobe-based media metadata.Dataset Migration
nvidia/bridge-v2-subset-synthetic-captionsreferences with nvidia/BridgeData2-Subset-Synthetic-Captions.Validation
Validated on base Cosmos3-Nano using 51 validation clips across T2V, I2V, and V2V. Structured JSON matched or outperformed dense captions on conditioned modes, with no
regression observed.
The full example flow was also re-run end to end:
download -> launch shell -> export -> JSON-prompt inference
Tests
Adds 36 unit tests covering: