Skip to content

Add post-training example with structured-JSON captions#17

Open
Xuanmeng-Zhang wants to merge 5 commits into
mainfrom
unify-structured-jsonl-captions
Open

Add post-training example with structured-JSON captions#17
Xuanmeng-Zhang wants to merge 5 commits into
mainfrom
unify-structured-jsonl-captions

Conversation

@Xuanmeng-Zhang
Copy link
Copy Markdown
Collaborator

@Xuanmeng-Zhang Xuanmeng-Zhang commented Jun 4, 2026

This PR fixes a caption-format mismatch: the post-training example trained on dense captions, while inference used the model’s native structured JSON prompt format. Structured JSON (caption_json) is now the default across captioning, training, and inference, with dense captions kept as a fallback. The example also migrates to the maintained nvidia/BridgeData2-Subset-Synthetic-Captions dataset.

Caption Format & Pipeline

  • Adds inference/structured_caption.py as the canonical structured-caption schema and serializer.
  • Shares caption_json_to_prompt between training and inference, making prompts byte-identical.
  • Updates caption_from_video.py to save both caption.json and caption.txt.
  • Updates captions_to_sft_jsonl.py to emit both caption_json and caption, add loader-matching filters, and write a summary JSON.
  • Updates sft_dataset.py so caption_json is the preferred training target, with dense caption preserved as fallback.
  • Adds inference_prompts_to_json.py to rewrite validation prompts into structured JSON.
  • Adds video_metadata.py for ffprobe-based media metadata.

Dataset Migration

  • Replaces dense-only nvidia/bridge-v2-subset-synthetic-captions references with nvidia/BridgeData2-Subset-Synthetic-Captions.
  • Pins the dataset revision that includes caption_json.
  • Updates docs, launch scripts, dataset registry entries, tests, and H100 staging references.
  • Documents caption_json as primary and dense caption as backup.

Validation

Validated on base Cosmos3-Nano using 51 validation clips across T2V, I2V, and V2V. Structured JSON matched or outperformed dense captions on conditioned modes, with no
regression observed.

The full example flow was also re-run end to end:

download -> launch shell -> export -> JSON-prompt inference

Tests

Adds 36 unit tests covering:

  • Structured-caption schema, parsing, and assembly
  • JSONL conversion
  • Inference-prompt rewriting
  • Caption generation
  • Loader caption selection

@Xuanmeng-Zhang Xuanmeng-Zhang requested a review from lfengad June 4, 2026 04:06
The post-training example trained on dense captions while inference uses the
model's native structured-JSON prompt format — a misalignment. This makes
structured JSON (`caption_json`) the default across captioning, training, and
inference, with the dense narrative kept as a backup, and migrates the example
onto the maintained `nvidia/BridgeData2-Subset-Synthetic-Captions` dataset.

Caption format & pipeline:
- New `inference/structured_caption.py`: canonical schema + parse/assemble of the
  two-phase VLM output; `caption_json_to_prompt` is the single serializer shared
  by the loader and inference for byte-identical train↔infer prompts.
- `caption_from_video.py` saves both `caption.json` (structured) and `caption.txt`
  (dense); two-phase `video_captioner.txt` emits the full canonical Phase-1 JSON.
- `captions_to_sft_jsonl.py` emits `caption_json` + dense `caption`; adds
  `--num-video-frames`, loader-matching filters, and a `<output>.summary.json`.
- `sft_dataset.py` `_select_caption`: `caption_json` top priority, dict serialized
  verbatim (no prose-period/suffix/media mangling on the JSON path); configurable
  `max_num_tokens` (recipes raise it to 2048 for the longer JSON prompts).
- New `inference_prompts_to_json.py` rewrites val inference prompts to JSON.
- New `video_metadata.py` (ffprobe media fields).

Evaluation:
- New `cosmos_framework/scripts/eval.py`: CPU PSNR/SSIM of generated vs GT video,
  aggregated per conditioning mode, with `--compare-baseline` for A/B deltas.
  Verified on base Cosmos3-Nano (JSON vs dense prompt, 51 clips × T2V/I2V/V2V):
  JSON ≥ dense on the conditioned modes (I2V/V2V SSIM win-rate 71%/69%, p<0.01).

Dataset unification:
- Migrate every reference (docs, launch shells, checkpoints.py DATASETS registry +
  its tests, H100 staging) from the dense-only `nvidia/bridge-v2-subset-synthetic-
  captions` to `nvidia/BridgeData2-Subset-Synthetic-Captions` at the revision that
  carries `caption_json`. Docs gain Format + Evaluate sections.

Tests: 52 passing (structured_caption, captions_to_sft_jsonl, inference_prompts_to_json,
caption_from_video, sft_dataset caption selection, eval).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Xuanmeng-Zhang Xuanmeng-Zhang force-pushed the unify-structured-jsonl-captions branch from 17471de to 8eef586 Compare June 4, 2026 04:17
Comment thread cosmos_framework/scripts/eval.py Outdated
@@ -0,0 +1,284 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that we could remove this eval.py file? The eval part is not necessary?

Copy link
Copy Markdown
Collaborator Author

@Xuanmeng-Zhang Xuanmeng-Zhang Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got, already remove it.

Comment thread cosmos_framework/scripts/eval.py Outdated
CPU-only "vision" evaluation: pair each predicted ``vision.mp4`` with its ground-truth
video, compute per-clip PSNR and SSIM, and aggregate the means **per conditioning mode**
(``t2v`` / ``i2v`` / ``v2v``). This is a dependency-light port of imaginaire4's
``cosmos3.scripts.eval`` *vision* path (which computes PSNR only); SSIM is added here with
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cosmos3.*** better be removed since this is deprecated.

Comment thread cosmos_framework/scripts/eval.py Outdated
def compute_video_metrics(gt_cthw_uint8: torch.Tensor, pred_path: Path) -> dict[str, float]:
"""Read ``pred_path``, align it to GT, and return ``{"psnr", "ssim"}``.

Alignment mirrors the imaginaire4 reference: read at most ``T_gt + 1`` frames (so an
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The imaginare4 mention better be removed in all the texts.

lfengad and others added 2 commits June 4, 2026 16:59
The PSNR/SSIM eval script was added only to verify the structured-JSON
captions help during development; it is not needed in the post-training
example. Remove cosmos_framework/scripts/eval.py and its test, and drop the
'Evaluate generated videos' section from docs/dataset_jsonl.md.

This also removes the deprecated `cosmos3.scripts.eval` reference and the
imaginaire4 mentions (both were confined to eval.py).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@lfengad lfengad requested a review from foreverlms June 4, 2026 12:15
@lfengad
Copy link
Copy Markdown
Collaborator

lfengad commented Jun 4, 2026

overall LGTM

lfengad
lfengad previously approved these changes Jun 4, 2026
@foreverlms foreverlms requested a review from yy-code-nv June 5, 2026 04:11
foreverlms
foreverlms previously approved these changes Jun 5, 2026
Copy link
Copy Markdown
Collaborator

@foreverlms foreverlms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

# Structured-JSON captions are long; raise the token budget so
# the loader does not truncate them (see sft_dataset.py
# _MAX_NUM_TOKENS). 2048 covers the example set (measured max ~1790).
max_num_tokens=2048,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value should be tunable? dataloader_train.max_sequence_length is a tunable parameter in toml_config, might use the same value for this? refer to

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. As suggested, I made this configurable with a new [dataloader_train].max_caption_tokens field, which defaults to 2048. And I kept it separate from max_sequence_length intentionally: max_sequence_length caps the packed sequence length (45056), while max_caption_tokens controls per-caption truncation. Reusing 45056 here would effectively disable per-caption truncation. I left VLM unchanged because it already caps length through max_sequence_length.

so that partial or slightly-off VLM output still round-trips instead of being
dropped; the goal is structural validation, not rejection.
"""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this class used by prompt upsample? I think they should have similar final json structure?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. StructuredCaption is a mirror of the same external_api/t2v_i2v_video_json_schema.json the upsampler loads in prompt_upsampling.py, so they share the field contract. The upsampler currently handles the response as an opaque dict (json.loads → dict, no typed validation), so it doesn't strictly need the class today, but it could adopt it to validate/normalize. Should we or make it here or leave it as a follow-up?

Copy link
Copy Markdown
Collaborator

@yy-code-nv yy-code-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented

…n budget as [dataloader_train].max_caption_tokens for VFM SFT recipes. The recipe maps it to the dataset parameter; VLM recipes skip it because packing is capped by max_sequence_length. Set the example TOMLs to 2048, which covers the structured JSON captions measured so far (max ~1790 tokens), while keeping 2048 as the default when omitted.Rename max_num_tokens/_MAX_NUM_TOKENS to max_caption_tokens/_MAX_CAPTION_TOKENS end to end across sft_dataset.py, open_source_dataloader.py, recipes, and caption_from_video.py. This keeps the TOML and dataset naming aligned and avoids confusion with model.max_num_tokens_after_packing.
@Xuanmeng-Zhang Xuanmeng-Zhang dismissed stale reviews from foreverlms and lfengad via 080b1bb June 6, 2026 05:31
@Xuanmeng-Zhang
Copy link
Copy Markdown
Collaborator Author

@foreverlms @yy-code-nv Please take a look at your convenience. I added a new configurable [dataloader_train].max_caption_tokens field with 2048 as default. And I kept it separate from max_sequence_length intentionally: max_sequence_length caps the packed sequence length (45056), while max_caption_tokens controls per-caption truncation. I left VLM unchanged because it already caps length through max_sequence_length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants