Skip to content

[docs][example] VLM Examples#1531

Merged
SumanthRH merged 18 commits intoNovaSky-AI:mainfrom
nithinvc:nithinc/geometry-3k-example
Apr 21, 2026
Merged

[docs][example] VLM Examples#1531
SumanthRH merged 18 commits intoNovaSky-AI:mainfrom
nithinvc:nithinc/geometry-3k-example

Conversation

@nithinvc
Copy link
Copy Markdown
Contributor

@nithinvc nithinvc commented Apr 17, 2026

Summary

Adds two end-to-end multi-turn VLM RL examples (Geometry-3K and VisGym) along with a Vision-Language RL tutorial that documents the shared VLM setup (flags, dataset record shape, local vLLM override). Also wires the SkyRLVLMGymGenerator from #1486 into the main entrypoint behind a config flag so VLM runs can be launched end-to-end from ppo_base_config.yaml.

  • Geometry-3K example (examples/train/geometry3k/) — multi-turn GRPO on hiyouga/geometry3k with Qwen/Qwen3-VL-8B-Instruct. Up to 3 turns per episode; model checks candidate answers with a calc_score tool before committing to a final \boxed{} answer. Binary reward.
  • VisGym example (examples/train/visgym/) — multi-image multi-turn RL where every env step returns a new image observation. Two recipes:
    • run_visgym_from_instruct.sh — vanilla Qwen3-VL-8B-Instruct, keyword actions, task-only reward, KL on.
    • run_visgym_from_sft.sh — starts from a structured <observation>/<justification>/<action> SFT checkpoint with tuple actions and a mixed task+format reward.
  • Docs — new tutorials/vision_language_rl.mdx (shared VLM setup, required flags, dataset shape, support matrix) and example pages for each recipe under examples/. Docs pages include reward curves and a VisGym rollout GIF.
  • Generator wiring — new generator.vision_language_generator config flag. When true, main_base.py constructs SkyRLVLMGymGenerator instead of SkyRLGymGenerator. Defaults to false, no behavior change for existing runs.
  • mm_token_type_ids shim (model_wrapper.py) — transformers v5 expects mm_token_type_ids to be populated at tokenization to distinguish text vs. multimodal tokens, but vLLM doesn't support transformers v5 yet and doesn't return them. Populate here from image_token_id when images are present and the field is missing. Remove once vLLM ships transformers v5 support.

Test plan

  • bash examples/train/geometry3k/run_geometry3k.sh trains end-to-end on 1×8×H100; reward curve matches docs/public/images/examples/geometry3k_reward.png.
  • bash examples/train/visgym/run_visgym_from_instruct.sh trains end-to-end on 1×8×H100; reward curve matches docs/public/images/examples/visgym_maze2d_reward.png.
  • MODEL_PATH=/path/to/sft_ckpt bash examples/train/visgym/run_visgym_from_sft.sh trains end-to-end.
  • Existing non-VLM runs unaffected (vision_language_generator: false is the default).
  • Docs build: cd docs && npm run build.

@nithinvc nithinvc force-pushed the nithinc/geometry-3k-example branch from e4cb81b to 53a1dd9 Compare April 17, 2026 23:09
@nithinvc nithinvc marked this pull request as ready for review April 20, 2026 21:26
gemini-code-assist[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@nithinvc nithinvc changed the title [wip][docs][example] VLM Examples [docs][example] VLM Examples Apr 20, 2026
Copy link
Copy Markdown
Member

@SumanthRH SumanthRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, next time around let's break such changes up into smaller PRs. Have you run the dataset preparation scripts E2E? let's make sure we do that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The YAML is legacy and will be deleted soon. Revert?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

print(f"Saved full training set ({len(train_dataset)} examples) to {train_parquet_path}")

# Process and save the val split
if "val" in dataset:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split name should be validation according to https://huggingface.co/datasets/hiyouga/geometry3k/viewer/default/validation, not val

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


# Process and save the val split
if "val" in dataset:
val_dataset = dataset["val"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split name is validation not val according to https://huggingface.co/datasets/hiyouga/geometry3k/viewer/default/validation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread examples/train/geometry3k/math_utils.py Outdated
return None


def grade_answer_verl(solution_str: str, ground_truth: str) -> bool:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's instead call this grade_answer_from_boxed and add the reference used (looks like it's VERL) as a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

# Algorithm
trainer.algorithm.advantage_estimator="grpo" \
trainer.algorithm.use_kl_loss=false \
generator.n_samples_per_prompt=8 \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This snippet here says n_samples_per_prompt=8 but then the actual script at examples/train/geometry3k/run_geometry3k.sh says n_samples_per_prompt=4.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, both are now 4 samples per prompt

Comment on lines +11 to +12
**Local vLLM source override required (temporary).** VLM training needs a newer vLLM than the `vllm==0.19.0` pinned in the root `pyproject.toml`. Until the next vLLM release ships with the multimodal rendering support used by SkyRL's new inference stack, clone vLLM locally and point uv at it by adding one line under `[tool.uv.sources]` in the repo root `pyproject.toml`:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify the exact commit required? It looks like it needs to be after 80b1823

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@nithinvc
Copy link
Copy Markdown
Contributor Author

Overall LGTM, next time around let's break such changes up into smaller PRs. Have you run the dataset preparation scripts E2E? let's make sure we do that.

Yes, the "validation" key got past me since the script training run uses the test split test.parquet when doing evals. Just reran both dataset generation scripts and they produce the correct parquet files

@SumanthRH SumanthRH merged commit f83573c into NovaSky-AI:main Apr 21, 2026
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants