Skip to content

Turing Envs (Covers Multichallenge, InverseIFEval, CFBench and SysBench datasets)#949

Closed
MahanFathi wants to merge 43 commits intomainfrom
mfathi/turing_envs
Closed

Turing Envs (Covers Multichallenge, InverseIFEval, CFBench and SysBench datasets)#949
MahanFathi wants to merge 43 commits intomainfrom
mfathi/turing_envs

Conversation

@MahanFathi
Copy link
Copy Markdown
Contributor

@MahanFathi MahanFathi commented Mar 24, 2026

Important

This PR is a successor to #801, branch feature/nvidia-IF-bench-validators-integrations. The main changes were already there, this branch includes changes we needed to successfully test the environments.

Important note to @bxyu-nvidia: rollout_collection.py was changed in the abovelinked PR, which requires your careful review.

Summary of my changes

  • Dynamic judge URL discovery — Added judge_server_name config field so the judge URL is resolved at runtime from the NeMo-Gym server registry instead of being hardcoded. Enables use of local_vllm_model (which actually spins up vLLM via Ray) as the judge server type.
  • Lazy import fix in profiling.py — Moved gprof2dot/pydot imports inside dump() to avoid ModuleNotFoundError on Ray workers where profiling deps aren't installed.
  • Configurable reward aggregation — Replaced hard-coded all-or-nothing (all) reward with a configurable aggregation_mode supporting all, any, mean, min, and max. Default remains all (no behavior change).
  • Thinking trace stripping — Added helpers to skip type="reasoning" output items and strip <think>/<thinking> tags before evaluation, preventing chain-of-thought from contaminating validator checks and judge prompts.
  • Judge prompt restructuring — Reordered LLM_JUDGE_QUESTION_PROMPT to present conversation context before the model response, and replaced fragile JSON output format with robust [[YES]]/[[NO]] bracket markers plus multi-tier fallback extraction. Eliminates silent false negatives from JSON parse failures.
  • Configurable judge sampling parameters — Exposed judge_temperature (default 0.7), judge_top_p (default 0.8), and judge_max_tokens (default 10000) as config fields, replacing previously hardcoded values.
  • Config & docs updates — Base turing_vif.yaml updated with new fields; README documented aggregation_mode.

All changes are backwards-compatible.

Checklist

  • Ran successful experiments with Multichallenge and InverseIFEval datasets on Nano-v3
  • @abukharin-nv is wrapping up his experiments using the same env on CFBench and SysBench
  • Gitlab issues for training datasets have been created (to my knowledge MC and IIFEval are approved by legal)

dhrutisundar-turing and others added 30 commits January 7, 2026 14:52
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Flagging validation issues and writing them into error.json

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
…pport

[IFTL-218] Multi-Lang Support

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
…r-turing/Nvidia-gym-turing into fixes/lang_validator

Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Fixes/lang validator

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Remove the "Where Do Reward Scores Come From?" note that implied custom
verification logic is optional. Also fix tutorial goals to match actual
content and correct the resource server name.

Fixes #776

Signed-off-by: Chris Wing <cwing@nvidia.com>
change tutorial card est time from 45-90 to 30 mins as in the tutorial
itself

#780

Signed-off-by: cmunley1 <cmunley@nvidia.com>

Signed-off-by: Christian Munley <cmunley@nvidia.com>
…t tutorials section (#785)

Signed-off-by: Brian Yu <bxyu@nvidia.com>

Signed-off-by: bxyu-nvidia <bxyu@nvidia.com>
5927179

Signed-off-by: cmunley1 <cmunley@nvidia.com>

Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
- Extract _preprocess_rows_from_config from duplicate run_from_config
- Add missing imports: json, deepcopy, Union, Literal
- Add return results to run_from_config
- Remove dead _post_coroutine block (undefined server_client reference)

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor
…rtifacts

- Remove # pragma: no cover from RolloutCollectionHelper class
- Drop stale how_to_start.md entry from .gitignore
- Delete tracked example_rollouts.jsonl generated artifact

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor
Auto-generated by update-readme-table pre-commit hook.

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor
abubakaria56 and others added 12 commits March 2, 2026 19:41
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor
…s.json

Rollouts that return a schema_validation or language_compatibility error
are now excluded from results.jsonl, reward profiling, and agent metrics.
TuringVIFVerifyResponse gains a should_skip_rollout flag (set to True on
those two error paths) which rollout_collection.py reads to route the
result into errors.json instead of the main output. On resume_from_cache,
already-errored rollout keys are loaded from errors.json so they are not
re-run. The finish summary now prints successful vs skipped counts.

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
…th inline comments

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
…s_rows_from_config

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
…_vif

- Add optional judge_server_name config field to TuringVIFResourcesServerConfig.
  When set, _get_judge_client() discovers the judge URL at runtime via
  get_server_url() from the NeMo Gym server registry instead of using
  a hardcoded judge_base_url. Enables use of local_vllm_model which
  actually spins up a vLLM instance (unlike vllm_model where
  spinup_server was silently ignored).
- Make gprof2dot and pydot imports lazy in profiling.py to avoid
  ModuleNotFoundError on Ray worker nodes.
Replace hard-coded all-or-nothing reward (AND over all checks) with a
configurable aggregation_mode field supporting five modes: all, any,
mean, min, max. Default remains 'all' for backwards compatibility.

The new _aggregate_scores() method converts per-check boolean verdicts
into float scores and combines them according to the configured mode,
enabling denser reward signals (e.g. mean) for RL training.
Add _extract_text_from_response() that finds the last assistant message
(skipping reasoning output items) and strips <think>/<thinking> blocks,
including unpaired closing tags from prompt template boundaries.
Apply _strip_thinking_traces() to judge responses before JSON parsing.

Without this, thinking models' chain-of-thought leaked into fast
validators (inflating word counts, matching keywords in reasoning) and
LLM judge evaluations.
…rams

Switch LLM judge verdict format from fragile JSON parsing to robust
[[YES]]/[[NO]] bracket markers with last-occurrence-wins extraction.
Reorder prompt template to place conversation context before model
response for more natural evaluation order. Add configurable
judge_temperature (0.7), judge_top_p (0.8), and judge_max_tokens
(10000) to TuringVIFResourcesServerConfig.
Resolve conflicts in README.md (take upstream table format, keep turing_vif entry)
and nemo_gym/rollout_collection.py (merge turing_vif skip-rollout logic with
upstream's prompt config, progress tracking, and aggregate metrics features).
@MahanFathi MahanFathi requested a review from bxyu-nvidia March 24, 2026 17:40
@MahanFathi MahanFathi added the resources-server Resources servers (math, code, etc.) label Mar 24, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@MahanFathi MahanFathi closed this Mar 24, 2026
@bxyu-nvidia bxyu-nvidia deleted the mfathi/turing_envs branch March 26, 2026 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

resources-server Resources servers (math, code, etc.)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants