Turing Envs (Covers Multichallenge, InverseIFEval, CFBench and SysBench datasets)#949

Closed

MahanFathi wants to merge 43 commits intomainfrom

mfathi/turing_envs

Contributor

MahanFathi commented Mar 24, 2026 •

edited

Loading

Important

This PR is a successor to #801, branch feature/nvidia-IF-bench-validators-integrations. The main changes were already there, this branch includes changes we needed to successfully test the environments.

Important note to @bxyu-nvidia: rollout_collection.py was changed in the abovelinked PR, which requires your careful review.

Summary of my changes

Dynamic judge URL discovery — Added judge_server_name config field so the judge URL is resolved at runtime from the NeMo-Gym server registry instead of being hardcoded. Enables use of local_vllm_model (which actually spins up vLLM via Ray) as the judge server type.
Lazy import fix in profiling.py — Moved gprof2dot/pydot imports inside dump() to avoid ModuleNotFoundError on Ray workers where profiling deps aren't installed.
Configurable reward aggregation — Replaced hard-coded all-or-nothing (all) reward with a configurable aggregation_mode supporting all, any, mean, min, and max. Default remains all (no behavior change).
Thinking trace stripping — Added helpers to skip type="reasoning" output items and strip <think>/<thinking> tags before evaluation, preventing chain-of-thought from contaminating validator checks and judge prompts.
Judge prompt restructuring — Reordered LLM_JUDGE_QUESTION_PROMPT to present conversation context before the model response, and replaced fragile JSON output format with robust [[YES]]/[[NO]] bracket markers plus multi-tier fallback extraction. Eliminates silent false negatives from JSON parse failures.
Configurable judge sampling parameters — Exposed judge_temperature (default 0.7), judge_top_p (default 0.8), and judge_max_tokens (default 10000) as config fields, replacing previously hardcoded values.
Config & docs updates — Base turing_vif.yaml updated with new fields; README documented aggregation_mode.

All changes are backwards-compatible.

Checklist

Ran successful experiments with Multichallenge and InverseIFEval datasets on Nano-v3
@abukharin-nv is wrapping up his experiments using the same env on CFBench and SysBench
Gitlab issues for training datasets have been created (to my knowledge MC and IIFEval are approved by legal)

dhrutisundar-turing and others added 30 commits

January 7, 2026 14:52


          integrate VIF validators and add test jsonl files

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>


          Add env.example

2160f2f

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>


          Update How to start.md

adb12b5

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>


          Flagging validation issues and writing them into error.json

8ae2d61

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          Added comments

488415d

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          Merge pull request #2 from dhrutisundar-turing/validation-flagging

dac8825

Flagging validation issues and writing them into error.json

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>


          ADD pass criteria support

ab68bb6

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>


          ADD multi lang support

e473520

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>


          Merge pull request #3 from dhrutisundar-turing/IFTL-218-multi-lang-su…

b847bfe

…pport

[IFTL-218] Multi-Lang Support

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>


          separated the validators into language folders

b7ae788

Signed-off-by: qasimo-debug <qasim.o@turing.com>


          Code cleaning

e360b95

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          verified with all tests in the official guide

42f537c

Signed-off-by: qasimo-debug <qasim.o@turing.com>


          Removing unsupported instruction from language validators.

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          fixed ascii display for non-english language

7f3f786

Signed-off-by: qasimo-debug <qasim.o@turing.com>


          Addinf SPDX header to all python files

9a870f1

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          code cleanup

0def99f

Signed-off-by: qasimo-debug <qasim.o@turing.com>


          Merge branch 'fixes/lang_validator' of https://github.com/dhrutisunda…

5062eca

…r-turing/Nvidia-gym-turing into fixes/lang_validator

Signed-off-by: qasimo-debug <qasim.o@turing.com>


          cleaned up licence headers

19ad4d5

Signed-off-by: qasimo-debug <qasim.o@turing.com>


          Merge pull request #5 from dhrutisundar-turing/fixes/lang_validator

b5ff81a

Fixes/lang validator

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          Merge branch 'main' into feature/nvidia-IF-bench-validators-integrations

a4dc771

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          Rollout collection tutorial fixes (#790)

41e8e51

Remove the "Where Do Reward Scores Come From?" note that implied custom
verification logic is optional. Also fix tutorial goals to match actual
content and correct the resource server name.

Fixes #776

Signed-off-by: Chris Wing <cwing@nvidia.com>


          docs: align tutorial time (#791)

644c3b4

change tutorial card est time from 45-90 to 30 mins as in the tutorial
itself

#780

Signed-off-by: cmunley1 <cmunley@nvidia.com>

Signed-off-by: Christian Munley <cmunley@nvidia.com>


          docs: Move environment best practices from contributing to environmen…

7bda8c4

…t tutorials section (#785)

Signed-off-by: Brian Yu <bxyu@nvidia.com>

Signed-off-by: bxyu-nvidia <bxyu@nvidia.com>


          fix: typos in verifiers agent readme (#755)

c0ebfa7

5927179

Signed-off-by: cmunley1 <cmunley@nvidia.com>

Signed-off-by: Christian Munley <cmunley@nvidia.com>


          Merge branch 'main' into feature/nvidia-IF-bench-validators-integrations

33d8dd3

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          fix: resolve rollout_collection.py test failures and missing imports

04fc646

- Extract _preprocess_rows_from_config from duplicate run_from_config
- Add missing imports: json, deepcopy, Union, Literal
- Add return results to run_from_config
- Remove dead _post_coroutine block (undefined server_client reference)

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          style: apply ruff format and update pre-commit hooks

b99f1ac

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor


          Merge branch 'main' of NVIDIA-NeMo/Gym into feature/nvidia-IF-bench-v…

274e388

…alidators-integrations


          chore: enable coverage for RolloutCollectionHelper and remove stale a…

20f4cd0

…rtifacts

- Remove # pragma: no cover from RolloutCollectionHelper class
- Drop stale how_to_start.md entry from .gitignore
- Delete tracked example_rollouts.jsonl generated artifact

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor


          chore: add turing_vif to resource server table in README

9fa969f

Auto-generated by update-readme-table pre-commit hook.

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor

abubakaria56 and others added 12 commits

March 2, 2026 19:41


          Merge branch 'main' into feature/nvidia-IF-bench-validators-integrations

300218a


          chore: update gitignore, pre-commit versions, and test formatting

04dae74

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor


          revert: remove pyenv and gradio entries from .gitignore

065cf3a

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor


          feat(turing-vif): skip schema/language-error rollouts and write error…

14a8a9b

…s.json

Rollouts that return a schema_validation or language_compatibility error
are now excluded from results.jsonl, reward profiling, and agent metrics.
TuringVIFVerifyResponse gains a should_skip_rollout flag (set to True on
those two error paths) which rollout_collection.py reads to route the
result into errors.json instead of the main output. On resume_from_cache,
already-errored rollout keys are loaded from errors.json so they are not
re-run. The finish summary now prints successful vs skipped counts.

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          Merge branch 'main' into feature/nvidia-IF-bench-validators-integrations

8c25e20


          docs(rollout_collection): annotate turing_vif skip-rollout changes wi…

da5f71b

…th inline comments

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          refactor(rollout_collection): remove unnecessary sort from _preproces…

fda547b

…s_rows_from_config

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          feat: add judge_server_name for dynamic judge URL discovery in turing…

9db175e

…_vif

- Add optional judge_server_name config field to TuringVIFResourcesServerConfig.
  When set, _get_judge_client() discovers the judge URL at runtime via
  get_server_url() from the NeMo Gym server registry instead of using
  a hardcoded judge_base_url. Enables use of local_vllm_model which
  actually spins up a vLLM instance (unlike vllm_model where
  spinup_server was silently ignored).
- Make gprof2dot and pydot imports lazy in profiling.py to avoid
  ModuleNotFoundError on Ray worker nodes.


          feat(turing_vif): add configurable reward aggregation mode

a0241a5

Replace hard-coded all-or-nothing reward (AND over all checks) with a
configurable aggregation_mode field supporting five modes: all, any,
mean, min, max. Default remains 'all' for backwards compatibility.

The new _aggregate_scores() method converts per-check boolean verdicts
into float scores and combines them according to the configured mode,
enabling denser reward signals (e.g. mean) for RL training.


          fix(turing_vif): strip thinking traces from policy and judge responses

0d67e5b

Add _extract_text_from_response() that finds the last assistant message
(skipping reasoning output items) and strips <think>/<thinking> blocks,
including unpaired closing tags from prompt template boundaries.
Apply _strip_thinking_traces() to judge responses before JSON parsing.

Without this, thinking models' chain-of-thought leaked into fast
validators (inflating word counts, matching keywords in reasoning) and
LLM judge evaluations.


          fix: improve judge prompt robustness and add configurable sampling pa…

70fc126

…rams

Switch LLM judge verdict format from fragile JSON parsing to robust
[[YES]]/[[NO]] bracket markers with last-occurrence-wins extraction.
Reorder prompt template to place conversation context before model
response for more natural evaluation order. Add configurable
judge_temperature (0.7), judge_top_p (0.8), and judge_max_tokens
(10000) to TuringVIFResourcesServerConfig.


          Merge upstream/main into mfathi/turing_envs

6112e6b

Resolve conflicts in README.md (take upstream table format, keep turing_vif entry)
and nemo_gym/rollout_collection.py (merge turing_vif skip-rollout logic with
upstream's prompt config, progress tracking, and aggregate metrics features).

MahanFathi requested a review from bxyu-nvidia

March 24, 2026 17:40

MahanFathi assigned bxyu-nvidia

MahanFathi added the resources-server label

copy-pr-bot bot commented Mar 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


          Merge branch 'main' into mfathi/turing_envs

bfba47e

MahanFathi closed this

bxyu-nvidia deleted the mfathi/turing_envs branch

March 26, 2026 22:43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

resources-server