Turing Envs (Covers Multichallenge, InverseIFEval, CFBench and SysBench datasets)#949
Closed
MahanFathi wants to merge 43 commits intomainfrom
Closed
Turing Envs (Covers Multichallenge, InverseIFEval, CFBench and SysBench datasets)#949MahanFathi wants to merge 43 commits intomainfrom
MahanFathi wants to merge 43 commits intomainfrom
Conversation
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Flagging validation issues and writing them into error.json Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
…pport [IFTL-218] Multi-Lang Support Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
…r-turing/Nvidia-gym-turing into fixes/lang_validator Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Fixes/lang validator Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Remove the "Where Do Reward Scores Come From?" note that implied custom verification logic is optional. Also fix tutorial goals to match actual content and correct the resource server name. Fixes #776 Signed-off-by: Chris Wing <cwing@nvidia.com>
change tutorial card est time from 45-90 to 30 mins as in the tutorial itself #780 Signed-off-by: cmunley1 <cmunley@nvidia.com> Signed-off-by: Christian Munley <cmunley@nvidia.com>
…t tutorials section (#785) Signed-off-by: Brian Yu <bxyu@nvidia.com> Signed-off-by: bxyu-nvidia <bxyu@nvidia.com>
5927179 Signed-off-by: cmunley1 <cmunley@nvidia.com> Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
- Extract _preprocess_rows_from_config from duplicate run_from_config - Add missing imports: json, deepcopy, Union, Literal - Add return results to run_from_config - Remove dead _post_coroutine block (undefined server_client reference) Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com> Made-with: Cursor Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com> Made-with: Cursor
…alidators-integrations
…rtifacts - Remove # pragma: no cover from RolloutCollectionHelper class - Drop stale how_to_start.md entry from .gitignore - Delete tracked example_rollouts.jsonl generated artifact Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com> Made-with: Cursor
Auto-generated by update-readme-table pre-commit hook. Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com> Made-with: Cursor
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com> Made-with: Cursor
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com> Made-with: Cursor
…s.json Rollouts that return a schema_validation or language_compatibility error are now excluded from results.jsonl, reward profiling, and agent metrics. TuringVIFVerifyResponse gains a should_skip_rollout flag (set to True on those two error paths) which rollout_collection.py reads to route the result into errors.json instead of the main output. On resume_from_cache, already-errored rollout keys are loaded from errors.json so they are not re-run. The finish summary now prints successful vs skipped counts. Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
…th inline comments Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
…s_rows_from_config Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
…_vif - Add optional judge_server_name config field to TuringVIFResourcesServerConfig. When set, _get_judge_client() discovers the judge URL at runtime via get_server_url() from the NeMo Gym server registry instead of using a hardcoded judge_base_url. Enables use of local_vllm_model which actually spins up a vLLM instance (unlike vllm_model where spinup_server was silently ignored). - Make gprof2dot and pydot imports lazy in profiling.py to avoid ModuleNotFoundError on Ray worker nodes.
Replace hard-coded all-or-nothing reward (AND over all checks) with a configurable aggregation_mode field supporting five modes: all, any, mean, min, max. Default remains 'all' for backwards compatibility. The new _aggregate_scores() method converts per-check boolean verdicts into float scores and combines them according to the configured mode, enabling denser reward signals (e.g. mean) for RL training.
Add _extract_text_from_response() that finds the last assistant message (skipping reasoning output items) and strips <think>/<thinking> blocks, including unpaired closing tags from prompt template boundaries. Apply _strip_thinking_traces() to judge responses before JSON parsing. Without this, thinking models' chain-of-thought leaked into fast validators (inflating word counts, matching keywords in reasoning) and LLM judge evaluations.
…rams Switch LLM judge verdict format from fragile JSON parsing to robust [[YES]]/[[NO]] bracket markers with last-occurrence-wins extraction. Reorder prompt template to place conversation context before model response for more natural evaluation order. Add configurable judge_temperature (0.7), judge_top_p (0.8), and judge_max_tokens (10000) to TuringVIFResourcesServerConfig.
Resolve conflicts in README.md (take upstream table format, keep turing_vif entry) and nemo_gym/rollout_collection.py (merge turing_vif skip-rollout logic with upstream's prompt config, progress tracking, and aggregate metrics features).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Important
This PR is a successor to #801, branch
feature/nvidia-IF-bench-validators-integrations. The main changes were already there, this branch includes changes we needed to successfully test the environments.Important note to @bxyu-nvidia:
rollout_collection.pywas changed in the abovelinked PR, which requires your careful review.Summary of my changes
judge_server_nameconfig field so the judge URL is resolved at runtime from the NeMo-Gym server registry instead of being hardcoded. Enables use oflocal_vllm_model(which actually spins up vLLM via Ray) as the judge server type.profiling.py— Movedgprof2dot/pydotimports insidedump()to avoidModuleNotFoundErroron Ray workers where profiling deps aren't installed.all) reward with a configurableaggregation_modesupportingall,any,mean,min, andmax. Default remainsall(no behavior change).type="reasoning"output items and strip<think>/<thinking>tags before evaluation, preventing chain-of-thought from contaminating validator checks and judge prompts.LLM_JUDGE_QUESTION_PROMPTto present conversation context before the model response, and replaced fragile JSON output format with robust[[YES]]/[[NO]]bracket markers plus multi-tier fallback extraction. Eliminates silent false negatives from JSON parse failures.judge_temperature(default 0.7),judge_top_p(default 0.8), andjudge_max_tokens(default 10000) as config fields, replacing previously hardcoded values.turing_vif.yamlupdated with new fields; README documentedaggregation_mode.All changes are backwards-compatible.
Checklist