Add NewtonBench Resource Server by Kelvin0110 · Pull Request #650 · NVIDIA-NeMo/Gym

Kelvin0110 · 2026-02-05T07:53:47Z

Contributing To NeMo-Gym (NewtonBench Resource Server)

1) Basic information

i. Description of the environment

A resource server wrapping the NewtonBench benchmark

Tasks: 324 scientific law discovery tasks across 12 physics domains.
Observation Space: Experimental results (numeric or structured dictionaries) returned after tool use.
Tools:
- run_experiment: Query the environment with specific parameters to receive physical observations.
- execute_python: (Optional) Python code-assisted discovery for complex data analysis.
Server: FastAPI resource server following NeMo Gym conventions.

ii. Description of the verification logic

The verifier uses the NewtonBench evaluation suite to score the agent's proposed scientific law:

Law Extraction: Attempts to find a law within <final_law> tags in the assistant's final response.
Success Criteria: Evaluates both symbolic equivalence (via an LLM judge) and numeric accuracy (Root Mean Square Logarithmic Error - RMSLE).
Reward Calculation:
- reward = 0.3 * R_symbolic + 0.7 * R_numeric.
  - $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise.
  - $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$, yielding a score in $(-1, 1]$.
/verify endpoint processes the agent's submission and returns these detailed performance metrics.

iii. Description of the prompts/tasks (source + domain)

Domain: Maths (Scientific Law Discovery).
Source: Tasks and prompts adapted from the NewtonBench benchmark, which instruct the agent to discover a specific shifted scientific law (e.g., Newton's Law of Gravitation, Snell's Law) by performing interactive experiments.

iv. License information

Code: Apache 2.0.
Data: Apache 2.0
NewtonBench Benchmark: MIT (Copyright (c) 2025 HKUST-KnowComp).

2) Environment validity check

i. Commands used to collect rollouts

# Start NeMo Gym servers (agent + NewtonBench)
config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\
responses_api_models/vllm_model/configs/vllm_model.yaml"
ng_run "+config_paths=[$config_paths]"

# Collect sample rollouts
ng_collect_rollouts \
    +agent_name=newton_bench_simple_agent \
    +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \
    +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \
    +limit=5

# View rollouts
ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl

ii. Resulting rollouts (5 examples)

See resources_servers/newton_bench/data/example_rollouts.jsonl
Expected behavior:

Agent performs several experiments, analyzes data, and submits a scientific law.
Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0).
Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative.

3) Tests

i. Commands used to run the tests

source resources_servers/newton_bench/.venv/bin/activate 
pytest resources_servers/newton_bench/tests/test_app.py

Coverage notes:
Resource server tests provide comprehensive coverage of the following areas:

Session Lifecycle: Successful seeding, error handling for invalid modules, session ending, and background cleanup.
Experiment Execution: Dynamic handler registration for each modules, basic run experiment execution, and error handling for uninitialized sessions, mismatched module calls, etc.
Python Sandbox: Basic execution, session-based code persistence, timeout enforcement, and security validation (restricting dangerous imports/operations).
Verification Logic: Law extraction from diverse response structures, and reward calculation via symbolic equivalence (LLM judge) and numeric RMSLE.

4) Reward profiling

Models: Qwen/Qwen3-VL-8B-Thinking

Method:

108 prompts based on version v0 of scientific laws.
4 rollouts per prompt (432 total).
Tool calling of run_experiment enabled and agent loops until law submission.

Results:
Overall Metrics

Total Rollouts: 432
Mean Reward: $\approx$ 0.0675
Median Reward: 0.0
Min Reward: $\approx$ -0.8786
Max Reward: 1.0

Tool Call Statistics

Average Tool Calls: 22.95 per rollout
Min Tool Calls: 0
Max Tool Calls: 1770
Correlation (tool calls $\leftrightarrow$ reward): $\approx$ -0.0211 (Weak negative correlation)

Reward Distribution (Buckets)

Reward Range	Count
[-1.0, -0.8)	16
[-0.8, -0.6)	16
[-0.6, -0.4)	60
[-0.4, -0.2)	39
[-0.2, 0.0)	24
[0.0, 0.2)	150
[0.2, 0.4)	46
[0.4, 0.6)	2
[0.6, 0.8)	1
[0.8, 1.0]	78

Performance by Tool Call Count Bins

Tool Call Range	Rollouts (n)	Mean Reward
0	23	$\approx$ -0.1112
1–10	329	$\approx$ 0.0824
11–50	60	$\approx$ 0.1308
51–200	15	$\approx$ -0.1959
201–2000	5	$\approx$ -0.0600

Key observations:

Symbolic Accuracy: Approximately 19.7% symbolic accuracy and a widespread RMSLE distribution indicate frequent failures to recover exact symbolic forms or precise numeric behavior.
Reward Distribution: Rewards cluster near zero (median 0.0, mean ~0.0675) with a long tail and many negative outcomes, reflecting frequent partial or failed discoveries.
Tool Usage Sweet Spot: Positive performance is observed with moderate tool use (1–50 calls), with a peak in the 11–50 range, suggesting that tool-driven data collection is critical for inducing scientific laws.
Diminishing Returns: Performance declines sharply after 50 calls, showing additional tool calls become detrimental and successful discovery depends on reasoning and hypothesis selection rather than raw data volume.

copy-pr-bot · 2026-02-05T07:53:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cmunley1 · 2026-02-05T22:48:36Z

can you please merge main?

Kelvin0110 · 2026-02-06T10:16:48Z

Sure, I’ve merged the latest main branch. Please let me know if you’d like me to take any further steps.

cmunley1 · 2026-02-06T18:07:05Z

have you tried training with NeMo RL (ideally we can test training before merging)? Also, I see you used a vision language model, does anything require vision here (not an issue, just curious) ?

cmunley1 · 2026-02-06T20:06:03Z

DCO is faililng can you try to resolve that? https://docs.nvidia.com/nemo/gym/latest/contribute/development-setup.html#dco-and-commit-signing

Also see here https://docs.nvidia.com/nemo/gym/latest/contribute/environments/new-environment.html#contribution-workflow

cmunley1 · 2026-02-06T20:41:19Z

please also run pre-commit check like ruff https://docs.nvidia.com/nemo/gym/latest/contribute/development-setup.html#pre-commit-hook-failures

cmunley1

need to pass dco and precommit

Kelvin0110 · 2026-02-10T05:52:47Z

Thanks for checking.
Our resource server doesn’t require any vision. We selected Qwen/Qwen3-VL-8B-Thinking because this vision language model provides stronger pure text performance than the corresponding non‑VL models (e.g., qwen3-8b-thinking). Since our tasks involve relatively complex reasoning, using the stronger model helps ensure more stable and reliable reward distribution.

newtdes · 2026-02-10T06:01:36Z

Also for DCO and precommit, we will handle that to ensure our pull request will pass both checking

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>

Kelvin0110 · 2026-02-18T17:18:34Z

The DCO and precommit checking are solved.
For the Unit tests / Test, some tests fail because they require NewtonBench to exist in the root directory. Currently, we assume that users will manually clone NewtonBench themselves. When NewtonBench is cloned, all unit tests can pass successfully.
Hence, to ensure that automated unit test checks pass consistently, should we add NewtonBench as a submodule?

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>

.gitignore

tests/unit_tests/test_global_config.py

.pre-commit-config.yaml

Signed-off-by: cmunley1 <cmunley@nvidia.com>

cmunley1 · 2026-02-21T22:11:18Z

The DCO and precommit checking are solved. For the Unit tests / Test, some tests fail because they require NewtonBench to exist in the root directory. Currently, we assume that users will manually clone NewtonBench themselves. When NewtonBench is cloned, all unit tests can pass successfully. Hence, to ensure that automated unit test checks pass consistently, should we add NewtonBench as a submodule?

shouldn't need submodule, can you clone during setup_webserver or else pip install?

Maybe like this a4216fb (I didnt test anything but ng_test here.)

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>

cmunley1 · 2026-02-24T10:05:22Z

Collecting rollouts: 100%|████████████████████████████████████████████| 1/1 [00:29<00:00, 29.57s/it]
{
    "reward": 0.39999999999999997,
    "noise_level": 0.0,
    "rmsle": 0.0,
    "exact_accuracy": 0.0,
    "symbolic_equivalent": 0.0
}

Confirming I can run rollouts!

Have you tried running training with NeMo RL?

Kelvin0110 · 2026-02-24T18:01:05Z

Yup, we’ve implemented the automated cloning based on your suggestion, and all server functionalities remain intact and pass the tests. Thanks for the helpful feedback!
For the NeMo RL training, we haven’t run it yet. We’re currently applying for GPU resources for testing that.

cmunley1 · 2026-02-25T09:41:24Z

i am able to run a few steps

cmunley1

@bxyu-nvidia to review

# Contributing To NeMo-Gym (NewtonBench Resource Server) ## 1) Basic information ### i. Description of the environment A resource server wrapping the [NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark - **Tasks:** 324 scientific law discovery tasks across 12 physics domains. - **Observation Space:** Experimental results (numeric or structured dictionaries) returned after tool use. - **Tools:** - `run_experiment`: Query the environment with specific parameters to receive physical observations. - `execute_python`: (Optional) Python code-assisted discovery for complex data analysis. - **Server:** FastAPI resource server following NeMo Gym conventions. ### ii. Description of the verification logic The verifier uses the NewtonBench evaluation suite to score the agent's proposed scientific law: - **Law Extraction:** Attempts to find a law within `<final_law>` tags in the assistant's final response. - **Success Criteria:** Evaluates both **symbolic equivalence** (via an LLM judge) and **numeric accuracy** (Root Mean Square Logarithmic Error - RMSLE). - **Reward Calculation:** - `reward = 0.3 * R_symbolic + 0.7 * R_numeric`. - $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise. - $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$, yielding a score in $(-1, 1]$. - `/verify` endpoint processes the agent's submission and returns these detailed performance metrics. ### iii. Description of the prompts/tasks (source + domain) **Domain:** Maths (Scientific Law Discovery). **Source:** Tasks and prompts adapted from the [NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark, which instruct the agent to discover a specific shifted scientific law (e.g., Newton's Law of Gravitation, Snell's Law) by performing interactive experiments. ### iv. License information - **Code:** Apache 2.0. - **Data:** Apache 2.0 - **NewtonBench Benchmark:** MIT (Copyright (c) 2025 HKUST-KnowComp). ## 2) Environment validity check ### i. Commands used to collect rollouts ```bash # Start NeMo Gym servers (agent + NewtonBench) config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\ responses_api_models/vllm_model/configs/vllm_model.yaml" ng_run "+config_paths=[$config_paths]" # Collect sample rollouts ng_collect_rollouts \ +agent_name=newton_bench_simple_agent \ +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \ +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \ +limit=5 # View rollouts ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl ``` ### ii. Resulting rollouts (5 examples) See `resources_servers/newton_bench/data/example_rollouts.jsonl` **Expected behavior:** - Agent performs several experiments, analyzes data, and submits a scientific law. - Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0). - Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative. ## 3) Tests ### i. Commands used to run the tests ```bash source resources_servers/newton_bench/.venv/bin/activate pytest resources_servers/newton_bench/tests/test_app.py ``` **Coverage notes:** Resource server tests provide comprehensive coverage of the following areas: - **Session Lifecycle:** Successful seeding, error handling for invalid modules, session ending, and background cleanup. - **Experiment Execution:** Dynamic handler registration for each modules, basic run experiment execution, and error handling for uninitialized sessions, mismatched module calls, etc. - **Python Sandbox:** Basic execution, session-based code persistence, timeout enforcement, and security validation (restricting dangerous imports/operations). - **Verification Logic:** Law extraction from diverse response structures, and reward calculation via symbolic equivalence (LLM judge) and numeric RMSLE. ## 4) Reward profiling **Models:** Qwen/Qwen3-VL-8B-Thinking **Method:** - 108 prompts based on version v0 of scientific laws. - 4 rollouts per prompt (432 total). - Tool calling of `run_experiment` enabled and agent loops until law submission. **Results:** **Overall Metrics** - **Total Rollouts:** 432 - **Mean Reward:** $\approx$ 0.0675 - **Median Reward:** 0.0 - **Min Reward:** $\approx$ -0.8786 - **Max Reward:** 1.0 **Tool Call Statistics** - **Average Tool Calls:** 22.95 per rollout - **Min Tool Calls:** 0 - **Max Tool Calls:** 1770 - **Correlation (tool calls $\leftrightarrow$ reward):** $\approx$ -0.0211 (Weak negative correlation) **Reward Distribution (Buckets)** | Reward Range | Count | |:---|:---| | [-1.0, -0.8) | 16 | | [-0.8, -0.6) | 16 | | [-0.6, -0.4) | 60 | | [-0.4, -0.2) | 39 | | [-0.2, 0.0) | 24 | | [0.0, 0.2) | 150 | | [0.2, 0.4) | 46 | | [0.4, 0.6) | 2 | | [0.6, 0.8) | 1 | | [0.8, 1.0] | 78 | **Performance by Tool Call Count Bins** | Tool Call Range | Rollouts (n) | Mean Reward | |:---:|:---:|:---:| | 0 | 23 | $\approx$ -0.1112 | | 1–10 | 329 | $\approx$ 0.0824 | | 11–50 | 60 | $\approx$ 0.1308 | | 51–200 | 15 | $\approx$ -0.1959 | | 201–2000 | 5 | $\approx$ -0.0600 | **Key observations**: - **Symbolic Accuracy:** Approximately 19.7% symbolic accuracy and a widespread RMSLE distribution indicate frequent failures to recover exact symbolic forms or precise numeric behavior. - **Reward Distribution:** Rewards cluster near zero (median 0.0, mean ~0.0675) with a long tail and many negative outcomes, reflecting frequent partial or failed discoveries. - **Tool Usage Sweet Spot:** Positive performance is observed with moderate tool use (1–50 calls), with a peak in the 11–50 range, suggesting that tool-driven data collection is critical for inducing scientific laws. - **Diminishing Returns:** Performance declines sharply after 50 calls, showing additional tool calls become detrimental and successful discovery depends on reasoning and hypothesis selection rather than raw data volume. --------- Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk> Signed-off-by: cmunley1 <cmunley@nvidia.com> Co-authored-by: Christian Munley <cmunley@nvidia.com>

Kelvin0110 requested a review from a team as a code owner February 5, 2026 07:53

cmunley1 self-requested a review February 5, 2026 22:41

cmunley1 requested changes Feb 10, 2026

View reviewed changes

Kelvin0110 force-pushed the cmunley1/newton branch 2 times, most recently from 62303b5 to ae2236a Compare February 16, 2026 12:05

Add new resource_server: newton_bench

561beb2

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>

Kelvin0110 force-pushed the cmunley1/newton branch from ae2236a to 561beb2 Compare February 16, 2026 14:57

Merge branch 'main' into cmunley1/newton

fa68305

Kelvin0110 requested a review from cmunley1 February 21, 2026 14:05

Kelvin0110 and others added 2 commits February 21, 2026 22:29

Merge branch 'main' into cmunley1/newton

f17025c

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>

Merge branch 'main' into cmunley1/newton

fe18469

cmunley1 reviewed Feb 21, 2026

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

tests/unit_tests/test_global_config.py Outdated Show resolved Hide resolved

.pre-commit-config.yaml Outdated Show resolved Hide resolved

newton bench test

a4216fb

Signed-off-by: cmunley1 <cmunley@nvidia.com>

cmunley1 requested a review from bxyu-nvidia February 21, 2026 22:12

Kelvin0110 added 5 commits February 23, 2026 23:45

Remove unnecessary changes

4461079

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>

Merge branch 'cmunley1/newton-bench-test' into cmunley1/newton

3287139

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>

Enhance automatic cloning of the NewtonBench repository

3eb186b

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>

Add environment variable patching for verification tests

72a7e87

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>

Add API key setup instructions and logging for missing keys

fdfd341

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>

Merge branch 'main' into cmunley1/newton

f6ebfdb

cmunley1 approved these changes Feb 25, 2026

View reviewed changes

bxyu-nvidia approved these changes Feb 25, 2026

View reviewed changes

bxyu-nvidia merged commit 91f4912 into NVIDIA-NeMo:main Feb 25, 2026
5 checks passed

Conversation

Kelvin0110 commented Feb 5, 2026

Contributing To NeMo-Gym (NewtonBench Resource Server)

1) Basic information

i. Description of the environment

ii. Description of the verification logic

iii. Description of the prompts/tasks (source + domain)

iv. License information

2) Environment validity check

i. Commands used to collect rollouts

ii. Resulting rollouts (5 examples)

3) Tests

i. Commands used to run the tests

4) Reward profiling

Uh oh!

copy-pr-bot bot commented Feb 5, 2026

Uh oh!

cmunley1 commented Feb 5, 2026

Uh oh!

Kelvin0110 commented Feb 6, 2026

Uh oh!

cmunley1 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmunley1 commented Feb 6, 2026

Uh oh!

cmunley1 commented Feb 6, 2026

Uh oh!

cmunley1 left a comment

Choose a reason for hiding this comment

Uh oh!

Kelvin0110 commented Feb 10, 2026

Uh oh!

newtdes commented Feb 10, 2026

Uh oh!

Kelvin0110 commented Feb 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cmunley1 commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmunley1 commented Feb 24, 2026

Uh oh!

Kelvin0110 commented Feb 24, 2026

Uh oh!

cmunley1 commented Feb 25, 2026

Uh oh!

cmunley1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cmunley1 commented Feb 6, 2026 •

edited

Loading

cmunley1 commented Feb 21, 2026 •

edited

Loading