Skip to content

Add NewtonBench Resource Server #650

Merged
bxyu-nvidia merged 11 commits intoNVIDIA-NeMo:mainfrom
Kelvin0110:cmunley1/newton
Feb 25, 2026
Merged

Add NewtonBench Resource Server #650
bxyu-nvidia merged 11 commits intoNVIDIA-NeMo:mainfrom
Kelvin0110:cmunley1/newton

Conversation

@Kelvin0110
Copy link
Contributor

Contributing To NeMo-Gym (NewtonBench Resource Server)

1) Basic information

i. Description of the environment

A resource server wrapping the NewtonBench benchmark

  • Tasks: 324 scientific law discovery tasks across 12 physics domains.
  • Observation Space: Experimental results (numeric or structured dictionaries) returned after tool use.
  • Tools:
    • run_experiment: Query the environment with specific parameters to receive physical observations.
    • execute_python: (Optional) Python code-assisted discovery for complex data analysis.
  • Server: FastAPI resource server following NeMo Gym conventions.

ii. Description of the verification logic

The verifier uses the NewtonBench evaluation suite to score the agent's proposed scientific law:

  • Law Extraction: Attempts to find a law within <final_law> tags in the assistant's final response.
  • Success Criteria: Evaluates both symbolic equivalence (via an LLM judge) and numeric accuracy (Root Mean Square Logarithmic Error - RMSLE).
  • Reward Calculation:
    • reward = 0.3 * R_symbolic + 0.7 * R_numeric.
      • $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise.
      • $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$, yielding a score in $(-1, 1]$.
  • /verify endpoint processes the agent's submission and returns these detailed performance metrics.

iii. Description of the prompts/tasks (source + domain)

Domain: Maths (Scientific Law Discovery).
Source: Tasks and prompts adapted from the NewtonBench benchmark, which instruct the agent to discover a specific shifted scientific law (e.g., Newton's Law of Gravitation, Snell's Law) by performing interactive experiments.

iv. License information

  • Code: Apache 2.0.
  • Data: Apache 2.0
  • NewtonBench Benchmark: MIT (Copyright (c) 2025 HKUST-KnowComp).

2) Environment validity check

i. Commands used to collect rollouts

# Start NeMo Gym servers (agent + NewtonBench)
config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\
responses_api_models/vllm_model/configs/vllm_model.yaml"
ng_run "+config_paths=[$config_paths]"

# Collect sample rollouts
ng_collect_rollouts \
    +agent_name=newton_bench_simple_agent \
    +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \
    +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \
    +limit=5

# View rollouts
ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl

ii. Resulting rollouts (5 examples)

See resources_servers/newton_bench/data/example_rollouts.jsonl
Expected behavior:

  • Agent performs several experiments, analyzes data, and submits a scientific law.
  • Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0).
  • Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative.

3) Tests

i. Commands used to run the tests

source resources_servers/newton_bench/.venv/bin/activate 
pytest resources_servers/newton_bench/tests/test_app.py

Coverage notes:
Resource server tests provide comprehensive coverage of the following areas:

  • Session Lifecycle: Successful seeding, error handling for invalid modules, session ending, and background cleanup.
  • Experiment Execution: Dynamic handler registration for each modules, basic run experiment execution, and error handling for uninitialized sessions, mismatched module calls, etc.
  • Python Sandbox: Basic execution, session-based code persistence, timeout enforcement, and security validation (restricting dangerous imports/operations).
  • Verification Logic: Law extraction from diverse response structures, and reward calculation via symbolic equivalence (LLM judge) and numeric RMSLE.

4) Reward profiling

Models: Qwen/Qwen3-VL-8B-Thinking

Method:

  • 108 prompts based on version v0 of scientific laws.
  • 4 rollouts per prompt (432 total).
  • Tool calling of run_experiment enabled and agent loops until law submission.

Results:
Overall Metrics

  • Total Rollouts: 432
  • Mean Reward: $\approx$ 0.0675
  • Median Reward: 0.0
  • Min Reward: $\approx$ -0.8786
  • Max Reward: 1.0

Tool Call Statistics

  • Average Tool Calls: 22.95 per rollout
  • Min Tool Calls: 0
  • Max Tool Calls: 1770
  • Correlation (tool calls $\leftrightarrow$ reward): $\approx$ -0.0211 (Weak negative correlation)

Reward Distribution (Buckets)

Reward Range Count
[-1.0, -0.8) 16
[-0.8, -0.6) 16
[-0.6, -0.4) 60
[-0.4, -0.2) 39
[-0.2, 0.0) 24
[0.0, 0.2) 150
[0.2, 0.4) 46
[0.4, 0.6) 2
[0.6, 0.8) 1
[0.8, 1.0] 78

Performance by Tool Call Count Bins

Tool Call Range Rollouts (n) Mean Reward
0 23 $\approx$ -0.1112
1–10 329 $\approx$ 0.0824
11–50 60 $\approx$ 0.1308
51–200 15 $\approx$ -0.1959
201–2000 5 $\approx$ -0.0600

Key observations:

  • Symbolic Accuracy: Approximately 19.7% symbolic accuracy and a widespread RMSLE distribution indicate frequent failures to recover exact symbolic forms or precise numeric behavior.
  • Reward Distribution: Rewards cluster near zero (median 0.0, mean ~0.0675) with a long tail and many negative outcomes, reflecting frequent partial or failed discoveries.
  • Tool Usage Sweet Spot: Positive performance is observed with moderate tool use (1–50 calls), with a peak in the 11–50 range, suggesting that tool-driven data collection is critical for inducing scientific laws.
  • Diminishing Returns: Performance declines sharply after 50 calls, showing additional tool calls become detrimental and successful discovery depends on reasoning and hypothesis selection rather than raw data volume.

@Kelvin0110 Kelvin0110 requested a review from a team as a code owner February 5, 2026 07:53
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cmunley1 cmunley1 self-requested a review February 5, 2026 22:41
@cmunley1
Copy link
Contributor

cmunley1 commented Feb 5, 2026

can you please merge main?

@Kelvin0110
Copy link
Contributor Author

Sure, I’ve merged the latest main branch. Please let me know if you’d like me to take any further steps.

@cmunley1
Copy link
Contributor

cmunley1 commented Feb 6, 2026

have you tried training with NeMo RL (ideally we can test training before merging)? Also, I see you used a vision language model, does anything require vision here (not an issue, just curious) ?

@cmunley1
Copy link
Contributor

cmunley1 commented Feb 6, 2026

@cmunley1
Copy link
Contributor

cmunley1 commented Feb 6, 2026

Copy link
Contributor

@cmunley1 cmunley1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to pass dco and precommit

@Kelvin0110
Copy link
Contributor Author

Thanks for checking.
Our resource server doesn’t require any vision. We selected Qwen/Qwen3-VL-8B-Thinking because this vision language model provides stronger pure text performance than the corresponding non‑VL models (e.g., qwen3-8b-thinking). Since our tasks involve relatively complex reasoning, using the stronger model helps ensure more stable and reliable reward distribution.

@newtdes
Copy link

newtdes commented Feb 10, 2026

Also for DCO and precommit, we will handle that to ensure our pull request will pass both checking

@Kelvin0110 Kelvin0110 force-pushed the cmunley1/newton branch 2 times, most recently from 62303b5 to ae2236a Compare February 16, 2026 12:05
Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
@Kelvin0110
Copy link
Contributor Author

The DCO and precommit checking are solved.
For the Unit tests / Test, some tests fail because they require NewtonBench to exist in the root directory. Currently, we assume that users will manually clone NewtonBench themselves. When NewtonBench is cloned, all unit tests can pass successfully.
Hence, to ensure that automated unit test checks pass consistently, should we add NewtonBench as a submodule?

@Kelvin0110 Kelvin0110 requested a review from cmunley1 February 21, 2026 14:05
Kelvin0110 and others added 2 commits February 21, 2026 22:29
Signed-off-by: cmunley1 <cmunley@nvidia.com>
@cmunley1
Copy link
Contributor

cmunley1 commented Feb 21, 2026

The DCO and precommit checking are solved. For the Unit tests / Test, some tests fail because they require NewtonBench to exist in the root directory. Currently, we assume that users will manually clone NewtonBench themselves. When NewtonBench is cloned, all unit tests can pass successfully. Hence, to ensure that automated unit test checks pass consistently, should we add NewtonBench as a submodule?

shouldn't need submodule, can you clone during setup_webserver or else pip install?

Maybe like this a4216fb (I didnt test anything but ng_test here.)

@cmunley1 cmunley1 requested a review from bxyu-nvidia February 21, 2026 22:12
Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
@cmunley1
Copy link
Contributor

Collecting rollouts: 100%|████████████████████████████████████████████| 1/1 [00:29<00:00, 29.57s/it]
{
    "reward": 0.39999999999999997,
    "noise_level": 0.0,
    "rmsle": 0.0,
    "exact_accuracy": 0.0,
    "symbolic_equivalent": 0.0
}

Confirming I can run rollouts!

Have you tried running training with NeMo RL?

@Kelvin0110
Copy link
Contributor Author

Yup, we’ve implemented the automated cloning based on your suggestion, and all server functionalities remain intact and pass the tests. Thanks for the helpful feedback!
For the NeMo RL training, we haven’t run it yet. We’re currently applying for GPU resources for testing that.

@cmunley1
Copy link
Contributor

i am able to run a few steps

image

Copy link
Contributor

@cmunley1 cmunley1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bxyu-nvidia to review

@bxyu-nvidia bxyu-nvidia merged commit 91f4912 into NVIDIA-NeMo:main Feb 25, 2026
5 checks passed
fsiino-nvidia pushed a commit that referenced this pull request Feb 26, 2026
# Contributing To NeMo-Gym (NewtonBench Resource Server)

## 1) Basic information

### i. Description of the environment
A resource server wrapping the
[NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark
- **Tasks:** 324 scientific law discovery tasks across 12 physics
domains.
- **Observation Space:** Experimental results (numeric or structured
dictionaries) returned after tool use.
- **Tools:**
- `run_experiment`: Query the environment with specific parameters to
receive physical observations.
- `execute_python`: (Optional) Python code-assisted discovery for
complex data analysis.
- **Server:** FastAPI resource server following NeMo Gym conventions.

### ii. Description of the verification logic
The verifier uses the NewtonBench evaluation suite to score the agent's
proposed scientific law:
- **Law Extraction:** Attempts to find a law within `<final_law>` tags
in the assistant's final response.
- **Success Criteria:** Evaluates both **symbolic equivalence** (via an
LLM judge) and **numeric accuracy** (Root Mean Square Logarithmic Error
- RMSLE).
- **Reward Calculation:**
  - `reward = 0.3 * R_symbolic + 0.7 * R_numeric`.
    - $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise.
- $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$,
yielding a score in $(-1, 1]$.
- `/verify` endpoint processes the agent's submission and returns these
detailed performance metrics.

### iii. Description of the prompts/tasks (source + domain)
**Domain:** Maths (Scientific Law Discovery).
**Source:** Tasks and prompts adapted from the
[NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark,
which instruct the agent to discover a specific shifted scientific law
(e.g., Newton's Law of Gravitation, Snell's Law) by performing
interactive experiments.

### iv. License information
- **Code:** Apache 2.0.
- **Data:** Apache 2.0
- **NewtonBench Benchmark:** MIT (Copyright (c) 2025 HKUST-KnowComp).

## 2) Environment validity check

### i. Commands used to collect rollouts
```bash
# Start NeMo Gym servers (agent + NewtonBench)
config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\
responses_api_models/vllm_model/configs/vllm_model.yaml"
ng_run "+config_paths=[$config_paths]"

# Collect sample rollouts
ng_collect_rollouts \
    +agent_name=newton_bench_simple_agent \
    +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \
    +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \
    +limit=5

# View rollouts
ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl
```

### ii. Resulting rollouts (5 examples)
See `resources_servers/newton_bench/data/example_rollouts.jsonl`
**Expected behavior:**
- Agent performs several experiments, analyzes data, and submits a
scientific law.
- Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0).
- Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative.

## 3) Tests

### i. Commands used to run the tests
```bash
source resources_servers/newton_bench/.venv/bin/activate
pytest resources_servers/newton_bench/tests/test_app.py
```

**Coverage notes:**
Resource server tests provide comprehensive coverage of the following
areas:
- **Session Lifecycle:** Successful seeding, error handling for invalid
modules, session ending, and background cleanup.
- **Experiment Execution:** Dynamic handler registration for each
modules, basic run experiment execution, and error handling for
uninitialized sessions, mismatched module calls, etc.
- **Python Sandbox:** Basic execution, session-based code persistence,
timeout enforcement, and security validation (restricting dangerous
imports/operations).
- **Verification Logic:** Law extraction from diverse response
structures, and reward calculation via symbolic equivalence (LLM judge)
and numeric RMSLE.

## 4) Reward profiling

**Models:** Qwen/Qwen3-VL-8B-Thinking

**Method:**
- 108 prompts based on version v0 of scientific laws.
- 4 rollouts per prompt (432 total).
- Tool calling of `run_experiment` enabled and agent loops until law
submission.

**Results:**
**Overall Metrics**
- **Total Rollouts:** 432
- **Mean Reward:** $\approx$ 0.0675
- **Median Reward:** 0.0
- **Min Reward:** $\approx$ -0.8786
- **Max Reward:** 1.0

**Tool Call Statistics**
- **Average Tool Calls:** 22.95 per rollout
- **Min Tool Calls:** 0
- **Max Tool Calls:** 1770
- **Correlation (tool calls $\leftrightarrow$ reward):** $\approx$
-0.0211 (Weak negative correlation)

**Reward Distribution (Buckets)**
| Reward Range | Count |
|:---|:---|
| [-1.0, -0.8) | 16 |
| [-0.8, -0.6) | 16 |
| [-0.6, -0.4) | 60 |
| [-0.4, -0.2) | 39 |
| [-0.2, 0.0) | 24 |
| [0.0, 0.2) | 150 |
| [0.2, 0.4) | 46 |
| [0.4, 0.6) | 2 |
| [0.6, 0.8) | 1 |
| [0.8, 1.0] | 78 |

**Performance by Tool Call Count Bins**
| Tool Call Range | Rollouts (n) | Mean Reward |
|:---:|:---:|:---:|
| 0 | 23 | $\approx$ -0.1112 |
| 1–10 | 329 | $\approx$ 0.0824 |
| 11–50 | 60 | $\approx$ 0.1308 |
| 51–200 | 15 | $\approx$ -0.1959 |
| 201–2000 | 5 | $\approx$ -0.0600 |

**Key observations**:
- **Symbolic Accuracy:** Approximately 19.7% symbolic accuracy and a
widespread RMSLE distribution indicate frequent failures to recover
exact symbolic forms or precise numeric behavior.
- **Reward Distribution:** Rewards cluster near zero (median 0.0, mean
~0.0675) with a long tail and many negative outcomes, reflecting
frequent partial or failed discoveries.
- **Tool Usage Sweet Spot:** Positive performance is observed with
moderate tool use (1–50 calls), with a peak in the 11–50 range,
suggesting that tool-driven data collection is critical for inducing
scientific laws.
- **Diminishing Returns:** Performance declines sharply after 50 calls,
showing additional tool calls become detrimental and successful
discovery depends on reasoning and hypothesis selection rather than raw
data volume.

---------

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
Signed-off-by: cmunley1 <cmunley@nvidia.com>
Co-authored-by: Christian Munley <cmunley@nvidia.com>
fsiino-nvidia pushed a commit that referenced this pull request Feb 26, 2026
# Contributing To NeMo-Gym (NewtonBench Resource Server)

## 1) Basic information

### i. Description of the environment
A resource server wrapping the
[NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark
- **Tasks:** 324 scientific law discovery tasks across 12 physics
domains.
- **Observation Space:** Experimental results (numeric or structured
dictionaries) returned after tool use.
- **Tools:**
- `run_experiment`: Query the environment with specific parameters to
receive physical observations.
- `execute_python`: (Optional) Python code-assisted discovery for
complex data analysis.
- **Server:** FastAPI resource server following NeMo Gym conventions.

### ii. Description of the verification logic
The verifier uses the NewtonBench evaluation suite to score the agent's
proposed scientific law:
- **Law Extraction:** Attempts to find a law within `<final_law>` tags
in the assistant's final response.
- **Success Criteria:** Evaluates both **symbolic equivalence** (via an
LLM judge) and **numeric accuracy** (Root Mean Square Logarithmic Error
- RMSLE).
- **Reward Calculation:**
  - `reward = 0.3 * R_symbolic + 0.7 * R_numeric`.
    - $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise.
- $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$,
yielding a score in $(-1, 1]$.
- `/verify` endpoint processes the agent's submission and returns these
detailed performance metrics.

### iii. Description of the prompts/tasks (source + domain)
**Domain:** Maths (Scientific Law Discovery).
**Source:** Tasks and prompts adapted from the
[NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark,
which instruct the agent to discover a specific shifted scientific law
(e.g., Newton's Law of Gravitation, Snell's Law) by performing
interactive experiments.

### iv. License information
- **Code:** Apache 2.0.
- **Data:** Apache 2.0
- **NewtonBench Benchmark:** MIT (Copyright (c) 2025 HKUST-KnowComp).

## 2) Environment validity check

### i. Commands used to collect rollouts
```bash
# Start NeMo Gym servers (agent + NewtonBench)
config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\
responses_api_models/vllm_model/configs/vllm_model.yaml"
ng_run "+config_paths=[$config_paths]"

# Collect sample rollouts
ng_collect_rollouts \
    +agent_name=newton_bench_simple_agent \
    +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \
    +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \
    +limit=5

# View rollouts
ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl
```

### ii. Resulting rollouts (5 examples)
See `resources_servers/newton_bench/data/example_rollouts.jsonl`
**Expected behavior:**
- Agent performs several experiments, analyzes data, and submits a
scientific law.
- Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0).
- Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative.

## 3) Tests

### i. Commands used to run the tests
```bash
source resources_servers/newton_bench/.venv/bin/activate
pytest resources_servers/newton_bench/tests/test_app.py
```

**Coverage notes:**
Resource server tests provide comprehensive coverage of the following
areas:
- **Session Lifecycle:** Successful seeding, error handling for invalid
modules, session ending, and background cleanup.
- **Experiment Execution:** Dynamic handler registration for each
modules, basic run experiment execution, and error handling for
uninitialized sessions, mismatched module calls, etc.
- **Python Sandbox:** Basic execution, session-based code persistence,
timeout enforcement, and security validation (restricting dangerous
imports/operations).
- **Verification Logic:** Law extraction from diverse response
structures, and reward calculation via symbolic equivalence (LLM judge)
and numeric RMSLE.

## 4) Reward profiling

**Models:** Qwen/Qwen3-VL-8B-Thinking

**Method:**
- 108 prompts based on version v0 of scientific laws.
- 4 rollouts per prompt (432 total).
- Tool calling of `run_experiment` enabled and agent loops until law
submission.

**Results:**
**Overall Metrics**
- **Total Rollouts:** 432
- **Mean Reward:** $\approx$ 0.0675
- **Median Reward:** 0.0
- **Min Reward:** $\approx$ -0.8786
- **Max Reward:** 1.0

**Tool Call Statistics**
- **Average Tool Calls:** 22.95 per rollout
- **Min Tool Calls:** 0
- **Max Tool Calls:** 1770
- **Correlation (tool calls $\leftrightarrow$ reward):** $\approx$
-0.0211 (Weak negative correlation)

**Reward Distribution (Buckets)**
| Reward Range | Count |
|:---|:---|
| [-1.0, -0.8) | 16 |
| [-0.8, -0.6) | 16 |
| [-0.6, -0.4) | 60 |
| [-0.4, -0.2) | 39 |
| [-0.2, 0.0) | 24 |
| [0.0, 0.2) | 150 |
| [0.2, 0.4) | 46 |
| [0.4, 0.6) | 2 |
| [0.6, 0.8) | 1 |
| [0.8, 1.0] | 78 |

**Performance by Tool Call Count Bins**
| Tool Call Range | Rollouts (n) | Mean Reward |
|:---:|:---:|:---:|
| 0 | 23 | $\approx$ -0.1112 |
| 1–10 | 329 | $\approx$ 0.0824 |
| 11–50 | 60 | $\approx$ 0.1308 |
| 51–200 | 15 | $\approx$ -0.1959 |
| 201–2000 | 5 | $\approx$ -0.0600 |

**Key observations**:
- **Symbolic Accuracy:** Approximately 19.7% symbolic accuracy and a
widespread RMSLE distribution indicate frequent failures to recover
exact symbolic forms or precise numeric behavior.
- **Reward Distribution:** Rewards cluster near zero (median 0.0, mean
~0.0675) with a long tail and many negative outcomes, reflecting
frequent partial or failed discoveries.
- **Tool Usage Sweet Spot:** Positive performance is observed with
moderate tool use (1–50 calls), with a peak in the 11–50 range,
suggesting that tool-driven data collection is critical for inducing
scientific laws.
- **Diminishing Returns:** Performance declines sharply after 50 calls,
showing additional tool calls become detrimental and successful
discovery depends on reasoning and hypothesis selection rather than raw
data volume.

---------

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
Signed-off-by: cmunley1 <cmunley@nvidia.com>
Co-authored-by: Christian Munley <cmunley@nvidia.com>
abubakaria56 pushed a commit to abubakaria56/Gym that referenced this pull request Mar 2, 2026
# Contributing To NeMo-Gym (NewtonBench Resource Server)

## 1) Basic information

### i. Description of the environment
A resource server wrapping the
[NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark
- **Tasks:** 324 scientific law discovery tasks across 12 physics
domains.
- **Observation Space:** Experimental results (numeric or structured
dictionaries) returned after tool use.
- **Tools:**
- `run_experiment`: Query the environment with specific parameters to
receive physical observations.
- `execute_python`: (Optional) Python code-assisted discovery for
complex data analysis.
- **Server:** FastAPI resource server following NeMo Gym conventions.

### ii. Description of the verification logic
The verifier uses the NewtonBench evaluation suite to score the agent's
proposed scientific law:
- **Law Extraction:** Attempts to find a law within `<final_law>` tags
in the assistant's final response.
- **Success Criteria:** Evaluates both **symbolic equivalence** (via an
LLM judge) and **numeric accuracy** (Root Mean Square Logarithmic Error
- RMSLE).
- **Reward Calculation:**
  - `reward = 0.3 * R_symbolic + 0.7 * R_numeric`.
    - $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise.
- $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$,
yielding a score in $(-1, 1]$.
- `/verify` endpoint processes the agent's submission and returns these
detailed performance metrics.

### iii. Description of the prompts/tasks (source + domain)
**Domain:** Maths (Scientific Law Discovery).  
**Source:** Tasks and prompts adapted from the
[NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark,
which instruct the agent to discover a specific shifted scientific law
(e.g., Newton's Law of Gravitation, Snell's Law) by performing
interactive experiments.

### iv. License information
- **Code:** Apache 2.0.
- **Data:** Apache 2.0
- **NewtonBench Benchmark:** MIT (Copyright (c) 2025 HKUST-KnowComp).

## 2) Environment validity check

### i. Commands used to collect rollouts
```bash
# Start NeMo Gym servers (agent + NewtonBench)
config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\
responses_api_models/vllm_model/configs/vllm_model.yaml"
ng_run "+config_paths=[$config_paths]"

# Collect sample rollouts
ng_collect_rollouts \
    +agent_name=newton_bench_simple_agent \
    +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \
    +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \
    +limit=5

# View rollouts
ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl
```

### ii. Resulting rollouts (5 examples)
See `resources_servers/newton_bench/data/example_rollouts.jsonl`  
**Expected behavior:**
- Agent performs several experiments, analyzes data, and submits a
scientific law.
- Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0).
- Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative.

## 3) Tests

### i. Commands used to run the tests
```bash
source resources_servers/newton_bench/.venv/bin/activate 
pytest resources_servers/newton_bench/tests/test_app.py
```

**Coverage notes:**
Resource server tests provide comprehensive coverage of the following
areas:
- **Session Lifecycle:** Successful seeding, error handling for invalid
modules, session ending, and background cleanup.
- **Experiment Execution:** Dynamic handler registration for each
modules, basic run experiment execution, and error handling for
uninitialized sessions, mismatched module calls, etc.
- **Python Sandbox:** Basic execution, session-based code persistence,
timeout enforcement, and security validation (restricting dangerous
imports/operations).
- **Verification Logic:** Law extraction from diverse response
structures, and reward calculation via symbolic equivalence (LLM judge)
and numeric RMSLE.

## 4) Reward profiling

**Models:** Qwen/Qwen3-VL-8B-Thinking

**Method:**
- 108 prompts based on version v0 of scientific laws.
- 4 rollouts per prompt (432 total).
- Tool calling of `run_experiment` enabled and agent loops until law
submission.

**Results:**
**Overall Metrics**
- **Total Rollouts:** 432
- **Mean Reward:** $\approx$ 0.0675
- **Median Reward:** 0.0
- **Min Reward:** $\approx$ -0.8786
- **Max Reward:** 1.0

**Tool Call Statistics**
- **Average Tool Calls:** 22.95 per rollout
- **Min Tool Calls:** 0
- **Max Tool Calls:** 1770
- **Correlation (tool calls $\leftrightarrow$ reward):** $\approx$
-0.0211 (Weak negative correlation)

**Reward Distribution (Buckets)**
| Reward Range | Count |
|:---|:---|
| [-1.0, -0.8) | 16 |
| [-0.8, -0.6) | 16 |
| [-0.6, -0.4) | 60 |
| [-0.4, -0.2) | 39 |
| [-0.2, 0.0) | 24 |
| [0.0, 0.2) | 150 |
| [0.2, 0.4) | 46 |
| [0.4, 0.6) | 2 |
| [0.6, 0.8) | 1 |
| [0.8, 1.0] | 78 |

**Performance by Tool Call Count Bins**
| Tool Call Range | Rollouts (n) | Mean Reward |
|:---:|:---:|:---:|
| 0 | 23 | $\approx$ -0.1112 |
| 1–10 | 329 | $\approx$ 0.0824 |
| 11–50 | 60 | $\approx$ 0.1308 |
| 51–200 | 15 | $\approx$ -0.1959 |
| 201–2000 | 5 | $\approx$ -0.0600 |

**Key observations**:  
- **Symbolic Accuracy:** Approximately 19.7% symbolic accuracy and a
widespread RMSLE distribution indicate frequent failures to recover
exact symbolic forms or precise numeric behavior.
- **Reward Distribution:** Rewards cluster near zero (median 0.0, mean
~0.0675) with a long tail and many negative outcomes, reflecting
frequent partial or failed discoveries.
- **Tool Usage Sweet Spot:** Positive performance is observed with
moderate tool use (1–50 calls), with a peak in the 11–50 range,
suggesting that tool-driven data collection is critical for inducing
scientific laws.
- **Diminishing Returns:** Performance declines sharply after 50 calls,
showing additional tool calls become detrimental and successful
discovery depends on reasoning and hypothesis selection rather than raw
data volume.

---------

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
Signed-off-by: cmunley1 <cmunley@nvidia.com>
Co-authored-by: Christian Munley <cmunley@nvidia.com>
abubakaria56 pushed a commit to abubakaria56/Gym that referenced this pull request Mar 2, 2026
# Contributing To NeMo-Gym (NewtonBench Resource Server)

## 1) Basic information

### i. Description of the environment
A resource server wrapping the
[NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark
- **Tasks:** 324 scientific law discovery tasks across 12 physics
domains.
- **Observation Space:** Experimental results (numeric or structured
dictionaries) returned after tool use.
- **Tools:**
- `run_experiment`: Query the environment with specific parameters to
receive physical observations.
- `execute_python`: (Optional) Python code-assisted discovery for
complex data analysis.
- **Server:** FastAPI resource server following NeMo Gym conventions.

### ii. Description of the verification logic
The verifier uses the NewtonBench evaluation suite to score the agent's
proposed scientific law:
- **Law Extraction:** Attempts to find a law within `<final_law>` tags
in the assistant's final response.
- **Success Criteria:** Evaluates both **symbolic equivalence** (via an
LLM judge) and **numeric accuracy** (Root Mean Square Logarithmic Error
- RMSLE).
- **Reward Calculation:**
  - `reward = 0.3 * R_symbolic + 0.7 * R_numeric`.
    - $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise.
- $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$,
yielding a score in $(-1, 1]$.
- `/verify` endpoint processes the agent's submission and returns these
detailed performance metrics.

### iii. Description of the prompts/tasks (source + domain)
**Domain:** Maths (Scientific Law Discovery).  
**Source:** Tasks and prompts adapted from the
[NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark,
which instruct the agent to discover a specific shifted scientific law
(e.g., Newton's Law of Gravitation, Snell's Law) by performing
interactive experiments.

### iv. License information
- **Code:** Apache 2.0.
- **Data:** Apache 2.0
- **NewtonBench Benchmark:** MIT (Copyright (c) 2025 HKUST-KnowComp).

## 2) Environment validity check

### i. Commands used to collect rollouts
```bash
# Start NeMo Gym servers (agent + NewtonBench)
config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\
responses_api_models/vllm_model/configs/vllm_model.yaml"
ng_run "+config_paths=[$config_paths]"

# Collect sample rollouts
ng_collect_rollouts \
    +agent_name=newton_bench_simple_agent \
    +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \
    +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \
    +limit=5

# View rollouts
ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl
```

### ii. Resulting rollouts (5 examples)
See `resources_servers/newton_bench/data/example_rollouts.jsonl`  
**Expected behavior:**
- Agent performs several experiments, analyzes data, and submits a
scientific law.
- Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0).
- Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative.

## 3) Tests

### i. Commands used to run the tests
```bash
source resources_servers/newton_bench/.venv/bin/activate 
pytest resources_servers/newton_bench/tests/test_app.py
```

**Coverage notes:**
Resource server tests provide comprehensive coverage of the following
areas:
- **Session Lifecycle:** Successful seeding, error handling for invalid
modules, session ending, and background cleanup.
- **Experiment Execution:** Dynamic handler registration for each
modules, basic run experiment execution, and error handling for
uninitialized sessions, mismatched module calls, etc.
- **Python Sandbox:** Basic execution, session-based code persistence,
timeout enforcement, and security validation (restricting dangerous
imports/operations).
- **Verification Logic:** Law extraction from diverse response
structures, and reward calculation via symbolic equivalence (LLM judge)
and numeric RMSLE.

## 4) Reward profiling

**Models:** Qwen/Qwen3-VL-8B-Thinking

**Method:**
- 108 prompts based on version v0 of scientific laws.
- 4 rollouts per prompt (432 total).
- Tool calling of `run_experiment` enabled and agent loops until law
submission.

**Results:**
**Overall Metrics**
- **Total Rollouts:** 432
- **Mean Reward:** $\approx$ 0.0675
- **Median Reward:** 0.0
- **Min Reward:** $\approx$ -0.8786
- **Max Reward:** 1.0

**Tool Call Statistics**
- **Average Tool Calls:** 22.95 per rollout
- **Min Tool Calls:** 0
- **Max Tool Calls:** 1770
- **Correlation (tool calls $\leftrightarrow$ reward):** $\approx$
-0.0211 (Weak negative correlation)

**Reward Distribution (Buckets)**
| Reward Range | Count |
|:---|:---|
| [-1.0, -0.8) | 16 |
| [-0.8, -0.6) | 16 |
| [-0.6, -0.4) | 60 |
| [-0.4, -0.2) | 39 |
| [-0.2, 0.0) | 24 |
| [0.0, 0.2) | 150 |
| [0.2, 0.4) | 46 |
| [0.4, 0.6) | 2 |
| [0.6, 0.8) | 1 |
| [0.8, 1.0] | 78 |

**Performance by Tool Call Count Bins**
| Tool Call Range | Rollouts (n) | Mean Reward |
|:---:|:---:|:---:|
| 0 | 23 | $\approx$ -0.1112 |
| 1–10 | 329 | $\approx$ 0.0824 |
| 11–50 | 60 | $\approx$ 0.1308 |
| 51–200 | 15 | $\approx$ -0.1959 |
| 201–2000 | 5 | $\approx$ -0.0600 |

**Key observations**:  
- **Symbolic Accuracy:** Approximately 19.7% symbolic accuracy and a
widespread RMSLE distribution indicate frequent failures to recover
exact symbolic forms or precise numeric behavior.
- **Reward Distribution:** Rewards cluster near zero (median 0.0, mean
~0.0675) with a long tail and many negative outcomes, reflecting
frequent partial or failed discoveries.
- **Tool Usage Sweet Spot:** Positive performance is observed with
moderate tool use (1–50 calls), with a peak in the 11–50 range,
suggesting that tool-driven data collection is critical for inducing
scientific laws.
- **Diminishing Returns:** Performance declines sharply after 50 calls,
showing additional tool calls become detrimental and successful
discovery depends on reasoning and hypothesis selection rather than raw
data volume.

---------

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
Signed-off-by: cmunley1 <cmunley@nvidia.com>
Co-authored-by: Christian Munley <cmunley@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants