Add NewtonBench Resource Server #650
Conversation
|
can you please merge main? |
|
Sure, I’ve merged the latest main branch. Please let me know if you’d like me to take any further steps. |
|
have you tried training with NeMo RL (ideally we can test training before merging)? Also, I see you used a vision language model, does anything require vision here (not an issue, just curious) ? |
|
DCO is faililng can you try to resolve that? https://docs.nvidia.com/nemo/gym/latest/contribute/development-setup.html#dco-and-commit-signing Also see here https://docs.nvidia.com/nemo/gym/latest/contribute/environments/new-environment.html#contribution-workflow |
|
please also run pre-commit check like ruff https://docs.nvidia.com/nemo/gym/latest/contribute/development-setup.html#pre-commit-hook-failures |
cmunley1
left a comment
There was a problem hiding this comment.
need to pass dco and precommit
|
Thanks for checking. |
|
Also for DCO and precommit, we will handle that to ensure our pull request will pass both checking |
62303b5 to
ae2236a
Compare
Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
ae2236a to
561beb2
Compare
|
The DCO and precommit checking are solved. |
Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
Signed-off-by: cmunley1 <cmunley@nvidia.com>
shouldn't need submodule, can you clone during setup_webserver or else pip install? Maybe like this a4216fb (I didnt test anything but |
Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
Confirming I can run rollouts! Have you tried running training with NeMo RL? |
|
Yup, we’ve implemented the automated cloning based on your suggestion, and all server functionalities remain intact and pass the tests. Thanks for the helpful feedback! |
# Contributing To NeMo-Gym (NewtonBench Resource Server) ## 1) Basic information ### i. Description of the environment A resource server wrapping the [NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark - **Tasks:** 324 scientific law discovery tasks across 12 physics domains. - **Observation Space:** Experimental results (numeric or structured dictionaries) returned after tool use. - **Tools:** - `run_experiment`: Query the environment with specific parameters to receive physical observations. - `execute_python`: (Optional) Python code-assisted discovery for complex data analysis. - **Server:** FastAPI resource server following NeMo Gym conventions. ### ii. Description of the verification logic The verifier uses the NewtonBench evaluation suite to score the agent's proposed scientific law: - **Law Extraction:** Attempts to find a law within `<final_law>` tags in the assistant's final response. - **Success Criteria:** Evaluates both **symbolic equivalence** (via an LLM judge) and **numeric accuracy** (Root Mean Square Logarithmic Error - RMSLE). - **Reward Calculation:** - `reward = 0.3 * R_symbolic + 0.7 * R_numeric`. - $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise. - $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$, yielding a score in $(-1, 1]$. - `/verify` endpoint processes the agent's submission and returns these detailed performance metrics. ### iii. Description of the prompts/tasks (source + domain) **Domain:** Maths (Scientific Law Discovery). **Source:** Tasks and prompts adapted from the [NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark, which instruct the agent to discover a specific shifted scientific law (e.g., Newton's Law of Gravitation, Snell's Law) by performing interactive experiments. ### iv. License information - **Code:** Apache 2.0. - **Data:** Apache 2.0 - **NewtonBench Benchmark:** MIT (Copyright (c) 2025 HKUST-KnowComp). ## 2) Environment validity check ### i. Commands used to collect rollouts ```bash # Start NeMo Gym servers (agent + NewtonBench) config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\ responses_api_models/vllm_model/configs/vllm_model.yaml" ng_run "+config_paths=[$config_paths]" # Collect sample rollouts ng_collect_rollouts \ +agent_name=newton_bench_simple_agent \ +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \ +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \ +limit=5 # View rollouts ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl ``` ### ii. Resulting rollouts (5 examples) See `resources_servers/newton_bench/data/example_rollouts.jsonl` **Expected behavior:** - Agent performs several experiments, analyzes data, and submits a scientific law. - Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0). - Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative. ## 3) Tests ### i. Commands used to run the tests ```bash source resources_servers/newton_bench/.venv/bin/activate pytest resources_servers/newton_bench/tests/test_app.py ``` **Coverage notes:** Resource server tests provide comprehensive coverage of the following areas: - **Session Lifecycle:** Successful seeding, error handling for invalid modules, session ending, and background cleanup. - **Experiment Execution:** Dynamic handler registration for each modules, basic run experiment execution, and error handling for uninitialized sessions, mismatched module calls, etc. - **Python Sandbox:** Basic execution, session-based code persistence, timeout enforcement, and security validation (restricting dangerous imports/operations). - **Verification Logic:** Law extraction from diverse response structures, and reward calculation via symbolic equivalence (LLM judge) and numeric RMSLE. ## 4) Reward profiling **Models:** Qwen/Qwen3-VL-8B-Thinking **Method:** - 108 prompts based on version v0 of scientific laws. - 4 rollouts per prompt (432 total). - Tool calling of `run_experiment` enabled and agent loops until law submission. **Results:** **Overall Metrics** - **Total Rollouts:** 432 - **Mean Reward:** $\approx$ 0.0675 - **Median Reward:** 0.0 - **Min Reward:** $\approx$ -0.8786 - **Max Reward:** 1.0 **Tool Call Statistics** - **Average Tool Calls:** 22.95 per rollout - **Min Tool Calls:** 0 - **Max Tool Calls:** 1770 - **Correlation (tool calls $\leftrightarrow$ reward):** $\approx$ -0.0211 (Weak negative correlation) **Reward Distribution (Buckets)** | Reward Range | Count | |:---|:---| | [-1.0, -0.8) | 16 | | [-0.8, -0.6) | 16 | | [-0.6, -0.4) | 60 | | [-0.4, -0.2) | 39 | | [-0.2, 0.0) | 24 | | [0.0, 0.2) | 150 | | [0.2, 0.4) | 46 | | [0.4, 0.6) | 2 | | [0.6, 0.8) | 1 | | [0.8, 1.0] | 78 | **Performance by Tool Call Count Bins** | Tool Call Range | Rollouts (n) | Mean Reward | |:---:|:---:|:---:| | 0 | 23 | $\approx$ -0.1112 | | 1–10 | 329 | $\approx$ 0.0824 | | 11–50 | 60 | $\approx$ 0.1308 | | 51–200 | 15 | $\approx$ -0.1959 | | 201–2000 | 5 | $\approx$ -0.0600 | **Key observations**: - **Symbolic Accuracy:** Approximately 19.7% symbolic accuracy and a widespread RMSLE distribution indicate frequent failures to recover exact symbolic forms or precise numeric behavior. - **Reward Distribution:** Rewards cluster near zero (median 0.0, mean ~0.0675) with a long tail and many negative outcomes, reflecting frequent partial or failed discoveries. - **Tool Usage Sweet Spot:** Positive performance is observed with moderate tool use (1–50 calls), with a peak in the 11–50 range, suggesting that tool-driven data collection is critical for inducing scientific laws. - **Diminishing Returns:** Performance declines sharply after 50 calls, showing additional tool calls become detrimental and successful discovery depends on reasoning and hypothesis selection rather than raw data volume. --------- Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk> Signed-off-by: cmunley1 <cmunley@nvidia.com> Co-authored-by: Christian Munley <cmunley@nvidia.com>
# Contributing To NeMo-Gym (NewtonBench Resource Server) ## 1) Basic information ### i. Description of the environment A resource server wrapping the [NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark - **Tasks:** 324 scientific law discovery tasks across 12 physics domains. - **Observation Space:** Experimental results (numeric or structured dictionaries) returned after tool use. - **Tools:** - `run_experiment`: Query the environment with specific parameters to receive physical observations. - `execute_python`: (Optional) Python code-assisted discovery for complex data analysis. - **Server:** FastAPI resource server following NeMo Gym conventions. ### ii. Description of the verification logic The verifier uses the NewtonBench evaluation suite to score the agent's proposed scientific law: - **Law Extraction:** Attempts to find a law within `<final_law>` tags in the assistant's final response. - **Success Criteria:** Evaluates both **symbolic equivalence** (via an LLM judge) and **numeric accuracy** (Root Mean Square Logarithmic Error - RMSLE). - **Reward Calculation:** - `reward = 0.3 * R_symbolic + 0.7 * R_numeric`. - $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise. - $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$, yielding a score in $(-1, 1]$. - `/verify` endpoint processes the agent's submission and returns these detailed performance metrics. ### iii. Description of the prompts/tasks (source + domain) **Domain:** Maths (Scientific Law Discovery). **Source:** Tasks and prompts adapted from the [NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark, which instruct the agent to discover a specific shifted scientific law (e.g., Newton's Law of Gravitation, Snell's Law) by performing interactive experiments. ### iv. License information - **Code:** Apache 2.0. - **Data:** Apache 2.0 - **NewtonBench Benchmark:** MIT (Copyright (c) 2025 HKUST-KnowComp). ## 2) Environment validity check ### i. Commands used to collect rollouts ```bash # Start NeMo Gym servers (agent + NewtonBench) config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\ responses_api_models/vllm_model/configs/vllm_model.yaml" ng_run "+config_paths=[$config_paths]" # Collect sample rollouts ng_collect_rollouts \ +agent_name=newton_bench_simple_agent \ +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \ +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \ +limit=5 # View rollouts ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl ``` ### ii. Resulting rollouts (5 examples) See `resources_servers/newton_bench/data/example_rollouts.jsonl` **Expected behavior:** - Agent performs several experiments, analyzes data, and submits a scientific law. - Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0). - Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative. ## 3) Tests ### i. Commands used to run the tests ```bash source resources_servers/newton_bench/.venv/bin/activate pytest resources_servers/newton_bench/tests/test_app.py ``` **Coverage notes:** Resource server tests provide comprehensive coverage of the following areas: - **Session Lifecycle:** Successful seeding, error handling for invalid modules, session ending, and background cleanup. - **Experiment Execution:** Dynamic handler registration for each modules, basic run experiment execution, and error handling for uninitialized sessions, mismatched module calls, etc. - **Python Sandbox:** Basic execution, session-based code persistence, timeout enforcement, and security validation (restricting dangerous imports/operations). - **Verification Logic:** Law extraction from diverse response structures, and reward calculation via symbolic equivalence (LLM judge) and numeric RMSLE. ## 4) Reward profiling **Models:** Qwen/Qwen3-VL-8B-Thinking **Method:** - 108 prompts based on version v0 of scientific laws. - 4 rollouts per prompt (432 total). - Tool calling of `run_experiment` enabled and agent loops until law submission. **Results:** **Overall Metrics** - **Total Rollouts:** 432 - **Mean Reward:** $\approx$ 0.0675 - **Median Reward:** 0.0 - **Min Reward:** $\approx$ -0.8786 - **Max Reward:** 1.0 **Tool Call Statistics** - **Average Tool Calls:** 22.95 per rollout - **Min Tool Calls:** 0 - **Max Tool Calls:** 1770 - **Correlation (tool calls $\leftrightarrow$ reward):** $\approx$ -0.0211 (Weak negative correlation) **Reward Distribution (Buckets)** | Reward Range | Count | |:---|:---| | [-1.0, -0.8) | 16 | | [-0.8, -0.6) | 16 | | [-0.6, -0.4) | 60 | | [-0.4, -0.2) | 39 | | [-0.2, 0.0) | 24 | | [0.0, 0.2) | 150 | | [0.2, 0.4) | 46 | | [0.4, 0.6) | 2 | | [0.6, 0.8) | 1 | | [0.8, 1.0] | 78 | **Performance by Tool Call Count Bins** | Tool Call Range | Rollouts (n) | Mean Reward | |:---:|:---:|:---:| | 0 | 23 | $\approx$ -0.1112 | | 1–10 | 329 | $\approx$ 0.0824 | | 11–50 | 60 | $\approx$ 0.1308 | | 51–200 | 15 | $\approx$ -0.1959 | | 201–2000 | 5 | $\approx$ -0.0600 | **Key observations**: - **Symbolic Accuracy:** Approximately 19.7% symbolic accuracy and a widespread RMSLE distribution indicate frequent failures to recover exact symbolic forms or precise numeric behavior. - **Reward Distribution:** Rewards cluster near zero (median 0.0, mean ~0.0675) with a long tail and many negative outcomes, reflecting frequent partial or failed discoveries. - **Tool Usage Sweet Spot:** Positive performance is observed with moderate tool use (1–50 calls), with a peak in the 11–50 range, suggesting that tool-driven data collection is critical for inducing scientific laws. - **Diminishing Returns:** Performance declines sharply after 50 calls, showing additional tool calls become detrimental and successful discovery depends on reasoning and hypothesis selection rather than raw data volume. --------- Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk> Signed-off-by: cmunley1 <cmunley@nvidia.com> Co-authored-by: Christian Munley <cmunley@nvidia.com>
# Contributing To NeMo-Gym (NewtonBench Resource Server) ## 1) Basic information ### i. Description of the environment A resource server wrapping the [NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark - **Tasks:** 324 scientific law discovery tasks across 12 physics domains. - **Observation Space:** Experimental results (numeric or structured dictionaries) returned after tool use. - **Tools:** - `run_experiment`: Query the environment with specific parameters to receive physical observations. - `execute_python`: (Optional) Python code-assisted discovery for complex data analysis. - **Server:** FastAPI resource server following NeMo Gym conventions. ### ii. Description of the verification logic The verifier uses the NewtonBench evaluation suite to score the agent's proposed scientific law: - **Law Extraction:** Attempts to find a law within `<final_law>` tags in the assistant's final response. - **Success Criteria:** Evaluates both **symbolic equivalence** (via an LLM judge) and **numeric accuracy** (Root Mean Square Logarithmic Error - RMSLE). - **Reward Calculation:** - `reward = 0.3 * R_symbolic + 0.7 * R_numeric`. - $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise. - $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$, yielding a score in $(-1, 1]$. - `/verify` endpoint processes the agent's submission and returns these detailed performance metrics. ### iii. Description of the prompts/tasks (source + domain) **Domain:** Maths (Scientific Law Discovery). **Source:** Tasks and prompts adapted from the [NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark, which instruct the agent to discover a specific shifted scientific law (e.g., Newton's Law of Gravitation, Snell's Law) by performing interactive experiments. ### iv. License information - **Code:** Apache 2.0. - **Data:** Apache 2.0 - **NewtonBench Benchmark:** MIT (Copyright (c) 2025 HKUST-KnowComp). ## 2) Environment validity check ### i. Commands used to collect rollouts ```bash # Start NeMo Gym servers (agent + NewtonBench) config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\ responses_api_models/vllm_model/configs/vllm_model.yaml" ng_run "+config_paths=[$config_paths]" # Collect sample rollouts ng_collect_rollouts \ +agent_name=newton_bench_simple_agent \ +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \ +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \ +limit=5 # View rollouts ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl ``` ### ii. Resulting rollouts (5 examples) See `resources_servers/newton_bench/data/example_rollouts.jsonl` **Expected behavior:** - Agent performs several experiments, analyzes data, and submits a scientific law. - Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0). - Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative. ## 3) Tests ### i. Commands used to run the tests ```bash source resources_servers/newton_bench/.venv/bin/activate pytest resources_servers/newton_bench/tests/test_app.py ``` **Coverage notes:** Resource server tests provide comprehensive coverage of the following areas: - **Session Lifecycle:** Successful seeding, error handling for invalid modules, session ending, and background cleanup. - **Experiment Execution:** Dynamic handler registration for each modules, basic run experiment execution, and error handling for uninitialized sessions, mismatched module calls, etc. - **Python Sandbox:** Basic execution, session-based code persistence, timeout enforcement, and security validation (restricting dangerous imports/operations). - **Verification Logic:** Law extraction from diverse response structures, and reward calculation via symbolic equivalence (LLM judge) and numeric RMSLE. ## 4) Reward profiling **Models:** Qwen/Qwen3-VL-8B-Thinking **Method:** - 108 prompts based on version v0 of scientific laws. - 4 rollouts per prompt (432 total). - Tool calling of `run_experiment` enabled and agent loops until law submission. **Results:** **Overall Metrics** - **Total Rollouts:** 432 - **Mean Reward:** $\approx$ 0.0675 - **Median Reward:** 0.0 - **Min Reward:** $\approx$ -0.8786 - **Max Reward:** 1.0 **Tool Call Statistics** - **Average Tool Calls:** 22.95 per rollout - **Min Tool Calls:** 0 - **Max Tool Calls:** 1770 - **Correlation (tool calls $\leftrightarrow$ reward):** $\approx$ -0.0211 (Weak negative correlation) **Reward Distribution (Buckets)** | Reward Range | Count | |:---|:---| | [-1.0, -0.8) | 16 | | [-0.8, -0.6) | 16 | | [-0.6, -0.4) | 60 | | [-0.4, -0.2) | 39 | | [-0.2, 0.0) | 24 | | [0.0, 0.2) | 150 | | [0.2, 0.4) | 46 | | [0.4, 0.6) | 2 | | [0.6, 0.8) | 1 | | [0.8, 1.0] | 78 | **Performance by Tool Call Count Bins** | Tool Call Range | Rollouts (n) | Mean Reward | |:---:|:---:|:---:| | 0 | 23 | $\approx$ -0.1112 | | 1–10 | 329 | $\approx$ 0.0824 | | 11–50 | 60 | $\approx$ 0.1308 | | 51–200 | 15 | $\approx$ -0.1959 | | 201–2000 | 5 | $\approx$ -0.0600 | **Key observations**: - **Symbolic Accuracy:** Approximately 19.7% symbolic accuracy and a widespread RMSLE distribution indicate frequent failures to recover exact symbolic forms or precise numeric behavior. - **Reward Distribution:** Rewards cluster near zero (median 0.0, mean ~0.0675) with a long tail and many negative outcomes, reflecting frequent partial or failed discoveries. - **Tool Usage Sweet Spot:** Positive performance is observed with moderate tool use (1–50 calls), with a peak in the 11–50 range, suggesting that tool-driven data collection is critical for inducing scientific laws. - **Diminishing Returns:** Performance declines sharply after 50 calls, showing additional tool calls become detrimental and successful discovery depends on reasoning and hypothesis selection rather than raw data volume. --------- Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk> Signed-off-by: cmunley1 <cmunley@nvidia.com> Co-authored-by: Christian Munley <cmunley@nvidia.com>
# Contributing To NeMo-Gym (NewtonBench Resource Server) ## 1) Basic information ### i. Description of the environment A resource server wrapping the [NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark - **Tasks:** 324 scientific law discovery tasks across 12 physics domains. - **Observation Space:** Experimental results (numeric or structured dictionaries) returned after tool use. - **Tools:** - `run_experiment`: Query the environment with specific parameters to receive physical observations. - `execute_python`: (Optional) Python code-assisted discovery for complex data analysis. - **Server:** FastAPI resource server following NeMo Gym conventions. ### ii. Description of the verification logic The verifier uses the NewtonBench evaluation suite to score the agent's proposed scientific law: - **Law Extraction:** Attempts to find a law within `<final_law>` tags in the assistant's final response. - **Success Criteria:** Evaluates both **symbolic equivalence** (via an LLM judge) and **numeric accuracy** (Root Mean Square Logarithmic Error - RMSLE). - **Reward Calculation:** - `reward = 0.3 * R_symbolic + 0.7 * R_numeric`. - $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise. - $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$, yielding a score in $(-1, 1]$. - `/verify` endpoint processes the agent's submission and returns these detailed performance metrics. ### iii. Description of the prompts/tasks (source + domain) **Domain:** Maths (Scientific Law Discovery). **Source:** Tasks and prompts adapted from the [NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark, which instruct the agent to discover a specific shifted scientific law (e.g., Newton's Law of Gravitation, Snell's Law) by performing interactive experiments. ### iv. License information - **Code:** Apache 2.0. - **Data:** Apache 2.0 - **NewtonBench Benchmark:** MIT (Copyright (c) 2025 HKUST-KnowComp). ## 2) Environment validity check ### i. Commands used to collect rollouts ```bash # Start NeMo Gym servers (agent + NewtonBench) config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\ responses_api_models/vllm_model/configs/vllm_model.yaml" ng_run "+config_paths=[$config_paths]" # Collect sample rollouts ng_collect_rollouts \ +agent_name=newton_bench_simple_agent \ +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \ +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \ +limit=5 # View rollouts ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl ``` ### ii. Resulting rollouts (5 examples) See `resources_servers/newton_bench/data/example_rollouts.jsonl` **Expected behavior:** - Agent performs several experiments, analyzes data, and submits a scientific law. - Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0). - Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative. ## 3) Tests ### i. Commands used to run the tests ```bash source resources_servers/newton_bench/.venv/bin/activate pytest resources_servers/newton_bench/tests/test_app.py ``` **Coverage notes:** Resource server tests provide comprehensive coverage of the following areas: - **Session Lifecycle:** Successful seeding, error handling for invalid modules, session ending, and background cleanup. - **Experiment Execution:** Dynamic handler registration for each modules, basic run experiment execution, and error handling for uninitialized sessions, mismatched module calls, etc. - **Python Sandbox:** Basic execution, session-based code persistence, timeout enforcement, and security validation (restricting dangerous imports/operations). - **Verification Logic:** Law extraction from diverse response structures, and reward calculation via symbolic equivalence (LLM judge) and numeric RMSLE. ## 4) Reward profiling **Models:** Qwen/Qwen3-VL-8B-Thinking **Method:** - 108 prompts based on version v0 of scientific laws. - 4 rollouts per prompt (432 total). - Tool calling of `run_experiment` enabled and agent loops until law submission. **Results:** **Overall Metrics** - **Total Rollouts:** 432 - **Mean Reward:** $\approx$ 0.0675 - **Median Reward:** 0.0 - **Min Reward:** $\approx$ -0.8786 - **Max Reward:** 1.0 **Tool Call Statistics** - **Average Tool Calls:** 22.95 per rollout - **Min Tool Calls:** 0 - **Max Tool Calls:** 1770 - **Correlation (tool calls $\leftrightarrow$ reward):** $\approx$ -0.0211 (Weak negative correlation) **Reward Distribution (Buckets)** | Reward Range | Count | |:---|:---| | [-1.0, -0.8) | 16 | | [-0.8, -0.6) | 16 | | [-0.6, -0.4) | 60 | | [-0.4, -0.2) | 39 | | [-0.2, 0.0) | 24 | | [0.0, 0.2) | 150 | | [0.2, 0.4) | 46 | | [0.4, 0.6) | 2 | | [0.6, 0.8) | 1 | | [0.8, 1.0] | 78 | **Performance by Tool Call Count Bins** | Tool Call Range | Rollouts (n) | Mean Reward | |:---:|:---:|:---:| | 0 | 23 | $\approx$ -0.1112 | | 1–10 | 329 | $\approx$ 0.0824 | | 11–50 | 60 | $\approx$ 0.1308 | | 51–200 | 15 | $\approx$ -0.1959 | | 201–2000 | 5 | $\approx$ -0.0600 | **Key observations**: - **Symbolic Accuracy:** Approximately 19.7% symbolic accuracy and a widespread RMSLE distribution indicate frequent failures to recover exact symbolic forms or precise numeric behavior. - **Reward Distribution:** Rewards cluster near zero (median 0.0, mean ~0.0675) with a long tail and many negative outcomes, reflecting frequent partial or failed discoveries. - **Tool Usage Sweet Spot:** Positive performance is observed with moderate tool use (1–50 calls), with a peak in the 11–50 range, suggesting that tool-driven data collection is critical for inducing scientific laws. - **Diminishing Returns:** Performance declines sharply after 50 calls, showing additional tool calls become detrimental and successful discovery depends on reasoning and hypothesis selection rather than raw data volume. --------- Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk> Signed-off-by: cmunley1 <cmunley@nvidia.com> Co-authored-by: Christian Munley <cmunley@nvidia.com>

Contributing To NeMo-Gym (NewtonBench Resource Server)
1) Basic information
i. Description of the environment
A resource server wrapping the NewtonBench benchmark
run_experiment: Query the environment with specific parameters to receive physical observations.execute_python: (Optional) Python code-assisted discovery for complex data analysis.ii. Description of the verification logic
The verifier uses the NewtonBench evaluation suite to score the agent's proposed scientific law:
<final_law>tags in the assistant's final response.reward = 0.3 * R_symbolic + 0.7 * R_numeric./verifyendpoint processes the agent's submission and returns these detailed performance metrics.iii. Description of the prompts/tasks (source + domain)
Domain: Maths (Scientific Law Discovery).
Source: Tasks and prompts adapted from the NewtonBench benchmark, which instruct the agent to discover a specific shifted scientific law (e.g., Newton's Law of Gravitation, Snell's Law) by performing interactive experiments.
iv. License information
2) Environment validity check
i. Commands used to collect rollouts
ii. Resulting rollouts (5 examples)
See
resources_servers/newton_bench/data/example_rollouts.jsonlExpected behavior:
3) Tests
i. Commands used to run the tests
source resources_servers/newton_bench/.venv/bin/activate pytest resources_servers/newton_bench/tests/test_app.pyCoverage notes:
Resource server tests provide comprehensive coverage of the following areas:
4) Reward profiling
Models: Qwen/Qwen3-VL-8B-Thinking
Method:
run_experimentenabled and agent loops until law submission.Results:
Overall Metrics
Tool Call Statistics
Reward Distribution (Buckets)
Performance by Tool Call Count Bins
Key observations: