Skip to content

[RLVMR] Success rate definition in code does not match paper description #34

@luli-git

Description

@luli-git

Hi authors, thank you very much for the very insightful work and for releasing your code! While replicating the results of RLVMR on scienceworld, I found that the success rate metric as implemented in the code appears to differ significantly from the definition stated in the paper. The paper defines success rate as "The percentage of tasks successfully completed by the agent on each evaluation split," but the code uses a trivially satisfiable condition that inflates this metric.

Code in Question

In envs.py, the won flag is defined as:

isCompleted = done
info["won"] = isCompleted and info["score"] > 0

And reward is computed as:

def compute_reward(info, multi_modal=False):
    reward = 10.0 * float(info['won'])
    return reward

The Problem

In ScienceWorld, done=True is triggered by three conditions:

  1. The agent completes the task (score = 100)
  2. The agent hits the step limit (score can be anything from 0 to 99)
  3. The agent fails catastrophically (score < 0, forced termination by ScienceWorld)

The > 0 threshold correctly filters out case 3 (catastrophic failures with negative scores). However, it still counts case 2 (step-limit timeouts) as successes, even when the agent made zero progress (score = 0).

Empirical Evidence

When training with this code, I observed:

Success rate (won) converged to ~100% — which is expected since every finished episode counts as a "win"
Meanwhile, the actual score (logged when won=True) decreased to ~2.1 (out of 100), indicating the agent was not meaningfully completing tasks. This confirms that the success rate metric is not measuring task completion.

Expected Behavior

A meaningful success rate definition should require actual task completion, e.g.:

info["won"] = isCompleted and info["score"] >= 100 # fully completed

Questions

  1. Is this an intentional design choice, or a bug (e.g., should > 0 be >= 100)?
  2. Were the results reported in the paper evaluated using this same code, or was a different success criterion used during evaluation?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions