Question Regarding the Definition of "Intermediate Step" in Process-Level Rewards

Hello,

First, thank you for your great work on the S2R paper. The methodology for teaching models to self-verify and self-correct is very insightful.

I'm writing to ask for clarification on the process-level reward mechanism, as I'm having trouble reconciling the description in the paper with the apparent logic. The confusion centers on how an intermediate [solve] action (`s_j`) is evaluated for correctness.

The paper states that the reward for `s_j` is based on `V_golden(s_j)`, which appears to compare the output of this intermediate step to the overall ground-truth answer of the problem.

This leads to a confusing scenario. For example, in a multi-step math problem, an intermediate step s_j might be to correctly calculate a value like "the length of side A is 5". However, the final ground-truth answer for the problem might be "the area is 25".

If the system compares the intermediate result ("5") to the final ground-truth answer ("25"), it would assign a negative reward to a perfectly correct and necessary intermediate step. This seems counter-intuitive for training the model to learn a valid reasoning process.

Could you please clarify how `V_golden(s_j)` is implemented for these intermediate steps? Is there a different mechanism for evaluating the correctness of an intermediate calculation that doesn't rely on comparing it to the final answer?

Thank you for your time and for sharing your valuable research.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question Regarding the Definition of "Intermediate Step" in Process-Level Rewards #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question Regarding the Definition of "Intermediate Step" in Process-Level Rewards #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions