Hello,
First, thank you for your great work on the S2R paper. The methodology for teaching models to self-verify and self-correct is very insightful.
I'm writing to ask for clarification on the process-level reward mechanism, as I'm having trouble reconciling the description in the paper with the apparent logic. The confusion centers on how an intermediate [solve] action (s_j) is evaluated for correctness.
The paper states that the reward for s_j is based on V_golden(s_j), which appears to compare the output of this intermediate step to the overall ground-truth answer of the problem.
This leads to a confusing scenario. For example, in a multi-step math problem, an intermediate step s_j might be to correctly calculate a value like "the length of side A is 5". However, the final ground-truth answer for the problem might be "the area is 25".
If the system compares the intermediate result ("5") to the final ground-truth answer ("25"), it would assign a negative reward to a perfectly correct and necessary intermediate step. This seems counter-intuitive for training the model to learn a valid reasoning process.
Could you please clarify how V_golden(s_j) is implemented for these intermediate steps? Is there a different mechanism for evaluating the correctness of an intermediate calculation that doesn't rely on comparing it to the final answer?
Thank you for your time and for sharing your valuable research.
Hello,
First, thank you for your great work on the S2R paper. The methodology for teaching models to self-verify and self-correct is very insightful.
I'm writing to ask for clarification on the process-level reward mechanism, as I'm having trouble reconciling the description in the paper with the apparent logic. The confusion centers on how an intermediate [solve] action (
s_j) is evaluated for correctness.The paper states that the reward for
s_jis based onV_golden(s_j), which appears to compare the output of this intermediate step to the overall ground-truth answer of the problem.This leads to a confusing scenario. For example, in a multi-step math problem, an intermediate step s_j might be to correctly calculate a value like "the length of side A is 5". However, the final ground-truth answer for the problem might be "the area is 25".
If the system compares the intermediate result ("5") to the final ground-truth answer ("25"), it would assign a negative reward to a perfectly correct and necessary intermediate step. This seems counter-intuitive for training the model to learn a valid reasoning process.
Could you please clarify how
V_golden(s_j)is implemented for these intermediate steps? Is there a different mechanism for evaluating the correctness of an intermediate calculation that doesn't rely on comparing it to the final answer?Thank you for your time and for sharing your valuable research.