Idea: Expanding N-turn conversations into N different "up to k turns" conversations and compute reward separately

From looking at the [grpo_env_trainer](https://github.com/willccbb/verifiers/blob/main/verifiers/trainers/grpo_env_trainer.py) it seems that the reward is computed to the whole multi-turn conversation. I'm curious if enabling reward computation for each conversation "up to turn K" would be helpful. Nonsense warning below :)

The basic idea is to expand a conversation of N turns into N entries, each with the conversation up to turn K for K =1 to N. Currently the reward design of calculating the proportion of assistant messages for, say, format check. There could be scenarios where the model break a rule (e.g. reasoning XML tag formatting) at say turn 3 and in that case seems helpful to assign a good reward to a copy of the conversation up to turn 1 and turn 2 and give a punishment for the conversation up to turn 3 to highlight the contrast and where the mistake is happening. This approach could give more signal as it kind of "narrow the search space": currently it's directly optimizing path of length N and so the reward is rather sparse but with the expansion we have signals even for path of length 1. 

Also a clarifying question: For multi-turn conversations, how exactly is the learning carried out? Do model learns to respond in all "up to turn K" conversations naturally? 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Expanding N-turn conversations into N different "up to k turns" conversations and compute reward separately #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Idea: Expanding N-turn conversations into N different "up to k turns" conversations and compute reward separately #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions