From looking at the grpo_env_trainer it seems that the reward is computed to the whole multi-turn conversation. I'm curious if enabling reward computation for each conversation "up to turn K" would be helpful. Nonsense warning below :)
The basic idea is to expand a conversation of N turns into N entries, each with the conversation up to turn K for K =1 to N. Currently the reward design of calculating the proportion of assistant messages for, say, format check. There could be scenarios where the model break a rule (e.g. reasoning XML tag formatting) at say turn 3 and in that case seems helpful to assign a good reward to a copy of the conversation up to turn 1 and turn 2 and give a punishment for the conversation up to turn 3 to highlight the contrast and where the mistake is happening. This approach could give more signal as it kind of "narrow the search space": currently it's directly optimizing path of length N and so the reward is rather sparse but with the expansion we have signals even for path of length 1.
Also a clarifying question: For multi-turn conversations, how exactly is the learning carried out? Do model learns to respond in all "up to turn K" conversations naturally?
From looking at the grpo_env_trainer it seems that the reward is computed to the whole multi-turn conversation. I'm curious if enabling reward computation for each conversation "up to turn K" would be helpful. Nonsense warning below :)
The basic idea is to expand a conversation of N turns into N entries, each with the conversation up to turn K for K =1 to N. Currently the reward design of calculating the proportion of assistant messages for, say, format check. There could be scenarios where the model break a rule (e.g. reasoning XML tag formatting) at say turn 3 and in that case seems helpful to assign a good reward to a copy of the conversation up to turn 1 and turn 2 and give a punishment for the conversation up to turn 3 to highlight the contrast and where the mistake is happening. This approach could give more signal as it kind of "narrow the search space": currently it's directly optimizing path of length N and so the reward is rather sparse but with the expansion we have signals even for path of length 1.
Also a clarifying question: For multi-turn conversations, how exactly is the learning carried out? Do model learns to respond in all "up to turn K" conversations naturally?