Skip to content

Idea: Expanding N-turn conversations into N different "up to k turns" conversations and compute reward separately #34

@richardzhuang0412

Description

@richardzhuang0412

From looking at the grpo_env_trainer it seems that the reward is computed to the whole multi-turn conversation. I'm curious if enabling reward computation for each conversation "up to turn K" would be helpful. Nonsense warning below :)

The basic idea is to expand a conversation of N turns into N entries, each with the conversation up to turn K for K =1 to N. Currently the reward design of calculating the proportion of assistant messages for, say, format check. There could be scenarios where the model break a rule (e.g. reasoning XML tag formatting) at say turn 3 and in that case seems helpful to assign a good reward to a copy of the conversation up to turn 1 and turn 2 and give a punishment for the conversation up to turn 3 to highlight the contrast and where the mistake is happening. This approach could give more signal as it kind of "narrow the search space": currently it's directly optimizing path of length N and so the reward is rather sparse but with the expansion we have signals even for path of length 1.

Also a clarifying question: For multi-turn conversations, how exactly is the learning carried out? Do model learns to respond in all "up to turn K" conversations naturally?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions