add gsm8k_with_tool agent_loop as example#1543
Conversation
fe3009e to
0c2c2e9
Compare
0c2c2e9 to
4cf124b
Compare
|
@claude review |
| }, | ||
| "required": ["answer"], | ||
| }, | ||
| }, | ||
| }, | ||
| } | ||
| ] |
There was a problem hiding this comment.
Claude: Warning — "required" is nested inside "properties" instead of being a sibling
The JSON Schema for this tool definition is malformed. "required" should be at the same level as "properties", not nested inside it. This may cause the LLM to not understand the tool schema correctly.
| }, | |
| "required": ["answer"], | |
| }, | |
| }, | |
| }, | |
| } | |
| ] | |
| "properties": { | |
| "answer": { | |
| "type": "string", | |
| "description": "The model's answer to the GSM8K math problem, must be a digits", | |
| }, | |
| }, | |
| "required": ["answer"], |
|
|
||
| _, function_calls = self.extract_tool_calls(rollout_state) | ||
| if not function_calls: | ||
| break | ||
|
|
||
| tool_messages = [] | ||
| for function_call in function_calls: | ||
| tool_name = function_call.name | ||
| tool_args = function_call.arguments | ||
| if tool_name == "calc_gsm8k_reward": | ||
| answer = tool_args | ||
| ground_truth = cast(dict, rollout_state.reward_model).get("ground_truth", "") | ||
| function_results = self.calc_gsm8k_reward(answer, ground_truth) | ||
| tool_message = { | ||
| "role": "tool", | ||
| "content": json.dumps({"result": function_results}, ensure_ascii=False), | ||
| } | ||
| tool_messages.append(tool_message) | ||
|
|
||
| # 处理工具调用的输出 | ||
| tools_response_ids = self.tokenizer.apply_chat_template(tool_messages, remove_system_prompt=True) | ||
| cur_turn_tokens.extend(tools_response_ids) | ||
| final_response_ids.extend(tools_response_ids) | ||
| final_logprobs.extend([0.0] * len(tools_response_ids)) | ||
| final_response_mask.extend([0] * len(tools_response_ids)) | ||
|
|
||
| tokens = cast(list[int], rollout_state.tokens) | ||
| raw_prompt_ids = len(tokens) - len(final_response_mask) | ||
| final_tokens = tokens[:raw_prompt_ids] | ||
| final_response_ids = tokens[raw_prompt_ids:] | ||
| rollout_state.tokens = final_tokens | ||
| rollout_state.response_ids = final_response_ids | ||
| rollout_state.response_mask = final_response_mask | ||
| rollout_state.logprobs = final_logprobs | ||
| rollout_state.response = self.tokenizer.decode(rollout_state.response_ids) | ||
| assert len(rollout_state.response_ids) == len(rollout_state.response_mask) == len(rollout_state.logprobs), ( |
There was a problem hiding this comment.
Claude: Warning — final_response_ids is built up incrementally then overwritten
Lines 141 and 167-168 build final_response_ids by extending it across turns, but then line 174 overwrites it entirely with tokens[raw_prompt_ids:]. This means the carefully accumulated final_response_ids is discarded. This works because tokens is also being extended in lock-step, but the intermediate final_response_ids variable is misleading — it's only used for the length assertion at line 180. Consider simplifying by not accumulating final_response_ids at all, or adding a comment explaining the intent.
| response_labels.append(-100) | ||
| else: | ||
| response_labels.append(response_ids[idx]) | ||
|
|
||
| # 根据 response_mask 计算新的 advantages | ||
| actual_advantages: list[float] = [advantages[i].item()] * len(prompt_ids) | ||
| for mask in response_mask: |
There was a problem hiding this comment.
Claude: Critical — actual_advantages includes prompt tokens, causing length mismatch with packing
actual_advantages is constructed with len(prompt_ids) entries for the prompt plus len(response_mask) entries for the response, then sliced with [:-1]. This means each element in advantages_list for a sample has length len(prompt_ids) + len(response_mask) - 1 = len(input_ids).
However, in _packing (controller.py), after this PR, advantage_list collects these per-sample lists and for padding appends pad_advantages = [-100] * pad_len. Then it flattens everything. The old code used repeat_interleave with cu_seq_lens_q to expand per-sample scalars to per-token values. The new code assumes each sample's advantage list is already per-token, but the packing code's pad_advantages uses pad_len (number of padding tokens) while the real samples use len(input_ids) tokens.
Please verify that the flattened advantage_flat length matches shifted_labels length (1, pack_max_length) in all cases. The old repeat_interleave approach guaranteed alignment; the new approach requires manual length tracking.
| continue | ||
|
|
||
| prompt_ids = group[0].prompt_ids | ||
| prompt_ids = group[0].tokens |
There was a problem hiding this comment.
Claude: Warning — prompt_ids = group[0].tokens semantic change
The old code used group[0].prompt_ids (which was presumably just the prompt portion). The new code uses group[0].tokens, which in the agent loop is set to the full token sequence minus response. Please verify this is semantically equivalent in all code paths, especially for the non-tool (single-turn) agent loop where tokens might still include response tokens at this point.
|
Claude: ## Summary This PR adds a GSM8K with tool-calling agent loop for RL training, introducing multi-turn tool interaction support with IssuesCritical
Warning
Nit
VerdictREQUEST_CHANGES |
6ca55dd to
dcfc945
Compare
| ) | ||
|
|
||
| # 3. judger | ||
| judger_config = GSM8KRouterJudgerConfig(judger_name="openai/gsm8k") |
There was a problem hiding this comment.
后续 pr 改成 router 是通用类,而不是固定给某个 judge 用的
515e725 to
92970be
Compare
| return content, function_calls | ||
|
|
||
| async def generate_sample(self, rollout_state: RolloutState) -> RolloutState: | ||
| # NOTE: 使用过程中发现很容易忘了给rollout_state传sample_params |
* add gsm8k_with_tool agent_loop as example * fix claude comments * fix haian comments * add data_preprocess for gsm8k_with_tool
No description provided.