-
Notifications
You must be signed in to change notification settings - Fork 51
feat: planning task and megamind agent #679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: planning task and megamind agent #679
Conversation
Manipulation O3DE BenchmarkTESTING NEW AGENTSTesting new agents. Aim is to modify the plan and execute agent so it can do better. I test agents on 10 trivial tasks, single repeat. I know 10 scenarios are not the most representative, but more samples would take forever with this much different setups. LLMs were run on CPU. For reference these are results of simple react agent and standard plan and execute agent from https://github.com/langchain-ai/langgraph/blob/main/docs/docs/tutorials/plan-and-execute/plan-and-execute.ipynb.
The standard agent from the link does not wotk as it is supposed to, because the executor tries to do many steps at once, returns only last message as past_steps, which confuses the replanner. Replanner does not remove the 1st step, so executor gets it again and so on… i, version→c393921 I modified the agent slightly, so that after every executor invoke the 1st step is removed from the plan, 66caef9. But still timeouts and recursion limits occured as executor wants to do everything The main issues i noticed:
![]()
![]()
Solutions to try:
![]() It helped, executor now in majority of cases calls tool to the task, sometimes still calls excessive tools, but less often, version → 6d57ad9
![]()
|
WAREHOUSE TASKMock of a warehouse state was created to test agents and models in scenario somewhat similar to the use case of demo, check out src/rai_bench/rai_bench/examples/tool_calling_demo.py There is state of env that keeps track of robot and objects positions. based on that state tools return output. Tools are very simple mocks of navigation, detection and manipulation tools. DEVELOPING AGENTSI tried to develop new agent that will perform better in this scenario. I began with classic plan&execute agent, but it had issues with reflecting if step was done correctly. I tested also the classic supervisor agent but it lacked planning capabilities. Then i started to modify it by adding and removing different nodes. It resulted in plan & supervisor agent which is a hybrid - planner node makes plan at the beggining, then supervisor delegate each step to certain executor node (like navigation agent). After executor node there is structure output node which asseses if step was completed (bool). This info is brought back to supervisor, if success go with next step, if not repeat(maybe change somth). The problem is that supervisor doesn't have replan capabilities, which is a problem when a step depends on what happened in prevoius steps. https://github.com/RobotecAI/rai/blob/jm/feat/demo-bench/src/rai_core/rai/agents/langchain/planner_supervisor.py I tried to introduce more nodes to split the decision making even more, but i felt like more nodes made making good decision on the next step harder. All these conclusions brought me to another idea - MegaMind B) MegaMind merges supervisor planner and replanner into one node. Instead of making whole plan at the start it makes only first step, delegates it to proper executor, then structure output node marks success and provides brief explanation what was done. This explanation is then saved and provided to megamind node as step done already. Based on that megamind comes up with the next step. verision f353762c9886e74d30d10a97f8954ae6296fe9afhttps://github.com/RobotecAI/rai/blob/jm/feat/demo-bench/src/rai_core/rai/agents/langchain/megamind.py With a lot of iteration of prompts adjustment this is the best performing agent so far. To be fair i tested all previous agent mostly with gpt-4o, but i assume if llm is better with certain agent the rest of models will also be. gpt was performing promising on Megamind so i decided to test other models on this agent. It is worth noting that executor llm is qwen3:8b but it is not a problem as tools are very simple. MEGAMIND TESTINGrunning on GPU 16GB VRAM, if model does not fit in the VRAM, its partially offloaded to CPU gpt-4o-mini - quite bad, gets stuck after around 4-5 steps, keeps doing same steps in the loop qwen3:30b-a3b-instruct-2507-q8_0 - better, but sometimes forgets to drop before picking up another and then starts to make mistakes - can’t salvage the scenario when mistake was made before. Sometimes manages to do objective properly. Don’t know when to stop after it. Takes from 8 to 15sec per step* gpt-oss:20b (partially on CPU) - slightly better then gpt-4o-mini, but still quite bad. in thinking mode takes from 45-60 seconds and sometimes just generate text instead of calling tool. When thinking switched off takes 20-60 seconds and uses tools properly. In both cases after around 8-10 steps starts to struggle. when navigating, keeps coming up with coords, that are slightly off which is weird (probably prompt related) qwen3:30b (partailly on CPU) - (thinking mode cannot be switched off?) sometimes it just outputs think block without the tool call, ending the scenario so hard to evaluate qwen3:14b - 4-8 seconds per step, gets stuck after ~10 steps, starts to loop around navigating back and forth Overall all models when they get stuck or even qwen3:30b-a3b-instruct-2507-q8_0 when manages to do whole sequence properly, keeps going until recurssion limit. They do not recognize that the objective was done, maybe it can be fixed via prompts. |
robot_x, robot_y = self.get_position() | ||
# Find which box we're dropping into | ||
for box_id, box_data in self._boxes.items(): | ||
if relative_pos == box_data["relative"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my observations, this check failed most of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my curiosity on why this requires relative position. Is it coming from VLM for real world application ? For qwen3:8b, you can see agent rationalizes how to deal with this relative position, it was kinda funny.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is relative to the robot, as manipulation operates on coordinates relative to the arm.
box_data["world_position"][1] + relative_pos[1], | ||
) | ||
|
||
self._state["held_object"] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit unexpected. Basically, as long as drop_object is called, we consider the object dropped regardless whether there is any box found to drop it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this variable says if there is object currently held in the gripper, regardles where it was dropped or even if the object was in the gripper before this function
[ | ||
SystemMessage( | ||
content=f""" | ||
Analyze if this task was completed successfully: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very helpful for the agent to track the progress. As an exploration idea, what would you think to introduce a status_summary_prompt_template
that tasks could optionally provide for more structured status tracking ? For instance, in the warehouse scenario, such a template might help the agent better understand slot-level progress. Whether the slot has been checked, object picked up and sorted etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that sounds like a good idea, I will test it when i have time or maybe you want to test it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments below on introducing planning_prompt.
For example for the navigation tasks include the final location. | ||
For manipulation the coordinates of objects that have been detected, picked up from or dropped to. | ||
Below you have messages of agent doing the task:""" | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I changed this prompt to "Determine success and provide brief explanation of what happened by slot, for example Slot 2: [NAVIGATED] Object status: UNKNOWN - NEEDS CHECKING.
Below you have messages of agent doing the task:"
This is the output from qwen3:8b
Steps that were already done successfully:
1. Slot 1: [NAVIGATED] Object status: UNKNOWN - NEEDS CHECKING
2. Slot 1: [DETECTED] Object status: BLUE - IDENTIFIED
3. Slot 1: [PICKED UP] Object status: SUCCESSFUL - Object was picked up and transported to the first box.
4. Slot 2: [NAVIGATED] Object status: UNKNOWN - NEEDS CHECKING.
5. Slot 2: [OBJECT_DETECTED] Object status: RED - NO ACTION REQUIRED
6. Slot 3: [NAVIGATED] Object status: UNKNOWN - NEEDS CHECKING
7. Slot 3: [NAVIGATED] Object status: UNKNOWN - NEEDS CHECKING.
8. Slot 3: [PICKED UP] Object status: GREEN - [DROPPED] in second box at x: 5.0, y: 5.0.
9. Slot 4: [NAVIGATED] Object status: UNKNOWN - NEEDS CHECKING
10. Slot 4: [SUCCESS] Object status: GREEN - DROPPED IN SECOND BOX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that seems very helpful for the megamind node. It has to be adjusted to every sceanrio though. I can't put it in framework
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, above example was just to experiment with a structured status template and see the effect of it. It is scenario dependent which we could dynamically construct if task/scenario provides it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried something similar recently and it was giving the agent current info about all slots in prompt every time it was his time to think and it worked! I think it can be helpful as agent can understand the "map" of environment better
return [ | ||
self.pick_up_object, | ||
self.drop_object, | ||
self.ask_vlm, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my experiments, adding nav_tool
and where_am_i
to manipulation_tools
improved results with Qwen3:8b
. The issue was that once an object is picked up, Megamind was unable to transfer control back to the navigation specialist. This resulted in drop_object being called before navigating to the correct box location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i encountered this issue to. I think it is hard for him to completed a sequence like pick , navigate and drop when these are in other agents. On the other hand in real world scenario, manipulation is a lot harder and i still think that having separate agent for that would be better. But I agree, when tools are simple, we can just make "movement agent". I don't know how to make megamind better at executing sequences between different subagents. Maybe combining it with status prompts, like: Currently holding: Box, or something like that
}, | ||
) | ||
validators = [ | ||
#### navigate to slot1, detect and pick up 1st object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very challenging to write validators that work in non-deterministic and hard scenario like this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but there is a optimal execution flow and even sometimes qwen3 a3b, managed to do it ;p
i refactored megamind to be more generic, like we talked @boczekbartek, i made both class and function, because i didn't know which way is better - so tell me what you like more and i will delete the other |
"step_success": StepSuccess(success=False, explanation=""), | ||
"step_messages": [], | ||
} | ||
megamind_system_prompt=task.get_system_prompt(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also add a planning_prompt
for megamind (example) and use it in strucuture_output_node to summarize the overall task progress based on steps done. In my experiments, this helps the decision-making for agent using qwen3 fairly consistently. This will be provided by bench task (example).
Note, the planning_prompt didn't help agent with gpt-4o-mini
that much, perhaps we need a different prompt for it. Not sure how gpt-4o
works for you, for me, the results vary a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea, i copied to this branch here ac60d58
self, input: dict[str, Any], outputs: List[Any] | ||
) -> None: | ||
input["messages"].extend(outputs) | ||
input["step_messages"].extend(outputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These will be available to both megamind and sub-task agents. Do we have concerns on providing "too" much context, i.e. context pollution ? ATM, the megamind is structured for task planning and specialists carry out actual steps. IMO, more focused context could help decision-making :). Do you have any observations how context affects agent behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These messages are indeed passed to State, but megamind llm, does not have them in its prompt, so i m not sure what do you mean by context in this case.
model_name = f"supervisor-{supervisor_name}_executor-{executor_name}" | ||
supervisor_llm = get_llm_for_benchmark(model_name=supervisor_name, vendor="ollama") | ||
executor_llm = get_llm_for_benchmark( | ||
model_name=executor_name, vendor="ollama", reasoning=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd want to take out reasoning=False
. Like you said, we can't turn off qwen3 thinking, even with this. This code will fail for gpt-4o-*.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reasoning = False works for smaller qwen3, like 4b or 8b. It doesn't work for example with bigger 30b, idk why.
But this is a example script, if you want to use gpt, just replace these lines. Maybe you are right that by default there should be model which does not use this param as this might introduce confusion to new user. I applied it here 2123d25
Sorry, my bad, mistakenly close this PR by clicking on the wrong button. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the code stands now, it looks solid overall! I have two minor suggestions which could be addressed in subsequent PRs:
- The SortTask validators might be too strict for tool calling validation, which causes the scoring to be inaccurate.
- Adding some validation for the megamind decision-making process would be a nice enhancement
robot_x, robot_y = self.get_position() | ||
# Find which box we're dropping into | ||
for box_id, box_data in self._boxes.items(): | ||
if relative_pos == box_data["relative"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my curiosity on why this requires relative position. Is it coming from VLM for real world application ? For qwen3:8b, you can see agent rationalizes how to deal with this relative position, it was kinda funny.
} | ||
|
||
|
||
class SortTask(Task): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative name is SortingTask
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmatejcz +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied here ac60d58
] | ||
|
||
def get_system_prompt(self) -> str: | ||
return SYSTEM_PROMPT + "\n" + WAREHOUSE_ENVIRONMENT_DESCRIPTION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added a couple of helper functions for my experiments that might be useful to incorporate, your choice.
- report_sorting_status - outputs the sorted objects from boxes
- get_slot_position - helps the agent get back on track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea sure, there will be useful for sure, added them here c98a3f7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the function convention and move the pure langgraph agent to Also it would be good to avoid making nested functions and move |
thanks for info, applied here: 7ce2cf5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmatejcz I left some minor comments - please have a look. Overall the PR looks good and I think it's ready to be merged.
I tested with the tool_calling_custom_agent.py
and I got score:
TASK SCORE: 0.2, TOTAL TIME: 50.188
```.
) | ||
return { | ||
"past_steps": [(task, agent_response["messages"][-1].content)], | ||
# "plan": plan[1:], # removing the step that was executed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comment if it's not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed here f365ecd
|
||
def execute_step(state: PlanExecuteState): | ||
"""Execute the current step of the plan.""" | ||
# TODO (jmatejcz) should we pass whole plan or only single the to the executor? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it TODO required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed here f365ecd
class PlanExecuteState(ReActAgentState): | ||
"""State for the plan and execute agent.""" | ||
|
||
# TODO (jmatejcz) should original_task be replaced with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about this TODO?
Change to NOTE if it's just a mention for the future or remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed here f365ecd
@@ -0,0 +1,277 @@ | |||
# Copyright (C) 2024 Robotec.AI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright (C) 2024 Robotec.AI | |
# Copyright (C) 2025 Robotec.AI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed here f365ecd
Thank you! I won't change the validation or add more tasks in this PR, it can be done in the future. |
Co-authored-by: Julia Jia"
thank you @Juliaj for the attention and comments on this PR, im merging it, but we can continue discussion here or anywhere you like |
…#620) feat: basic tasks extension (#644) feat: tool calling custom interfaces tasks extension (#636) feat: tool calling spatial reasoning tasks extension (#637) refactor: remove navigation tasks (#638) refactor: o3de config (#630) refactor(`nav2_toolkit`): remove unused `action_client` (#670) fix: manipulaiton bench fixes (#653) docs: rai simbench docs update (#665) feat: planning task and megamind agent (#679) feat: megamind context providers (#687) feat: tool calling bench - manipulation tasks extenstion (#656) chore: resolving conflicts (#690) Co-authored-by: Julia Jia <juliajster@gmail.com> Co-authored-by: Magdalena Kotynia <magdalena.kotynia@robotec.ai> Co-authored-by: Maciej Majek <46171033+maciejmajek@users.noreply.github.com> Co-authored-by: Pawel Kotowski <pawel.kotowski@oleahealth.ai> Co-authored-by: Brian Tuan <btuan@users.noreply.github.com>
…#620) feat: basic tasks extension (#644) feat: tool calling custom interfaces tasks extension (#636) feat: tool calling spatial reasoning tasks extension (#637) refactor: remove navigation tasks (#638) refactor: o3de config (#630) refactor(`nav2_toolkit`): remove unused `action_client` (#670) fix: manipulaiton bench fixes (#653) docs: rai simbench docs update (#665) feat: planning task and megamind agent (#679) feat: megamind context providers (#687) feat: tool calling bench - manipulation tasks extenstion (#656) chore: resolving conflicts (#690) Co-authored-by: Julia Jia <juliajster@gmail.com> Co-authored-by: Magdalena Kotynia <magdalena.kotynia@robotec.ai> Co-authored-by: Maciej Majek <46171033+maciejmajek@users.noreply.github.com> Co-authored-by: Pawel Kotowski <pawel.kotowski@oleahealth.ai> Co-authored-by: Brian Tuan <btuan@users.noreply.github.com>
feat: tool calling benchmark unified across types and prompts variety… (#620) feat: basic tasks extension (#644) feat: tool calling custom interfaces tasks extension (#636) feat: tool calling spatial reasoning tasks extension (#637) refactor: remove navigation tasks (#638) refactor: o3de config (#630) refactor(nav2_toolkit): remove unused action_client (#670) fix: manipulaiton bench fixes (#653) docs: rai simbench docs update (#665) feat: planning task and megamind agent (#679) feat: megamind context providers (#687) feat: tool calling bench - manipulation tasks extenstion (#656) chore: resolving conflicts (#690) Co-authored-by: Jakub Matejczyk <58983084+jmatejcz@users.noreply.github.com> Co-authored-by: Julia Jia <juliajster@gmail.com> Co-authored-by: Magdalena Kotynia <magdalena.kotynia@robotec.ai> Co-authored-by: Pawel Kotowski <pawel.kotowski@oleahealth.ai> Co-authored-by: Brian Tuan <btuan@users.noreply.github.com> Co-authored-by: jmatejcz <jakub.matejczyk@robotec.ai>
Purpose
Add new task that will demand planning, replanning and remembering.
Develop and test agents that will be better performers in such tasks than just a regular react agent
Proposed Changes
New type task to tool calling benchmark task SortTask, it is complicated and long, no need for more for now.
Added new agents:
It has subagents like supervisor, after which there is a structre output attacher which determines if a step has been done successfully and explains what happened. That is fed back to megamind node.
Megamind node itself is like a planner but instead of developing plan once at the start it develops just next step and takes into consideration the response from output attacher, previous step and overall objective.
It performs better on replanning than agent desinged for that - Plan&Execute
Additional small things:
THE RESULTS OF TESTS ARE IN COMMENTS.
the commits linked in results could be outdated as it was done some time ago
Issues
#642
Testing