-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CodeActAgent to delegate to BrowsingAgent #2103
CodeActAgent to delegate to BrowsingAgent #2103
Conversation
6f83748
to
50a3225
Compare
50a3225
to
1881874
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! The big test for delegates? 😅
Exciting stuff. At least in theory, delegating should work better.
👍👍👍👍 Expecting this feature for a long time. Firstly I want to confirm, if this means when running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! It's clean! I have a little bit of overall question: what would the calling agent obtain from the subtask agent's execution? Is it just the final text message from the sub-task agent (e.g. browsing agent in this case), or should the calling agent also have access to more detailed information (e.g. HTML observations in the subtask, etc.) as part of the final observation for the calling agent?
@@ -101,7 +103,7 @@ def step(self, state: State) -> Action: | |||
isinstance(prev_action, MessageAction) and prev_action.source != 'user' | |||
): | |||
# agent has responded, task finish. | |||
return AgentFinishAction() | |||
return AgentFinishAction(outputs={'content': prev_action.content}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less of a review more of a question:
Just curious, is this AgentFinishAction's outputs becoming the AgentDelegationObservation for the calling agent? What can the calling agent see after the subtask has finished, is this just this LLM-generated finishing message, or it also contains history observations, etc. I am asking because I wonder if the calling agent want to browse some websites and expect HTML as the subtask finishing result, but in this case it would be a LLM summary of it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this AgentFinishAction's outputs becoming the AgentDelegationObservation for the calling agent?
Yep. It doesn't contain the history. I tried that initially and it didn't work well - CodeActAgent not capable of passing test_browse_internet
test. Likely, CodeActAgent got confused by the history because it didn't know about those browsing commands issued by BrowsingAgent.
We had a discussion around this topic here: #1910 (comment) and we didn't reach a conclusion, so this might change.
I believe there are pros & cons for the parent agent to know about child agent's history. For this delegation in particular, I don't see much benefit for CodeActAgent to know about BrowsingAgent's history though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I agree that for now we just keep it this way for simplicity - trust the child agent's ability to summarize enough useful information.
As for the use case, I could imagine if the parent want the final browser state containing the final observation, i.e., raw contents (such as html, axtree) after a series of of actions, it might be useful to provide the parent agent with the final observation, instead of only letting the LLM to summarize a message back to the parent agent. Of course it has pros and cons but that's one possible scenario where it could be needed.
Feel free to merge this! This design choice should probably be another issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we can just save the state (e.g., all the histories) of the child agent (i.e., the browsing agent) and kept it there. And allows the parent delegator to ask follow-up question if needed.
Something like this from the delegator:
# In[1]
browser_agent = BrowserAgent()
response = browser_agent.run("Search google and tell me about OpenDevin")
print(response)
# Out[1]
OpenDevin is a platform for autonomous software engineers, powered by AI and LLMs. Here are some key points about it:
Repository Name: OpenDevin/OpenDevin
Description: OpenDevin: Code Less, Make More
Purpose: A platform for autonomous software engineers, powered by AI and LLMs.
Key Features:
OpenDevin agents collaborate with human developers to write code, fix bugs, and ship features.
The platform runs inside a Docker container.
Getting Started:
Requires the most recent version of Docker (26.0.0).
Works on Linux, Mac OS, or WSL on Windows.
Commands to start the app are provided.
...
# In[2]
response = browser_agent.run("Can you tell me the actual command i need to use to start the APP")
# The answer to this question should actually lies in the sub-task agent's observation history
print(response)
# Out[2]
Oh sure, it is XXX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah when I think more about it, I feel like we should (again) include child's history. We do need some prompting so that the parent agent doesn't get confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use #2021 to condense actions from the child browsing agent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's almost ready for it. It currently identifies the 'chunks' of child events, in order to (insert here: exclude them from parent stream/summarize them in parent stream).
@assertion No it has nothing to do with browser env. At the moment, a browser env is always launched regardless of the agent. |
1881874
to
6b7d114
Compare
Yes, this is true right now. I mean if browse action delegated to BrowsingAgent after this pr, starting a browserEnv for other agents like CodeActAgent is not reasonable. Well, this is actually an optimization that can reduce the number of started processes during agent running. |
That's a good point. We probably could refactor how runtime and agents interact. UPDATE: I quickly checked AgentSkills code and seems it's only installed for CodeActAgent. |
Love this, making it optional can save a lot of resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! And this is exciting! Tried a random query and it worked pretty well!
One minor bug: the screenshot will NOT display when the agent is trying to predict the next action (cc @frankxu2004 - any ideas?)
Another minor issue: when it finishes, the frontend will repeat the message (one from the Browsing Agent, one from the delegator), It even got a little empty box in the middle:
This is exciting! |
I am putting this PR on hold so that it doesn't affect current evaluations. |
Discussed with @xingyaoww offline and we will put the second issue on hold:
I don't have a good way to solve it right now. If browsing action is just a subtask of the original big task, then user won't feel any redundancy, but if the user simply gives a browsing task, then indeed they would feel the agent is repeating itself. We might need to think about the interactive design better to not confuse users. |
Hello @li-boxuan Diff:
--- /home/runner/work/OpenDevin/OpenDevin/tests/integration/mock/CodeActAgent/test_browse_internet/prompt_003.log 2024-06-07 09:06:47.574055654 +0000
+++ /tmp/tmpezug6yc1 2024-06-07 09:17:54.012024643 +0000
@@ -118,14 +118,13 @@
[8] heading 'The Ultimate Answer'
[9] paragraph ''
StaticText 'Click the button to reveal the answer to life, the universe, and everything.'
- [10] button 'Click me'
+ [10] button 'Click me', clickable Should be fixed with #2307 |
Let CodeActAgent delegate to BrowsingAgent for browsing tasks. With this PR, CodeActAgent could also pass the test
test_browse_internet
- which it couldn't before because that test requires locating a button and clicking on it.Closes #1945