Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CodeActAgent to delegate to BrowsingAgent #2103

Merged
merged 2 commits into from
Jun 7, 2024

Conversation

li-boxuan
Copy link
Member

@li-boxuan li-boxuan commented May 28, 2024

Let CodeActAgent delegate to BrowsingAgent for browsing tasks. With this PR, CodeActAgent could also pass the test test_browse_internet - which it couldn't before because that test requires locating a button and clicking on it.

Closes #1945

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! The big test for delegates? 😅

Exciting stuff. At least in theory, delegating should work better.

@assertion
Copy link
Contributor

👍👍👍👍 Expecting this feature for a long time. Firstly I want to confirm, if this means when running CodeActAgent should not lauch a browser env? Only BrowsingAgent does this after all browse action modified to this delegate agent mode. @li-boxuan

Copy link
Collaborator

@frankxu2004 frankxu2004 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! It's clean! I have a little bit of overall question: what would the calling agent obtain from the subtask agent's execution? Is it just the final text message from the sub-task agent (e.g. browsing agent in this case), or should the calling agent also have access to more detailed information (e.g. HTML observations in the subtask, etc.) as part of the final observation for the calling agent?

agenthub/codeact_agent/codeact_agent.py Outdated Show resolved Hide resolved
@@ -101,7 +103,7 @@ def step(self, state: State) -> Action:
isinstance(prev_action, MessageAction) and prev_action.source != 'user'
):
# agent has responded, task finish.
return AgentFinishAction()
return AgentFinishAction(outputs={'content': prev_action.content})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less of a review more of a question:

Just curious, is this AgentFinishAction's outputs becoming the AgentDelegationObservation for the calling agent? What can the calling agent see after the subtask has finished, is this just this LLM-generated finishing message, or it also contains history observations, etc. I am asking because I wonder if the calling agent want to browse some websites and expect HTML as the subtask finishing result, but in this case it would be a LLM summary of it right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this AgentFinishAction's outputs becoming the AgentDelegationObservation for the calling agent?

Yep. It doesn't contain the history. I tried that initially and it didn't work well - CodeActAgent not capable of passing test_browse_internet test. Likely, CodeActAgent got confused by the history because it didn't know about those browsing commands issued by BrowsingAgent.

We had a discussion around this topic here: #1910 (comment) and we didn't reach a conclusion, so this might change.

I believe there are pros & cons for the parent agent to know about child agent's history. For this delegation in particular, I don't see much benefit for CodeActAgent to know about BrowsingAgent's history though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I agree that for now we just keep it this way for simplicity - trust the child agent's ability to summarize enough useful information.

As for the use case, I could imagine if the parent want the final browser state containing the final observation, i.e., raw contents (such as html, axtree) after a series of of actions, it might be useful to provide the parent agent with the final observation, instead of only letting the LLM to summarize a message back to the parent agent. Of course it has pros and cons but that's one possible scenario where it could be needed.

Feel free to merge this! This design choice should probably be another issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we can just save the state (e.g., all the histories) of the child agent (i.e., the browsing agent) and kept it there. And allows the parent delegator to ask follow-up question if needed.

Something like this from the delegator:

# In[1]
browser_agent = BrowserAgent()
response = browser_agent.run("Search google and tell me about OpenDevin")
print(response)

# Out[1]
OpenDevin is a platform for autonomous software engineers, powered by AI and LLMs. Here are some key points about it:
Repository Name: OpenDevin/OpenDevin
Description: OpenDevin: Code Less, Make More
Purpose: A platform for autonomous software engineers, powered by AI and LLMs.
Key Features:
OpenDevin agents collaborate with human developers to write code, fix bugs, and ship features.
The platform runs inside a Docker container.
Getting Started:
Requires the most recent version of Docker (26.0.0).
Works on Linux, Mac OS, or WSL on Windows.
Commands to start the app are provided.
...

# In[2]
response = browser_agent.run("Can you tell me the actual command i need to use to start the APP")
# The answer to this question should actually lies in the sub-task agent's observation history
print(response)

# Out[2]
Oh sure, it is XXX

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah when I think more about it, I feel like we should (again) include child's history. We do need some prompting so that the parent agent doesn't get confused.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use #2021 to condense actions from the child browsing agent?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's almost ready for it. It currently identifies the 'chunks' of child events, in order to (insert here: exclude them from parent stream/summarize them in parent stream).

@li-boxuan
Copy link
Member Author

if this means when running CodeActAgent should not lauch a browser env?

@assertion No it has nothing to do with browser env. At the moment, a browser env is always launched regardless of the agent.

@li-boxuan li-boxuan force-pushed the codeact/delegate-to-browser branch from 1881874 to 6b7d114 Compare May 29, 2024 03:06
@assertion
Copy link
Contributor

No it has nothing to do with browser env. At the moment, a browser env is always launched regardless of the agent.

Yes, this is true right now. I mean if browse action delegated to BrowsingAgent after this pr, starting a browserEnv for other agents like CodeActAgent is not reasonable.

Well, this is actually an optimization that can reduce the number of started processes during agent running.

@li-boxuan
Copy link
Member Author

li-boxuan commented May 29, 2024

I mean if browse action delegated to BrowsingAgent, starting a browserEnv for other agents like CodeActAgent is not reasonable.

That's a good point. We probably could refactor how runtime and agents interact. A similar problem is AgentSkills are always installed even though not every agent uses it. Let me create a thread on slack so that we could discuss this more.

UPDATE: I quickly checked AgentSkills code and seems it's only installed for CodeActAgent.
UPDATE2: Created https://opendevin.slack.com/archives/C06QKSD9UBA/p1716952778225289

@frankxu2004
Copy link
Collaborator

I mean if browse action delegated to BrowsingAgent, starting a browserEnv for other agents like CodeActAgent is not reasonable.

That's a good point. We probably could refactor how runtime and agents interact. A similar problem is AgentSkills are always installed even though not every agent uses it. Let me create a thread on slack so that we could discuss this more.

UPDATE: I quickly checked AgentSkills code and seems it's only installed for CodeActAgent. UPDATE2: Created https://opendevin.slack.com/archives/C06QKSD9UBA/p1716952778225289

Love this, making it optional can save a lot of resources.

@li-boxuan li-boxuan changed the title CodeactAgent to delegate to BrowsingAgent CodeActAgent to delegate to BrowsingAgent May 30, 2024
Copy link
Member

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! And this is exciting! Tried a random query and it worked pretty well!

image

One minor bug: the screenshot will NOT display when the agent is trying to predict the next action (cc @frankxu2004 - any ideas?)

image

Another minor issue: when it finishes, the frontend will repeat the message (one from the Browsing Agent, one from the delegator), It even got a little empty box in the middle:

image

@li-boxuan li-boxuan self-assigned this May 31, 2024
@neubig
Copy link
Contributor

neubig commented Jun 1, 2024

This is exciting!

@li-boxuan
Copy link
Member Author

I am putting this PR on hold so that it doesn't affect current evaluations.

@li-boxuan li-boxuan marked this pull request as draft June 2, 2024 22:27
@li-boxuan
Copy link
Member Author

Discussed with @xingyaoww offline and we will put the second issue on hold:

when it finishes, the frontend will repeat the message (one from the Browsing Agent, one from the delegator), It even got a little empty box in the middle

I don't have a good way to solve it right now. If browsing action is just a subtask of the original big task, then user won't feel any redundancy, but if the user simply gives a browsing task, then indeed they would feel the agent is repeating itself. We might need to think about the interactive design better to not confuse users.

@li-boxuan li-boxuan marked this pull request as ready for review June 7, 2024 07:53
@li-boxuan li-boxuan merged commit 45ce09d into OpenDevin:main Jun 7, 2024
18 checks passed
@li-boxuan li-boxuan deleted the codeact/delegate-to-browser branch June 7, 2024 07:53
@tobitege
Copy link
Collaborator

tobitege commented Jun 7, 2024

Hello @li-boxuan
this looks like an issue right now for failing integration tests:

https://github.com/OpenDevin/OpenDevin/blame/45ce09d70ec9e46ded03df3b168f684e030c6dee/tests/integration/mock/CodeActAgent/test_browse_internet/prompt_003.log#L121

Diff:
--- /home/runner/work/OpenDevin/OpenDevin/tests/integration/mock/CodeActAgent/test_browse_internet/prompt_003.log	2024-06-07 09:06:47.574055654 +0000
+++ /tmp/tmpezug6yc1	2024-06-07 09:17:54.012024643 +0000
@@ -118,14 +118,13 @@
 	[8] heading 'The Ultimate Answer'
 	[9] paragraph ''
 		StaticText 'Click the button to reveal the answer to life, the universe, and everything.'
-	[10] button 'Click me'
+	[10] button 'Click me', clickable

Should be fixed with #2307

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorporate BrowsingAgent's prompts to CodeAct to enable richer interactions with the browser.
8 participants