OvercookedGPT (WIP)

An OpenAI gym environment to evaluate the ability of large language models (LLMs; eg. GPT-4, Claude) in long-horizon reasoning and task planning in dynamic multi-agent settings based on gym-cooking [1].

https://www.youtube.com/watch?v=4LmcpkS53Wg

Introduction

There is a new area of AI research where foundation models such as LLMs are used for decision making in complex environments that involve long-horizon reasoning, control, and planning [2]. For instance, Text2Motion [3] enables robots to solve sequential manipulation tasks by using LLMs. Also OpenAI's GPT-4 performs well in theory-of-mind (ToM) tasks [6], which require understanding other agents' beliefs, goals, and mental states.

OvercookedGPT is an interactive 2D game environment where OpenAI's GPT-4/3.5-Turbo as well as Anthropic's Claude generate intertemporal and sequential tasks in a centralized fashion to control multiple agents to achieve a goal in a simulation (i.e., to cook food at a kitchen). It is based on gym-cooking [1] and was also inspired by overcooked_ai [4] (which is used in [5]). The purpose of this simulator is to evaluate the ability of the LLMs in long-horizon reasoning and task planning in dynamic multi-agent environments. To this end, in-context learning (i.e., few-shot learning with prompt engineering methods of CoT and PAL [7]) is used to guide the LLMs to generate a task queue in Python that is executed by the simulator on the fly. As shown in [8], the reasoning performance of LLMs improves with an increased degree of input prompt complexity, where complex prompts achieve better performance than simple ones.

How does ICL work?

OvercookedGPT uses ICL (in-context learning) to guide LLMs to generate a task queue to control multiple agents, but why/how can LLMs (whose parameters including the W_Q, W_K and W_V matrices in the Transformer attention are "frozen") perform an unseen, non-pretrained language task by merely observing a few demonstrations? [9] shows that LLMs have the ability to override semantic priors when presented with demonstration examples (input-label pairs) despite the stronger semantic priors that the models may hold (which ability is an emergent phenomena unlocked by model scaling). Moreover, [10] suggests that LLMs perform gradient descent in ICL through forward computation (whereas explicit finetuning computes gradients by back-propagation) - ICL can therefore be considered as implicit finetuning.

Installation

python3 -m pip install -U pygame --user
git clone https://github.com/BladeTransformerLLC/OvercookedGPT.git
cd OvercookedGPT
pip3 install -r requirements.txt

Set the OPENAI_API_KEY environment variable (or put the key string in utils/chatgpt/openai.json).
Set the ANTHROPIC_API_KEY environment variable (or put the key string in utils/claude/anthropic.json).

Usage

Start a single-agent simulation using GPT-3.5-Turbo by default (specify a model in utils/chatgpt/openai.json):

python3 main.py --num-agents 1 --level partial-divider_salad --gpt

(or replace --gpt with --claude for using Anthropic's Claude). Enter a task eg. "Make a tomato and lettuce salad and deliver it."

Start a multi-agent simulation:

python3 main.py --num-agents 2 --level partial-divider_salad --gpt

Mannually control agents with arrow keys (switch between agents by pressing 1 or 2):

python3 main.py --num-agents 2 --level partial-divider_salad --gpt --manual

ToDo

Allow simultaneous/parallel subtask execution by multiple agents within the same timestep
Prevent agents from moving through other agents (make them avoid/wait others)
Evaluate with 3 or more agents in large levels

References

Wu et al., "Too many cooks: Bayesian inference for coordinating multi-agent collaboration," 2020.
Yang et al., "Foundation Models for Decision Making: Problems, Methods, and Opportunities," 2023.
Lin et al., "Text2Motion: From Natural Language Instructions to Feasible Plans," 2023.
Carroll et al., "On the Utility of Learning about Humans for Human-AI Coordination," 2020.
Hong et al., "Learning to Influence Human Behavior with Offline Reinforcement Learning," 2023.
Moghaddam & Honey, "Boosting Theory-of-Mind Performance in Large Language Models via Prompting," 2023.
Gao et al., "PAL: Program-aided Language Models," 2022.
Fu et al., "Complexity-Based Prompting for Multi-Step Reasoning," 2022.
Wei et al., "Larger language models do in-context learning differently," 2023.
Dai et al., "Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers," 2022.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
delegation_planner		delegation_planner
envs		envs
misc		misc
navigation_planner		navigation_planner
recipe_planner		recipe_planner
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delegation_planner

delegation_planner

envs

envs

misc

misc

navigation_planner

navigation_planner

recipe_planner

recipe_planner

utils

utils

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

init.py

init.py

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

OvercookedGPT (WIP)

Introduction

How does ICL work?

Installation

Usage

ToDo

References

About

Releases

Packages

Languages

License

BladeTransformerLLC/OvercookedGPT

Folders and files

Latest commit

History

Repository files navigation

OvercookedGPT (WIP)

Introduction

How does ICL work?

Installation

Usage

ToDo

References

About

Resources

License

Stars

Watchers

Forks

Languages