You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Embodied agents have achieved prominent performance in following humaninstructions to complete tasks. However, the potential of providinginstructions informed by texts and images to assist humans in completing tasksremains underexplored. To uncover this capability, we present the multimodalprocedural planning (MPP) task, in which models are given a high-level goal andgenerate plans of paired text-image steps, providing more complementary andinformative guidance than unimodal plans. The key challenges of MPP are toensure the informativeness, temporal coherence,and accuracy of plans acrossmodalities. To tackle this, we propose Text-Image Prompting (TIP), adual-modality prompting method that jointly leverages zero-shot reasoningability in large language models (LLMs) and compelling text-to-image generationability from diffusion-based models. TIP improves the interaction in the dualmodalities using Text-to-Image Bridge and Image-to-Text Bridge, allowing LLMsto guide the textual-grounded image plan generation and leveraging thedescriptions of image plans to ground the textual plan reversely. To addressthe lack of relevant datasets, we collect WIKIPLAN and RECIPEPLAN as a testbedfor MPP. Our results show compelling human preferences and automatic scoresagainst unimodal and multimodal baselines on WIKIPLAN and RECIPEPLAN in termsof informativeness, temporal coherence, and plan accuracy. Our code and data:https://github.com/YujieLu10/MPP.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: