You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Visual language models (VLMs) rapidly progressed with the recent success oflarge language models. There have been growing efforts on visual instructiontuning to extend the LLM with visual inputs, but lacks an in-depth study of thevisual language pre-training process, where the model learns to perform jointmodeling on both modalities. In this work, we examine the design options forVLM pre-training by augmenting LLM towards VLM through step-by-stepcontrollable comparisons. We introduce three main findings: (1) freezing LLMsduring pre-training can achieve decent zero-shot performance, but lackin-context learning capability, which requires unfreezing the LLM; (2)interleaved pre-training data is beneficial whereas image-text pairs alone arenot optimal; (3) re-blending text-only instruction data to image-text dataduring instruction fine-tuning not only remedies the degradation of text-onlytasks, but also boosts VLM task accuracy. With an enhanced pre-training recipewe build VILA, a Visual Language model family that consistently outperforms thestate-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bellsand whistles. Multi-modal pre-training also helps unveil appealing propertiesof VILA, including multi-image reasoning, enhanced in-context learning, andbetter world knowledge.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: