# Instruction Tuning for Large Language Models: A Survey


Original paper: [here](https://arxiv.org/abs/2308.10792)

## Introduction

The introduction section of the survey paper on instruction tuning (IT), also known as supervised fine-tuning (SFT), can be broken down into the following parts:

**1. The Rise of Large Language Models (LLMs)**

*   The field of large language models has progressed significantly in recent years.
*   Models such as **GPT-3**, **PaLM**, and **LLaMA** have demonstrated impressive capabilities across various natural language tasks.
*   These models are typically trained to minimize the error in predicting the next word in a sequence.



**2. The Mismatch Between Training Objective and User Objective**

*   A major issue with LLMs is the mismatch between how they are trained and what users want them to do.
*   LLMs are trained on minimizing word prediction error, while users want the models to "follow their instructions helpfully and safely".
*   This creates a gap between the model's objective and the user's objective.



**3. Instruction Tuning (IT) / Supervised Fine-tuning (SFT) as a Solution**

*   To address this mismatch, **instruction tuning (IT)**, also referred to as **supervised fine-tuning (SFT)**, has been proposed.
*   IT/SFT is an effective technique for enhancing the capabilities and controllability of large language models.
*   It involves further training LLMs using (**INSTRUCTION, OUTPUT**) pairs, where INSTRUCTION is the human instruction, and OUTPUT is the desired response.



**4. Benefits of SFT**

*   SFT bridges the gap between the next-word prediction objective and the user's objective of instruction following.
*   SFT allows for more controllable and predictable model behavior, as instructions constrain the model's outputs.
*   SFT is computationally efficient and helps LLMs adapt to specific domains without extensive retraining.



**5. Challenges of SFT**

*   Crafting high-quality instructions that cover desired behaviors is difficult.
*   Existing instruction datasets are often limited in quantity, diversity, and creativity.
*   There are concerns that SFT only improves on tasks heavily supported in the training dataset.
*   Criticism exists that SFT captures surface-level patterns rather than learning the task itself.
*   Improving instruction adherence and handling unexpected model responses remain open research questions.



**6. The Need for Further Research**

*   These challenges highlight the importance of further investigation and analysis to optimize the fine-tuning process and understand the behavior of instruction-tuned LLMs.
*   There has been an increasing research interest in analysis and discussions on LLMs, but rarely on the topic of LLM instruction tuning.
*   This survey attempts to fill this gap by organizing the most up-to-date knowledge on this quickly advancing field.

## Methodology

**2.1 Instruction Dataset Construction**

*   **Core Elements of an Instruction Dataset**
    *   Each instance in an instruction dataset consists of three key elements:
        *   An **instruction**: This is a natural language text sequence that specifies the task to be performed by the model. For example, "write a thank-you letter to XX for XX" or "write a blog on the topic of XX".
        *   An optional **input**: This provides supplementary information or context that the model may need to fulfill the instruction.
        *   An **anticipated output**: This is the desired response that the model should generate based on the given instruction and input.
*   **Purpose of Instructions**
    *   The instructions serve to guide the model to produce the correct output. This helps to bridge the gap between the model's training objective (next-word prediction) and the user's objective (following instructions).
*   **Types of Datasets**
    *   The survey later details different types of instruction tuning datasets. These are categorized into:
        *   Human-crafted Data
        *   Synthetic Data via Distillation
        *   Synthetic Data via Self-Improvement
*  **Multi-turn Conversational Datasets**
    * For multi-turn conversational SFT datasets, large language models can be used to self-play different roles (user and AI assistant) to generate messages in a conversational format.

This section establishes the fundamental structure of the data used for instruction tuning, highlighting the importance of instructions in guiding the model's behavior. The following sections in the survey will go into detail on the various approaches to creating these datasets.


**2.2 Instruction Tuning / Supervised Fine-tuning**

*   **The Process:**
    *   This section describes the actual fine-tuning process using the instruction datasets created as described in section 2.1.
    *   A **pretrained model** is taken as the base for fine-tuning.
    *   The fine-tuning is done in a **fully supervised manner**.

*   **Supervised Fine-tuning Details**
    *   Given an **instruction and an optional input**, the model is trained to **predict each token in the output sequentially.**
    *   This process is essentially a sequence generation task where the model learns to generate the desired output by predicting one token at a time.
    *  The training data consists of the (instruction, input, output) triplets described in section 2.1, and the model is trained to minimize the error between its prediction and the correct output sequence.

*   **Key Focus:**
    *   The core of instruction tuning is to **train the model to generate outputs that align with given instructions.**
    *   This fine-tuning process adapts the model's parameters so it can better understand and follow human instructions.

In summary, this section outlines the supervised training process where a pre-trained model is adjusted using instruction datasets to better follow instructions, focusing on generating outputs that match the anticipated output sequence.


## Datasets
*   This section details the instruction tuning datasets, which are categorized into three classes:
    1.  Human-crafted Data
    2.  Synthetic Data via Distillation
    3.  Synthetic Data via Self-improvement


### **Human-crafted Data**

*   **Definition:** Human-crafted data refers to datasets that are either **manually annotated or sourced directly from the internet**.
*   **Creation Process:** The creation of these datasets **does not involve machine learning techniques**. It relies on **manual gathering and verification**.
*   **Size:** Human-crafted datasets are generally **smaller** compared to synthetic datasets due to the manual effort required.
*  **General Description**: The datasets include instructions and instances, where the instructions describe the task, and the instances include inputs and outputs related to the task.

The survey then goes into detail on some widely-used human-crafted datasets:



*   **3.1.1 Natural Instructions**
    *   This is a **human-crafted English instruction dataset** consisting of **193,000 instances** from **61 distinct NLP tasks**.
    *   It is comprised of "**instructions**" and "**instances**".
        *   Each item in the "**instructions**" is a task description with 7 components: **title, definition, things to avoid, emphasis/caution, prompt, positive example, and negative example**.
        *   The "**instances**" are pairs of ("input", "output"), which are the input data and the textual result that correctly follows the given instruction.
    *   The data comes from **existing NLP datasets of 61 tasks**.
    *   The authors created the "**instructions**" by looking at the dataset's annotating instruction file. Then, they made the "**instances**" by unifying data instances across all NLP datasets to ("input", "output") pairs.
*   **3.1.2 P3 (Public Pool of Prompts)**
    *  P3 is an instruction tuning dataset created by **integrating 170 English NLP datasets and 2,052 English prompts**.
    *   **Prompts** (also called task templates) map data from a traditional NLP task (like question answering or text classification) to a natural language input-output pair.
    *   Each instance in P3 has three parts: "**inputs**", "**answer\_choices**", and "**targets**".
        *  "**Inputs**" is a sequence of text that describes the task in natural language.
        *  "**Answer choices**" is a list of possible text responses to the task.
        *  "**Targets**" is the correct text response to the input.
    *   The authors created **PromptSource**, a tool for collaborative creation of high-quality prompts.
    *   The P3 dataset was created by randomly selecting a prompt from PromptSource and transforming each instance into an ("inputs", "answer choices", "targets") triplet.
*   **3.1.3 xP3 (Crosslingual Public Pool of Prompts)**
    *   xP3 is a **multilingual instruction dataset** consisting of **16 diverse natural language tasks in 46 languages**.
    *   Each instance has two parts: "**inputs**" and "**targets**".
        *   "**Inputs**" is a task description in natural language.
        *   "**Targets**" is the textual result that follows the "**inputs**" correctly.
    *   The original data comes from three sources: **the English P3 dataset**, **4 English unseen tasks in P3**, and **30 multilingual NLP datasets**.
    *   The xP3 dataset was built by sampling human-written task templates from PromptSource and filling those templates to transform various NLP tasks into a unified format.
*   **3.1.4 Flan 2021**
    *   Flan 2021 is an **English instruction dataset** created by transforming **62 widely-used NLP benchmarks** into language input-output pairs.
    *   Each instance has "**input**" and "**target**" components.
        *   "**Input**" is a text sequence that describes a task using natural language instruction.
        *   "**Target**" is the correct textual result.
    *   The authors converted conventional NLP datasets into input-target pairs by: manually composing instruction and target templates, and filling templates with data instances from the dataset.
*  **3.1.5 LIMA**
    *   LIMA is an **English instruction dataset** with 1,000 training data instances and a test set with 300 instances.
    *   The train set contains 1,000 ("instruction", "response") pairs. 75% of the training data comes from community question & answer websites, 20% is manually written by a set of the authors, and 5% comes from the Super-Natural Instructions dataset.
    *   The validation set has 50 author-written instances.
    *   The test set contains 300 examples, with 76.7% written by a different set of authors and 23.3% from the Pushshift Reddit Dataset.
*   **3.1.6 Super-Natural Instructions**
    *  Super Natural Instructions is a **multilingual instruction collection** with 1,616 NLP tasks and 5 million task instances, covering 76 task types and 55 languages.
    * Each task in the dataset has an "**instruction**" and "**task instances**".
    *   The "**instruction**" has three components: a "**definition**" that describes the task in natural language; "**positive examples**" that are samples of inputs and correct outputs, with a short explanation for each; and "**negative examples**" that are samples of inputs and incorrect outputs, with a short explanation for each.
     *  The "**task instances**" are data instances with textual input and a list of acceptable textual outputs.
    *   The original data comes from existing public NLP datasets, intermediate annotations generated through crowdsourcing, and synthetic tasks transformed from symbolic tasks.
*   **3.1.7 Dolly**
    *   Dolly is an **English instruction dataset** with **15,000 human-generated data instances**, designed to enable LLMs to interact with users like ChatGPT.
    *   The dataset simulates a wide range of human behaviors, covering **7 specific types: open Q&A, closed Q&A, extracting information from Wikipedia, summarizing information from Wikipedia, brainstorming, classification, and creative writing.**
*   **3.1.8 OpenAssistant Conversations**
    * OpenAssistant Conversations is a **human-crafted multilingual assistant-style conversation corpus**, containing 161,443 messages from 66,497 conversation trees in 35 languages.
    * The dataset also includes 461,292 human-annotated quality ratings.
    * Each instance is a conversation tree (CT), where each node is a message generated by roles (prompter, assistant).
    * A CT’s root node is an initial prompt from the prompter, and other nodes are replies from a prompter or an assistant. A path from the root to any node is a valid conversation between prompter and assistant, called a thread.
    * The OpenAssistant Conversations dataset was created by filtering out inappropriate and offensive conversation trees.

In summary, this section describes different human-crafted datasets used for instruction tuning, highlighting their unique characteristics, data sources, and the process of their construction. These datasets are essential for training LLMs to better understand and follow human instructions.


### **Synthetic Data via Distillation**

*   **General Concept:** Distillation involves transferring knowledge and capabilities from a **highly capable "teacher" model** to a **less complex "student" model**. This enhances both the quality of responses and computational efficiency of the student model.
*   **Process in Synthetic Data Generation:** In this context, the process entails using queries generated from fine-tuned LLMs (like ChatGPT) to fine-tune other LLMs.
*   **Goal**: The objective of this approach is to **transfer the knowledge of powerful LLMs to smaller, more efficient models**.

Here’s a more detailed look at the methodology and specific datasets:

*   **General Methodology:**
    *   The process often starts with a **powerful LLM**, such as **GPT-3** or **GPT-4**, which serves as the "teacher" model.
    *   This **teacher model generates synthetic data** by responding to various prompts or instructions.
    *   The generated (instruction, output) pairs are then used to **fine-tune a smaller "student" LLM**, effectively transferring knowledge from the larger model to the smaller one.

*   **Specific Examples of Distillation Datasets:**
    *   **OIG (Open Instruction Generalist)**: This dataset contains **43 million English** instruction-response pairs generated by **ChatGPT**. The source does not provide technical details about this dataset.
    *   **Unnatural Instructions:** This dataset has **240,000 English** instances. It was generated using **InstructGPT**.
    *   **InstructWild**: This dataset contains **104,000 instances**, generated by **ChatGPT**.
    *   **Evol-Instruct / WizardLM:** The WizardLM dataset contains **52,000 English** instances. **ChatGPT** was used to generate this data. **WizardLM** focuses on obtaining diverse, high-quality instructions and responses by using a five-level system of progressively enhancing the complexity of the data generation prompts. It broadens the range of topics through manual expansion to increase data diversity.
    *   **Alpaca**: This dataset consists of **52,000 English** instruction-following examples generated by **InstructGPT**.
    *   **LogiCoT**: This dataset was created using **GPT-4** to generate logical chain-of-thought instruction-tuning data.
    *   **GPT-4-LLM**: This dataset contains **52,000 English and Chinese** instruction-response pairs, which were generated by **GPT-4**.
     *  **Vicuna**: This dataset has 70,000 instances of real user **ChatGPT** conversations, which were used to fine-tune the LLaMA model.
    *   **Baize v1**: This dataset contains **111,500 English** instances, generated by **ChatGPT**.
    *   **UltraChat**: This dataset includes **675,000 English and Chinese** instances generated by **GPT-3 and GPT-4**.
    *   **Guanaco**: This dataset consists of **534,530 multilingual** instances, generated by an unknown version of **GPT**.
     *  **Orca**: This dataset includes **1.5 million English** responses generated by **GPT 3.5 and GPT-4** and is used to instruct smaller language models in logical reasoning.
    *   **ShareGPT**:  This dataset contains **90,000 multilingual** instances from real user **ChatGPT** conversations.
    *  **WildChat**: This dataset consists of **150,000 multilingual** instances of real user **ChatGPT** conversations.
    *  **WizardCoder**: This dataset was generated using LLaMa 2 and was used for code generation.
     *   **Magicoder:** This dataset contains **75,000 or 110,000** instances of code generated by **GPT-3.5**, depending on the specific version.
    *   **WaveCoder**: This dataset was generated using **GPT-4** for code generation.
    *   **Phi-1**: This dataset contains **6 Billion tokens** of code and Q&A generated by **GPT-3.5**.
    *   **Phi-1.5**: This dataset contains code and Q&A generated by **GPT-3.5**.
    *   **Nectar**: This dataset has **183,000 English** instances generated by **GPT-4**, and is used for ranking tasks.

*   **Task-Specific Datasets**:
    *   In addition to general-domain datasets, there are efforts aimed at creating task-specific datasets using distillation to mimic LLM competencies in particular domains.
    *   Examples of these include datasets for:
        *   **Coding generation**: WizardCoder, Magicoder and WaveCoder.
        *  **Reasoning and writing**: Phi-1 and Phi-1.5.
         *   **Ranking**: Nectar.

*   **Key Takeaways from Distillation:**
    *   **Transfer of Knowledge**: The method allows for transferring the vast knowledge of large models to smaller ones, making it more practical to use them.
    *   **Computational Efficiency**: It improves the efficiency of smaller models without significantly sacrificing performance.
    *   **Use of Powerful LLMs:** The method relies heavily on powerful models like GPT-3 and GPT-4 as teachers.

In summary, section 3.2 highlights how synthetic data, created through the process of distillation, is used to fine-tune LLMs by leveraging the knowledge of more powerful language models, and it provides a detailed overview of datasets created using this method.


### **Synthetic Data via Self-Improvement**

*   **Core Concept**: This approach focuses on enhancing a pre-trained language model's ability to follow instructions by using the model's own generated data. This is done in a **bootstrapping manner**, where the model learns from its own outputs. The method relies on the model to improve itself through iterative self-generation and refinement of data.

*   **General Process**: The self-improvement process involves several key steps:

    1.  **Seed Data Collection**: The process starts with a small set of high-quality, **human-written tasks** that serve as a starting point. For example, Wang et al. (2022c) began with 175 such tasks.
    2.  **Instruction Generation**: Using the seed data as a few-shot prompt, the model (e.g., a vanilla GPT-3) is prompted to **generate new instructions**. This is done using in-context learning.
    3.  **Response Generation**: For each generated instruction, the model then generates a corresponding response. If it is an output-first task (like writing), the model directly generates the response. However, if it is an input-first task (like reading comprehension), the model first generates the necessary context or input before the response.
    4.  **Filtering**: The generated (instruction, response) pairs are then filtered using a series of rules or models to ensure quality and relevance.
    5.  **Iterative Improvement**: This process can be repeated, where the model learns from the newly generated and filtered data, refining its ability to generate better instructions and responses in each iteration.

*   **Key Considerations**:

    *   **Robust Base Model**: The self-improvement method requires a **strong base LLM** as its foundation. Without a powerful model, there is a risk of limiting the learning process to the model’s original capabilities.
    *   **Potential for Bias and Errors**: The cycle of self-improvement may also amplify any biases and errors present in the base model if not carefully monitored.

*   **Specific Self-Improvement Techniques and Datasets**:

    *   **3.3.1 SPIN (Self-Play Fine-Tuning)**:

        *   **Mechanism**: SPIN is a specialized self-improvement method that uses a **self-play mechanism**.
        *   **Process**: The language model is fine-tuned to distinguish between responses generated from a previous iteration of itself and the desired target data distribution.
        *   **Iterative Adjustment**: The model is iteratively adjusted to better match the target data distribution.
        *  **SPIN dataset**: The SPIN dataset has 49.8K English instances.

    *   **3.3.2 Instruction Back-translation**:

        *   **Approach**: This technique creates instructions for human-gathered texts rather than generating responses to human instructions.
        *   **Process**:
            1.  **Data Gathering**: It begins by collecting unlabeled text from the Clueweb, along with a small set of human-written (instruction, response) pairs as seed data.
            2.  **Back-translation Model Training**: A back-translation model, based on LLaMA, is trained using the seed data. This model takes the response as input and outputs the corresponding instruction.
            3.  **Instruction Generation**: The collected unlabeled texts are then input into the trained back-translation model to generate (instruction, response) format data.
            4.  **Evaluation Model Training**: An evaluation model, also based on LLaMA, is trained on the seed data to assess the generated (instruction, response) pairs.
            5.  **Filtering and Fine-tuning**: Low-quality pairs are filtered out, and the remaining high-quality data is used to fine-tune LLMs.
       *  **Instruction Back-translation dataset**: This dataset has 502K English instances.

In summary, section 3.3 highlights how models can leverage their own generated data to improve their performance iteratively. This approach includes techniques like self-play and instruction back-translation, each with its own specific method for creating synthetic data used in the fine-tuning process. This section also notes the importance of a strong base model and the potential for biases and errors in the generated data.

## Instruction Tuned LLMs

This section provides an overview of various large language models (LLMs) that have been fine-tuned using instruction tuning datasets. Here’s a summary of the key models discussed:

*   **4.1 InstructGPT**:
    *   This model is a 176B parameter model fine-tuned from **GPT-3** using a dataset of human-generated instructions. It was a pivotal early model in instruction tuning, though specific details of the dataset are not provided in this section.

*   **4.2 BLOOMZ**:
    *   Initialized with **BLOOM (176B)**, it is then fine-tuned on the **xP3** dataset, which includes human-generated instructions in 46 languages. The dataset is derived from a collection of English instruction-response pairs and multilingual NLP datasets transformed to fit English instruction templates.
    *   BLOOMZ shows improvements over BLOOM in zero-shot settings for tasks such as coreference resolution, sentence completion, and natural language inference. It also shows a 10% improvement on the HumanEval benchmark and a 9% BLEU improvement on generative tasks.

*   **4.3 FLAN-T5**:
    *   This model is initialized with **T5 (11B)** and fine-tuned on the **FLAN** dataset, which is constructed from 62 datasets of 12 NLP tasks transformed into instruction-following formats.
    *  It uses a JAX-based T5X framework, selecting the best model using held-out tasks every 2k steps. The fine-tuning cost is 0.2% of the T5's pre-training cost.
    *   FLAN-T5 outperforms T5 and achieves results comparable to larger models like PaLM in few-shot settings.

*   **4.4 Alpaca**:
    *   This model is fine-tuned from **LLaMA (7B)** using a 52K instruction dataset.
     * The instruction dataset is based on the self-instruct method.

*   **4.5 Vicuna**:
    *   Fine-tuned from **LLaMA (13B)** using 70K real-user conversations collected from ShareGPT.
    *   It demonstrates performance improvements over Alpaca (13B) and LLaMA (13B) on a test set and achieves responses that are rated as equal to or better than ChatGPT in 45% of the cases.

*  **4.6 GPT-4-LLM**:
    *   This model is fine-tuned from **LLaMA (7B)** using a dataset generated by **GPT-4**. It is fine-tuned in two steps: supervised fine-tuning and optimization using proximal policy optimization (PPO).
    *  It outperforms Alpaca (7B) and larger models such as Alpaca (13B) and LLaMA (13B) in both automated and human evaluations.

*   **4.7 Claude**:
    *   This model is fine-tuned on an instruction dataset consisting of 52K instructions and responses generated by **GPT-4** using a two stage process: supervised fine-tuning, and optimization with PPO..
    *   Claude generates more helpful and harmless responses compared to its backbone model, with significant improvements in areas like toxicity and instruction following.

*   **4.8 WizardLM**:
    *   Fine-tuned from **LLaMA (7B)** using the **Evol-Instruct** dataset generated by ChatGPT, with 70K instances used for training.
     * It focuses on obtaining diverse and high-quality instructions from GPT-3.
    *   WizardLM outperforms Alpaca and Vicuna and is rated as equal to or better than ChatGPT in 67% of test samples in human evaluations.

*   **4.9 ChatGLM2**:
     *   This model is a 6B parameter model that is fine-tuned from **GLM** with an instruction dataset that contains 1.1 tokens.

*   **4.10 LIMA**:
    *   Fine-tuned from **LLaMA (65B)** on a small instruction dataset, based on the hypothesis that a large language model's knowledge is gained during pre-training, with fine-tuning teaching it to respond to user instructions.
     *   It uses a training set with only 1K data instances.

*   **4.11 Others**: This section lists other models, including:
    *   **OPT-IML (175B)**, fine-tuned from **OPT** using the Instruction Meta-Learning (IML) dataset, which contains over 1500 NLP tasks.
    *   **Dolly 2.0 (12B)**, fine-tuned from **Pythia** on the databricks-dolly-15k dataset.
    *   **Falcon-Instruct (40B)**, fine-tuned from **Falcon** on an English dialogue dataset.
    *   **Minotaur (15B)**, fine-tuned from **Starcoder Plus** on open-source instruction datasets such as WizardLM.
    *   **Nous-Hermes (13B)**, fine-tuned from **LLaMA** using a 300k instruction dataset generated by GPT-4.
    *   **TÜLU (6.7B)**, fine-tuned from **OPT** with a mixed instruction dataset.
    *   **YuLan-Chat (13B)**, fine-tuned from **LLaMA** using 250k instructions.
    *    **MOSS (16B)**, a model with an unspecified base model, fine-tuned with an instruction dataset.
    *   **Airoboros (13B)**, fine-tuned from **LLaMA** with an instruction dataset.
    *   **UltraLM (13B)**, fine-tuned from **LLaMA** with an instruction dataset.

**Key Takeaways from Section 4**

*   **Variety of Base Models**: Instruction tuning is applied to various base LLMs, including GPT-3, BLOOM, T5, LLaMA, OPT, and Pythia.
*   **Diverse Datasets**: The models are fine-tuned using a wide range of datasets, including both human-crafted and synthetically generated data.
*   **Performance Improvements**: Fine-tuning with instruction datasets generally leads to improved performance on various tasks, including instruction following, question answering, and code generation.
*  **Focus on Instruction Following**: Instruction tuning primarily aims to align language models with human instructions and user intentions, enhancing the usability of these models.

This overview of section 4 provides a clear understanding of the landscape of instruction fine-tuned LLMs, highlighting the diverse approaches and the improvements achieved through this process.


##  Multi-modality Instruction Tuning

### Multi-modality Datasets

This section focuses on datasets used for instruction tuning of multi-modal models, which can process and understand different types of data such as images, text, speech, and video. Here's a breakdown of the key datasets discussed:

*   **MUL-TIINSTRUCT:**
    *   **Description**: This is a multi-modal instruction tuning dataset that includes **62 diverse tasks** in a unified sequence-to-sequence format.
    *   **Modality**: It primarily focuses on **image-text pairs**.
    *   **Task Coverage**: It covers 10 broad categories of tasks derived from 21 existing open-source datasets.
    *   **Instructions**: Each task is accompanied by 5 expert-written instructions.
    *   **Instance Creation**: For existing tasks, input/output pairs are taken from available open-source datasets. For new tasks, **5k to 5M instances** are created by extracting information from existing tasks or reformulating them.
    *   **Impact**: It has demonstrated efficiency in enhancing various transfer learning techniques. For example, fine-tuning the OFA model on MUL-TIINSTRUCT improves zero-shot performance across all unseen tasks.

*   **PMC-VQA:**
    *   **Description**: A large-scale **medical visual question-answering dataset**.
    *   **Modality**: Focuses on **image-text pairs**.
    *   **Size**: It includes **227k image-question pairs** from **149k images**, covering various medical modalities and diseases.
    *   **Task Format**: It can be used for both **open-ended and multiple-choice tasks**.
    *   **Creation Process**: The dataset was created by collecting image-caption pairs from the PMC-OA dataset, using ChatGPT to generate question-answer pairs, and manually verifying a subset for quality.
    *   **Use**: This dataset was used to train the MedVInT model, a generative model for medical visual understanding.

*   **LAMM:**
    *   **Description**: A multi-modal dataset designed for **various vision and point cloud tasks**.
    *   **Modality**: Includes both **image-text and point cloud-text pairs**.
    *  **Tasks**: It includes 9 common image tasks and 3 common point cloud tasks.
      *  **Framework**: The LAMM framework differentiates the encoder, projector, and LLM fine-tuning blocks to avoid modality conflicts.

*   **Vision-Flan:**
    *   **Description**: This is a large-scale, **human-annotated visual instruction tuning dataset**.
    *   **Modality**: Includes **multiple pairs**.
    *   **Size**: Consists of **1,664,261 instances** and more than **200 diverse vision-language tasks** derived from 101 open-source computer vision datasets.
    *   **Task Variety**: Covers a broad spectrum of tasks, including image captioning, visual question-answering, and visual comprehension.
    *   **Features**: Each task includes expertly written instructions and meticulously crafted templates for inputs and outputs.
    *   **Purpose**: Aims to enhance research and application in vision-language model domains, expanding interaction and comprehension between visual and linguistic modalities.

*   **ALLaVA:**
    *   **Description**: An open-source dataset for fine-tuning visual question-answering models.
    *   **Modality**: It uses **image-text pairs**.
    *  **Size**: Features **1.4M entries** including detailed captions, instructions and comprehensive answers from GPT-4V.
    *   **Generation**: High-quality captions and visual question-answers are created by prompting GPT-4V to generate both a caption and a question-answer pair for a single image.
    *   **Benefit**: This method incorporates more visual data, enhancing the model's understanding of both visual and textual elements and reducing hallucinations.

*   **ShareGPT4V:**
    *   **Description**: A collection of highly descriptive **image-text pairs**.
    *   **Modality**: Uses **image-text pairs**.
    *   **Components**: Consists of **100K captions generated by GPT4-Vision** from a variety of images, and **1.2M captions developed using a pre-trained model** trained on the initial set.
     *   **Coverage**: The captions cover various aspects such as global knowledge, object attributes, spatial relationships, and aesthetic evaluations.
    *   **Impact**: The fine-tuned ShareGPT4V-7B model outperforms other 7B-scale language models across 11 benchmark tests.

**Key Takeaways from Section 5.1**

*   **Diverse Modalities**: These datasets support various multi-modal tasks, including image-text, point cloud-text, and more complex combinations.
*  **Large Scale**: The datasets range in size from thousands to millions of instances, supporting training of robust models.
*   **Focus on Instruction Tuning**: The datasets are designed for instruction tuning, aiming to improve how well models follow instructions across modalities.
*   **Use of LLMs for Data Generation**: Many datasets utilize LLMs like GPT-4 to generate high-quality data, including captions, questions, and answers.
*  **Variety of Tasks**: These datasets cover diverse tasks, from general visual understanding to specialized tasks like medical image analysis.

This detailed breakdown of section 5.1 should provide a clear understanding of the variety and characteristics of datasets used in multi-modal instruction tuning, highlighting the different modalities, sizes, and purposes of each dataset.


### **Multi-modality Instruction Tuning Models**

This section reviews models that have been adapted for multi-modal tasks through instruction tuning, allowing them to process and generate content involving multiple types of data such as images, text, speech, and video. Here’s a structured overview:

*   **InstructPix2Pix:**
    *   **Model Type:** A conditional diffusion model.
    *   **Base Model:** Fine-tuned from **Stable Diffusion (983M)**.
    *   **Training Data:** Trained on a constructed multi-modal dataset containing over **450K text editing instructions and corresponding images** before and after the edits.
    *   **Data Generation:** The dataset is created by combining the capabilities of **GPT-3** (for generating text edits based on image prompts) and **Stable Diffusion** (for converting text edits into actual image edits).
    *  **Process:**  It trains a latent diffusion objective using the generated dataset, allowing it to follow image editing instructions rather than just describing the image or editing layer.
     *   **Evaluation:** It is compared qualitatively with previous works, such as SDEdit and Text2Live, and quantitatively using metrics that measure image consistency and edit quality.

*   **LLaVA:**
    *   **Model Type:** A large multi-modal model.
    *   **Architecture:** Connects the visual encoder of **CLIP (400M)** with the language decoder **LLaMA (7B)**.
    *   **Fine-tuning Data:** Fine-tuned using a generated instructional vision-language dataset consisting of **158K unique language-image instruction-following samples.**
    *   **Data Collection:** The data collection process involves creating conversation, detailed description, and complex reasoning prompts, and then using **GPT-4** to convert image-text pairs into the appropriate instruction-following format.
     * **Features:**  It encodes images using visual features such as captions and bounding boxes.
    *   **Performance:** LLaVA achieves an **85.1% relative score compared to GPT-4** on a synthetic multi-modal instruction following dataset and achieves state-of-the-art accuracy on the Science QA dataset when combined with GPT-4.

*   **InstructBLIP:**
    *   **Model Type:** A vision-language instruction tuning framework.
    *   **Base Model:** Initialized with a pre-trained **BLIP-2** model, which includes an image encoder, an LLM (FlanT5 or Vicuna), and a Query Transformer (Q-Former).
     *  **Architecture:** The Q-Former extracts instruction-aware visual features from the output embeddings of the frozen image encoder and feeds these as soft prompt inputs to the frozen LLM.
    *   **Key Functionality:** It aims to enhance the zero-shot transfer ability of vision-language models by leveraging the power of instruction tuning.
      * **Performance:** The smallest InstructBLIP (4B) outperforms Flamingo (80B) on all six shared evaluation datasets with an average relative improvement of 24.8%.

*   **Otter:**
     *   **Model Type**: A multi-modal model.
    *   **Base Model:** Fine-tuned from **OpenFlamingo (9B)**.
     *  **Fine-tuning Strategy:** The language and vision encoders are frozen, and only the Perceiver resampler module, cross-attention layers, and input/output embeddings are fine-tuned.
    *   **Training Data:** Trained on the MIMIC-IT dataset of **2.8M multi-modal instruction-response pairs**.
    *    **Features:** This dataset consists of image-instruction-answer triplets, with context including a series of related image-instruction-answer triplets, which helps the model to follow user instructions more accurately and provide more detailed image descriptions.
   *   **Performance**:  Otter is shown to follow user instructions more accurately and provide more detailed descriptions of images compared to OpenFlamingo.

*  **MultiModal-GPT:**
   *  **Model Type**: A vision and language model for dialogue with humans.
   *   **Base Model:** Based on the **OpenFlamingo** model, but the specific model size is not mentioned.
   *  **Modalities**: It supports image, text and video modalities.

**Key Takeaways from Section 5.2**

*   **Diverse Model Architectures**: The models include diffusion models, encoder-decoder models, and models that combine visual and language encoders.
*   **Use of Pre-trained Models**: Many models are built upon pre-trained models like Stable Diffusion, CLIP, LLaMA, BLIP-2, and OpenFlamingo, taking advantage of prior knowledge.
*   **Instruction Tuning Focus**: These models are fine-tuned using instruction tuning, enhancing their ability to understand and follow instructions across multiple modalities.
*   **Data Generation Techniques**: Several models utilize large language models such as GPT-3 and GPT-4 to generate instruction-based training data, improving the diversity and quality of the training sets.
*  **Emphasis on Multi-Modal Tasks**: These models address diverse multi-modal tasks, including image editing, visual question answering, and multi-modal dialogue.

This detailed overview of section 5.2 provides a clear understanding of how various models are adapted for multi-modal tasks through instruction tuning, highlighting the base models, training data, and key features of each model. The section showcases the progress in developing models that can effectively combine and process information from different modalities.

## Efficient Tuning Techniques

This section explores various methods to adapt LLMs to downstream tasks by optimizing a small fraction of parameters. These techniques are categorized into addition-based, specification-based, and reparameterization-based approaches.

*   **Introduction to Efficient Tuning**
    *   Efficient fine-tuning techniques aim to adapt LLMs to downstream tasks by optimizing a small fraction of parameters in multiple ways, i.e., **addition-based, specification-based, and reparameterization-based**.
        *   **Addition-based methods** introduce extra trainable parameters or modules not present in the original model, such as adapter tuning and prompt-based tuning.
        *   **Specification-based methods** specify certain inherent model parameters to be tuned while freezing others, like BitFit which tunes the bias terms of a pre-trained model.
        *   **Reparameterization methods** transform model weights into more parameter-efficient forms for tuning, such as LoRA.
    *   The key hypothesis behind reparameterization methods is that model adaptation is low-rank, so weights can be reparameterized into low-rank factors or a low-dimensional subspace.

*   **7.1 LoRA**
    *   **Low-Rank Adaptation (LoRA)** enables efficient adaptation of LLMs using low-rank updates.
    *   LoRA uses DeepSpeed as the training backbone.
    *   The key idea is that the actual change in LLMs’ weights needed for new task adaptation lies in a **low-dimensional subspace**.
    *   For a pretrained weight matrix *W0*, the adapted weight matrix is modeled as *W0* + *∆W*, where *∆W* is a low rank update. *∆W* is parameterized as *∆W* = *BA*, where *A* and *B* are much smaller trainable matrices.
    *   The rank *r* of *∆W* is chosen to be much smaller than the dimensions of *W0*. Instead of directly training all of *W0*, LoRA trains low-dimensional *A* and *B*, which indirectly trains *W0* in a low-rank subspace.
    *   This results in far fewer trainable parameters compared to full fine-tuning.
    *   For GPT-3, LoRA reduces the number of trainable parameters by 10,000x and memory usage by 3x compared to full fine-tuning.

*   **7.2 HINT**
    *   **HINT** combines the generalization benefits of instruction tuning with efficient on-demand fine-tuning, avoiding repeatedly processing lengthy instructions.
    *   The core of HINT lies in **hypernetworks**, which generate parameter-efficient modules for LLMs adaptation based on natural language instructions and few-shot examples.
    *   The hypernetwork converts instructions and few-shot examples into an encoded instruction and generates adapter and prefix parameters using a pretrained text encoder and cross-attention based parameter generator.
    *   The generated adapters and prefixes are inserted into the backbone model as efficient tuning modules.
    *   At inference, the hypernetwork performs inference only once per task to generate adapted modules. This allows HINT to incorporate long instructions and additional few-shots without increasing compute, unlike regular fine-tuning or input concatenation methods.

*   **7.3 BitFit**
    *   While not detailed in a subsection, it is mentioned that **BitFit** is a specification-based method that tunes the bias terms of the pre-trained model, while freezing other parameters.

*   **7.4 Adapter Tuning**
    *  Adapter tuning is mentioned as a representative addition-based method that introduces extra trainable parameters or modules not present in the original model.

*   **7.5 Delta-tuning**
    *   **Delta-tuning** provides optimization and optimal control perspectives for theoretical analysis.
    *   It performs subspace optimization by restricting tuning to a low-dimensional manifold.
     *  The tuned parameters act as optimal controllers guiding model behavior on downstream tasks.

**Key Takeaways from Section 7**

*   **Efficiency Focus:** The primary goal of these techniques is to reduce the computational cost and resources required for adapting LLMs to new tasks.
*   **Parameter Reduction:** These methods achieve efficiency by optimizing only a small subset of parameters, using techniques like low-rank adaptation, hypernetworks, and bias-term tuning.
*    **Three Main Categories**: The efficient fine-tuning techniques can be broadly categorized into addition-based, specification-based, and reparameterization-based methods.
*   **Practical Benefits**: Techniques like LoRA can significantly reduce the number of trainable parameters and memory usage, making it feasible to fine-tune LLMs on less powerful hardware.
*   **Generalization and Customization**: Methods like HINT allow the model to generalize well using instructions and few-shot examples, without the computational costs of repeated processing.

This detailed subsection-by-subsection analysis of section 7 shows how various efficient tuning techniques can be used to adapt large language models while minimizing computational resources. These techniques are crucial for making LLMs more accessible and adaptable to specific needs.