# Building a Multimodal LLaVA Model

This tutorial demonstrates how to build a multimodal model base on LLaVA architecture, by combining a language model and a vision model with a projection layer. LLaVA models can process both text and images, allowing for tasks like image captioning, visual question answering, and more.

## What is LLaVA?

LLaVA is a multimodal model architecture that connects a vision encoder with a language model, enabling the processing of both visual and textual information. This allows the model to understand and generate text based on image inputs.

## Prerequisites

Before starting, make sure you have installed the ``align-anything`` package.

```bash
# clone the repository
git clone git@github.com:PKU-Alignment/align-anything.git
cd align-anything

# create virtual env
conda create -n align-anything python==3.11
conda activate align-anything
```

- **`[Optional]`** We recommend installing [CUDA](https://anaconda.org/nvidia/cuda) in the conda environment and set the environment variable.

```bash
# We tested on the H800 computing cluster, and this version of CUDA works well.
# You can adjust this version according to the actual situation of the computing cluster.

conda install nvidia/label/cuda-12.2.0::cuda
export CUDA_HOME=$CONDA_PREFIX
```

> If your CUDA installed in a different location, such as `/usr/local/cuda/bin/nvcc`, you can set the environment variables as follows:

```bash
export CUDA_HOME="/usr/local/cuda"
```

Finally, install `align-anything` by:

```bash
# We prepare quick installation for training and evaluation.
# If you only need to use the training or evaluation module,
# you can install the corresponding dependencies.
pip install -e .[train] # install the training dependencies
pip install -e .[evaluate] # install the evaluation dependencies

# If you need to install all dependencies, you can use the following command:
pip install -e .[all]
```

## Importing Required Libraries

In [None]:
import os
import torch
from transformers import (
    AutoModel,
    AutoModelForCausalLM,
    AutoTokenizer,
    CLIPImageProcessor,
    LlavaConfig,
    LlavaForConditionalGeneration,
    LlavaProcessor,
)

## Helper Function for Model Optimization

The following function ensures that all model parameters are contiguous in memory, which can improve performance.

In [2]:
def make_model_contiguous(module):
    """Make all model parameters contiguous in memory for better performance."""
    for child in module.children():
        make_model_contiguous(child)

    for param in module.parameters(recurse=False):
        param.data = param.data.contiguous()

## Setting Up Model Paths

You can customize these paths based on your preferences. By default, we'll use Meta-Llama-3.1-8B-Instruct as the language model and CLIP ViT-Large as the vision model.

In [3]:
# Define model paths
language_model_path = (
    "/PATH/TO/YOUR/Llama-3.1-8B-Instruct"  # You can change this to any compatible language model
)
vision_tower_path = (
    "/PATH/TO/YOUR/clip-vit-large-patch14-336"  # You can change this to any compatible vision model
)
# TODO: change this to your own path
save_path = "/PATH/TO/YOUR/llama_vision"  # Where to save the combined model

# Create the save directory if it doesn't exist
os.makedirs(save_path, exist_ok=True)

## Loading the Base Models

First, we'll load the vision model, language model, tokenizer, and image processor.

In [None]:
# Load the vision model
print("Loading vision model...")
vision_model = AutoModel.from_pretrained(vision_tower_path)

# Load the language model
print("Loading language model...")
language_model = AutoModelForCausalLM.from_pretrained(language_model_path)

# Load the tokenizer for the language model
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(language_model_path)

# Load the image processor for the vision model
print("Loading image processor...")
image_processor = CLIPImageProcessor.from_pretrained(vision_tower_path)

## Preparing the Tokenizer

We need to add special tokens to the tokenizer for handling images and padding.

In [None]:
# Add special tokens to the tokenizer
tokenizer.add_special_tokens({'additional_special_tokens': ['<image>', '<unk>', '<pad>']})
tokenizer.pad_token = '<pad>'
tokenizer.unk_token = '<unk>'

print(f"Tokenizer vocabulary size after adding special tokens: {len(tokenizer)}")

## Creating the LLaVA Configuration

Now we'll create a configuration that combines the vision and language model configurations.

In [30]:
# Extract configurations from the base models
vision_config = vision_model.vision_model.config
language_config = language_model.config

# Create a combined LLaVA configuration
config = LlavaConfig(vision_config, language_config)

# Set the image token index in the configuration
config.image_token_index = tokenizer.convert_tokens_to_ids('<image>')

# Set the chat template as the original llava template
# copy from https://huggingface.co/llava-hf/llava-1.5-7b-hf/blob/main/chat_template.json
llava_template = """"{% for message in messages %}{% if message['role'] != 'system' %}{{ message['role'].upper() + ': '}}{% endif %}{# Render all images first #}{% for content in message['content'] | selectattr('type', 'equalto', 'image') %}{{ '<image>\n' }}{% endfor %}{# Render all text next #}{% if message['role'] != 'assistant' %}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{{ content['text'] + ' '}}{% endfor %}{% else %}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{% generation %}{{ content['text'] + ' '}}{% endgeneration %}{% endfor %}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ 'ASSISTANT:' }}{% endif %}"""

# Create a processor that combines the image processor and tokenizer
processor = LlavaProcessor(
    image_processor=image_processor,
    tokenizer=tokenizer,
    patch_size=14,
    chat_template=llava_template,
)

## Building the LLaVA Model

Now we'll create the LLaVA model by combining the vision and language models.

In [None]:
# Initialize the LLaVA model with the combined configuration
print("Building LLaVA model...")
model = LlavaForConditionalGeneration(config)

# Assign the pre-trained language model
model.language_model = language_model

# Assign the pre-trained vision model
model.vision_tower = vision_model

# Resize the token embeddings to match the new tokenizer size
model.resize_token_embeddings(len(tokenizer))

# Make the model parameters contiguous for better performance
make_model_contiguous(model)

print("LLaVA model built successfully!")

## Saving the Model

Finally, we'll save the combined model and processor to disk.

Before saving the model, you may need to specify the `CUDA_HOME` environment variable to avoid cuda warning.

In [32]:
import os

os.environ["CUDA_HOME"] = "/PATH/TO/YOUR/CUDA"  # run `which nvcc` to get the path

In [None]:
# Save the model and processor
print(f"Saving model to {save_path}...")
model.save_pretrained(save_path)
processor.save_pretrained(save_path)
print("Model and processor saved successfully!")

## Testing the Model

Let's test our newly created LLaVA model with a sample image.

In [34]:
from PIL import Image
import requests

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What are these?"},
            {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)

We can first checkout the image and the prompt.

In [None]:
prompt

In [None]:
raw_image

Then, we load the model and processor.

In [None]:
# Load the saved model and processor
loaded_model = LlavaForConditionalGeneration.from_pretrained(save_path)
loaded_processor = LlavaProcessor.from_pretrained(save_path)

Now we should process the image and the prompt to torch tensor.

In [None]:
inputs = loaded_processor(images=raw_image, text=prompt, return_tensors='pt')
print('The input has the following keys: ', inputs.keys())
print('Their shapes are: ', {k: v.shape for k, v in inputs.items()})

Finally, we can generate the response.

In [None]:
output = loaded_model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

As you see, the model's output is pretty messy. That's because the model is not trained on the image-text pair. As showcase in the LLaVA paper, we should fine-tune the model on the image-text pair to get a better performance. A reference script is provided in the `scripts/llava_step1.sh` file.

The training setup is mainly based on the LLaVA paper.

**Note**: You should finish the following steps to make sure your training process is smooth.

1. Download the `liuhaotian/LLaVA-Pretrain` dataset by:

```bash
huggingface-cli download --repo-type dataset --resume-download liuhaotian/LLaVA-Pretrain --local-dir /PATH/TO/YOUR/liuhaotian/LLaVA-Pretrain --local-dir-use-symlinks False
```

2. unzip the `images.zip` file and put the `images` folder in the `/PATH/TO/YOUR/images/` folder.

3. Specify the `COCO_DATA_DIR` environment variable to the path of the `images` folder.

The final script is as follows:

```bash
export COCO_DATA_DIR="/PATH/TO/YOUR/images/"


# Initialize variables
MODEL_NAME_OR_PATH="/PATH/TO/YOUR/MLLMs"
TRAIN_DATASETS="/PATH/TO/YOUR/liuhaotian/LLaVA-Pretrain"
TRAIN_TEMPLATE="LLaVA_Pretrain" # dataset template
TRAIN_DATA_FILES="blip_laion_cc_sbu_558k.json"

OUTPUT_DIR="../outputs/llava_step1" # output dir

# For wandb online logging
export WANDB_API_KEY=""

# Source the setup script
source ./setup.sh

# Execute deepspeed command
deepspeed \
        --master_port ${MASTER_PORT} \
        --module align_anything.trainers.text_image_to_text.sft \
        --model_name_or_path ${MODEL_NAME_OR_PATH} \
        --train_datasets ${TRAIN_DATASETS} \
        --train_template ${TRAIN_TEMPLATE} \
        --train_split train \
        --train_data_files ${TRAIN_DATA_FILES} \
        --output_dir ${OUTPUT_DIR} \
        --save_total_limit 3 \
        --freeze_vision_tower True \
        --freeze_mm_proj False \
        --freeze_language_model True \
        --epochs 1 \
        --ds_cfgs ds_z2_config.json \
        --learning_rate 1.e-3 \
        --per_device_train_batch_size 16 \
        --gradient_accumulation_steps 32
```





## Conclusion

Congratulations! You've successfully built a multimodal LLaVA model by combining a language model with a vision model. This model can now process both text and images, enabling a wide range of multimodal applications.

### Acknowledgements

- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)
- [LLaVA Paper](https://arxiv.org/abs/2304.08485)