The goal of this tutorial is to create a custom vision-language model by combining the vision encoder weights with the language model's weights, adjusting configurations as needed to ensure compatibility. Specifically, we will demonstrates how to integrate a pre-trained vision encoder (from Qwen2.5-VL-7B-Instruct) into a smaller language model (Qwen2.5-0.5B-Instruct)

Since Qwen already has a vision version of it, we can directly modify the config and adjust the LLM related part so that we can use it with the Qwen 2.5 0.5B Instruct model.
Let's see the difference between the config.json of the language model (LM) and the vision language model (VLM)
<br>
<img src="/home/jackson/vision-r1/static/diff_1.png" width="400"> <br>
<img src="/home/jackson/vision-r1/static/diff_2.png" width="200">

A few difference that we can notice is:
1. Their architecture different (Qwen2ForCausalLM and Qwen2_5_VLForConditionalGeneration), the later is architecture that incorporate vision.
2. VLM has additional tokens for vision (We will need to inspect if those token are absent in LM, we will need to add those token)
3. Difference in hidden_states, num_hidden_layers, num_attention_heads..etc. 
4. VLM has extra key for `vision_config`
5. The `out_hidden_size` of vision model is 2048 while the `hidden_states` of LM is 896, we will need to modify the last layer of vision model to make them compatible

In [1]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    AutoConfig,
    Qwen2_5_VLForConditionalGeneration,
)
from collections import OrderedDict

model_name = "Qwen/Qwen2.5-0.5B-Instruct"

lm = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, device_map="cuda"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

  from .autonotebook import tqdm as notebook_tqdm


## Checking additional tokens
If we check the Qwen VL model, we can observe the `vocab_size` is 151936, which is the same as the LM. In their config, they have these special tokens which indicates the vision related tokens. Let's check if the LM contains the same token in its vocabulary.
"`vision_start_token_id`": 151652, <br>
"`vision_end_token_id`": 151653,<br>
"`vision_token_id`": 151654,<br>
"`image_token_id`": 151655,<br>
"`video_token_id`": 151656,<br>

In [2]:
vision_tokens = [151653, 151654, 151655, 151656]
for vt in vision_tokens:
    print(tokenizer._convert_id_to_token(vt))

<|vision_end|>
<|vision_pad|>
<|image_pad|>
<|video_pad|>


Seems like the LM already has those token! So we don't need to add any new tokens.

## Combine vision encoder with base model
One simplest approach is to create a new config.json file that specifically fits the vision encoder and Qwen 0.5, use it to create a new model architecture, then load the corresponding pretrained weights into it.

We can simply use the vision model config.json and modify from there.

This is the new config file that match the LM config (such as `hidden_states`, `intermediate_size`..etc)
```json
{
  "architectures": [
    "Qwen2_5_VLForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "vision_start_token_id": 151652,
  "vision_end_token_id": 151653,
  "vision_token_id": 151654,
  "image_token_id": 151655,
  "video_token_id": 151656,
  "hidden_act": "silu",
  "hidden_size": 896,
  "initializer_range": 0.02,
  "intermediate_size": 4864,
  "max_position_embeddings": 32768,
  "max_window_layers": 21,
  "model_type": "qwen2_5_vl",
  "num_attention_heads": 14,
  "num_hidden_layers": 24,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.1",
  "use_cache": true,
  "use_sliding_window": false,
  "vision_config": {
    "depth": 32,
    "hidden_act": "silu",
    "hidden_size": 1280,
    "intermediate_size": 3420,
    "num_heads": 16,
    "in_chans": 3,
    "out_hidden_size": 896,
    "patch_size": 14,
    "spatial_merge_size": 2,
    "spatial_patch_size": 14,
    "window_size": 112,
    "fullatt_block_indexes": [
      7,
      15,
      23,
      31
    ],
    "tokens_per_second": 2,
    "temporal_patch_size": 2
  },
  "rope_scaling": {
    "type": "mrope",
    "mrope_section": [
      16,
      24,
      24
    ]
  },
  "vocab_size": 151936
}
```
We can now load the custom model with our new config.json file!

In [None]:
config = AutoConfig.from_pretrained("qwen-0.5-vl-config.json")
custom_model = Qwen2_5_VLForConditionalGeneration._from_config(config=config)

next step is to save the weights of `vision_encoder` from VLM, and LM weights.

In [4]:
# save vision encoder weights
vlm = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "/tmp2/jackson/Qwen--Qwen2.5-VL-7B-Instruct",
    device_map="cuda:0",
    torch_dtype=torch.bfloat16,
)
visual_weights_path = "visual_weights.pth"
torch.save(vlm.visual.state_dict(), visual_weights_path)

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  1.68it/s]


In [8]:
# save lm weights
lm_weights_path = "lm_weights.pth"
torch.save(lm.state_dict(), lm_weights_path)

Now, we can load the saved weights and load it into our custom model

In [9]:
# load previously saved vision encoder and LM weights
visual_state_dict = torch.load("visual_weights.pth", weights_only=True)
llm_state_dict = torch.load("lm_weights.pth", weights_only=True)
modified_visual_state_dict = OrderedDict(
    (f"visual.{key}", value) for key, value in visual_state_dict.items()
)

Note that the output shape of last layer is vision encoder does not match with the LM `hidden_states`. Thus, we will need to modify its last layer. Instead of randomly initialize the weights or add a new projection layer, we slice the original weights for simplicity. (eg, `output_hidden_size` for Qwen 2.5 VL 7B is  3584, but we will take the first 896 (`hidden_states` for LM)) as the new weights.

In [10]:
print(modified_visual_state_dict["visual.merger.mlp.2.weight"].shape)
print(modified_visual_state_dict["visual.merger.mlp.2.bias"].shape)
new_output_shape = 896
modified_visual_state_dict["visual.merger.mlp.2.weight"] = modified_visual_state_dict[
    "visual.merger.mlp.2.weight"
][:new_output_shape, :]
modified_visual_state_dict["visual.merger.mlp.2.bias"] = modified_visual_state_dict[
    "visual.merger.mlp.2.bias"
][:new_output_shape]
print("Modified")
print(modified_visual_state_dict["visual.merger.mlp.2.weight"].shape)
print(modified_visual_state_dict["visual.merger.mlp.2.bias"].shape)

torch.Size([3584, 5120])
torch.Size([3584])
Modified
torch.Size([896, 5120])
torch.Size([896])


In [11]:
miss_1 = custom_model.load_state_dict(modified_visual_state_dict, strict=False)
miss_2 = custom_model.load_state_dict(llm_state_dict, strict=False)

In [12]:
total_miss = miss_1.missing_keys + miss_2.missing_keys

check if any missing weights

In [None]:
# miss_1 should contain missing keys from llm if there is any
# miss_2 should contain missing keys from visual if there is any
for name, param in custom_model.named_parameters():
    if name in total_miss:
        continue
    print(name)

In [15]:
custom_model.save_pretrained("qwen-0.5-vl-custom")

Done! Now we have successfully integrate the well-trained `vision_encoder` and LM weights! We can proceed to fine-tune them! :)