## Install the necassary packages...
1. `pip install git+https://github.com/huggingface/transformers`
2. `pip install qwen-vl-utils`
3. `pip install accelerate`
4. `pip install flash-attn --no-build-isolation` (If you have GPU in your machine)

In [None]:
%pip install git+https://github.com/huggingface/transformers
%pip install qwen-vl-utils
%pip install accelerate
%pip install flash-attn --no-build-isolation

In [1]:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

### Load the model
- It'll load in 2 shards

In [2]:
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    torch_dtype=torch.bfloat16, # Binary floating point helps to load the model in lower precision with lower memory size.
    attn_implementation="flash_attention_2", # Flash attention helps to reduce memory usage as the context length increase.
    device_map="auto",
)
model.device, model.dtype

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

(device(type='cuda', index=0), torch.bfloat16)

Qwen2-VL-2B is a `2.21` billion parameterd model

In [4]:
total_params = sum(p.numel() for p in model.parameters()) / 1e9
print(f"Total number of parameters: {total_params:.2f} billion")

Total number of parameters: 2.21 billion


Let's calculate how much vram it needs.
- model param is 2.21B (as we calculate above)
- bytes required to load the fp16/bf16 model is 2

In [5]:
# Calculate the memory needed to load a 2.21 billion parameter model in bf16
# Each parameter in bf16 takes 2 bytes
model_params = 2.21e9  # 2.21 billion parameters
bytes_per_param = 2  # bf16 uses 2 bytes per parameter
total_memory_bytes = model_params * bytes_per_param
total_memory_gb = total_memory_bytes / (1024 ** 3)  # Convert bytes to GB
print(f"Memory needed to load the model in bf16: {total_memory_gb:.2f} GB")

Memory needed to load the model in bf16: 4.12 GB


Let's actually check the calculated vram is really true...yes it is.

In [3]:
if torch.cuda.is_available():
    print(f"GPU VRAM Usage: {torch.cuda.memory_allocated() / (1024 ** 3):.2f} GB")
else:
    print("CUDA is not available. Cannot determine GPU VRAM usage.")

GPU VRAM Usage: 4.12 GB


### Processor
- Processor is a tool that make the inputs such as text, images, video (the inputs the model can accept) to the form that can fed to the model.
- Normal text-only LLMs does have only tokenizer. But in here we have
    - vision processor
    - text processor (tokenizer)

In [6]:
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

preprocessor_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

In [7]:
print(processor)

Qwen2VLProcessor:
- image_processor: Qwen2VLImageProcessor {
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pixels": 3136,
  "patch_size": 14,
  "processor_class": "Qwen2VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "max_pixels": 12845056,
    "min_pixels": 3136
  },
  "temporal_patch_size": 2
}

- tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2-VL-2B-Instruct', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', 

In [12]:
dir(processor)[33:]

['_upload_modified_files',
 'apply_chat_template',
 'attributes',
 'batch_decode',
 'chat_template',
 'decode',
 'feature_extractor_class',
 'from_args_and_dict',
 'from_pretrained',
 'get_processor_dict',
 'image_processor',
 'image_processor_class',
 'model_input_names',
 'optional_attributes',
 'optional_call_args',
 'prepare_and_validate_optional_call_args',
 'push_to_hub',
 'register_for_auto_class',
 'save_pretrained',
 'to_dict',
 'to_json_file',
 'to_json_string',
 'tokenizer',
 'tokenizer_class',
 'valid_kwargs',
 'validate_init_kwargs']

### Let's see how the processor works!

In [15]:
message = [
    {
        "role": "user",
        "content": "Yo what up!"
    }
]

text_inputs = processor.apply_chat_template(message)
text_inputs

'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nYo what up!<|im_end|>\n'

In [16]:
text_inputs = processor.apply_chat_template(message, add_generation_prompt=True)
text_inputs

'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nYo what up!<|im_end|>\n<|im_start|>assistant\n'

In [33]:
text_inputs = processor.apply_chat_template(message, add_generation_prompt=True, tokenize=True)
text_inputs, len(text_inputs)

([151644,
  8948,
  198,
  2610,
  525,
  264,
  10950,
  17847,
  13,
  151645,
  198,
  151644,
  872,
  198,
  64725,
  1128,
  705,
  0,
  151645,
  198,
  151644,
  77091,
  198],
 23)

In [34]:
processor.tokenizer.decode(text_inputs)

'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nYo what up!<|im_end|>\n<|im_start|>assistant\n'

Let's insert some images too

In [18]:
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
text_inputs = processor.apply_chat_template(message, add_generation_prompt=True)
text_inputs

'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nYo what up!<|im_end|>\n<|im_start|>assistant\n'

In [21]:
image_inputs, video_inputs = process_vision_info(messages)
image_inputs, video_inputs

([<PIL.Image.Image image mode=RGB size=2044x1372>], None)

In [27]:
image_inputs[0], image_inputs[0].size

(<PIL.Image.Image image mode=RGB size=2044x1372>, (2044, 1372))

In [29]:
inputs = processor(
    text=[text_inputs],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs.keys()

dict_keys(['input_ids', 'attention_mask', 'pixel_values', 'image_grid_thw'])

In [37]:
inputs['pixel_values'].shape

torch.Size([14308, 1176])

In [31]:
inputs['input_ids'], inputs['input_ids'].shape

(tensor([[151644,   8948,    198,   2610,    525,    264,  10950,  17847,     13,
          151645,    198, 151644,    872,    198,  64725,   1128,    705,      0,
          151645,    198, 151644,  77091,    198]]),
 torch.Size([1, 23]))

In [36]:
processor.tokenizer.decode(inputs['input_ids'].reshape(-1))

'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nYo what up!<|im_end|>\n<|im_start|>assistant\n'

In [38]:
inputs = inputs.to("cuda")

In [39]:
generated_ids = model.generate(**inputs, max_new_tokens=128)

In [42]:
generated_ids, generated_ids.shape

(tensor([[151644,   8948,    198,   2610,    525,    264,  10950,  17847,     13,
          151645,    198, 151644,    872,    198,  64725,   1128,    705,      0,
          151645,    198, 151644,  77091,    198,   9707,      0,   2585,    646,
             358,   7789,    498,   3351,     30, 151645]], device='cuda:0'),
 torch.Size([1, 33]))

In [43]:
processor.tokenizer.decode(generated_ids.reshape(-1))

'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nYo what up!<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today?<|im_end|>'

In [45]:
from PIL import Image
import requests

url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
image.size

(2048, 1365)

`<|vision_start|><|image_pad|><|vision_end|>` is the part where the image (image embeds) is going to insert in.

In [49]:
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image."},
            {
                "type": "image",
            },
        ],
    }
]

text_prompt = processor.apply_chat_template(conversation)
text_prompt

'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nDescribe this image.<|vision_start|><|image_pad|><|vision_end|><|im_end|>\n'

In [58]:
processor.tokenizer(text_prompt, return_tensors="pt")['input_ids'], processor.tokenizer(text_prompt, return_tensors="pt")['input_ids'].shape

(tensor([[151644,   8948,    198,   2610,    525,    264,  10950,  17847,     13,
          151645,    198, 151644,    872,    198,  74785,    419,   2168,     13,
          151652, 151655, 151653, 151645,    198]]),
 torch.Size([1, 23]))

In [51]:
inputs = processor(
    text=[text_prompt], images=[image], return_tensors="pt"
)
inputs = inputs.to("cuda")
inputs.keys()

dict_keys(['input_ids', 'attention_mask', 'pixel_values', 'image_grid_thw'])

In [53]:
inputs["input_ids"].shape

torch.Size([1, 3599])

In [59]:
output_ids = model.generate(**inputs, max_new_tokens=128)

In [60]:
processor.tokenizer.decode(output_ids.reshape(-1))

'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nDescribe this image.<|vision_start|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><