![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/openvino/HuggingFace_OpenVINO_in_Spark_NLP_Qwen2VL.ipynb)

# Import OpenVINO Qwen2VL models from HuggingFace 🤗 into Spark NLP 🚀

This notebook provides a detailed walkthrough on optimizing and importing Qwen2VL models from HuggingFace  for use in Spark NLP, with [Intel OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html). The focus is on converting the model to the OpenVINO format and applying precision optimizations (INT8 and INT4), to enhance the performance and efficiency on CPU platforms using [Optimum Intel](https://huggingface.co/docs/optimum/main/en/intel/inference).

Let's keep in mind a few things before we start 😊

- OpenVINO support was introduced in  `Spark NLP 5.4.0`, enabling high performance CPU inference for models. So please make sure you have upgraded to the latest Spark NLP release.
- Model quantization is a computationally expensive process, so it is recommended to use a runtime with more than 32GB memory for exporting the quantized model from HuggingFace.
- You can import Qwen2VL models via `Qwen2VL`. These models are usually under `Text Generation` category and have `Qwen2VL` in their labels.
- Reference: [Qwen2VL](https://huggingface.co/docs/transformers/model_doc/llama#transformers.Qwen2VL)
- Some [example models](https://huggingface.co/models?search=Qwen2VL)

## 1. Export and Save the HuggingFace model

- Let's install `transformers` and `openvino` packages with other dependencies. You don't need `openvino` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.41.2`. This doesn't mean it won't work with the future release, but we wanted you to know which versions have been tested successfully.

In [1]:
from pathlib import Path
import requests

In [None]:

%pip install -qU "openvino>=2024.4.0" "nncf>=2.13.0"
%pip install -q  "sentencepiece" "tokenizers>=0.12.1" "transformers>=4.45.0" "gradio>=4.36"
%pip install -q -U --pre --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly openvino-tokenizers openvino openvino-genai
%pip install -q --upgrade huggingface_hub
%pip install -q --upgrade torch>=2.2.1
%pip install -q --upgrade qwen-vl-utils

utility_files = ["notebook_utils.py", "cmd_helper.py"]

from pathlib import Path
import requests

if not Path("ov_qwen2_vl.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/qwen2-vl/ov_qwen2_vl.py")
    open("ov_qwen2_vl.py", "w").write(r.text)

if not Path("notebook_utils.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py")
    open("notebook_utils.py", "w").write(r.text)

### 1.1 Convert the model to OpenVino

In [2]:
from ov_qwen2_vl import model_selector

model_id = model_selector()

model_id

INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino


Dropdown(description='Model:', options=('Qwen/Qwen2-VL-2B-Instruct', 'Qwen/Qwen2-VL-7B-Instruct'), value='Qwen…

In [None]:
print(f"Selected {model_id.value}")
pt_model_id = model_id.value
model_dir = Path(pt_model_id.split("/")[-1])

Selected Qwen/Qwen2-VL-2B-Instruct


In [4]:
model_dir

PosixPath('test/Qwen2-VL-2B-Instruct')

In [5]:
from ov_qwen2_vl import convert_qwen2vl_model
import nncf

compression_configuration = {
    "mode": nncf.CompressWeightsMode.INT4_ASYM,
    "group_size": 128,
    "ratio": 1.0,
}

convert_qwen2vl_model(pt_model_id, model_dir, compression_configuration)


⌛ Qwen/Qwen2-VL-2B-Instruct conversion started. Be patient, it may takes some time.
⌛ Load Original model


`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✅ Original model successfully loaded
⌛ Convert Input embedding model


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


✅ Input embedding model successfully converted
⌛ Convert Language model


  or len(self.key_cache[layer_idx]) == 0  # the layer has no cache
  elif sliding_window is None or key_value_length < sliding_window:
  if attention_mask.shape[-1] > target_length:
  len(self.key_cache[layer_idx]) == 0


✅ Language model successfully converted
⌛ Weights compression with int4_asym mode started
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_asym                 │ 15% (1 / 197)               │ 0% (0 / 196)                           │
├───────────────────────────┼─────────────────────────────┼────────────────────────────────────────┤
│ int4_asym                 │ 85% (196 / 197)             │ 100% (196 / 196)                       │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


Output()

✅ Weights compression finished
⌛ Convert Image embedding model
⌛ Weights compression with int4_asym mode started
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_asym                 │ 1% (1 / 130)                │ 0% (0 / 129)                           │
├───────────────────────────┼─────────────────────────────┼────────────────────────────────────────┤
│ int4_asym                 │ 99% (129 / 130)             │ 100% (129 / 129)                       │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


Output()

✅ Weights compression finished
✅ Image embedding model successfully converted
✅ Qwen/Qwen2-VL-2B-Instruct model conversion finished. You can find results in test/Qwen2-VL-2B-Instruct


In [6]:
import torch
import torch.nn as nn

class Qwen2ReshapePatches(nn.Module):
    def __init__(self,
                 temporal_patch_size: int = 2,
                 merge_size: int = 2,
                 patch_size: int = 14
                 ):
        super().__init__()
        self.temporal_patch_size = temporal_patch_size
        self.merge_size = merge_size
        self.patch_size = patch_size

    def forward(self, patches, repetition_factor=1):
        # Repeat the patches along the first dimension
        patches = patches.repeat(repetition_factor, 1, 1, 1)
        channel = patches.shape[1]
        grid_t = patches.shape[0] // self.temporal_patch_size
        resized_height = patches.shape[2]
        resized_width = patches.shape[3]
        grid_h, grid_w = resized_height // self.patch_size, resized_width // self.patch_size
        patches = patches.reshape(
            grid_t,
            self.temporal_patch_size,
            channel,
            grid_h // self.merge_size,
            self.merge_size,
            self.patch_size,
            grid_w // self.merge_size,
            self.merge_size,
            self.patch_size,
        )
        patches = patches.permute(0, 3, 6, 4, 7, 2, 1, 5, 8)
        flatten_patches = patches.reshape(
            grid_t * grid_h * grid_w, channel * self.temporal_patch_size * self.patch_size * self.patch_size
        )

        return flatten_patches


patch_reshape_model = Qwen2ReshapePatches()

In [7]:
import openvino as ov


ov_model = ov.convert_model(
            patch_reshape_model,
            example_input={
                "patches": torch.ones((1, 3, 1372, 2044), dtype=torch.float32),
                "repetition_factor": torch.tensor(2),
            }
        )

# Save the OpenVINO model
ov.save_model(ov_model, model_dir/"openvino_patch_reshape_model.xml")

In [8]:
from transformers.models.qwen2_vl.modeling_qwen2_vl import VisionRotaryEmbedding
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, AutoConfig

config = AutoConfig.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")


class RotaryEmbedding(nn.Module):

    def __init__(self, embed_dim, spatial_merge_size):
        super().__init__()
        self._rotary_pos_emb = VisionRotaryEmbedding(embed_dim)
        self.spatial_merge_size = spatial_merge_size
    
    def forward(self, grid_thw):
        t, h, w = grid_thw
        pos_ids = []
        # for t, h, w in grid_thw:

        hpos_ids = torch.arange(h).unsqueeze(1).expand(-1, w)
        hpos_ids = hpos_ids.reshape(
            h // self.spatial_merge_size,
            self.spatial_merge_size,
            w // self.spatial_merge_size,
            self.spatial_merge_size,
        )
        hpos_ids = hpos_ids.permute(0, 2, 1, 3)
        hpos_ids = hpos_ids.flatten()

        wpos_ids = torch.arange(w).unsqueeze(0).expand(h, -1)
        wpos_ids = wpos_ids.reshape(
            h // self.spatial_merge_size,
            self.spatial_merge_size,
            w // self.spatial_merge_size,
            self.spatial_merge_size,
        )
        wpos_ids = wpos_ids.permute(0, 2, 1, 3)
        wpos_ids = wpos_ids.flatten()
        pos_ids.append(torch.stack([hpos_ids, wpos_ids], dim=-1).repeat(t, 1))
        pos_ids = torch.cat(pos_ids, dim=0)
        max_grid_size = grid_thw.max()
        rotary_pos_emb_full = self._rotary_pos_emb(max_grid_size)
        rotary_pos_emb = rotary_pos_emb_full[pos_ids].flatten(1)
        return rotary_pos_emb



vision_rotary_embedding = RotaryEmbedding(config.vision_config.embed_dim // config.vision_config.num_heads // 2, config.vision_config.spatial_merge_size)


In [9]:
import openvino as ov

vision_embedding_ov = ov.convert_model(
    vision_rotary_embedding,
    example_input={
        "grid_thw": torch.tensor([1, 98, 146]),
    }
)

# Save the OpenVINO model
ov.save_model(vision_embedding_ov, model_dir/"openvino_rotary_embeddings_model.xml")

  t, h, w = grid_thw


In [10]:
class MergeMultiModalInputs(torch.nn.Module):
    def __init__(self,image_token_index=151655):
        super().__init__()
        self.image_token_index = image_token_index

    def forward(
        self,
        vision_embeds,
        inputs_embeds,
        input_ids,
    ):
        image_features = vision_embeds
        inputs_embeds = inputs_embeds
        special_image_mask = (input_ids == self.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
        # image_features = image_features.to(inputs_embeds.dtype)
        final_embedding = inputs_embeds.masked_scatter(special_image_mask, image_features)

        return {
            "inputs_embeds": final_embedding
        }

torch_model_merge = MergeMultiModalInputs()

In [11]:
import openvino as ov

# convert MergeMultiModalInputs to OpenVINO IR
ov_model_merge = ov.convert_model(
    torch_model_merge,
    example_input={
        "vision_embeds": torch.randn((3577, 1536), dtype=torch.float32),
        "inputs_embeds": torch.randn((1, 3602, 1536), dtype=torch.float32),
        "input_ids": torch.randint(0, 151656, (1, 3602), dtype=torch.long),
    }
)
ov.save_model(ov_model_merge, model_dir/"openvino_multimodal_merge_model.xml")

### 1.2 Load openvino models

In [7]:
LANGUAGE_MODEL_NAME = "openvino_language_model.xml"
IMAGE_EMBEDDING_NAME = "openvino_vision_embeddings_model.xml"
IMAGE_EMBEDDING_MERGER_NAME = "openvino_vision_embeddings_merger_model.xml"
TEXT_EMBEDDING_NAME = "openvino_text_embeddings_model.xml"
ROTARY_EMBEDDING_NAME = "openvino_rotary_embeddings_model.xml"
PATCH_RESHAPE_NAME = "openvino_patch_reshape_model.xml"

In [None]:
import openvino as ov
import gc

core = ov.Core()
model_path = model_dir

language_model = core.read_model(model_path / LANGUAGE_MODEL_NAME)
compiled_language_model = core.compile_model(language_model, "CPU")
request = compiled_language_model.create_infer_request()

image_embedding = core.compile_model(model_path / IMAGE_EMBEDDING_NAME, "CPU")
image_embedding_merger = core.compile_model(model_path / IMAGE_EMBEDDING_MERGER_NAME, "CPU")
text_embedding = core.compile_model(model_path / TEXT_EMBEDDING_NAME, "CPU")
rotary_embedding = core.compile_model(model_path / ROTARY_EMBEDDING_NAME, "CPU")
patch_reshape = core.compile_model(model_path / PATCH_RESHAPE_NAME, "CPU")


In [15]:
# check if all the models are converted

print("⌛ Check if all models are converted")
language_model_path = model_dir / LANGUAGE_MODEL_NAME
image_embed_path = model_dir / IMAGE_EMBEDDING_NAME
image_merger_path = model_dir / IMAGE_EMBEDDING_MERGER_NAME
text_embed_path = model_dir / TEXT_EMBEDDING_NAME
rotary_embed_path = model_dir / ROTARY_EMBEDDING_NAME
patch_reshape_path = model_dir / PATCH_RESHAPE_NAME




if all(
    [
        language_model_path.exists(),
        image_embed_path.exists(),
        image_merger_path.exists(),
        text_embed_path.exists(),
        rotary_embed_path.exists(),
        patch_reshape_path.exists(),
    ]
):
    print(f"✅ All models are converted. You can find results in {model_dir}")
else:
    print("❌ Not all models are converted. Please check the conversion process")

⌛ Check if all models are converted
✅ All models are converted. You can find results in test/Qwen2-VL-2B-Instruct


### 1.2 Copy assets to the assets folder

In [16]:
assets_dir = model_dir / "assets"
assets_dir.mkdir(exist_ok=True)

# copy all the assets to the assets directory (json files, vocab files, etc.)

import shutil

# copy all json files

for file in model_dir.glob("*.json"):
    shutil.copy(file, assets_dir)

    


In [17]:
!ls -lh {model_dir}

total 1.7G
-rw-rw-r-- 1 prabod prabod  392 Dec 10 06:55 added_tokens.json
drwxrwxr-x 2 prabod prabod 4.0K Dec 10 06:59 assets
-rw-rw-r-- 1 prabod prabod 1.1K Dec 10 06:55 chat_template.json
-rw-rw-r-- 1 prabod prabod 1.2K Dec 10 06:55 config.json
-rw-rw-r-- 1 prabod prabod 1.6M Dec 10 06:55 merges.txt
-rw-rw-r-- 1 prabod prabod 873M Dec 10 06:57 openvino_language_model.bin
-rw-rw-r-- 1 prabod prabod 3.5M Dec 10 06:57 openvino_language_model.xml
-rw-rw-r-- 1 prabod prabod   40 Dec 10 06:58 openvino_multimodal_merge_model.bin
-rw-rw-r-- 1 prabod prabod 9.8K Dec 10 06:58 openvino_multimodal_merge_model.xml
-rw-rw-r-- 1 prabod prabod  132 Dec 10 06:58 openvino_patch_reshape_model.bin
-rw-rw-r-- 1 prabod prabod  24K Dec 10 06:58 openvino_patch_reshape_model.xml
-rw-rw-r-- 1 prabod prabod  132 Dec 10 06:58 openvino_rotary_embeddings_model.bin
-rw-rw-r-- 1 prabod prabod  30K Dec 10 06:58 openvino_rotary_embeddings_model.xml
-rw-rw-r-- 1 prabod prabod 446M Dec 10 06:55 openvino_text_embeddings

In [18]:
!ls -lh {assets_dir}

total 14M
-rw-rw-r-- 1 prabod prabod  392 Dec 10 07:05 added_tokens.json
-rw-rw-r-- 1 prabod prabod 1.1K Dec 10 07:05 chat_template.json
-rw-rw-r-- 1 prabod prabod 1.2K Dec 10 07:05 config.json
-rw-rw-r-- 1 prabod prabod  567 Dec 10 07:05 preprocessor_config.json
-rw-rw-r-- 1 prabod prabod  613 Dec 10 07:05 special_tokens_map.json
-rw-rw-r-- 1 prabod prabod 4.3K Dec 10 07:05 tokenizer_config.json
-rw-rw-r-- 1 prabod prabod  11M Dec 10 07:05 tokenizer.json
-rw-rw-r-- 1 prabod prabod 2.7M Dec 10 07:05 vocab.json


### 1.3 Test the openvino model

In [5]:
import openvino as ov
import torch
from pathlib import Path
core = ov.Core()
device = "CPU"


In [8]:

model_path = Path("/mnt/research/Projects/ModelZoo/QWEN2-VL/test/Qwen2-VL-2B-Instruct")

language_model = core.read_model(model_path / LANGUAGE_MODEL_NAME)
compiled_language_model = core.compile_model(language_model, "CPU")
request = compiled_language_model.create_infer_request()

image_embedding = core.compile_model(model_path / IMAGE_EMBEDDING_NAME, "CPU")
image_embedding_merger = core.compile_model(model_path / IMAGE_EMBEDDING_MERGER_NAME, "CPU")
text_embedding = core.compile_model(model_path / TEXT_EMBEDDING_NAME, "CPU")
rotary_embedding = core.compile_model(model_path / ROTARY_EMBEDDING_NAME, "CPU")
patch_reshape = core.compile_model(model_path / PATCH_RESHAPE_NAME, "CPU")

In [9]:
generated_tokens = []

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor, Qwen2VLImageProcessor
from qwen_vl_utils import process_vision_info
import torch
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, AutoConfig, TextStreamer
import numpy as np

config = AutoConfig.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", device_map="cpu", 
  trust_remote_code=True, 
  torch_dtype="float16", 
  _attn_implementation='eager'   
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-7B-Instruct"*
#     torch_dtype=torch.bfloat16*
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs_new = inputs.to("cpu")


input_ids = inputs_new['input_ids']
pixel_values = inputs_new['pixel_values']
image_grid_thw = inputs_new['image_grid_thw']
current_input_ids = input_ids
input_names = {key.get_any_name(): idx for idx, key in enumerate(language_model.inputs)}
generated_tokens = []
for i in range(50):
    inputs_embeds = torch.from_numpy(text_embedding(current_input_ids)[0])
    if current_input_ids.shape[-1] > 1:
        hidden_states = torch.from_numpy(image_embedding(pixel_values)[0])
        rotary_pos_emb = torch.cat([torch.from_numpy(rotary_embedding(x)[0]) for x in image_grid_thw], dim=0)
        grid_thw = image_grid_thw
        cu_seqlens = torch.repeat_interleave(grid_thw[:, 1] * grid_thw[:, 2], grid_thw[:, 0]).cumsum(dim=0, dtype=torch.int32)
        cu_seqlens = torch.nn.functional.pad(cu_seqlens, (1, 0), value=0)
        attention_mask = torch.zeros((1, hidden_states.shape[0], hidden_states.shape[0]), dtype=torch.bool)
        causal_mask = torch.zeros_like(attention_mask, dtype=torch.float32)
        for i in range(1, len(cu_seqlens)):
            attention_mask[..., cu_seqlens[i - 1] : cu_seqlens[i], cu_seqlens[i - 1] : cu_seqlens[i]] = True

        causal_mask.masked_fill_(torch.logical_not(attention_mask), float("-inf"))

        image_embeds = torch.from_numpy(image_embedding_merger(
            {
                "hidden_states": hidden_states,
                "rotary_pos_emb": rotary_pos_emb,
                "attention_mask": attention_mask,
            }
        )[0])
        image_mask = input_ids == config.image_token_id
        inputs_embeds[image_mask] = image_embeds
    # break
    if i>0:
        inputs = {}

    if current_input_ids.shape[-1] > 1:
        attention_mask = inputs_new["attention_mask"]
        position_ids = torch.arange(current_input_ids.shape[1], device=current_input_ids.device).view(1, 1, -1).expand(3, current_input_ids.shape[0], -1)
    
    # Prepare inputs for the model
    inputs["inputs_embeds"] = inputs_embeds
    inputs["attention_mask"] = attention_mask
    inputs["position_ids"] = position_ids
    if "beam_idx" in input_names:
        inputs["beam_idx"] = np.arange(inputs_embeds.shape[0], dtype=int)
    
    # Start inference
    request.start_async(inputs, share_inputs=True)
    request.wait()
    
    # Get the logits and find the next token
    logits = torch.from_numpy(request.get_tensor("logits").data)
    next_token = logits.argmax(-1)[0][-1]

    # Append the generated token
    generated_tokens.append(next_token)
    
    # Update input_ids with the new token
    current_input_ids = torch.cat([next_token.unsqueeze(0).unsqueeze(0)], dim=-1)
    
    position_ids = torch.tensor(inputs_new["input_ids"].shape[-1] + i).view(1, 1, -1).expand(3, current_input_ids.shape[0], -1)

    inputs["position_ids"] = position_ids


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [10]:
output_text = processor.batch_decode(
    generated_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print("Question:\n Describe this Image")
print("Answer:")
print("".join(output_text))

Question:
 Describe this Image
Answer:
The Bennett is sitting on the beach, wearing a plaid shirt and black pants. She is smiling and appears to be enjoying her time outdoors. A dog is sitting next to her, wearing a harness and leash. The beach is sandy and the sky


## 2. Import and Save Qwen2VL in Spark NLP

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Let's start Spark with Spark NLP included via our simple `start()` function

In [1]:
import sparknlp

# let's start Spark with Spark NLP
spark = sparknlp.start()


24/11/07 09:56:55 WARN Utils: Your hostname, minotaur resolves to a loopback address: 127.0.1.1; using 192.168.1.4 instead (on interface eno1)
24/11/07 09:56:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
24/11/07 09:56:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [None]:
imageClassifier = Qwen2VLForMultiModal.pretrained() \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

24/11/07 09:57:34 WARN NativeLibrary: Failed to load library null: java.lang.UnsatisfiedLinkError: Can't load library: /tmp/openvino-native15331424460843812197/libtbb.so.2




In [None]:
imageClassifier.write().overwrite().save("Qwen2VL_spark_nlp")

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql.functions import lit
from pyspark.ml import Pipeline
from pathlib import Path
import os

# download two images to test into ./images folder

url1 = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
url2 = "http://images.cocodataset.org/val2017/000000039769.jpg"

Path("images").mkdir(exist_ok=True)

!wget -q -O images/image1.jpg {url1}
!wget -q -O images/image2.jpg {url2}



images_path = "file://" + os.getcwd() + "/images/"
image_df = spark.read.format("image").load(
    path=images_path
)

test_df = image_df.withColumn("text", lit("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n"))

image_assembler = ImageAssembler().setInputCol("image").setOutputCol("image_assembler")

imageClassifier = Qwen2VLForMultiModal.load("Qwen2VL_spark_nlp")\
            .setMaxOutputLength(50) \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

pipeline = Pipeline(
            stages=[
                image_assembler,
                imageClassifier,
            ]
        )

model = pipeline.fit(test_df)

In [None]:
light_pipeline = LightPipeline(model)
image_path = os.getcwd() + "/images/" + "image1.jpg"
print("image_path: " + image_path)
annotations_result = light_pipeline.fullAnnotateImage(
    image_path,
    "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n"
)

for result in annotations_result:
    print(result["answer"])

image_path: /mnt/research/Projects/ModelZoo/LLAVA/images/image1.jpg
[Annotation(document, 0, 363, This image features a cat comfortably laying inside a cardboard box. The cat appears to be relaxed and enjoying its cozy spot. The scene takes place on a carpeted floor, which adds to the overall warm and inviting atmosphere of the image. The cat's position inside the box creates a sense of security and contentment, making it an endearing and heartwarming scene., Map(), [])]
