![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/openvino/HuggingFace_OpenVINO_in_Spark_NLP_Qwen2VL.ipynb)

# Import OpenVINO Qwen2VL models from HuggingFace 🤗 into Spark NLP 🚀

This notebook provides a detailed walkthrough on optimizing and importing Qwen2VL models from HuggingFace  for use in Spark NLP, with [Intel OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html). The focus is on converting the model to the OpenVINO format and applying precision optimizations (INT8 and INT4), to enhance the performance and efficiency on CPU platforms using [Optimum Intel](https://huggingface.co/docs/optimum/main/en/intel/inference).

Let's keep in mind a few things before we start 😊

- OpenVINO support was introduced in  `Spark NLP 5.4.0`, enabling high performance CPU inference for models. So please make sure you have upgraded to the latest Spark NLP release.
- Model quantization is a computationally expensive process, so it is recommended to use a runtime with more than 32GB memory for exporting the quantized model from HuggingFace.
- You can import Qwen2VL models via `Qwen2VL`. These models are usually under `Text Generation` category and have `Qwen2VL` in their labels.
- Reference: [Qwen2VL](https://huggingface.co/docs/transformers/model_doc/llama#transformers.Qwen2VL)
- Some [example models](https://huggingface.co/models?search=Qwen2VL)

## 1. Export and Save the HuggingFace model

- Let's install `transformers` and `openvino` packages with other dependencies. You don't need `openvino` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.41.2`. This doesn't mean it won't work with the future release, but we wanted you to know which versions have been tested successfully.

In [24]:

%pip install -qU "openvino>=2024.4.0" "nncf>=2.13.0"
%pip install -q  "sentencepiece" "tokenizers>=0.12.1" "transformers>=4.45.0" "gradio>=4.36" "accelerate>=0.26.0"
%pip install -q -U --pre --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly openvino-tokenizers openvino openvino-genai
%pip install -q --upgrade huggingface_hub
%pip install -q --upgrade torch>=2.2.1 torchvision>=0.10.2
%pip install -q --upgrade qwen-vl-utils
%pip install -q --upgrade ipywidgets

utility_files = ["notebook_utils.py", "cmd_helper.py"]

from pathlib import Path
import requests

if not Path("ov_qwen2_vl.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/qwen2-vl/ov_qwen2_vl.py")
    open("ov_qwen2_vl.py", "w").write(r.text)

if not Path("notebook_utils.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py")
    open("notebook_utils.py", "w").write(r.text)

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### 1.1 Convert the model to OpenVino

In [3]:
from ov_qwen2_vl import model_selector
from pathlib import Path
import requests

model_id = model_selector()

model_id

Dropdown(description='Model:', options=('Qwen/Qwen2-VL-2B-Instruct', 'Qwen/Qwen2-VL-7B-Instruct'), value='Qwen…

In [4]:
print(f"Selected {model_id.value}")
pt_model_id = model_id.value
model_dir = Path(pt_model_id.split("/")[-1])

Selected Qwen/Qwen2-VL-2B-Instruct


In [5]:
model_dir

PosixPath('Qwen2-VL-2B-Instruct')

In [6]:
from ov_qwen2_vl import convert_qwen2vl_model
import nncf

compression_configuration = {
    "mode": nncf.CompressWeightsMode.INT4_ASYM,
    "group_size": 128,
    "ratio": 1.0,
}

convert_qwen2vl_model(pt_model_id, model_dir, compression_configuration)


⌛ Qwen/Qwen2-VL-2B-Instruct conversion started. Be patient, it may takes some time.
⌛ Load Original model


Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✅ Original model successfully loaded
⌛ Convert Input embedding model
✅ Input embedding model successfully converted
⌛ Convert Language model


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
  or len(self.key_cache[layer_idx]) == 0  # the layer has no cache
  elif sliding_window is None or key_value_length < sliding_window:
  if attention_mask.shape[-1] > target_length:
  len(self.key_cache[layer_idx]) == 0


✅ Language model successfully converted
⌛ Weights compression with int4_asym mode started
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_asym                 │ 15% (1 / 197)               │ 0% (0 / 196)                           │
├───────────────────────────┼─────────────────────────────┼────────────────────────────────────────┤
│ int4_asym                 │ 85% (196 / 197)             │ 100% (196 / 196)                       │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


Output()

✅ Weights compression finished
⌛ Convert Image embedding model
⌛ Weights compression with int4_asym mode started
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_asym                 │ 1% (1 / 130)                │ 0% (0 / 129)                           │
├───────────────────────────┼─────────────────────────────┼────────────────────────────────────────┤
│ int4_asym                 │ 99% (129 / 130)             │ 100% (129 / 129)                       │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


Output()

✅ Weights compression finished
✅ Image embedding model successfully converted
✅ Qwen/Qwen2-VL-2B-Instruct model conversion finished. You can find results in Qwen2-VL-2B-Instruct


In [7]:
import torch
import torch.nn as nn

class Qwen2ReshapePatches(nn.Module):
    def __init__(self,
                 temporal_patch_size: int = 2,
                 merge_size: int = 2,
                 patch_size: int = 14
                 ):
        super().__init__()
        self.temporal_patch_size = temporal_patch_size
        self.merge_size = merge_size
        self.patch_size = patch_size

    def forward(self, patches, repetition_factor=1):
        # Repeat the patches along the first dimension
        patches = patches.repeat(repetition_factor, 1, 1, 1)
        channel = patches.shape[1]
        grid_t = patches.shape[0] // self.temporal_patch_size
        resized_height = patches.shape[2]
        resized_width = patches.shape[3]
        grid_h, grid_w = resized_height // self.patch_size, resized_width // self.patch_size
        patches = patches.reshape(
            grid_t,
            self.temporal_patch_size,
            channel,
            grid_h // self.merge_size,
            self.merge_size,
            self.patch_size,
            grid_w // self.merge_size,
            self.merge_size,
            self.patch_size,
        )
        patches = patches.permute(0, 3, 6, 4, 7, 2, 1, 5, 8)
        flatten_patches = patches.reshape(
            grid_t * grid_h * grid_w, channel * self.temporal_patch_size * self.patch_size * self.patch_size
        )

        return flatten_patches


patch_reshape_model = Qwen2ReshapePatches()

In [8]:
import openvino as ov


ov_model = ov.convert_model(
            patch_reshape_model,
            example_input={
                "patches": torch.ones((1, 3, 1372, 2044), dtype=torch.float32),
                "repetition_factor": torch.tensor(2),
            }
        )

# Save the OpenVINO model
ov.save_model(ov_model, model_dir/"openvino_patch_reshape_model.xml")

In [9]:
from transformers.models.qwen2_vl.modeling_qwen2_vl import VisionRotaryEmbedding
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, AutoConfig

config = AutoConfig.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")


class RotaryEmbedding(nn.Module):

    def __init__(self, embed_dim, spatial_merge_size):
        super().__init__()
        self._rotary_pos_emb = VisionRotaryEmbedding(embed_dim)
        self.spatial_merge_size = spatial_merge_size
    
    def forward(self, grid_thw):
        t, h, w = grid_thw
        pos_ids = []
        # for t, h, w in grid_thw:

        hpos_ids = torch.arange(h).unsqueeze(1).expand(-1, w)
        hpos_ids = hpos_ids.reshape(
            h // self.spatial_merge_size,
            self.spatial_merge_size,
            w // self.spatial_merge_size,
            self.spatial_merge_size,
        )
        hpos_ids = hpos_ids.permute(0, 2, 1, 3)
        hpos_ids = hpos_ids.flatten()

        wpos_ids = torch.arange(w).unsqueeze(0).expand(h, -1)
        wpos_ids = wpos_ids.reshape(
            h // self.spatial_merge_size,
            self.spatial_merge_size,
            w // self.spatial_merge_size,
            self.spatial_merge_size,
        )
        wpos_ids = wpos_ids.permute(0, 2, 1, 3)
        wpos_ids = wpos_ids.flatten()
        pos_ids.append(torch.stack([hpos_ids, wpos_ids], dim=-1).repeat(t, 1))
        pos_ids = torch.cat(pos_ids, dim=0)
        max_grid_size = grid_thw.max()
        rotary_pos_emb_full = self._rotary_pos_emb(max_grid_size)
        rotary_pos_emb = rotary_pos_emb_full[pos_ids].flatten(1)
        return rotary_pos_emb



vision_rotary_embedding = RotaryEmbedding(config.vision_config.embed_dim // config.vision_config.num_heads // 2, config.vision_config.spatial_merge_size)


In [10]:
import openvino as ov

vision_embedding_ov = ov.convert_model(
    vision_rotary_embedding,
    example_input={
        "grid_thw": torch.tensor([1, 98, 146]),
    }
)

# Save the OpenVINO model
ov.save_model(vision_embedding_ov, model_dir/"openvino_rotary_embeddings_model.xml")

  t, h, w = grid_thw


In [11]:
class MergeMultiModalInputs(torch.nn.Module):
    def __init__(self,image_token_index=151655):
        super().__init__()
        self.image_token_index = image_token_index

    def forward(
        self,
        vision_embeds,
        inputs_embeds,
        input_ids,
    ):
        image_features = vision_embeds
        inputs_embeds = inputs_embeds
        special_image_mask = (input_ids == self.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
        # image_features = image_features.to(inputs_embeds.dtype)
        final_embedding = inputs_embeds.masked_scatter(special_image_mask, image_features)

        return {
            "inputs_embeds": final_embedding
        }

torch_model_merge = MergeMultiModalInputs()

In [12]:
import openvino as ov

# convert MergeMultiModalInputs to OpenVINO IR
ov_model_merge = ov.convert_model(
    torch_model_merge,
    example_input={
        "vision_embeds": torch.randn((3577, 1536), dtype=torch.float32),
        "inputs_embeds": torch.randn((1, 3602, 1536), dtype=torch.float32),
        "input_ids": torch.randint(0, 151656, (1, 3602), dtype=torch.long),
    }
)
ov.save_model(ov_model_merge, model_dir/"openvino_multimodal_merge_model.xml")

### 1.2 Load openvino models

In [1]:
LANGUAGE_MODEL_NAME = "openvino_language_model.xml"
IMAGE_EMBEDDING_NAME = "openvino_vision_embeddings_model.xml"
IMAGE_EMBEDDING_MERGER_NAME = "openvino_vision_embeddings_merger_model.xml"
TEXT_EMBEDDING_NAME = "openvino_text_embeddings_model.xml"
ROTARY_EMBEDDING_NAME = "openvino_rotary_embeddings_model.xml"
PATCH_RESHAPE_NAME = "openvino_patch_reshape_model.xml"

In [14]:
import openvino as ov
import gc

core = ov.Core()
model_path = model_dir

language_model = core.read_model(model_path / LANGUAGE_MODEL_NAME)
compiled_language_model = core.compile_model(language_model, "CPU")
request = compiled_language_model.create_infer_request()

image_embedding = core.compile_model(model_path / IMAGE_EMBEDDING_NAME, "CPU")
image_embedding_merger = core.compile_model(model_path / IMAGE_EMBEDDING_MERGER_NAME, "CPU")
text_embedding = core.compile_model(model_path / TEXT_EMBEDDING_NAME, "CPU")
rotary_embedding = core.compile_model(model_path / ROTARY_EMBEDDING_NAME, "CPU")
patch_reshape = core.compile_model(model_path / PATCH_RESHAPE_NAME, "CPU")


In [15]:
# check if all the models are converted

print("⌛ Check if all models are converted")
language_model_path = model_dir / LANGUAGE_MODEL_NAME
image_embed_path = model_dir / IMAGE_EMBEDDING_NAME
image_merger_path = model_dir / IMAGE_EMBEDDING_MERGER_NAME
text_embed_path = model_dir / TEXT_EMBEDDING_NAME
rotary_embed_path = model_dir / ROTARY_EMBEDDING_NAME
patch_reshape_path = model_dir / PATCH_RESHAPE_NAME




if all(
    [
        language_model_path.exists(),
        image_embed_path.exists(),
        image_merger_path.exists(),
        text_embed_path.exists(),
        rotary_embed_path.exists(),
        patch_reshape_path.exists(),
    ]
):
    print(f"✅ All models are converted. You can find results in {model_dir}")
else:
    print("❌ Not all models are converted. Please check the conversion process")

⌛ Check if all models are converted
✅ All models are converted. You can find results in Qwen2-VL-2B-Instruct


### 1.2 Copy assets to the assets folder

In [16]:
assets_dir = model_dir / "assets"
assets_dir.mkdir(exist_ok=True)

# copy all the assets to the assets directory (json files, vocab files, etc.)

import shutil

# copy all json files

for file in model_dir.glob("*.json"):
    shutil.copy(file, assets_dir)

    


In [17]:
!ls -lh {model_dir}

total 1.7G
-rw-rw-r-- 1 prabod prabod  392 Feb 13 22:58 added_tokens.json
drwxrwxr-x 2 prabod prabod 4.0K Feb 13 23:03 assets
-rw-rw-r-- 1 prabod prabod 1.1K Feb 13 22:58 chat_template.json
-rw-rw-r-- 1 prabod prabod 1.2K Feb 13 22:58 config.json
-rw-rw-r-- 1 prabod prabod 1.6M Feb 13 22:58 merges.txt
-rw-rw-r-- 1 prabod prabod 873M Feb 13 23:00 openvino_language_model.bin
-rw-rw-r-- 1 prabod prabod 3.4M Feb 13 23:00 openvino_language_model.xml
-rw-rw-r-- 1 prabod prabod   40 Feb 13 23:01 openvino_multimodal_merge_model.bin
-rw-rw-r-- 1 prabod prabod 9.8K Feb 13 23:01 openvino_multimodal_merge_model.xml
-rw-rw-r-- 1 prabod prabod  132 Feb 13 23:00 openvino_patch_reshape_model.bin
-rw-rw-r-- 1 prabod prabod  24K Feb 13 23:00 openvino_patch_reshape_model.xml
-rw-rw-r-- 1 prabod prabod  132 Feb 13 23:00 openvino_rotary_embeddings_model.bin
-rw-rw-r-- 1 prabod prabod  30K Feb 13 23:00 openvino_rotary_embeddings_model.xml
-rw-rw-r-- 1 prabod prabod 446M Feb 13 22:58 openvino_text_embeddings

In [18]:
!ls -lh {assets_dir}

total 14M
-rw-rw-r-- 1 prabod prabod  392 Feb 13 23:03 added_tokens.json
-rw-rw-r-- 1 prabod prabod 1.1K Feb 13 23:03 chat_template.json
-rw-rw-r-- 1 prabod prabod 1.2K Feb 13 23:03 config.json
-rw-rw-r-- 1 prabod prabod  567 Feb 13 23:03 preprocessor_config.json
-rw-rw-r-- 1 prabod prabod  613 Feb 13 23:03 special_tokens_map.json
-rw-rw-r-- 1 prabod prabod 4.3K Feb 13 23:03 tokenizer_config.json
-rw-rw-r-- 1 prabod prabod  11M Feb 13 23:03 tokenizer.json
-rw-rw-r-- 1 prabod prabod 2.7M Feb 13 23:03 vocab.json


## 2. Import and Save Qwen2VL in Spark NLP

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Let's start Spark with Spark NLP included via our simple `start()` function

In [1]:
import sparknlp

# let's start Spark with Spark NLP
spark = sparknlp.start()


24/11/07 09:56:55 WARN Utils: Your hostname, minotaur resolves to a loopback address: 127.0.1.1; using 192.168.1.4 instead (on interface eno1)
24/11/07 09:56:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
24/11/07 09:56:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [3]:
imageClassifier = Qwen2VLTransformer.loadSavedModel(str(model_path),spark) \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

25/02/14 00:53:12 WARN NativeLibrary: Failed to load library null: java.lang.UnsatisfiedLinkError: Can't load library: /tmp/openvino-native16473116188009294604/libtbb.so.2




In [None]:
imageClassifier.write().overwrite().save("Qwen2VL_spark_nlp")

                                                                                

In [None]:
!ls -lah Qwen2VL_spark_nlp

total 1.7G
drwxr-xr-x  4 prabod prabod 4.0K Feb 14 00:53 .
drwxr-xr-x 12 prabod root   4.0K Feb 14 00:53 ..
drwxr-xr-x  6 prabod prabod 4.0K Feb 14 00:53 fields
drwxr-xr-x  2 prabod prabod 4.0K Feb 14 00:53 metadata
-rw-r--r--  1 prabod prabod 876M Feb 14 00:53 openvino_language_model.xml
-rw-r--r--  1 prabod prabod 6.9M Feb 14 00:53 .openvino_language_model.xml.crc
-rw-r--r--  1 prabod prabod  11K Feb 14 00:53 openvino_multimodal_merge_model.xml
-rw-r--r--  1 prabod prabod   92 Feb 14 00:53 .openvino_multimodal_merge_model.xml.crc
-rw-r--r--  1 prabod prabod  24K Feb 14 00:53 openvino_patch_reshape_model.xml
-rw-r--r--  1 prabod prabod  200 Feb 14 00:53 .openvino_patch_reshape_model.xml.crc
-rw-r--r--  1 prabod prabod  30K Feb 14 00:53 openvino_rotary_embeddings_model.xml
-rw-r--r--  1 prabod prabod  248 Feb 14 00:53 .openvino_rotary_embeddings_model.xml.crc
-rw-r--r--  1 prabod prabod 446M Feb 14 00:53 openvino_text_embeddings_model.xml
-rw-r--r--  1 prabod prabod 3.5M Feb 14 00:53 .

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql.functions import lit
from pyspark.ml import Pipeline
from pathlib import Path
import os

# download two images to test into ./images folder

url1 = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
url2 = "http://images.cocodataset.org/val2017/000000039769.jpg"

Path("images").mkdir(exist_ok=True)

!wget -q -O images/image1.jpg {url1}
!wget -q -O images/image2.jpg {url2}



images_path = "file://" + os.getcwd() + "/images/"
image_df = spark.read.format("image").load(
    path=images_path
)

test_df = image_df.withColumn("text", lit("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n"))

image_assembler = ImageAssembler().setInputCol("image").setOutputCol("image_assembler")

imageClassifier = Qwen2VLTransformer.load("Qwen2VL_spark_nlp")\
            .setMaxOutputLength(50) \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

pipeline = Pipeline(
            stages=[
                image_assembler,
                imageClassifier,
            ]
        )

model = pipeline.fit(test_df)

In [7]:
light_pipeline = LightPipeline(model)
image_path = os.getcwd() + "/images/" + "image1.jpg"
print("image_path: " + image_path)
annotations_result = light_pipeline.fullAnnotateImage(
    image_path,
    "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n"
)

for result in annotations_result:
    print(result["answer"])

image_path: /home/prabod/Projects/spark-nlp/examples/python/transformers/openvino/images/image1.jpg
[Annotation(document, 0, 245, The image shows a cat lying inside a cardboard box. The cat appears to be relaxed and comfortable, with its eyes closed, suggesting it is resting or sleeping. The box is placed on a light-colored carpet, and the background includes a portion of a, Map(), [])]
