![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/openvino/HuggingFace_OpenVINO_in_Spark_NLP_Gemma3ForMultiModal.ipynb)

# Import OpenVINO Gemma3 models from HuggingFace 🤗 into Spark NLP 🚀

This notebook provides a detailed walkthrough on optimizing and importing Gemma3 models from HuggingFace for use in Spark NLP, with [Intel OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html). The focus is on converting the model to the OpenVINO format and applying precision optimizations (INT8 and INT4), to enhance the performance and efficiency on CPU platforms using [Optimum Intel](https://huggingface.co/docs/optimum/main/en/intel/inference).

Let's keep in mind a few things before we start 😊

- OpenVINO support was introduced in `Spark NLP 5.4.0`, enabling high performance CPU inference for models. So please make sure you have upgraded to the latest Spark NLP release.
- Model quantization is a computationally expensive process, so it is recommended to use a runtime with more than 32GB memory for exporting the quantized model from HuggingFace.
- You can import Gemma3 models via `Gemma3`. These models are usually under the `Text Generation` category and have `Gemma3` in their labels.
- Reference: [Gemma3](https://huggingface.co/docs/transformers/model_doc/llama#transformers.Gemma3)
- Some [example models](https://huggingface.co/models?search=Gemma3)

## Table of Contents

1. [Setup and Installation](#setup-and-installation)
2. [Model Configuration](#model-configuration)
3. [Model Loading and Preparation](#model-loading-and-preparation)
4. [Model Conversion to OpenVINO](#model-conversion-to-openvino)
5. [Model Quantization](#model-quantization)
6. [Model Merger Implementation](#model-merger-implementation)
7. [Testing OpenVINO Model](#7-testing-openvino-model)


## 1. Setup and Installation

First, let's install all the required dependencies for this notebook.

In [None]:
# Install OpenVINO and NNCF for model optimization
import platform

%pip install -q "torch>=2.1" "torchvision" "Pillow" "gradio>=4.36" "opencv-python" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install  -q -U "openvino>=2025.0.0" "openvino-tokenizers>=2025.0.0" "nncf>=2.15.0"
%pip install -q "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3" --extra-index-url https://download.pytorch.org/whl/cpu

if platform.system() == "Darwin":
    %pip install -q "numpy<2.0"

### Environment Configuration

Configure the environment to disable tokenizer parallelism for better compatibility.

In [1]:
import os

# Disable tokenizer parallelism to avoid potential issues
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## 2. Model Configuration

Set up the model ID and quantization parameters for the conversion process.

In [2]:
model_ids = [
    "google/gemma-3-4b-it",
    "google/gemma-3-12b-it",
    "google/gemma-3-12b-pt",
    "google/gemma-3-4b-pt",
]

## 3. Model Loading and Preparation

Load the model processor, configuration, and prepare the model for conversion to OpenVINO format.

In [None]:
import shutil
from pathlib import Path
import torch

for model_id in model_ids:
    output_dir = f"./models/int4/{model_id}"
    # check if the model is already optimized
    if not os.path.exists(
        f"{output_dir}/openvino_language_model.xml"
    ) and not os.path.exists(f"{output_dir}/openvino_language_model.bin"):
        !optimum-cli export openvino --model {model_id} --weight-format int4 {output_dir}
    else:
        print(f"Model {model_id} already optimized.")

## 4. Model Conversion to OpenVINO

Define paths for the converted model components and implement conversion utilities.

In [None]:
for model_id in model_ids:
    # change vision embed avg pool to opset1
    # this is a workaround for the issue with the Gemma3 model
    output_dir = f"./models/int4/{model_id}"
    with open(f"{output_dir}/openvino_vision_embeddings_model.xml", "r") as f:
        xml = f.read()
    xml = xml.replace("opset14", "opset1")
    with open(f"{output_dir}/openvino_vision_embeddings_model.xml", "w") as f:
        f.write(xml)

    if not os.path.exists(f"{output_dir}/assets"):
        output_dir = Path(output_dir)
        assets_dir = output_dir / "assets"
        assets_dir.mkdir(exist_ok=True)

        # copy all the assets to the assets directory (json files, vocab files, etc.)
        for file in output_dir.glob("*.json"):
            shutil.copy(file, assets_dir)

In [15]:
def cleanup_torchscript_cache():
    """
    Helper function for removing cached model representation to prevent memory leaks
    during model conversion.
    """
    torch._C._jit_clear_class_registry()
    torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore()
    torch.jit._state._clear_class_state()

## 5. Model Merger Implementation

Implement the model merger to combine text and image components.

In [20]:
import numpy as np
import torch
from transformers import AutoConfig
import openvino as ov
import gc

config = AutoConfig.from_pretrained(model_id)


class MergeMultiModalInputs(torch.nn.Module):
    def __init__(self, image_token_index=config.image_token_index):
        """
        Merge multimodal inputs with the image token index.
        Args:
            image_token_index (int): The token index for the image token.
        """
        super().__init__()
        self.image_token_index = image_token_index

    def forward(
        self,
        vision_embeds,
        inputs_embeds,
        input_ids,
    ):
        image_features = vision_embeds
        inputs_embeds = inputs_embeds
        special_image_mask = (
            (input_ids == self.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
        )
        # image_features = image_features.to(inputs_embeds.dtype)
        final_embedding = inputs_embeds.masked_scatter(
            special_image_mask, image_features
        )

        return {"final_embedding": final_embedding}

In [21]:
for model_id in model_ids:
    print(f"Converting model {model_id} merger to OpenVINO format...")
    core = ov.Core()
    output_dir = f"./models/int4/{model_id}"
    model_merger_path = f"{output_dir}/openvino_merger_model.xml"
    config = AutoConfig.from_pretrained(model_id)
    multimodal_merger = MergeMultiModalInputs(config.image_token_index)
    with torch.no_grad():
        ov_model = ov.convert_model(
            multimodal_merger,
            example_input={
                "input_ids": torch.ones([2, 1198], dtype=torch.int64),
                "inputs_embeds": torch.ones(
                    [2, 1198, config.text_config.hidden_size], dtype=torch.float32
                ),
                "vision_embeds": torch.ones(
                    [2, config.mm_tokens_per_image, config.text_config.hidden_size],
                    dtype=torch.float32,
                ),
            },
        )
        ov.save_model(ov_model, model_merger_path)
        del ov_model
        cleanup_torchscript_cache()
        gc.collect()

Converting model google/gemma-3-4b-it merger to OpenVINO format...
Converting model google/gemma-3-12b-it merger to OpenVINO format...
Converting model google/gemma-3-12b-pt merger to OpenVINO format...
Converting model google/gemma-3-4b-pt merger to OpenVINO format...


# 7. Testing OpenVINO Model




In [33]:
core = ov.Core()
device = "CPU"

# lets pick the first model
model_id = model_ids[0]
output_dir = f"./models/int4/{model_id}"
output_dir = Path(output_dir)


In [30]:
# paths for the exported models
image_embed_path = output_dir / "openvino_vision_embeddings_model.xml"
language_model_path = output_dir / "openvino_language_model.xml"
text_embeddings_path = output_dir / "openvino_text_embeddings_model.xml"
model_merger_path = output_dir / "openvino_merger_model.xml"

In [31]:
# compile the models
language_model = core.read_model(language_model_path)
compiled_language_model = core.compile_model(language_model, "AUTO")

image_embed_model = core.compile_model(image_embed_path, device)
text_embeddings_model = core.compile_model(text_embeddings_path, device)
multimodal_merger_model = core.compile_model(model_merger_path, device)

In [32]:
from transformers.image_utils import load_image
from transformers import AutoProcessor, TextStreamer
import numpy as np


DEVICE = "cpu"
# Load images
image1 = load_image(
    "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
)
image2 = load_image(
    "https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg"
)

processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}],
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in detail."},
        ],
    },
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

inputs_new = processor(text=prompt, images=[image1], return_tensors="pt")

request = compiled_language_model.create_infer_request()
merge_model_request = multimodal_merger_model.create_infer_request()
# Set the input names
input_names = {key.get_any_name(): idx for idx, key in enumerate(language_model.inputs)}
inputs = {}
# Set the initial input_ids
current_input_ids = inputs_new["input_ids"]
attention_mask = inputs_new["attention_mask"]
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
pixel_values = inputs_new["pixel_values"]
token_type_ids = inputs_new["token_type_ids"]

generation_args = {
    "max_new_tokens": 200,
    "do_sample": False,
    "streamer": TextStreamer(
        processor.tokenizer, skip_prompt=True, skip_special_tokens=True
    ),
}
generated_tokens = []

for i in range(generation_args["max_new_tokens"]):
    # Generate input embeds each time
    text_embeds = torch.from_numpy(text_embeddings_model(current_input_ids)[0])
    if current_input_ids.shape[-1] > 1:
        vision_embeds = torch.from_numpy(
            image_embed_model(
                {
                    "pixel_values": pixel_values,
                }
            )[0]
        )
        merge_model_request.start_async(
            {
                "vision_embeds": vision_embeds,
                "inputs_embeds": text_embeds,
                "input_ids": current_input_ids,
            },
            share_inputs=True,
        )
        merge_model_request.wait()
        final_embedding = torch.from_numpy(
            merge_model_request.get_tensor("final_embedding").data
        )
    else:
        final_embedding = text_embeds

    if i > 0:
        inputs = {}
    # Prepare inputs for the model
    inputs["inputs_embeds"] = final_embedding
    inputs["attention_mask"] = attention_mask
    inputs["position_ids"] = position_ids
    inputs["token_type_ids"] = token_type_ids
    if "beam_idx" in input_names:
        inputs["beam_idx"] = np.arange(attention_mask.shape[0], dtype=int)

    # Start inference
    request.start_async(inputs, share_inputs=True)
    request.wait()

    # Get the logits and find the next token
    logits = torch.from_numpy(request.get_tensor("logits").data)
    next_token = logits.argmax(-1)[0][-1]

    # Append the generated token
    generated_tokens.append(next_token)

    # Update input_ids with the new token
    current_input_ids = torch.cat([next_token.unsqueeze(0).unsqueeze(0)], dim=-1)

    # update the attention mask
    attention_mask = torch.cat(
        [attention_mask, torch.ones_like(attention_mask[:, :1])], dim=-1
    )

    # Update inputs for the next iteration
    position_ids = attention_mask.long().cumsum(-1) - 1
    position_ids.masked_fill_(attention_mask == 0, 1)
    position_ids = position_ids[:, -current_input_ids.shape[1] :]
    inputs["position_ids"] = position_ids
    token_type_ids = torch.zeros_like(current_input_ids)

generated_text = processor.decode(generated_tokens, skip_special_tokens=True)
print(generated_text)

Here's a detailed description of the image:

**Overall Impression:**

The image presents a wide, scenic view of the Statue of Liberty with the New York City skyline in the background. The scene is bathed in warm, golden sunlight, suggesting either early morning or late afternoon.

**Foreground:**

*   **Statue of Liberty:** The iconic statue dominates the foreground. It stands proudly on a small, rocky island (Liberty Island) with a patch of green grass and a few trees surrounding its base. The statue itself is a vibrant green color, likely due to the oxidation of the copper it's made from. The details of the statue's robes, torch, and crown are clearly visible.
*   **Water:** The water surrounding the island is a deep blue, with gentle ripples reflecting the sunlight.

**Background:**

*   **New York City Skyline:** A dense and impressive skyline of New York City stretches across the background. Numerous skyscrapers of varying heights and


# 8. Import and Save in Spark NLP
- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [23]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Let's start Spark with Spark NLP included via our simple `start()` function

In [24]:
# import sparknlp

# let's start Spark with Spark NLP
spark = sparknlp.start()


In [36]:
imageClassifier = (
    Gemma3ForMultiModal.loadSavedModel(str(output_dir), spark)
    .setInputCols("image_assembler")
    .setOutputCol("answer")
)

25/04/28 03:36:26 WARN NativeLibrary: Failed to load library null: java.lang.UnsatisfiedLinkError: Can't load library: /tmp/openvino-native10904419691103163033/libtbb.so.2




In [37]:
imageClassifier.write().overwrite().save(f"file:///tmp/{model_id}_spark_nlp")

                                                                                

In [39]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql.functions import lit
from pyspark.ml import Pipeline
from pathlib import Path
import os

# download two images to test into ./images folder

url1 = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
url2 = "http://images.cocodataset.org/val2017/000000039769.jpg"

Path("images").mkdir(exist_ok=True)

!wget -q -O images/image1.jpg {url1}
!wget -q -O images/image2.jpg {url2}


images_path = "file://" + os.getcwd() + "/images/"
image_df = spark.read.format("image").load(path=images_path)

test_df = image_df.withColumn(
    "text",
    lit(
        "<bos><start_of_turn>user\nYou are a helpful assistant.\n\n<start_of_image>Describe this image in detail.<end_of_turn>\n<start_of_turn>\n"
    ),
)

image_assembler = ImageAssembler().setInputCol("image").setOutputCol("image_assembler")

imageClassifier = (
    Gemma3ForMultiModal.load(f"file:///tmp/{model_id}_spark_nlp")
    .setMaxOutputLength(50)
    .setInputCols("image_assembler")
    .setOutputCol("answer")
)

pipeline = Pipeline(
    stages=[
        image_assembler,
        imageClassifier,
    ]
)

model = pipeline.fit(test_df)

In [40]:
light_pipeline = LightPipeline(model)
image_path = os.getcwd() + "/images/" + "image1.jpg"
print("image_path: " + image_path)
annotations_result = light_pipeline.fullAnnotateImage(
    image_path,
    "<bos><start_of_turn>user\nYou are a helpful assistant.\n\n<start_of_image>Describe this image in detail.<end_of_turn>\n<start_of_turn>",
)

for result in annotations_result:
    print(result["answer"])

image_path: /mnt/research/Projects/ModelZoo/Gemma3/images/image1.jpg
[Annotation(document, 0, 222, Okay, here's a detailed description of the image:**Overall Impression:**The image is a cozy and charming shot of a gray tabby cat completely relaxed and content inside a cardboard box. It’s a very peaceful and playful scene, Map(), [])]
