![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/openvino/HuggingFace_OpenVINO_in_Spark_NLP_InternVLForMultiModal.ipynb)

# Import OpenVINO InternVL models from HuggingFace 🤗 into Spark NLP 🚀

This notebook provides a detailed walkthrough on optimizing and importing InternVL models from HuggingFace for use in Spark NLP, with [Intel OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html). The focus is on converting the model to the OpenVINO format and applying precision optimizations (INT8 and INT4), to enhance the performance and efficiency on CPU platforms using [Optimum Intel](https://huggingface.co/docs/optimum/main/en/intel/inference).

Let's keep in mind a few things before we start 😊

- OpenVINO support was introduced in `Spark NLP 5.4.0`, enabling high performance CPU inference for models. So please make sure you have upgraded to the latest Spark NLP release.
- Model quantization is a computationally expensive process, so it is recommended to use a runtime with more than 32GB memory for exporting the quantized model from HuggingFace.
- You can import InternVL models via `InternVL`. These models are usually under the `Text Generation` category and have `InternVL` in their labels.
- Reference: [InternVL](https://huggingface.co/docs/transformers/model_doc/llama#transformers.InternVL)
- Some [example models](https://huggingface.co/models?search=InternVL)

## Table of Contents

1. [Setup and Installation](#setup-and-installation)
2. [Model Configuration](#model-configuration)
3. [Model Loading and Preparation](#model-loading-and-preparation)
4. [Model Conversion to OpenVINO](#model-conversion-to-openvino)
5. [Model Quantization](#model-quantization)
6. [Model Merger Implementation](#model-merger-implementation)
7. [Testing OpenVINO Model](#7-testing-openvino-model)


## 1. Setup and Installation

First, let's install all the required dependencies for this notebook.

In [1]:
# Install OpenVINO and NNCF for model optimization
import platform

%pip install -q "transformers>4.36" "torch>=2.1" "torchvision" "einops" "timm" "Pillow" "gradio>=4.36"  --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "nncf>=2.14.0" "datasets"
%pip install -q "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q -U --pre "openvino>=2025.0" "openvino-tokenizers>=2025.0" "openvino-genai>=2025.0" --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

if platform.system() == "Darwin":
    %pip install -q "numpy<2.0.0"

Note: you may need to restart the kernel to use updated packages.
[33m  DEPRECATION: Building 'jstyleson' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'jstyleson'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m[33m  DEPRECATION: Building 'grapheme' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'grapheme'. Discussion can be found at https://github.com/pypa/pip/i

### Environment Configuration

Configure the environment to disable tokenizer parallelism for better compatibility.

In [1]:
import os
# Disable tokenizer parallelism to avoid potential issues
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
from pathlib import Path
import types
from typing import Optional, List
import gc
import openvino as ov
from openvino.runtime import opset13
import nncf
import numpy as np
import torch
from openvino.frontend.pytorch.patch_model import __make_16bit_traceable
import torch.nn as nn



## 2. Model Configuration

Set up the model ID and quantization parameters for the conversion process.

In [1]:
model_ids = [
    "OpenGVLab/InternVL3-1B",
    # "OpenGVLab/InternVL3-2B",
    # "OpenGVLab/InternVL3-8B",
    # "OpenGVLab/InternVL3-9B",
    # "OpenGVLab/InternVL3-14B",
    # "OpenGVLab/InternVL2_5-1B",
    # "OpenGVLab/InternVL2_5-2B",
    # "OpenGVLab/InternVL2_5-4B",
    # "OpenGVLab/InternVL2_5-8B",
    # "OpenGVLab/InternVL2-1B",
    # "OpenGVLab/InternVL2-2B",
    # "OpenGVLab/InternVL2-4B",
    # "OpenGVLab/InternVL2-8B",
]

## 3. Model Loading and Preparation

Load the model processor, configuration, and prepare the model for conversion to OpenVINO format.

In [None]:
import shutil
from pathlib import Path
for model_id in model_ids:
    output_dir = f"./models/int4/{model_id}"
    # check if the model is already optimized
    if not os.path.exists(f"{output_dir}/openvino_language_model.xml") and not os.path.exists(f"{output_dir}/openvino_language_model.bin"):
        !optimum-cli export openvino --model {model_id} --weight-format int4 {output_dir} --trust-remote-code --dataset contextual --awq --num-samples 32
    else:
        print(f"Model {model_id} already optimized.")

## 4. Model Conversion to OpenVINO

Define paths for the converted model components and implement conversion utilities.

In [10]:
for model_id in model_ids:
    # change vision embed avg pool to opset1
    # this is a workaround for the issue with the InternVL model
    output_dir = f"./models/int4/{model_id}"
    if os.path.exists(output_dir):
        if not os.path.exists(f"{output_dir}/assets"):
            output_dir = Path(output_dir)
            assets_dir = output_dir/"assets"
            assets_dir.mkdir(exist_ok=True)
            print(f"Creating assets directory at {assets_dir}")

            # copy all the assets to the assets directory (json files, vocab files, etc.)
            for file in output_dir.glob("*.json"):
                shutil.copy(file, assets_dir)

Creating assets directory at models/int4/OpenGVLab/InternVL3-1B/assets
Creating assets directory at models/int4/OpenGVLab/InternVL3-2B/assets
Creating assets directory at models/int4/OpenGVLab/InternVL3-8B/assets
Creating assets directory at models/int4/OpenGVLab/InternVL3-14B/assets
Creating assets directory at models/int4/OpenGVLab/InternVL2_5-1B/assets
Creating assets directory at models/int4/OpenGVLab/InternVL2_5-4B/assets


In [11]:
def cleanup_torchscript_cache():
    """
    Helper function for removing cached model representation to prevent memory leaks
    during model conversion.
    """
    torch._C._jit_clear_class_registry()
    torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore()
    torch.jit._state._clear_class_state()

## 5. Model Merger Implementation

Implement the model merger to combine text and image components.

In [15]:
import numpy as np
import torch
from transformers import AutoConfig, AutoProcessor
import openvino as ov
import gc

class MergeMultiModalInputs(torch.nn.Module):
    def __init__(self,image_token_index=151648):
        """
        Merge multimodal inputs with the image token index.
        Args:
            image_token_index (int): The token index for the image token.
        """
        super().__init__()
        self.image_token_index = image_token_index

    def forward(
        self,
        vision_embeds,
        inputs_embeds,
        input_ids,
    ):
        image_features = vision_embeds
        inputs_embeds = inputs_embeds
        special_image_mask = (input_ids == self.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
        # image_features = image_features.to(inputs_embeds.dtype)
        final_embedding = inputs_embeds.masked_scatter(special_image_mask, image_features)

        return {
            "final_embedding": final_embedding
        }

In [18]:
for model_id in model_ids:
    if os.path.exists(f"./models/int4/{model_id}/openvino_language_model.xml"):
        print(f"Converting model {model_id} merger to OpenVINO format...")
        core = ov.Core()
        output_dir = f"./models/int4/{model_id}"
        model_merger_path = f"{output_dir}/openvino_merger_model.xml"
        config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
        image_size = config.force_image_size or config.vision_config.image_size
        patch_size = config.vision_config.patch_size
        patch_size = patch_size
        select_layer = config.select_layer
        num_image_token = int((image_size // patch_size) ** 2 * (config.downsample_ratio ** 2))
        processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
        IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'
        img_context_token_id = processor.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)

        multimodal_merger = MergeMultiModalInputs(img_context_token_id)
        with torch.no_grad():
            ov_model = ov.convert_model(
                multimodal_merger,
                example_input= {
                    "input_ids": torch.ones([2, 1198], dtype=torch.int64),
                    "inputs_embeds": torch.ones([2, 1198, config.llm_config.hidden_size], dtype=torch.float32),
                    "vision_embeds": torch.ones([2, num_image_token, config.llm_config.hidden_size], dtype=torch.float32),
                }
            )
            ov.save_model(ov_model, model_merger_path)
            del ov_model
            cleanup_torchscript_cache()
            gc.collect()

Converting model OpenGVLab/InternVL3-1B merger to OpenVINO format...
Converting model OpenGVLab/InternVL3-2B merger to OpenVINO format...
Converting model OpenGVLab/InternVL3-8B merger to OpenVINO format...
Converting model OpenGVLab/InternVL3-14B merger to OpenVINO format...
Converting model OpenGVLab/InternVL2_5-1B merger to OpenVINO format...
Converting model OpenGVLab/InternVL2_5-4B merger to OpenVINO format...
Converting model OpenGVLab/InternVL2-1B merger to OpenVINO format...


# 7. Testing OpenVINO Model




In [19]:
core = ov.Core()
device = "CPU"

# lets pick the first model
model_id = model_ids[0]
output_dir = f"./models/int4/{model_id}"
output_dir = Path(output_dir)


In [20]:
# paths for the exported models
image_embed_path = output_dir / "openvino_vision_embeddings_model.xml"
language_model_path = output_dir / "openvino_language_model.xml"
text_embeddings_path = output_dir / "openvino_text_embeddings_model.xml"
model_merger_path = output_dir / "openvino_merger_model.xml"

In [22]:
# compile the models
language_model = core.read_model(language_model_path)
compiled_language_model = core.compile_model(language_model, "AUTO")

image_embed_model = core.compile_model(image_embed_path, device)
text_embeddings_model = core.compile_model(text_embeddings_path, device)
multimodal_merger_model = core.compile_model(model_merger_path, device)

In [24]:

import torch
import torchvision.transforms as T
# from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    print(f"Image size: {image.size}")
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values


In [25]:
!mkdir images
!wget -O images/image1.jpg https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg
!wget -O images/image2.jpg https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg

mkdir: cannot create directory ‘images’: File exists
--2025-05-12 07:04:32--  https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg
Resolving cdn.britannica.com (cdn.britannica.com)... 18.64.50.12, 18.64.50.124, 18.64.50.34, ...
Connecting to cdn.britannica.com (cdn.britannica.com)|18.64.50.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 326629 (319K) [image/jpeg]
Saving to: ‘images/image1.jpg’


2025-05-12 07:04:33 (528 KB/s) - ‘images/image1.jpg’ saved [326629/326629]

--2025-05-12 07:04:34--  https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg
Resolving huggingface.co (huggingface.co)... 65.8.134.119, 65.8.134.40, 65.8.134.116, ...
Connecting to huggingface.co (huggingface.co)|65.8.134.119|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/6d/5b/6d5bc8ab63260b95af97fe910b8fb660b88a9b19e97bfada63102f0f1ee9110c/8b21ba78250f852ca5990063866b1ac

In [27]:
from transformers.image_utils import load_image as load_image_transformers
from transformers import AutoProcessor, TextStreamer
import numpy as np
from PIL import Image
import requests


DEVICE = "cpu"
IMG_START_TOKEN='<img>'
IMG_END_TOKEN='</img>'
IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)

question = "<|im_start|><image>\nDescribe this image in detail. reply in dot points <|im_end|><|im_start|>assistant\n"

pixel_values = load_image("images/image1.jpg", max_num=12)
num_patches = pixel_values.shape[0]
num_patches_list = [pixel_values.shape[0]] if pixel_values is not None else []
image_size = config.force_image_size or config.vision_config.image_size
patch_size = config.vision_config.patch_size
patch_size = patch_size
select_layer = config.select_layer
num_image_token = int((image_size // patch_size) ** 2 * (config.downsample_ratio ** 2))
img_context_token_id = processor.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * num_image_token * num_patches + IMG_END_TOKEN
query = question.replace('<image>', image_tokens, 1)

inputs_new = processor(query,return_tensors="pt")
inputs_new["pixel_values"] = pixel_values

request = compiled_language_model.create_infer_request()
merge_model_request = multimodal_merger_model.create_infer_request()
# Set the input names
input_names = {key.get_any_name(): idx for idx, key in enumerate(language_model.inputs)}
inputs = {}
# Set the initial input_ids
current_input_ids = inputs_new["input_ids"]
attention_mask = inputs_new["attention_mask"]
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
pixel_values = inputs_new["pixel_values"]
generated_tokens = []

for i in range(200):
    # Generate input embeds each time
    text_embeds = torch.from_numpy(
            text_embeddings_model(current_input_ids
            )[0]
        )
    if current_input_ids.shape[-1] > 1:
        vision_embeds = torch.from_numpy(
            image_embed_model(
                {
                    "pixel_values": pixel_values,
                }
            )[0]
        )
        vision_embeds = vision_embeds.reshape(1, -1, config.llm_config.hidden_size)
        merge_model_request.start_async({
            "vision_embeds": vision_embeds,
            "inputs_embeds": text_embeds,
            "input_ids": current_input_ids,
        }, share_inputs=True)
        merge_model_request.wait()
        final_embedding = torch.from_numpy(merge_model_request.get_tensor("final_embedding").data)
    else:
        final_embedding = text_embeds

    
    if i>0:
        inputs = {}
    # Prepare inputs for the model
    inputs["inputs_embeds"] = final_embedding
    inputs["attention_mask"] = attention_mask
    inputs["position_ids"] = position_ids
    # inputs["token_type_ids"] = token_type_ids
    if "beam_idx" in input_names:
        inputs["beam_idx"] = np.arange(attention_mask.shape[0], dtype=int)
    
    # Start inference
    request.start_async(inputs, share_inputs=True)
    request.wait()
    
    # Get the logits and find the next token
    logits = torch.from_numpy(request.get_tensor("logits").data)
    next_token = logits.argmax(-1)[0][-1]
    
    # Append the generated token
    generated_tokens.append(next_token)
    
    # Update input_ids with the new token
    current_input_ids = torch.cat([next_token.unsqueeze(0).unsqueeze(0)], dim=-1)
    
    # update the attention mask
    attention_mask = torch.cat([attention_mask, torch.ones_like(attention_mask[:, :1])], dim=-1)

    # Update inputs for the next iteration
    position_ids = attention_mask.long().cumsum(-1) - 1
    position_ids.masked_fill_(attention_mask == 0, 1)
    position_ids = position_ids[:, -current_input_ids.shape[1] :]
    inputs["position_ids"] = position_ids
    token_type_ids = torch.zeros_like(current_input_ids)

generated_text = processor.decode(generated_tokens, skip_special_tokens=True)
print(generated_text)

Image size: (1600, 1067)
This image showcases the iconic Statue of Liberty on Liberty Island, a symbol of freedom and welcome to immigrants around the world. The statue is prominently featured in the foreground, standing on a stone pedestal surrounded by water. The background reveals a bustling city skyline with numerous tall buildings, including the Empire State Building, suggesting a major metropolitan area. The sky is clear, and the lighting suggests it might be early morning or late afternoon, casting a warm glow over the scene. The water is calm, and there are a few small boats visible, adding to the serene atmosphere. The island is bordered by a grassy area with trees, and the overall setting is picturesque and inviting.


# 8. Import and Save in Spark NLP
- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [23]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Let's start Spark with Spark NLP included via our simple `start()` function

In [24]:
import sparknlp

# let's start Spark with Spark NLP
spark = sparknlp.start()


In [3]:
from pathlib import Path
model_id = model_ids[0]
output_dir = f"./models/int4/{model_id}"
output_dir = Path(output_dir)

In [4]:
imageClassifier = InternVLForMultiModal \
            .loadSavedModel(str(output_dir),spark) \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

25/05/12 07:55:26 WARN NativeLibrary: Failed to load library null: java.lang.UnsatisfiedLinkError: Can't load library: /tmp/openvino-native17051517669691827340/libtbb.so.2




In [5]:
imageClassifier.write().overwrite().save(f"file:///tmp/{model_id}_spark_nlp")

                                                                                

In [6]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql.functions import lit
from pyspark.ml import Pipeline
from pathlib import Path
import os

# download two images to test into ./images folder

url1 = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
url2 = "http://images.cocodataset.org/val2017/000000039769.jpg"

Path("images").mkdir(exist_ok=True)

!wget -q -O images/image1.jpg {url1}
!wget -q -O images/image2.jpg {url2}



images_path = "file://" + os.getcwd() + "/images/"
image_df = spark.read.format("image").load(
    path=images_path
)

test_df = image_df.withColumn("text", lit("<|im_start|><image>\nDescribe this image in detail.<|im_end|><|im_start|>assistant\n")) \

image_assembler = ImageAssembler().setInputCol("image").setOutputCol("image_assembler")

imageClassifier = InternVLForMultiModal.load(f"file:///tmp/{model_id}_spark_nlp")\
            .setMaxOutputLength(50) \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

pipeline = Pipeline(
            stages=[
                image_assembler,
                imageClassifier,
            ]
        )

model = pipeline.fit(test_df)

In [7]:
light_pipeline = LightPipeline(model)
image_path = os.getcwd() + "/images/" + "image1.jpg"
print("image_path: " + image_path)
annotations_result = light_pipeline.fullAnnotateImage(
    image_path,
    "<|im_start|><image>\nDescribe this image in detail.<|im_end|><|im_start|>assistant\n",
)

for result in annotations_result:
    print(result["answer"])

image_path: /mnt/research/Projects/ModelZoo/internVL/images/image1.jpg
[Annotation(document, 0, 227, The image features a gray tabby cat with fluffy fur, lying on its back inside an open cardboard box. The cat appears to be relaxed and content, with its eyes closed and ears perked up. The box is placed on a light-colored carpet, Map(), [])]


In [None]:
for model_id in model_ids:
    ZIP_NAME = f"{model_id.split('/')[-1].replace(' ','_').replace('-','_').lower()}_int4_sn"
    !aws s3 cp /tmp/{model_id}_spark_nlp/{ZIP_NAME}.zip s3://dev.johnsnowlabs.com/prabod/models/{ZIP_NAME}.zip --acl public-read