![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/openvino/HuggingFace_OpenVINO_in_Spark_NLP_E5V.ipynb)

# Import OpenVINO E5V models from HuggingFace 🤗 into Spark NLP 🚀

This notebook provides a detailed walkthrough on optimizing and importing E5V models from HuggingFace  for use in Spark NLP, with [Intel OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html). The focus is on converting the model to the OpenVINO format and applying precision optimizations (INT8 and INT4), to enhance the performance and efficiency on CPU platforms using [Optimum Intel](https://huggingface.co/docs/optimum/main/en/intel/inference).

Let's keep in mind a few things before we start 😊

- OpenVINO support was introduced in  `Spark NLP 5.4.0`, enabling high performance CPU inference for models. So please make sure you have upgraded to the latest Spark NLP release.
- Model quantization is a computationally expensive process, so it is recommended to use a runtime with more than 32GB memory for exporting the quantized model from HuggingFace.
- You can import E5V models via `E5V`. These models are usually under `Text Generation` category and have `E5V` in their labels.
- Reference: [E5V](https://huggingface.co/docs/transformers/model_doc/llama#transformers.E5V)
- Some [example models](https://huggingface.co/models?search=E5V)

## 1. Export and Save the HuggingFace model

- Let's install `transformers` and `openvino` packages with other dependencies. You don't need `openvino` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.41.2`. This doesn't mean it won't work with the future release, but we wanted you to know which versions have been tested successfully.

In [1]:
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
# # Install OpenVINO and NNCF for model optimization
import platform

%pip install -q "einops" "torch>2.1" "torchvision" "matplotlib>=3.4" "timm>=0.9.8" "transformers==4.41.2" "pillow" "gradio>=4.19" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q -U --pre "openvino>=2025.0" "openvino-tokenizers>=2025.0" "openvino-genai>=2025.0" --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
%pip install -q "accelerate" "nncf>=2.14.0" "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu

if platform.system() == "Darwin":
    %pip install -q "numpy<2.0.0"

In [None]:
!wget https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/American_Eskimo_Dog.jpg/360px-American_Eskimo_Dog.jpg -O dog.jpg

In [2]:
model_id = "royokong/e5-v"
output_dir = f"./models/int4/{model_id}"

In [3]:
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
import gc

processor = LlavaNextProcessor.from_pretrained(model_id)
image_encoder_model, input_embedding_model, language_model = None, None, None


class ImageEncoder(torch.nn.Module):
    def __init__(self, config, vision_tower, multi_modal_projector):
        super().__init__()
        self.config = config
        self.vision_tower = vision_tower
        self.multi_modal_projector = multi_modal_projector

    def forward(self, pixel_values):
        batch_size, num_patches, num_channels, height, width = pixel_values.shape
        reshaped_pixel_values = pixel_values.view(
            batch_size * num_patches, num_channels, height, width
        )
        image_features = self.vision_tower(
            reshaped_pixel_values, output_hidden_states=True
        )
        selected_image_feature = image_features.hidden_states[
            self.config.vision_feature_layer
        ]
        if self.config.vision_feature_select_strategy == "default":
            selected_image_feature = selected_image_feature[:, 1:]
        elif self.config.vision_feature_select_strategy == "full":
            selected_image_feature = selected_image_feature
        image_features = self.multi_modal_projector(selected_image_feature)
        return image_features


model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id, low_cpu_mem_usage=True
)
model.config.save_pretrained(output_dir)
image_encoder_model = ImageEncoder(
    model.config, model.vision_tower, model.multi_modal_projector
)
input_embedding_model = input_embedding_model = model.get_input_embeddings()
language_model = model.language_model
del model
gc.collect()

  from .autonotebook import tqdm as notebook_tqdm
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████| 4/4 [01:20<00:00, 20.18s/it]


111

In [3]:
import openvino as ov
from pathlib import Path

core = ov.Core()
device = "CPU"
# Load the model and convert it to OpenVINO format
output_dir = f"./models/int4/{model_id}"
output_dir = Path(output_dir)


In [5]:
IMAGE_ENCODER_PATH = output_dir / "openvino_vision_embeddings_model.xml"
LANGUAGE_MODEL_PATH = output_dir / "openvino_language_model.xml"
INPUT_EMBEDDING_PATH = output_dir / "openvino_text_embeddings_model.xml"

IMAGE_PACKER_PATH = output_dir / "openvino_image_packer.xml"
MULTIMODAL_MERGER_PATH = output_dir / "openvino_multimodal_merger.xml"

In [6]:
import torch
import openvino as ov
import gc


def cleanup_torchscript_cache():
    """
    Helper for removing cached model representation
    """
    torch._C._jit_clear_class_registry()
    torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore()
    torch.jit._state._clear_class_state()


if not IMAGE_ENCODER_PATH.exists():
    ov_image_encoder = ov.convert_model(
        image_encoder_model, example_input=torch.zeros((1, 5, 3, 336, 336))
    )
    ov.save_model(ov_image_encoder, IMAGE_ENCODER_PATH)
    del ov_image_encoder
    cleanup_torchscript_cache()

del image_encoder_model
gc.collect()

  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):


7397

In [7]:
llm_input = None

llm_input = input_embedding_model(torch.ones((2, 2), dtype=torch.int64))

if not INPUT_EMBEDDING_PATH.exists():
    ov_input_embeddings_model = ov.convert_model(
        input_embedding_model, example_input=torch.ones((2, 2), dtype=torch.int64)
    )
    ov.save_model(ov_input_embeddings_model, INPUT_EMBEDDING_PATH)
    del ov_input_embeddings_model
    cleanup_torchscript_cache()

del input_embedding_model
gc.collect()

117

In [8]:
from typing import Optional, Tuple, List
from openvino.runtime import opset13
import numpy as np


def model_has_state(ov_model: ov.Model):
    return len(ov_model.get_sinks()) > 0


def model_has_input_output_name(ov_model: ov.Model, name: str):
    """
    Helper function for checking that model has specified input or output name

    Parameters:
      ov_model (ov.Model):
      name (str):
          name of input or output

    Returns:
      True if input or output with requested name exists else False
    """
    return name in sum(
        [list(t.get_names()) for t in ov_model.inputs + ov_model.outputs], []
    )


def fuse_cache_reorder(
    ov_model: ov.Model,
    not_kv_inputs: List[str],
    key_value_input_names: List[str],
    gather_dim: int,
):
    """
    Fuses reored_cache during generate cycle into ov.Model. Used with stateful models, because we can not modify model state directly.

    Adds a new beam_idx parameter and Gather op per each kv-cache input in a given model.
    Should be run before make_stateful. Implements optimumum's _reorder_cache
    inside the model in the beginning of each iteration.
    Gather works along given gather_dim dimension that may vary from model to model.
    KV-cache inputs are identified based on names in key_value_input_names.
    Append the new beam_idx parameter to not_kv_inputs.

    Parameters:
      ov_model (`ov.Model`):
          openvino model for processing
      not_kv_inputs (`List[str]`):
          list of input nodes in model that not related to past key values
      key_value_input_names (`List[str]`):
          list of names for key value input layers
      gather_dim (int):
          dimension for gathering cache during reorder pass
    """

    if model_has_input_output_name(ov_model, "beam_idx"):
        raise ValueError("Model already has fused cache")
    input_batch = ov_model.input("inputs_embeds").get_partial_shape()[0]
    beam_idx = opset13.parameter(
        name="beam_idx", dtype=ov.Type.i32, shape=ov.PartialShape([input_batch])
    )
    beam_idx.output(0).get_tensor().add_names({"beam_idx"})  # why list is not accepted?
    ov_model.add_parameters([beam_idx])
    not_kv_inputs.append(ov_model.inputs[-1])
    # Go over all cache parameters and fuse _reorder_cache with indices provided by the new parameter beam_idx
    for input_name in key_value_input_names:
        parameter_output_port = ov_model.input(input_name)
        consumers = parameter_output_port.get_target_inputs()
        gather = opset13.gather(
            parameter_output_port, beam_idx, opset13.constant(gather_dim)
        )
        for consumer in consumers:
            consumer.replace_source_output(gather.output(0))
    ov_model.validate_nodes_and_infer_types()


def build_state_initializer(ov_model: ov.Model, batch_dim: int):
    """
    Build initialization ShapeOf Expression for all ReadValue ops

    Parameters:
      ov_model (ov.Model):
          openvino model
      batch_dim (int):
          index of dimension corresponding to batch size
    """
    input_ids = ov_model.input("inputs_embeds")
    batch = opset13.gather(
        opset13.shape_of(input_ids, output_type="i64"),
        opset13.constant([0]),
        opset13.constant(0),
    )
    for op in ov_model.get_ops():
        if op.get_type_name() == "ReadValue":
            dims = [dim.min_length for dim in list(op.get_output_partial_shape(0))]
            dims[batch_dim] = batch
            dims = [
                (
                    opset13.constant(np.array([dim], dtype=np.int64))
                    if isinstance(dim, int)
                    else dim
                )
                for dim in dims
            ]
            shape = opset13.concat(dims, axis=0)
            broadcast = opset13.broadcast(
                opset13.constant(0.0, dtype=op.get_output_element_type(0)), shape
            )
            op.set_arguments([broadcast])
    ov_model.validate_nodes_and_infer_types()


def make_stateful(
    ov_model: ov.Model,
    not_kv_inputs: List[str],
    key_value_input_names: List[str],
    key_value_output_names: List[str],
    batch_dim: int,
    num_attention_heads: int,
    num_beams_and_batch: int = None,
):
    """
    Hides kv-cache inputs and outputs inside the model as variables.

    Parameters:
        ov_model (ov.Model):
            openvino model
        not_kv_inputs (`List[str]`):
            list of input nodes in model that not related to past key values
        key_value_input_names (`List[str]`):
            list of names for key value input layers
        key_value_output_names (`List[str]`):
            list of names for key value input layers
        batch_dim (int):
            index of batch dimension in key value layers
        num_attention_heads (int):
            number of attention heads for batch dimension initialization
        num_beams_an_batch (int):
            precalculated number of beams and batch for shapes initialization
    """
    from openvino._offline_transformations import apply_make_stateful_transformation

    input_output_map = {}

    if num_beams_and_batch is not None:
        # Set batch size for input_ids and attention mask to avoid dynamic dimension got propagated from the end of the model back to ReadValue
        for input in not_kv_inputs:
            shape = input.get_partial_shape()
            if shape.rank.get_length() <= 2:  # == 1 for beam_index
                shape[0] = num_beams_and_batch
                input.get_node().set_partial_shape(shape)
    for kv_name_pair in zip(key_value_input_names, key_value_output_names):
        input_output_map[kv_name_pair[0]] = kv_name_pair[1]
        if num_beams_and_batch is not None:
            input = ov_model.input(kv_name_pair[0])
            shape = input.get_partial_shape()
            shape[batch_dim] = num_beams_and_batch * num_attention_heads
            input.get_node().set_partial_shape(shape)

    if num_beams_and_batch is not None:
        # Re-validation model if shapes are altered above
        ov_model.validate_nodes_and_infer_types()

    apply_make_stateful_transformation(ov_model, input_output_map)
    if num_beams_and_batch is None:
        build_state_initializer(ov_model, batch_dim)


def patch_stateful(ov_model):
    key_value_input_names = [key.get_any_name() for key in ov_model.inputs[2:-1]]
    key_value_output_names = [key.get_any_name() for key in ov_model.outputs[1:]]
    not_kv_inputs = [
        input
        for input in ov_model.inputs
        if not any(name in key_value_input_names for name in input.get_names())
    ]
    if not key_value_input_names or not key_value_output_names:
        return
    batch_dim = 0
    num_attention_heads = 1

    fuse_cache_reorder(ov_model, not_kv_inputs, key_value_input_names, batch_dim)
    make_stateful(
        ov_model,
        not_kv_inputs,
        key_value_input_names,
        key_value_output_names,
        batch_dim,
        num_attention_heads,
        None,
    )



In [9]:
import types

make_stateful_model = False
core = ov.Core()
model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id, low_cpu_mem_usage=True
)
language_model = model.language_model
if not LANGUAGE_MODEL_PATH.exists() or True:

    def forward_wrap(
        self,
        attention_mask,
        position_ids=None,
        past_key_values=None,
        inputs_embeds=None,
    ):
        result = self._orig_forward(
            input_ids=None,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            output_hidden_states=True,
            return_dict=True,
        )
        return result["hidden_states"][-1][:, -1, :]

    model_inputs = ["attention_mask", "position_ids"]
    model_outputs = ["last_hidden_state"]
    model_inputs.append("inputs_embeds")
    language_model.config.torchscript = True
    position_ids = torch.tensor([[2, 3], [2, 3]])
    language_model._orig_forward = language_model.forward
    language_model.forward = types.MethodType(forward_wrap, language_model)
    ov_model = ov.convert_model(
        language_model,
        example_input={
            "inputs_embeds": llm_input,
            "attention_mask": torch.ones((2, 4)),
            "position_ids": position_ids,
        },
    )

    for input, input_name in zip(ov_model.inputs, model_inputs):
        input.get_tensor().set_names({input_name})

    for output, output_name in zip(ov_model.outputs, model_outputs):
        output.get_tensor().set_names({output_name})
    if make_stateful_model:
        patch_stateful(ov_model)
    ov.save_model(ov_model, LANGUAGE_MODEL_PATH)
    del ov_model
    cleanup_torchscript_cache()
    gc.collect()

Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00,  1.00s/it]
  if sequence_length != 1:
  if a.grad is not None:


In [10]:
import nncf

compression_configuration = {
    "mode": nncf.CompressWeightsMode.INT4_ASYM,
    "group_size": 64,
    "ratio": 1.0,
}
LANGUAGE_MODEL_PATH_INT4 = (
    LANGUAGE_MODEL_PATH.parent / LANGUAGE_MODEL_PATH.name.replace(".xml", "-int4.xml")
)
ov_model = core.read_model(LANGUAGE_MODEL_PATH)
ov_model_compressed = nncf.compress_weights(ov_model, **compression_configuration)
ov.save_model(ov_model_compressed, LANGUAGE_MODEL_PATH_INT4)


INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_asym                 │ 1% (1 / 224)                │ 0% (0 / 223)                           │
├───────────────────────────┼─────────────────────────────┼────────────────────────────────────────┤
│ int4_asym                 │ 99% (223 / 224)             │ 100% (223 / 223)                       │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


In [11]:
import torch
import torch.nn as nn


class UnpadImage(nn.Module):
    def __init__(self):
        super(UnpadImage, self).__init__()

    def forward(self, tensor, original_size, current_size):
        """
        Unpads an image tensor to its original size based on the current size.
        Args:
            tensor (torch.Tensor): The input image tensor of shape (C, H, W).
            original_size (torch.Tensor): The original size of the image tensor as (H, W).
            current_size (torch.Tensor): The current size of the image tensor as (H, W).
        """
        # tensor: (C, H, W)
        original_size = original_size.to(torch.float32)
        original_height, original_width = original_size[0], original_size[1]
        current_height, current_width = current_size[0], current_size[1]

        original_aspect_ratio = original_width / original_height
        current_aspect_ratio = current_width / current_height

        # Comparison
        condition = original_aspect_ratio > current_aspect_ratio

        # Branch 1: vertical padding
        scale_factor_1 = current_width.float() / original_width.float()
        new_height = (original_height.float() * scale_factor_1).int()
        pad_top = ((current_height.float() - new_height) / 2).floor().long()

        # Branch 2: horizontal padding
        scale_factor_2 = current_height.float() / original_height.float()
        new_width = (original_width.float() * scale_factor_2).int()
        pad_left = ((current_width.float() - new_width) / 2).floor().long()

        zero = torch.zeros(1, dtype=pad_top.dtype, device=tensor.device).squeeze(0)

        # Use torch.where to conditionally compute slicing
        y_start = torch.where(condition, pad_top, zero)
        y_end = torch.where(condition, current_height - pad_top, current_height)

        x_start = torch.where(condition, zero, pad_left)
        x_end = torch.where(condition, current_width - pad_left, current_width)
        out = tensor[:, y_start.int() : y_end.int(), x_start.int() : x_end.int()]
        return out  # Remove batch dimension if needed


In [12]:
import math


class PackImageFeatures(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.unpad_image = UnpadImage()
        self.height = config.vision_config.image_size // config.vision_config.patch_size
        self.width = config.vision_config.image_size // config.vision_config.patch_size

    def forward(self, image_feature, image_sizes, num_patch_height, num_patch_width):
        # we image features is a single image features, so we can remove the loop
        base_image_features = image_feature[0]
        features = image_feature[1:]  # Skip the first token
        features = (
            features.view(
                num_patch_height, num_patch_width, self.height, self.width, -1
            )
            .permute(4, 0, 2, 1, 3)
            .contiguous()
            .flatten(1, 2)
            .flatten(2, 3)
        )
        features = self.unpad_image(
            features, image_sizes[0], torch._shape_as_tensor(features)[1:3]
        )
        features = features.flatten(1, 2).transpose(0, 1)
        features = torch.cat([base_image_features, features], dim=0)
        return features.unsqueeze(0)

In [13]:
import torch
import torch.nn as nn


class MergeInputWithImageFeatures(nn.Module):
    def __init__(self, pad_token_id=0, image_token_index=0):
        super().__init__()
        self.pad_token_id = pad_token_id
        self.image_token_index = image_token_index

    def forward(self, image_features, inputs_embeds, input_ids, attention_mask):
        num_images, num_image_patches, embed_dim = image_features.shape
        batch_size, sequence_length = input_ids.shape

        # left_padding = torch.sum(input_ids[:, -1] == self.pad_token_id) == 0  # Removed, not needed now

        special_image_token_mask = input_ids == self.image_token_index  # [B, S]
        num_special_image_tokens = special_image_token_mask.sum(dim=-1)  # [B]

        max_embed_dim = (
            num_special_image_tokens.max() * (num_image_patches - 1)
        ) + sequence_length  # scalar

        batch_indices, non_image_indices = torch.where(
            input_ids != self.image_token_index
        )  # [N], [N]

        # Step 2: Compute new token positions
        new_token_positions = (
            torch.cumsum(special_image_token_mask * (num_image_patches - 1) + 1, dim=-1)
            - 1
        )  # [B, S]

        nb_image_pad = max_embed_dim - 1 - new_token_positions[:, -1]  # [B]

        # left_padding_flag = (input_ids[:, -1] != self.pad_token_id).to(nb_image_pad.dtype)  # original
        left_padding_flag = (
            input_ids[:, -1] != self.pad_token_id
        ).long()  # more idiomatic torch
        # new_token_positions = new_token_positions + (left_padding_flag[:, None] * nb_image_pad[:, None])  # original
        new_token_positions += (
            left_padding_flag[:, None] * nb_image_pad[:, None]
        )  # updated

        text_to_overwrite = new_token_positions[batch_indices, non_image_indices]  # [N]

        # Step 3: Init final tensors
        final_embedding = torch.zeros(
            batch_size,
            max_embed_dim,
            embed_dim,
            dtype=inputs_embeds.dtype,
            device=inputs_embeds.device,
        )
        final_attention_mask = torch.zeros(
            batch_size,
            max_embed_dim,
            dtype=attention_mask.dtype,
            device=inputs_embeds.device,
        )

        # final_embedding[batch_indices, text_to_overwrite] = inputs_embeds[batch_indices, non_image_indices]  # original
        final_embedding.index_put_(
            (batch_indices, text_to_overwrite),
            inputs_embeds[batch_indices, non_image_indices],
        )  # torch native

        # final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[batch_indices, non_image_indices]  # original
        final_attention_mask.index_put_(
            (batch_indices, text_to_overwrite),
            attention_mask[batch_indices, non_image_indices],
        )  # torch native

        # Step 5: fill in image features
        image_to_overwrite = (final_embedding == 0).all(dim=-1)  # [B, L]
        image_to_overwrite &= (image_to_overwrite.cumsum(-1) - 1) >= nb_image_pad[
            :, None
        ]  # apply pad cutoff

        flat_image_features = image_features.reshape(-1, embed_dim).to(
            inputs_embeds.device
        )  # [N_img, D]

        # final_embedding[image_to_overwrite] = flat_image_features  # original
        final_embedding[image_to_overwrite] = flat_image_features[
            : image_to_overwrite.sum()
        ]  # safe assignment

        final_attention_mask |= image_to_overwrite  # logical or with existing mask

        position_ids = final_attention_mask.cumsum(-1) - 1
        position_ids = position_ids.masked_fill(final_attention_mask == 0, 1)

        # Step 6: remove pad token embeddings
        batch_pad_indices, pad_token_positions = torch.where(
            input_ids == self.pad_token_id
        )  # [N_pad]
        indices_to_mask = new_token_positions[
            batch_pad_indices, pad_token_positions
        ]  # [N_pad]

        # final_embedding[batch_pad_indices, indices_to_mask] = 0  # original
        final_embedding.index_put_(
            (batch_pad_indices, indices_to_mask),
            torch.zeros_like(final_embedding[batch_pad_indices, indices_to_mask]),
        )  # updated

        return {
            "final_embedding": final_embedding,
            "final_attention_mask": final_attention_mask,
            "position_ids": position_ids,
        }


In [None]:
# compile the models
language_model = core.read_model(LANGUAGE_MODEL_PATH)
compiled_language_model = core.compile_model(language_model, "AUTO")

image_embed_model = core.compile_model(IMAGE_ENCODER_PATH, device)
text_embeddings_model = core.compile_model(INPUT_EMBEDDING_PATH, device)

if IMAGE_PACKER_PATH.exists():
    image_packer_model = core.compile_model(IMAGE_PACKER_PATH, device)
else:
    image_packer_model = None
if MULTIMODAL_MERGER_PATH.exists()
    multimodal_merger_model = core.compile_model(MULTIMODAL_MERGER_PATH, device)
else:
    multimodal_merger_model = None

# multimodal_merger_model = core.compile_model(MODEL_MERGER_PATH, device)

In [15]:
import torch
import torch.nn.functional as F
import requests
from PIL import Image
from transformers import AutoTokenizer, AutoConfig
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration

llama3_template = "<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n \n"

processor = LlavaNextProcessor.from_pretrained("royokong/e5-v")

config = AutoConfig.from_pretrained("royokong/e5-v")
img_prompt = llama3_template.format("<image>\nSummary above image in one word: ")
text_prompt = llama3_template.format("<sent>\nSummary above sentence in one word: ")

images = [Image.open("dog.jpg").convert("RGB")]

for image in images:
    print(f"Image size: {image.size}, Mode: {image.mode}")

texts = ["A dog sitting in the grass."]

text_inputs = processor(
    [text_prompt.replace("<sent>", text) for text in texts],
    return_tensors="pt",
    padding=True,
)
img_inputs = processor(
    [img_prompt] * len(images), images, return_tensors="pt", padding=True
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Image size: (360, 282), Mode: RGB


In [16]:
img_input_ids = img_inputs["input_ids"]
img_attention_mask = img_inputs["attention_mask"]
image_sizes = img_inputs["image_sizes"]
pixel_values = img_inputs["pixel_values"]

text_input_ids = text_inputs["input_ids"]
text_attention_mask = text_inputs["attention_mask"]


In [17]:
image_features = torch.from_numpy(image_embed_model(pixel_values)[0])
image_inputs_embeds = torch.from_numpy(text_embeddings_model(img_input_ids)[0])
text_inputs_embeds = torch.from_numpy(text_embeddings_model(text_input_ids)[0])


In [18]:
image_packer = PackImageFeatures(config)
input_merger = MergeInputWithImageFeatures(
    pad_token_id=processor.tokenizer.pad_token_id,
    image_token_index=config.image_token_index,
)

In [19]:
import numpy as np
from typing import Union, List, Tuple
import torch


def select_best_resolution(original_size: tuple, possible_resolutions: list) -> tuple:
    """
    Selects the best resolution from a list of possible resolutions based on the original size.

    This is done by calculating the effective and wasted resolution for each possible resolution.

    The best fit resolution is the one that maximizes the effective resolution and minimizes the wasted resolution.

    Args:
        original_size (tuple):
            The original size of the image in the format (height, width).
        possible_resolutions (list):
            A list of possible resolutions in the format [(height1, width1), (height2, width2), ...].

    Returns:
        tuple: The best fit resolution in the format (height, width).
    """
    original_height, original_width = original_size
    best_fit = None
    max_effective_resolution = 0
    min_wasted_resolution = float("inf")

    for height, width in possible_resolutions:
        scale = min(width / original_width, height / original_height)
        downscaled_width, downscaled_height = (
            int(original_width * scale),
            int(original_height * scale),
        )
        effective_resolution = min(
            downscaled_width * downscaled_height, original_width * original_height
        )
        wasted_resolution = (width * height) - effective_resolution

        if effective_resolution > max_effective_resolution or (
            effective_resolution == max_effective_resolution
            and wasted_resolution < min_wasted_resolution
        ):
            max_effective_resolution = effective_resolution
            min_wasted_resolution = wasted_resolution
            best_fit = (height, width)

    return best_fit


def image_size_to_num_patches(image_size, grid_pinpoints, patch_size: int):
    """
    Calculate the number of patches after the preprocessing for images of any resolution.

    Args:
        image_size (`Union[torch.LongTensor, np.ndarray, Tuple[int, int]):
            The size of the input image in the format (height, width). ?
        grid_pinpoints (`List`):
            A list containing possible resolutions. Each item in the list should be a tuple or list
            of the form `(height, width)`.
        patch_size (`int`):
            The size of each image patch.

    Returns:
        int: the number of patches
    """
    if not isinstance(grid_pinpoints, list):
        raise ValueError("grid_pinpoints should be a list of tuples or lists")

    # ! VERY IMPORTANT if image_size is tensor, must convert to into tuple, otherwise it will cause wrong calculate
    if not isinstance(image_size, (list, tuple)):
        if not isinstance(image_size, (torch.Tensor, np.ndarray)):
            raise ValueError(
                f"image_size invalid type {type(image_size)} with value {image_size}"
            )
        image_size = image_size.tolist()

    best_resolution = select_best_resolution(image_size, grid_pinpoints)
    height, width = best_resolution
    num_patches = 0
    # consider change to ceil(height/patch_size)*ceil(width/patch_size) + 1
    for i in range(0, height, patch_size):
        for j in range(0, width, patch_size):
            num_patches += 1
    # add the base patch
    num_patches += 1
    return num_patches


def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
    """
    Calculate the shape of the image patch grid after the preprocessing for images of any resolution.

    Args:
        image_size (`tuple`):
            The size of the input image in the format (width, height).
        grid_pinpoints (`List`):
            A list containing possible resolutions. Each item in the list should be a tuple or list
            of the form `(height, width)`.
        patch_size (`int`):
            The size of each image patch.

    Returns:
        tuple: The shape of the image patch grid in the format (width, height).
    """
    if not isinstance(grid_pinpoints, list):
        raise ValueError("grid_pinpoints should be a list of tuples or lists")

    # ! VERY IMPORTANT if image_size is tensor, must convert to into tuple, otherwise it will cause wrong calculate
    if not isinstance(image_size, (list, tuple)):
        if not isinstance(image_size, (torch.Tensor, np.ndarray)):
            raise ValueError(
                f"image_size invalid type: {type(image_size)} not valid, should be either list, tuple, np.ndarray or tensor"
            )
        image_size = image_size.tolist()

    height, width = select_best_resolution(image_size, grid_pinpoints)
    return height // patch_size, width // patch_size

In [20]:
num_patch_width, num_patch_height = get_anyres_image_grid_shape(
    image_sizes[0],
    config.image_grid_pinpoints,
    config.vision_config.image_size,
)

In [21]:
packed_image_features = image_packer(
    image_features,
    image_sizes,
    num_patch_height=num_patch_height,
    num_patch_width=num_patch_width
)

In [22]:

if IMAGE_PACKER_PATH.exists():
    IMAGE_PACKER_PATH.unlink()

ov_image_packer = ov.convert_model(
    image_packer,
    example_input={
        "image_feature": image_features,
        "image_sizes": image_sizes,
        "num_patch_height": torch.tensor(num_patch_height, dtype=torch.int64),
        "num_patch_width": torch.tensor(num_patch_width, dtype=torch.int64)
    }
)
ov.save_model(ov_image_packer, IMAGE_PACKER_PATH)

In [23]:
if MULTIMODAL_MERGER_PATH.exists():
    MULTIMODAL_MERGER_PATH.unlink()
ov_multimodal_merger = ov.convert_model(
    input_merger,
    example_input={
        "image_features": packed_image_features,
        "inputs_embeds": image_inputs_embeds,
        "input_ids": img_input_ids,
        "attention_mask": img_attention_mask
    }
)
ov.save_model(ov_multimodal_merger, MULTIMODAL_MERGER_PATH)
cleanup_torchscript_cache()

In [24]:
import shutil
import os
if not os.path.exists(f"{output_dir}/assets"):
    output_dir = Path(output_dir)
    assets_dir = output_dir/"assets"
    assets_dir.mkdir(exist_ok=True)
    processor.save_pretrained(output_dir)
    # copy all the assets to the assets directory (json files, vocab files, etc.)
    for file in output_dir.glob("*.json"):
        shutil.copy(file, assets_dir)


In [27]:
# delete the f32 language model
if LANGUAGE_MODEL_PATH.exists():
    LANGUAGE_MODEL_PATH.unlink()

# delete the f32 language model bin file if exists
if LANGUAGE_MODEL_PATH.with_suffix(".bin").exists():
    LANGUAGE_MODEL_PATH.with_suffix(".bin").unlink()

## 2. Test the Exported model

In [29]:
IMAGE_ENCODER_PATH = output_dir / "openvino_vision_embeddings_model.xml"
LANGUAGE_MODEL_PATH = output_dir / "openvino_language_model-int4.xml"
INPUT_EMBEDDING_PATH = output_dir / "openvino_text_embeddings_model.xml"

IMAGE_PACKER_PATH = output_dir / "openvino_image_packer.xml"
MULTIMODAL_MERGER_PATH = output_dir / "openvino_multimodal_merger.xml"

In [30]:
# compile the models
language_model = core.read_model(LANGUAGE_MODEL_PATH)
compiled_language_model = core.compile_model(language_model, "AUTO")

image_embed_model = core.compile_model(IMAGE_ENCODER_PATH, device)
text_embeddings_model = core.compile_model(INPUT_EMBEDDING_PATH, device)

if IMAGE_PACKER_PATH.exists():
    image_packer_model = core.compile_model(IMAGE_PACKER_PATH, device)
else:
    image_packer_model = None
if MULTIMODAL_MERGER_PATH.exists():
    multimodal_merger_model = core.compile_model(MULTIMODAL_MERGER_PATH, device)
else:
    multimodal_merger_model = None


In [31]:
# use openvino model to pack the image features
packed_image_features = image_packer_model({
    'image_feature': image_features,
    'image_sizes': image_sizes,
    'num_patch_height': torch.tensor(num_patch_height, dtype=torch.int64),
    'num_patch_width': torch.tensor(num_patch_width, dtype=torch.int64)
})[0]
packed_image_features = torch.from_numpy(packed_image_features)

In [32]:
# use openvino model to merge the image features with text features
merger_out = multimodal_merger_model({
        "image_features": packed_image_features,
        "inputs_embeds": image_inputs_embeds,
        "input_ids": img_input_ids,
        "attention_mask": img_attention_mask
    }
)
image_final_embeds = torch.from_numpy(merger_out['final_embedding'])
image_final_attention_mask = torch.from_numpy(merger_out['final_attention_mask'])
image_position_ids = torch.from_numpy(merger_out['position_ids'])

In [33]:
request = compiled_language_model.create_infer_request()
img_input_lm = {
    "inputs_embeds": image_final_embeds.detach().numpy(),
    "attention_mask": image_final_attention_mask.detach().numpy(),
    "position_ids": image_position_ids.detach().numpy(),
}
request.start_async(img_input_lm, share_inputs=True)
request.wait()
img_lm_output = torch.from_numpy(request.get_tensor("last_hidden_state").data)

In [34]:
text_request = compiled_language_model.create_infer_request()
text_position_ids = text_attention_mask.long().cumsum(-1) - 1
text_position_ids.masked_fill_(text_attention_mask == 0, 1)
text_input_lm = {
    "inputs_embeds": text_inputs_embeds.detach().numpy(),
    "attention_mask": text_attention_mask.detach().numpy(),
    "position_ids": text_position_ids.detach().numpy(),
}
text_request.start_async(text_input_lm, share_inputs=True)
text_request.wait()
text_lm_output = torch.from_numpy(text_request.get_tensor("last_hidden_state").data)

In [35]:
import torch.nn.functional as F

txt_embed = F.normalize(text_lm_output, dim=-1)
img_embed = F.normalize(img_lm_output, dim=-1)

print(txt_embed @ img_embed.T)

tensor([[0.7158]])


## 3 Import and Save E5V in Spark NLP

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

In [None]:
import sparknlp

# let's start Spark with Spark NLP
spark = sparknlp.start()


In [1]:
model_id = "royokong/e5-v"

In [9]:
e5v_embeddings_sn = E5VEmbeddings \
            .loadSavedModel(str(output_dir),spark) \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

25/06/10 03:45:32 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
25/06/10 03:45:41 WARN NativeLibrary: Failed to load library null: java.lang.UnsatisfiedLinkError: Can't load library: /tmp/openvino-native4021672575912693842/libtbb.so.2




In [40]:
e5v_embeddings_sn.write().overwrite().save(f"file:///tmp/{model_id}_spark_nlp")

                                                                                

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql.functions import lit
from pyspark.ml import Pipeline
from sparknlp.util import EmbeddingsDataFrameUtils

from pathlib import Path
import os

# download two images to test into ./images folder

url1 = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"

Path("images").mkdir(exist_ok=True)

!wget -q -O images/image1.jpg {url1}



images_path = "file://" + os.getcwd() + "/images/"
image_df = spark.read.format("image").load(
    path=images_path
)

imagePrompt = "<|start_header_id|>user<|end_header_id|>\n\n<image>\\nSummary above image in one word: <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n \n"
image_df = spark.read.format("image").option("dropInvalid", True).load(images_path)
test_df = image_df.withColumn("text", lit(imagePrompt))

textPrompt = "<|start_header_id|>user<|end_header_id|>\n\n<sent>\\nSummary above sentence in one word: <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n \n"
textDesc = "A cat sitting in a box."
nullImageDF = spark.createDataFrame(
    [EmbeddingsDataFrameUtils.emptyImageRow], schema=
    EmbeddingsDataFrameUtils.imageSchema)
textDF = nullImageDF.withColumn("text", lit(textPrompt.replace("<sent>", textDesc)))

test_df = test_df.union(textDF)

imageAssembler = ImageAssembler() \
            .setInputCol("image") \
            .setOutputCol("image_assembler")
e5v = E5VEmbeddings.load(f"file:///tmp/{model_id}_spark_nlp") \
    .setInputCols(["image_assembler"]) \
    .setOutputCol("e5v")
pipeline = Pipeline().setStages([imageAssembler, e5v])
results = pipeline.fit(test_df).transform(test_df)
results.select("e5v.embeddings").show(truncate=True)