![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/openvino/HuggingFace_OpenVINO_in_Spark_NLP_SmolVLMTransformer.ipynb)

# Import OpenVINO SmolVLM models from HuggingFace 🤗 into Spark NLP 🚀

This notebook provides a detailed walkthrough on optimizing and importing SmolVLM models from HuggingFace for use in Spark NLP, with [Intel OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html). The focus is on converting the model to the OpenVINO format and applying precision optimizations (INT8 and INT4), to enhance the performance and efficiency on CPU platforms using [Optimum Intel](https://huggingface.co/docs/optimum/main/en/intel/inference).

Let's keep in mind a few things before we start 😊

- OpenVINO support was introduced in `Spark NLP 5.4.0`, enabling high performance CPU inference for models. So please make sure you have upgraded to the latest Spark NLP release.
- Model quantization is a computationally expensive process, so it is recommended to use a runtime with more than 32GB memory for exporting the quantized model from HuggingFace.
- You can import SmolVLM models via `SmolVLM`. These models are usually under the `Text Generation` category and have `SmolVLM` in their labels.
- Reference: [SmolVLM](https://huggingface.co/docs/transformers/model_doc/llama#transformers.SmolVLM)
- Some [example models](https://huggingface.co/models?search=SmolVLM)

## Table of Contents

1. [Setup and Installation](#setup-and-installation)
2. [Model Configuration](#model-configuration)
3. [Model Loading and Preparation](#model-loading-and-preparation)
4. [Model Conversion to OpenVINO](#model-conversion-to-openvino)
5. [Model Quantization](#model-quantization)
6. [Model Merger Implementation](#model-merger-implementation)
7. [Testing OpenVINO Model](#7-testing-openvino-model)
8. [Import and Save in Spark NLP](#8-import-and-save-in-spark-nlp)


## 1. Setup and Installation

First, let's install all the required dependencies for this notebook.

In [1]:
# Install OpenVINO and NNCF for model optimization
%pip install -qU "openvino>=2024.4.0" "nncf>=2.13.0"

# Install NLP and tokenization libraries
%pip install -q  "sentencepiece" "tokenizers>=0.12.1" "transformers>=4.46.0" "gradio>=4.36"

# Install OpenVINO nightly builds for latest features
%pip install -q -U --pre --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly openvino-tokenizers openvino openvino-genai

# Install HuggingFace Hub and PyTorch
%pip install -q --upgrade huggingface_hub
%pip install -q --upgrade torch

### Environment Configuration

Configure the environment to disable tokenizer parallelism for better compatibility.

In [2]:
import os
# Disable tokenizer parallelism to avoid potential issues
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [3]:
from pathlib import Path
import types
from typing import Optional, List
import gc
import openvino as ov
from openvino.runtime import opset13
import nncf
import numpy as np
import torch
from transformers import AutoProcessor, AutoConfig,AutoModelForVision2Seq
from openvino.frontend.pytorch.patch_model import __make_16bit_traceable


  from .autonotebook import tqdm as notebook_tqdm


## 2. Model Configuration

Set up the model ID and quantization parameters for the conversion process.

In [4]:
model_id = "HuggingFaceTB/SmolVLM-Instruct"

In [5]:
quantization_method = "int4"
output_dir = Path(f"models/{model_id}/{quantization_method}")
output_dir.mkdir(parents=True, exist_ok=True)

## 3. Model Loading and Preparation

Load the model processor, configuration, and prepare the model for conversion to OpenVINO format.

In [6]:
# Load the processor
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, )

# load the config
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)

# Change the model to use SDPA attention
# This is a workaround for the model to be compatible with OpenVINO
config.text_config._attn_implementation = "sdpa"


# export the processor and config to output_dir/assets
processor.save_pretrained(output_dir/"assets")
config.save_pretrained(output_dir/"assets")

In [7]:
# Load the model
model = AutoModelForVision2Seq.from_pretrained(model_id, trust_remote_code=True, config=config)
model.eval()
model.model.eval()
__make_16bit_traceable(model)



## 4. Model Conversion to OpenVINO

Define paths for the converted model components and implement conversion utilities.

In [8]:
core = ov.Core()

image_embed_path = output_dir / "image_embed.xml"
image_encoder_path = output_dir / "image_encoder.xml"
image_post_layernorm_path = output_dir / "image_post_layernorm.xml"
image_connector_path = output_dir / "image_connector.xml"
language_model_path = output_dir / "language_model.xml"
model_merger_path = output_dir / "model_merger.xml"
text_embeddings_path = output_dir / "text_embeddings.xml"
lm_head_path = output_dir / "lm_head.xml"
image_model_path = output_dir / "image_model.xml"

In [9]:
def cleanup_torchscript_cache():
    """
    Helper function for removing cached model representation to prevent memory leaks
    during model conversion.
    """
    torch._C._jit_clear_class_registry()
    torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore()
    torch.jit._state._clear_class_state()

### Convert Text Embeddings Component

Convert the text embeddings component of the model to OpenVINO format.

In [10]:
# save text embeddings
with torch.no_grad():
    ov_model = ov.convert_model(
        model.model.text_model.embed_tokens,
        example_input=torch.ones([2, 2], dtype=torch.int64),
    )
    ov.save_model(ov_model, text_embeddings_path)
    del ov_model
    cleanup_torchscript_cache()
    gc.collect()



### Convert Vision Model Components

Convert the vision model components to OpenVINO format.

In [11]:
def forward_wrap(self, pixel_values: torch.FloatTensor, patch_attention_mask: torch.FloatTensor) -> torch.Tensor:
        """
        Forward pass optimized for ONNX/OpenVINO export.

        Args:
            pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, max_im_h, max_im_w)`):
                Pixel values corresponding to the images. `max_im_h` and `max_im_w` are the
                maximum height and width in the batch after padding.
            patch_attention_mask (`torch.BoolTensor` of shape `(batch_size, max_nb_patches_h, max_nb_patches_w)`):
                Mask indicating which patches are actual image patches vs padding.

        Returns:
            `torch.Tensor` of shape `(batch_size, sequence_length, hidden_size)`: Patch embeddings
            with added positional embeddings. sequence_length is `max_nb_patches_h * max_nb_patches_w`.
        """
        patch_attention_mask = (patch_attention_mask > 0).to(torch.bool) # Convert to boolean mask

        batch_size, _, max_im_h, max_im_w = pixel_values.shape
        device = pixel_values.device

        # 1. Calculate Patch Embeddings (ONNX-friendly)
        # Shape: (batch_size, embed_dim, max_nb_patches_h, max_nb_patches_w)
        patch_embeds = self.patch_embedding(pixel_values)
        max_nb_patches_h, max_nb_patches_w = patch_embeds.shape[-2:]

        # Shape: (batch_size, max_nb_patches_h * max_nb_patches_w, embed_dim)
        embeddings = patch_embeds.flatten(2).transpose(1, 2)

        # 2. Calculate Positional Embeddings (Vectorized & ONNX-friendly)

        # Determine actual number of patches per image in the batch
        # Add float() and clamp(min=1) for safe division later. Add 1e-6 to avoid edge cases at 1.0.
        # Shape: (batch_size,)
        nb_patches_h = patch_attention_mask.any(dim=2).sum(dim=1).float().clamp(min=1.0)
        nb_patches_w = patch_attention_mask.any(dim=1).sum(dim=1).float().clamp(min=1.0)

        # Create coordinate grids for the *maximum* patch layout
        # Shape: (max_nb_patches_h,) and (max_nb_patches_w,)
        h_indices = torch.arange(max_nb_patches_h, device=device, dtype=torch.float32)
        w_indices = torch.arange(max_nb_patches_w, device=device, dtype=torch.float32)

        # Calculate fractional coordinates relative to the *actual* image dimensions
        # We scale the grid indices by the ratio of max patches to actual patches
        # to get coordinates in the [0, 1) range representative of the original image proportions.
        # Use broadcasting: h_indices (H) / nb_patches_h (B, 1) -> (B, H)
        #                   w_indices (W) / nb_patches_w (B, 1) -> (B, W)
        # Add epsilon to denominator to avoid division by zero potential, although clamp(min=1) helps.
        # The range should be [0, ~1) for bucketize.
        frac_coords_h = h_indices.unsqueeze(0) / (nb_patches_h.unsqueeze(1)) # Shape: (batch_size, max_nb_patches_h)
        frac_coords_w = w_indices.unsqueeze(0) / (nb_patches_w.unsqueeze(1)) # Shape: (batch_size, max_nb_patches_w)

        # Bucketize the fractional coordinates using the pre-defined boundaries
        # These map the fractional coordinates to the discrete grid cells of the *reference* embedding table
        # Shape: (batch_size, max_nb_patches_h)
        bucket_coords_h = torch.bucketize(frac_coords_h, self.boundaries, right=True)
        # Shape: (batch_size, max_nb_patches_w)
        bucket_coords_w = torch.bucketize(frac_coords_w, self.boundaries, right=True)

        # Combine bucket coordinates to get position IDs for the reference grid
        # Expand dims for broadcasting:
        # bucket_coords_h (B, H) -> (B, H, 1)
        # bucket_coords_w (B, W) -> (B, 1, W)
        # Result shape: (batch_size, max_nb_patches_h, max_nb_patches_w)
        position_ids_full = (
            bucket_coords_h.unsqueeze(2) * self.num_patches_per_side + bucket_coords_w.unsqueeze(1)
        )

        # Flatten the position IDs and the attention mask
        # Shape: (batch_size, max_nb_patches_h * max_nb_patches_w)
        position_ids_flat = position_ids_full.flatten(1)
        patch_attention_mask_flat = patch_attention_mask.flatten(1) # Shape: (batch_size, max_nb_patches_h * max_nb_patches_w)

        # Use the attention mask to select the calculated position IDs for actual patches
        # and use a default ID (e.g., 0) for padding patches. This mimics the original loop's behavior
        # where IDs were only calculated and assigned for active patches.
        # Using torch.zeros_like ensures the default ID tensor is on the correct device.
        final_position_ids = torch.where(
            patch_attention_mask_flat,
            position_ids_flat,
            torch.zeros_like(position_ids_flat) # Pad with position ID 0
        )

        # 3. Add Positional Embeddings
        # Shape: (batch_size, max_nb_patches_h * max_nb_patches_w, embed_dim)
        pos_embeds = self.position_embedding(final_position_ids)

        # Add to patch embeddings
        embeddings = embeddings + pos_embeds

        return embeddings
# boundaries = torch.arange(1 / self.num_patches_per_side, 1.0, 1 / self.num_patches_per_side)
# self.register_buffer("boundaries", boundaries, persistent=False)

model.model.vision_model.embeddings._boundaries = torch.arange(1 / model.model.vision_model.embeddings.num_patches_per_side, 1.0, 1 / model.model.vision_model.embeddings.num_patches_per_side)
# Register the boundaries buffer
model.model.vision_model.embeddings.register_buffer("boundaries", model.model.vision_model.embeddings._boundaries, persistent=False)

model.model.vision_model.embeddings._orig_forward = model.model.vision_model.embeddings.forward
# Wrap the forward method
model.model.vision_model.embeddings.forward = types.MethodType(forward_wrap, model.model.vision_model.embeddings)

with torch.no_grad():
    ov_model = ov.convert_model(
        model.model.vision_model.embeddings,
        example_input={
            "pixel_values": torch.ones([13, 3, 384, 384], dtype=torch.float32),
            "patch_attention_mask": torch.ones([13, 27, 27], dtype=torch.int64),
        }
    )
    ov.save_model(ov_model, image_embed_path)
    del ov_model
    cleanup_torchscript_cache()
    gc.collect()

In [12]:
def forward_wrap(
        self,
        inputs_embeds,
        attention_mask: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
):
    result = self.encoder(
        inputs_embeds=inputs_embeds,
        attention_mask=attention_mask,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    last_hidden_state = result[0]
    return self.post_layernorm(last_hidden_state)


model.model.vision_model._orig_forward = model.forward
model.model.vision_model.forward = types.MethodType(forward_wrap, model.model.vision_model)
with torch.no_grad():
    ov_model = ov.convert_model(
        model.model.vision_model,
        example_input={
            "inputs_embeds": torch.ones([13, 729, 1152], dtype=torch.float32),
        }
    )
    ov.save_model(ov_model, image_encoder_path)
    del ov_model
    cleanup_torchscript_cache()
    gc.collect()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
  if attn_weights.size() != (batch_size, self.num_heads, q_len, k_v_seq_len):
  if attn_output.size() != (batch_size, self.num_heads, q_len, self.head_dim):


In [13]:
with torch.no_grad():
    ov_model = ov.convert_model(
        model.model.connector,
        example_input=torch.ones([13, 729, 1152], dtype=torch.float32)
    )
    ov.save_model(ov_model, image_connector_path)
    del ov_model
    cleanup_torchscript_cache()
    gc.collect()

  height = width = int(seq**0.5)
  x = x.reshape(bsz, int(seq / (scale_factor**2)), embed_dim * (scale_factor**2))


### Convert Language Model Head

Convert the language model head to OpenVINO format.

In [14]:
with torch.no_grad():
    ov_model = ov.convert_model(
        model.lm_head,
        example_input=torch.ones([2, 2, 2048], dtype=torch.float32),
    )
    ov.save_model(ov_model, lm_head_path)

## 5. Model Quantization

Implement functions for model state management and quantization.

In [15]:
def model_has_state(ov_model: ov.Model):
    return len(ov_model.get_sinks()) > 0


def model_has_input_output_name(ov_model: ov.Model, name: str):
    """
    Helper function for checking that model has specified input or output name

    Parameters:
      ov_model (ov.Model):
      name (str):
          name of input or output

    Returns:
      True if input or output with requested name exists else False
    """
    return name in sum([list(t.get_names()) for t in ov_model.inputs + ov_model.outputs], [])


def fuse_cache_reorder(
    ov_model: ov.Model,
    not_kv_inputs: List[str],
    key_value_input_names: List[str],
    gather_dim: int,
):
    """
    Fuses reored_cache during generate cycle into ov.Model. Used with stateful models, because we can not modify model state directly.

    Adds a new beam_idx parameter and Gather op per each kv-cache input in a given model.
    Should be run before make_stateful. Implements optimumum's _reorder_cache
    inside the model in the beginning of each iteration.
    Gather works along given gather_dim dimension that may vary from model to model.
    KV-cache inputs are identified based on names in key_value_input_names.
    Append the new beam_idx parameter to not_kv_inputs.

    Parameters:
      ov_model (`ov.Model`):
          openvino model for processing
      not_kv_inputs (`List[str]`):
          list of input nodes in model that not related to past key values
      key_value_input_names (`List[str]`):
          list of names for key value input layers
      gather_dim (int):
          dimension for gathering cache during reorder pass
    """

    if model_has_input_output_name(ov_model, "beam_idx"):
        raise ValueError("Model already has fused cache")
    input_batch = ov_model.input("inputs_embeds").get_partial_shape()[0]
    beam_idx = opset13.parameter(name="beam_idx", dtype=ov.Type.i32, shape=ov.PartialShape([input_batch]))
    beam_idx.output(0).get_tensor().add_names({"beam_idx"})  # why list is not accepted?
    ov_model.add_parameters([beam_idx])
    not_kv_inputs.append(ov_model.inputs[-1])
    # Go over all cache parameters and fuse _reorder_cache with indices provided by the new parameter beam_idx
    for input_name in key_value_input_names:
        parameter_output_port = ov_model.input(input_name)
        consumers = parameter_output_port.get_target_inputs()
        gather = opset13.gather(parameter_output_port, beam_idx, opset13.constant(gather_dim))
        for consumer in consumers:
            consumer.replace_source_output(gather.output(0))
    ov_model.validate_nodes_and_infer_types()


def build_state_initializer(ov_model: ov.Model, batch_dim: int):
    """
    Build initialization ShapeOf Expression for all ReadValue ops

    Parameters:
      ov_model (ov.Model):
          openvino model
      batch_dim (int):
          index of dimension corresponding to batch size
    """
    input_ids = ov_model.input("inputs_embeds")
    batch = opset13.gather(
        opset13.shape_of(input_ids, output_type="i64"),
        opset13.constant([0]),
        opset13.constant(0),
    )
    for op in ov_model.get_ops():
        if op.get_type_name() == "ReadValue":
            dims = [dim.min_length for dim in list(op.get_output_partial_shape(0))]
            dims[batch_dim] = batch
            dims = [(opset13.constant(np.array([dim], dtype=np.int64)) if isinstance(dim, int) else dim) for dim in dims]
            shape = opset13.concat(dims, axis=0)
            broadcast = opset13.broadcast(opset13.constant(0.0, dtype=op.get_output_element_type(0)), shape)
            op.set_arguments([broadcast])
    ov_model.validate_nodes_and_infer_types()


def make_stateful(
    ov_model: ov.Model,
    not_kv_inputs: List[str],
    key_value_input_names: List[str],
    key_value_output_names: List[str],
    batch_dim: int,
    num_attention_heads: int,
    num_beams_and_batch: int = None,
):
    """
    Hides kv-cache inputs and outputs inside the model as variables.

    Parameters:
        ov_model (ov.Model):
            openvino model
        not_kv_inputs (`List[str]`):
            list of input nodes in model that not related to past key values
        key_value_input_names (`List[str]`):
            list of names for key value input layers
        key_value_output_names (`List[str]`):
            list of names for key value input layers
        batch_dim (int):
            index of batch dimension in key value layers
        num_attention_heads (int):
            number of attention heads for batch dimension initialization
        num_beams_an_batch (int):
            precalculated number of beams and batch for shapes initialization
    """
    from openvino._offline_transformations import apply_make_stateful_transformation

    input_output_map = {}

    if num_beams_and_batch is not None:
        # Set batch size for input_ids and attention mask to avoid dynamic dimension got propagated from the end of the model back to ReadValue
        for input in not_kv_inputs:
            shape = input.get_partial_shape()
            if shape.rank.get_length() <= 2:  # == 1 for beam_index
                shape[0] = num_beams_and_batch
                input.get_node().set_partial_shape(shape)
    for kv_name_pair in zip(key_value_input_names, key_value_output_names):
        input_output_map[kv_name_pair[0]] = kv_name_pair[1]
        if num_beams_and_batch is not None:
            input = ov_model.input(kv_name_pair[0])
            shape = input.get_partial_shape()
            shape[batch_dim] = num_beams_and_batch * num_attention_heads
            input.get_node().set_partial_shape(shape)

    if num_beams_and_batch is not None:
        # Re-validation model if shapes are altered above
        ov_model.validate_nodes_and_infer_types()

    apply_make_stateful_transformation(ov_model, input_output_map)
    if num_beams_and_batch is None:
        build_state_initializer(ov_model, batch_dim)


def patch_stateful(ov_model):
    key_value_input_names = [key.get_any_name() for key in ov_model.inputs[2:-1]]
    key_value_output_names = [key.get_any_name() for key in ov_model.outputs[1:]]
    not_kv_inputs = [input for input in ov_model.inputs if not any(name in key_value_input_names for name in input.get_names())]
    if not key_value_input_names or not key_value_output_names:
        return
    batch_dim = 0
    num_attention_heads = 1

    fuse_cache_reorder(ov_model, not_kv_inputs, key_value_input_names, batch_dim)
    make_stateful(
        ov_model,
        not_kv_inputs,
        key_value_input_names,
        key_value_output_names,
        batch_dim,
        num_attention_heads,
        None,
    )

### Model Quantization Process

The quantization process reduces the precision of the model weights from floating-point to integer format, which can significantly reduce model size and improve inference speed. We're using INT4 quantization in this notebook.

In [16]:
from transformers.cache_utils import DynamicCache

quantization_config = {
    "mode": nncf.CompressWeightsMode.INT4_ASYM,
    "group_size": 128,
    "ratio": 1.0,
}
lang_model = model.model.text_model
def forward_wrap(
        self,
        attention_mask,
        position_ids=None,
        past_key_values=None,
        inputs_embeds=None,
    ):
        if past_key_values is not None:
            new_past_key_values = DynamicCache.from_legacy_cache(past_key_values)
        else:
            new_past_key_values = None
        result = self._orig_forward(
            input_ids=None,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=new_past_key_values,
            inputs_embeds=inputs_embeds,
        )
        if past_key_values is not None:
            result["past_key_values"] = result["past_key_values"].to_legacy_cache()
        return tuple(result.values())

lang_model._orig_forward = lang_model.forward
lang_model.forward = types.MethodType(forward_wrap, lang_model)
hidden_size = lang_model.config.hidden_size
llm_input = torch.zeros([2, 2, hidden_size])
pkv = lang_model._orig_forward(
    inputs_embeds=llm_input,
    attention_mask=torch.ones((2, 2), dtype=torch.int64),
)[1].to_legacy_cache()
model_inputs = ["attention_mask", "position_ids"]
model_outputs = ["last_hidden_state"]
for idx in range(len(pkv)):
    model_inputs.extend([f"past_key_values.{idx}.key", f"past_key_values.{idx}.value"])
    model_outputs.extend([f"present.{idx}.key", f"present.{idx}.value"])
model_inputs.append("inputs_embeds")
position_ids = torch.tensor([[2, 3], [2, 3]])
with torch.no_grad():
    ov_model = ov.convert_model(
        lang_model,
        example_input={
            "inputs_embeds": llm_input,
            "attention_mask": torch.ones([2, 4], dtype=torch.int64),
            "past_key_values": pkv,
            "position_ids": position_ids,
        },
    )
for input, input_name in zip(ov_model.inputs, model_inputs):
            input.get_tensor().set_names({input_name})

for output, output_name in zip(ov_model.outputs, model_outputs):
    output.get_tensor().set_names({output_name})
patch_stateful(ov_model)
print("✅ Language model successfully converted")
fp_lang_model_path = language_model_path if quantization_config is None else language_model_path.parent / ("fp_" + language_model_path.name)
ov.save_model(ov_model, fp_lang_model_path)
del ov_model
cleanup_torchscript_cache()
gc.collect()


ov_model = core.read_model(fp_lang_model_path)
print(f"⌛ Weights compression with {quantization_config['mode']} mode started")
c_ov_model = nncf.compress_weights(ov_model, **quantization_config)
print("✅ Weights compression finished")
ov.save_model(c_ov_model, language_model_path)
del c_ov_model
del ov_model
gc.collect()

# delete fp_lang_model_path
fp_lang_model_path.unlink(missing_ok=True)
fp_lang_model_path.with_suffix(".bin").unlink(missing_ok=True)

  or len(self.key_cache[layer_idx]) == 0  # the layer has no cache
  if sequence_length != 1:
  len(self.key_cache[layer_idx]) == 0
  is_causal = query.shape[2] > 1 and causal_mask is None
  if a.grad is not None:


✅ Language model successfully converted
⌛ Weights compression with int4_asym mode started
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_asym                 │ 1% (1 / 168)                │ 0% (0 / 167)                           │
├───────────────────────────┼─────────────────────────────┼────────────────────────────────────────┤
│ int4_asym                 │ 99% (167 / 168)             │ 100% (167 / 167)                       │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


✅ Weights compression finished


## 6. Model Merger Implementation

Implement the model merger to combine text and image components.

In [17]:
import torch
import torch.nn as nn

class ModelMerger(nn.Module):
    def __init__(self, image_token_id: int, patch_size: int) -> None:
        super().__init__()
        self.image_token_id = image_token_id
        self.patch_size = patch_size

    def forward(self, input_ids: torch.LongTensor, inputs_embeds: torch.Tensor, image_hidden_states: torch.Tensor):
        # input_ids: (B, L)
        # inputs_embeds: (B, L, D)
        # image_hidden_states: (B * num_blocks, patch_size, D)

        B, L = input_ids.shape
        _, _, D = inputs_embeds.shape

        image_mask = (input_ids == self.image_token_id)  # (B, L)

        # Flatten inputs
        image_mask_flat = image_mask.view(-1)  # (B*L)
        input_ids_flat = input_ids.view(-1)

        # Compute indices for image_hidden_states
        token_indices = torch.arange(L, device=input_ids.device).unsqueeze(0).expand(B, L)
        image_token_counts = torch.cumsum(image_mask.to(torch.int32), dim=1) - 1
        image_token_counts = torch.where(image_mask, image_token_counts, torch.zeros_like(image_token_counts))

        block_index = image_token_counts // self.patch_size
        local_index = image_token_counts % self.patch_size

        flat_index = block_index * self.patch_size + local_index  # Index into flattened image_hidden_states

        # Flatten image_hidden_states to (B * num_blocks * patch_size, D)
        image_hidden_states_flat = image_hidden_states.view(-1, D)

        # Gather image embeddings for image tokens
        image_token_embeddings = torch.zeros_like(inputs_embeds)
        flat_index_expanded = flat_index.unsqueeze(-1).expand(-1, -1, D)
        image_token_embeddings = torch.where(
            image_mask.unsqueeze(-1),
            torch.gather(image_hidden_states_flat, 0, flat_index_expanded.view(-1, D)).view(B, L, D),
            torch.zeros_like(inputs_embeds)
        )

        # Merge
        merged_embeds = torch.where(image_mask.unsqueeze(-1), image_token_embeddings, inputs_embeds)

        return merged_embeds


In [19]:
model_merger = ModelMerger(image_token_id=model.image_token_id, patch_size=model.config.image_seq_len)

ov_model = ov.convert_model(
    model_merger,
    example_input={
        "input_ids": torch.ones([1, 1198], dtype=torch.int64),
        "inputs_embeds": torch.ones([1, 1198, 2048], dtype=torch.float32),
        "image_hidden_states": torch.ones([13, 81, 2048], dtype=torch.float32),
    },
)

ov.save_model(ov_model, model_merger_path)

# 7. Testing OpenVINO Model




In [20]:
core = ov.Core()
device = "CPU"

In [21]:
# paths for the exported models
image_embed_path = output_dir / "image_embed.xml"
image_encoder_path = output_dir / "image_encoder.xml"
image_connector_path = output_dir / "image_connector.xml"
language_model_path = output_dir / "language_model.xml"
model_merger_path = output_dir / "model_merger.xml"
text_embeddings_path = output_dir / "text_embeddings.xml"
lm_head_path = output_dir / "lm_head.xml"

In [22]:
# compile the models
language_model = core.read_model(language_model_path)
compiled_language_model = core.compile_model(language_model, "AUTO")

image_embed = core.compile_model(image_embed_path, device)
image_encoder = core.compile_model(image_encoder_path, device)
image_connector = core.compile_model(image_connector_path, device)
model_merger = core.compile_model(model_merger_path, device)
text_embeddings = core.compile_model(text_embeddings_path, device)
lm_head = core.compile_model(lm_head_path, device)

In [23]:
from transformers.image_utils import load_image
from transformers import AutoProcessor, TextStreamer
from transformers import AutoProcessor, AutoConfig,AutoModelForVision2Seq

DEVICE = "cpu"

# Load images
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg")

processor = AutoProcessor.from_pretrained(model_id)
# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe the image?"}
        ]
    },
]

config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)


# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs_new = processor(text=prompt, images=[image1], return_tensors="pt")


request = compiled_language_model.create_infer_request()
input_names = {key.get_any_name(): idx for idx, key in enumerate(language_model.inputs)}
inputs = {}
# Set the initial input_ids
current_input_ids = inputs_new["input_ids"]
attention_mask = inputs_new["attention_mask"]
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
pixel_values = inputs_new["pixel_values"]
pixel_attention_mask = inputs_new["pixel_attention_mask"]

patch_size = config.vision_config.patch_size

generation_args = {"max_new_tokens": 200, "do_sample": False, "streamer": TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True)}
generated_tokens = []

for i in range(generation_args["max_new_tokens"]):
    # Generate input embeds each time
    if current_input_ids.shape[-1] > 1:
        batch_size, num_images, num_channels, height, width = pixel_values.shape
        pixel_values = pixel_values
        pixel_values = pixel_values.view(batch_size * num_images, *pixel_values.shape[2:])

        # Remove padding images - padding images are full 0.
        nb_values_per_image = pixel_values.shape[1:].numel()
        real_images_inds = (pixel_values == 0.0).sum(dim=(-1, -2, -3)) != nb_values_per_image

        if not any(real_images_inds):
            # no images, leave one empty image.
            real_images_inds[0] = True

        pixel_values = pixel_values[real_images_inds].contiguous()

        # Handle the vision attention mask
        if pixel_attention_mask is None:
            pixel_attention_mask = torch.ones(
                size=[pixel_values.shape[i] for i in (0, 2, 3)],
                dtype=torch.bool,
                device=pixel_values.device,
            )
        else:
            # Remove padding images from the mask
            pixel_attention_mask = pixel_attention_mask.view(
                batch_size * num_images, *pixel_attention_mask.shape[2:]
            )
            pixel_attention_mask = pixel_attention_mask[real_images_inds].contiguous()

        patches_subgrid = pixel_attention_mask.unfold(dimension=1, size=patch_size, step=patch_size)
        patches_subgrid = patches_subgrid.unfold(dimension=2, size=patch_size, step=patch_size)
        patch_attention_mask = (patches_subgrid.sum(dim=(-1, -2)) > 0).bool()

        hidden_states =  torch.from_numpy(
            image_embed({
                "pixel_values": pixel_values,
                "patch_attention_mask": patch_attention_mask,
            })[0]
        )

        patch_attention_mask = patch_attention_mask.view(batch_size, -1)

        image_hidden_states_before = torch.from_numpy(
            image_encoder({
                "inputs_embeds": hidden_states
            })[0]
        )

        image_hidden_states = torch.from_numpy(
            image_connector({
                "image_hidden_states": image_hidden_states_before,
            })[0]
        )

        text_input_embeds = torch.from_numpy(
            text_embeddings(current_input_ids)[0]
        )

        inputs_embeds = torch.from_numpy(
            model_merger({
                "input_ids": current_input_ids,
                "inputs_embeds": text_input_embeds,
                "image_hidden_states": image_hidden_states,
            })[0]
        )
    else:
        text_input_embeds = torch.from_numpy(
            text_embeddings(current_input_ids)[0]
        )
        inputs_embeds = torch.from_numpy(
            model_merger({
                "input_ids": current_input_ids,
                "inputs_embeds": text_input_embeds,
                "image_hidden_states": image_hidden_states,
            })[0]
        )
    
    inputs["inputs_embeds"] = inputs_embeds
    inputs["attention_mask"] = attention_mask
    inputs["position_ids"] = position_ids
    if "beam_idx" in input_names:
        inputs["beam_idx"] = np.arange(inputs_embeds.shape[0], dtype=int)
    
    # Start inference
    request.start_async(inputs, share_inputs=True)
    request.wait()
    
    # Get the logits and find the next token
    last_hidden_state = torch.from_numpy(request.get_tensor("last_hidden_state").data)
    logits = torch.from_numpy(lm_head(last_hidden_state)[0])

    next_token = logits.argmax(-1)[0][-1]

    # Append the generated token
    generated_tokens.append(next_token)
    
    # Update input_ids with the new token
    current_input_ids = torch.cat([next_token.unsqueeze(0).unsqueeze(0)], dim=-1)
    
    # update the attention mask
    attention_mask = torch.cat([attention_mask, torch.ones_like(attention_mask[:, :1])], dim=-1)

    # Update inputs for the next iteration
    position_ids = attention_mask.long().cumsum(-1) - 1
    position_ids.masked_fill_(attention_mask == 0, 1)
    position_ids = position_ids[:, -current_input_ids.shape[1] :]
    inputs["position_ids"] = position_ids

    if next_token == processor.tokenizer.eos_token_id:
        break

generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True, eos_token_id=processor.tokenizer.eos_token_id)
print(generated_text)

 The image features a prominent green statue of liberty in the foreground, standing on a small island in the middle of a body of water. The statue is holding a torch in its right hand. The water is calm and blue, with a few small waves visible. In the background, there is a large cityscape with numerous skyscrapers and buildings, including the Empire State Building and the One World Trade Center. The sky is clear and blue, with a hint of sunlight reflecting off the water. The cityscape is densely packed with high-rise buildings, with some of the buildings having distinctive architectural features such as domes or spires. The image is clear and well-lit, with a focus on the statue and the cityscape.


# 8. Import and Save in Spark NLP
- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Let's start Spark with Spark NLP included via our simple `start()` function

In [None]:
import sparknlp

# let's start Spark with Spark NLP
spark = sparknlp.start()


24/11/07 09:56:55 WARN Utils: Your hostname, minotaur resolves to a loopback address: 127.0.1.1; using 192.168.1.4 instead (on interface eno1)
24/11/07 09:56:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
24/11/07 09:56:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [4]:
imageClassifier = SmolVLMTransformer \
            .loadSavedModel(str(output_dir),spark) \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

25/04/11 05:51:22 WARN NativeLibrary: Failed to load library null: java.lang.UnsatisfiedLinkError: Can't load library: /tmp/openvino-native17302445033292358869/libtbb.so.2




In [5]:
imageClassifier.write().overwrite().save("file:///tmp/SmolVLM_spark_nlp_2")

                                                                                

In [8]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql.functions import lit
from pyspark.ml import Pipeline
from pathlib import Path
import os

# download two images to test into ./images folder

url1 = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
url2 = "http://images.cocodataset.org/val2017/000000039769.jpg"

Path("images").mkdir(exist_ok=True)

!wget -q -O images/image1.jpg {url1}
!wget -q -O images/image2.jpg {url2}



images_path = "file://" + os.getcwd() + "/images/"
image_df = spark.read.format("image").load(
    path=images_path
)

test_df = image_df.withColumn("text", lit("<|im_start|>User:<image>Can you describe the image?<end_of_utterance>\nAssistant:"))

image_assembler = ImageAssembler().setInputCol("image").setOutputCol("image_assembler")

imageClassifier = SmolVLMTransformer.load("file:///tmp/SmolVLM_spark_nlp_2")\
            .setMaxOutputLength(50) \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

pipeline = Pipeline(
            stages=[
                image_assembler,
                imageClassifier,
            ]
        )

model = pipeline.fit(test_df)

In [9]:
light_pipeline = LightPipeline(model)
image_path = os.getcwd() + "/images/" + "image1.jpg"
print("image_path: " + image_path)
annotations_result = light_pipeline.fullAnnotateImage(
    image_path,
    "<|im_start|>User:<image>Can you describe the image?<end_of_utterance>\nAssistant:"
)

for result in annotations_result:
    print(result["answer"])

image_path: /mnt/research/Projects/ModelZoo/SmolVLM/images/image1.jpg
[Annotation(document, 0, 224,  The image features a gray tabby cat lying in a cardboard box. The cat has its eyes closed, suggesting it is relaxed. Its fur is light gray with darker patches, and its paws are visible, including one with a pink toe. The cat, Map(), [])]


In [11]:
ZIP_NAME = f"smolvlm_instruct_int4_sn"
!cd /tmp/SmolVLM_spark_nlp_2 && zip -r {ZIP_NAME}.zip .

  adding: model_merger.xml (deflated 90%)
  adding: .model_merger.xml.crc (stored 0%)
  adding: image_connector.xml (deflated 23%)
  adding: .image_encoder.xml.crc (deflated 0%)
  adding: .language_model.xml.crc (deflated 0%)
  adding: fields/ (stored 0%)
  adding: fields/merges/ (stored 0%)
  adding: fields/merges/.part-00017.crc (stored 0%)
  adding: fields/merges/part-00022 (deflated 77%)
  adding: fields/merges/part-00005 (deflated 77%)
  adding: fields/merges/.part-00054.crc (stored 0%)
  adding: fields/merges/.part-00037.crc (stored 0%)
  adding: fields/merges/part-00037 (deflated 77%)
  adding: fields/merges/part-00017 (deflated 78%)
  adding: fields/merges/part-00009 (deflated 78%)
  adding: fields/merges/part-00010 (deflated 78%)
  adding: fields/merges/.part-00039.crc (stored 0%)
  adding: fields/merges/.part-00012.crc (stored 0%)
  adding: fields/merges/part-00040 (deflated 77%)
  adding: fields/merges/.part-00005.crc (stored 0%)
  adding: fields/merges/.part-00019.crc (stor