![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/openvino/HuggingFace_OpenVINO_in_Spark_NLP_PaliGemmaForMultiModal.ipynb)

# Import OpenVINO PaliGemma models from HuggingFace 🤗 into Spark NLP 🚀

This notebook provides a detailed walkthrough on optimizing and importing PaliGemma models from HuggingFace for use in Spark NLP, with [Intel OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html). The focus is on converting the model to the OpenVINO format and applying precision optimizations (INT8 and INT4), to enhance the performance and efficiency on CPU platforms using [Optimum Intel](https://huggingface.co/docs/optimum/main/en/intel/inference).

Let's keep in mind a few things before we start 😊

- OpenVINO support was introduced in `Spark NLP 5.4.0`, enabling high performance CPU inference for models. So please make sure you have upgraded to the latest Spark NLP release.
- Model quantization is a computationally expensive process, so it is recommended to use a runtime with more than 32GB memory for exporting the quantized model from HuggingFace.
- You can import PaliGemma models via `PaliGemma`. These models are usually under the `Text Generation` category and have `PaliGemma` in their labels.
- Reference: [PaliGemma](https://huggingface.co/docs/transformers/model_doc/llama#transformers.PaliGemma)
- Some [example models](https://huggingface.co/models?search=PaliGemma)

## Table of Contents

1. [Setup and Installation](#setup-and-installation)
2. [Model Configuration](#model-configuration)
3. [Model Loading and Preparation](#model-loading-and-preparation)
4. [Model Conversion to OpenVINO](#model-conversion-to-openvino)
5. [Model Quantization](#model-quantization)
6. [Model Merger Implementation](#model-merger-implementation)
7. [Testing OpenVINO Model](#7-testing-openvino-model)


## 1. Setup and Installation

First, let's install all the required dependencies for this notebook.

In [1]:
# Install OpenVINO and NNCF for model optimization
%pip install -qU "openvino>=2024.4.0" "nncf>=2.13.0"

# Install NLP and tokenization libraries
%pip install -q  "sentencepiece" "tokenizers>=0.12.1" "transformers>=4.46.0" "gradio>=4.36"

# Install OpenVINO nightly builds for latest features
%pip install -q -U --pre --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly openvino-tokenizers openvino openvino-genai

# Install HuggingFace Hub and PyTorch
%pip install -q --upgrade huggingface_hub
%pip install -q --upgrade torch

### Environment Configuration

Configure the environment to disable tokenizer parallelism for better compatibility.

In [1]:
import os
# Disable tokenizer parallelism to avoid potential issues
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
from pathlib import Path
import types
from typing import Optional, List
import gc
import openvino as ov
from openvino.runtime import opset13
import nncf
import numpy as np
import torch
from transformers import AutoProcessor, AutoConfig, PaliGemmaForConditionalGeneration
from openvino.frontend.pytorch.patch_model import __make_16bit_traceable
import torch.nn as nn

  from .autonotebook import tqdm as notebook_tqdm


## 2. Model Configuration

Set up the model ID and quantization parameters for the conversion process.

In [3]:
model_id = "google/paligemma-3b-mix-224"

In [4]:
quantization_method = "int4"
output_dir = Path(f"models/{model_id}/{quantization_method}")
output_dir.mkdir(parents=True, exist_ok=True)

## 3. Model Loading and Preparation

Load the model processor, configuration, and prepare the model for conversion to OpenVINO format.

In [5]:
# Load the processor
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, )

# load the config
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)

# Change the model to use SDPA attention
# This is a workaround for the model to be compatible with OpenVINO
config.text_config._attn_implementation = "sdpa"


# export the processor and config to output_dir/assets
processor.save_pretrained(output_dir/"assets")
config.save_pretrained(output_dir/"assets")

In [6]:
# Load the model
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, trust_remote_code=True, config=config)
model.eval()
__make_16bit_traceable(model)



The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  4.20it/s]


## 4. Model Conversion to OpenVINO

Define paths for the converted model components and implement conversion utilities.

In [7]:
core = ov.Core()

image_encoder_path = output_dir / "image_encoder.xml"
language_model_path = output_dir / "language_model.xml"
model_merger_path = output_dir / "model_merger.xml"
text_embeddings_path = output_dir / "text_embeddings.xml"

In [8]:
def cleanup_torchscript_cache():
    """
    Helper function for removing cached model representation to prevent memory leaks
    during model conversion.
    """
    torch._C._jit_clear_class_registry()
    torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore()
    torch.jit._state._clear_class_state()

### Convert Text Embeddings Component

Convert the text embeddings component of the model to OpenVINO format.

In [9]:
# save text embeddings
with torch.no_grad():
    ov_model = ov.convert_model(
        model.language_model.model.embed_tokens,
        example_input=torch.ones([2, 2], dtype=torch.int64),
    )
    ov.save_model(ov_model, text_embeddings_path)
    del ov_model
    cleanup_torchscript_cache()
    gc.collect()



### Convert Vision Model Components

Convert the vision model components to OpenVINO format.

In [10]:
class VisionEncoder(nn.Module):
  def __init__(self, paligemma_model):
    super().__init__()
    self.config = paligemma_model.config
    self.vision_tower = paligemma_model.vision_tower
    self.multi_modal_projector = paligemma_model.multi_modal_projector

  def forward(self, pixel_values: torch.FloatTensor):
      """
      Obtains image last hidden states from the vision tower and apply multimodal projection.

      Args:
          pixel_values (`torch.FloatTensor]` of shape `(batch_size, channels, height, width)`)
              The tensors corresponding to the input images.
      Returns:
          image_features (`torch.Tensor`): Image feature tensor of shape `(num_images, image_length, embed_dim)`).
      """
      image_outputs = self.vision_tower(pixel_values)
      selected_image_feature = image_outputs.last_hidden_state
      image_features = self.multi_modal_projector(selected_image_feature)
      image_features = image_features / (self.config.text_config.hidden_size**0.5)
      return image_features

In [11]:
vision_model = VisionEncoder(model)
with torch.no_grad():
    ov_model = ov.convert_model(
        vision_model,
        example_input=torch.ones([2, 3, config.vision_config.image_size, config.vision_config.image_size], dtype=torch.float32),
    )
    ov.save_model(ov_model, image_encoder_path)
    del ov_model
    cleanup_torchscript_cache()
    gc.collect()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


## 5. Model Quantization

Implement functions for model state management and quantization.

In [12]:
def model_has_state(ov_model: ov.Model):
    return len(ov_model.get_sinks()) > 0


def model_has_input_output_name(ov_model: ov.Model, name: str):
    """
    Helper function for checking that model has specified input or output name

    Parameters:
      ov_model (ov.Model):
      name (str):
          name of input or output

    Returns:
      True if input or output with requested name exists else False
    """
    return name in sum([list(t.get_names()) for t in ov_model.inputs + ov_model.outputs], [])


def fuse_cache_reorder(
    ov_model: ov.Model,
    not_kv_inputs: List[str],
    key_value_input_names: List[str],
    gather_dim: int,
):
    """
    Fuses reored_cache during generate cycle into ov.Model. Used with stateful models, because we can not modify model state directly.

    Adds a new beam_idx parameter and Gather op per each kv-cache input in a given model.
    Should be run before make_stateful. Implements optimumum's _reorder_cache
    inside the model in the beginning of each iteration.
    Gather works along given gather_dim dimension that may vary from model to model.
    KV-cache inputs are identified based on names in key_value_input_names.
    Append the new beam_idx parameter to not_kv_inputs.

    Parameters:
      ov_model (`ov.Model`):
          openvino model for processing
      not_kv_inputs (`List[str]`):
          list of input nodes in model that not related to past key values
      key_value_input_names (`List[str]`):
          list of names for key value input layers
      gather_dim (int):
          dimension for gathering cache during reorder pass
    """

    if model_has_input_output_name(ov_model, "beam_idx"):
        raise ValueError("Model already has fused cache")
    input_batch = ov_model.input("inputs_embeds").get_partial_shape()[0]
    beam_idx = opset13.parameter(name="beam_idx", dtype=ov.Type.i32, shape=ov.PartialShape([input_batch]))
    beam_idx.output(0).get_tensor().add_names({"beam_idx"})  # why list is not accepted?
    ov_model.add_parameters([beam_idx])
    not_kv_inputs.append(ov_model.inputs[-1])
    # Go over all cache parameters and fuse _reorder_cache with indices provided by the new parameter beam_idx
    for input_name in key_value_input_names:
        parameter_output_port = ov_model.input(input_name)
        consumers = parameter_output_port.get_target_inputs()
        gather = opset13.gather(parameter_output_port, beam_idx, opset13.constant(gather_dim))
        for consumer in consumers:
            consumer.replace_source_output(gather.output(0))
    ov_model.validate_nodes_and_infer_types()


def build_state_initializer(ov_model: ov.Model, batch_dim: int):
    """
    Build initialization ShapeOf Expression for all ReadValue ops

    Parameters:
      ov_model (ov.Model):
          openvino model
      batch_dim (int):
          index of dimension corresponding to batch size
    """
    input_ids = ov_model.input("inputs_embeds")
    batch = opset13.gather(
        opset13.shape_of(input_ids, output_type="i64"),
        opset13.constant([0]),
        opset13.constant(0),
    )
    for op in ov_model.get_ops():
        if op.get_type_name() == "ReadValue":
            dims = [dim.min_length for dim in list(op.get_output_partial_shape(0))]
            dims[batch_dim] = batch
            dims = [(opset13.constant(np.array([dim], dtype=np.int64)) if isinstance(dim, int) else dim) for dim in dims]
            shape = opset13.concat(dims, axis=0)
            broadcast = opset13.broadcast(opset13.constant(0.0, dtype=op.get_output_element_type(0)), shape)
            op.set_arguments([broadcast])
    ov_model.validate_nodes_and_infer_types()


def make_stateful(
    ov_model: ov.Model,
    not_kv_inputs: List[str],
    key_value_input_names: List[str],
    key_value_output_names: List[str],
    batch_dim: int,
    num_attention_heads: int,
    num_beams_and_batch: int = None,
):
    """
    Hides kv-cache inputs and outputs inside the model as variables.

    Parameters:
        ov_model (ov.Model):
            openvino model
        not_kv_inputs (`List[str]`):
            list of input nodes in model that not related to past key values
        key_value_input_names (`List[str]`):
            list of names for key value input layers
        key_value_output_names (`List[str]`):
            list of names for key value input layers
        batch_dim (int):
            index of batch dimension in key value layers
        num_attention_heads (int):
            number of attention heads for batch dimension initialization
        num_beams_an_batch (int):
            precalculated number of beams and batch for shapes initialization
    """
    from openvino._offline_transformations import apply_make_stateful_transformation

    input_output_map = {}

    if num_beams_and_batch is not None:
        # Set batch size for input_ids and attention mask to avoid dynamic dimension got propagated from the end of the model back to ReadValue
        for input in not_kv_inputs:
            shape = input.get_partial_shape()
            if shape.rank.get_length() <= 2:  # == 1 for beam_index
                shape[0] = num_beams_and_batch
                input.get_node().set_partial_shape(shape)
    for kv_name_pair in zip(key_value_input_names, key_value_output_names):
        input_output_map[kv_name_pair[0]] = kv_name_pair[1]
        if num_beams_and_batch is not None:
            input = ov_model.input(kv_name_pair[0])
            shape = input.get_partial_shape()
            shape[batch_dim] = num_beams_and_batch * num_attention_heads
            input.get_node().set_partial_shape(shape)

    if num_beams_and_batch is not None:
        # Re-validation model if shapes are altered above
        ov_model.validate_nodes_and_infer_types()

    apply_make_stateful_transformation(ov_model, input_output_map)
    if num_beams_and_batch is None:
        build_state_initializer(ov_model, batch_dim)


def patch_stateful(ov_model):
    key_value_input_names = [key.get_any_name() for key in ov_model.inputs[2:-1]]
    key_value_output_names = [key.get_any_name() for key in ov_model.outputs[1:]]
    not_kv_inputs = [input for input in ov_model.inputs if not any(name in key_value_input_names for name in input.get_names())]
    if not key_value_input_names or not key_value_output_names:
        return
    batch_dim = 0
    num_attention_heads = 1

    fuse_cache_reorder(ov_model, not_kv_inputs, key_value_input_names, batch_dim)
    make_stateful(
        ov_model,
        not_kv_inputs,
        key_value_input_names,
        key_value_output_names,
        batch_dim,
        num_attention_heads,
        None,
    )

### Model Quantization Process

The quantization process reduces the precision of the model weights from floating-point to integer format, which can significantly reduce model size and improve inference speed. We're using INT4 quantization in this notebook.

In [14]:
from transformers.cache_utils import DynamicCache

quantization_config = {
    "mode": nncf.CompressWeightsMode.INT4_ASYM,
    "group_size": 128,
    "ratio": 1.0,
}
lang_model = model.language_model
def forward_wrap(
        self,
        attention_mask,
        position_ids=None,
        past_key_values=None,
        inputs_embeds=None,
    ):
        if past_key_values is not None:
            new_past_key_values = DynamicCache.from_legacy_cache(past_key_values)
        else:
            new_past_key_values = None
        result = self._orig_forward(
            input_ids=None,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=new_past_key_values,
            inputs_embeds=inputs_embeds,
        )
        if past_key_values is not None:
            result["past_key_values"] = result["past_key_values"].to_legacy_cache()
        return tuple(result.values())

lang_model._orig_forward = lang_model.forward
lang_model.forward = types.MethodType(forward_wrap, lang_model)
hidden_size = lang_model.config.hidden_size
llm_input = torch.zeros([2, 2, hidden_size])
pkv = lang_model._orig_forward(
        inputs_embeds=llm_input,
        attention_mask=torch.ones((2, 2), dtype=torch.int64),
        past_key_values= DynamicCache()
    )[1]
try:
    pkv = pkv.to_legacy_cache()
except Exception as e:
    print(f"Error: {e}")
    legacy_cache = ()
    for layer_idx in range(lang_model.config.num_hidden_layers):
        legacy_cache += ((pkv.key_cache[layer_idx], pkv.value_cache[layer_idx]))
    pkv = legacy_cache
model_inputs = ["attention_mask", "position_ids"]
model_outputs = ["logits"]
for idx in range(len(pkv)):
    model_inputs.extend([f"past_key_values.{idx}.key", f"past_key_values.{idx}.value"])
    model_outputs.extend([f"present.{idx}.key", f"present.{idx}.value"])
model_inputs.append("inputs_embeds")
position_ids = torch.tensor([[2, 3], [2, 3]])
with torch.no_grad():
    ov_model = ov.convert_model(
        lang_model,
        example_input={
            "inputs_embeds": llm_input,
            "attention_mask": torch.ones([2, 4], dtype=torch.int64),
            "past_key_values": pkv,
            "position_ids": position_ids,
        },
    )
for input, input_name in zip(ov_model.inputs, model_inputs):
            input.get_tensor().set_names({input_name})

for output, output_name in zip(ov_model.outputs, model_outputs):
    output.get_tensor().set_names({output_name})
patch_stateful(ov_model)
print("✅ Language model successfully converted")
fp_lang_model_path = language_model_path if quantization_config is None else language_model_path.parent / ("fp_" + language_model_path.name)
ov.save_model(ov_model, fp_lang_model_path)
del ov_model
cleanup_torchscript_cache()
gc.collect()


ov_model = core.read_model(fp_lang_model_path)
print(f"⌛ Weights compression with {quantization_config['mode']} mode started")
c_ov_model = nncf.compress_weights(ov_model, **quantization_config)
print("✅ Weights compression finished")
ov.save_model(c_ov_model, language_model_path)
del c_ov_model
del ov_model
gc.collect()

# delete fp_lang_model_path
fp_lang_model_path.unlink(missing_ok=True)
fp_lang_model_path.with_suffix(".bin").unlink(missing_ok=True)

  or len(self.key_cache[layer_idx]) == 0  # the layer has no cache
  if sequence_length != 1:
  op1 = operator(*args, **kwargs)
  len(self.key_cache[layer_idx]) == 0
  if a.grad is not None:


✅ Language model successfully converted
⌛ Weights compression with int4_asym mode started
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int8_asym                 │ 21% (1 / 127)               │ 0% (0 / 126)                           │
├───────────────────────────┼─────────────────────────────┼────────────────────────────────────────┤
│ int4_asym                 │ 79% (126 / 127)             │ 100% (126 / 126)                       │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


✅ Weights compression finished


## 6. Model Merger Implementation

Implement the model merger to combine text and image components.

In [15]:
import numpy as np
import torch

class MergeMultiModalInputs(torch.nn.Module):
    def __init__(self,image_token_index=257152):
        super().__init__()
        self.image_token_index = image_token_index

    def forward(
        self,
        vision_embeds,
        inputs_embeds,
        input_ids,
    ):
        image_features = vision_embeds
        inputs_embeds = inputs_embeds
        special_image_mask = (input_ids == self.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
        # image_features = image_features.to(inputs_embeds.dtype)
        final_embedding = inputs_embeds.masked_scatter(special_image_mask, image_features)

        return {
            "final_embedding": final_embedding
        }

In [16]:
multimodal_merger = MergeMultiModalInputs(config.image_token_index)
with torch.no_grad():
    ov_model = ov.convert_model(
        multimodal_merger,
        example_input= {
            "input_ids": torch.ones([2, 1198], dtype=torch.int64),
            "inputs_embeds": torch.ones([2, 1198, config.hidden_size], dtype=torch.float32),
            "vision_embeds": torch.ones([2, config.vision_config.num_image_tokens, config.hidden_size], dtype=torch.float32),
        }
    )
    ov.save_model(ov_model, model_merger_path)
    del ov_model
    cleanup_torchscript_cache()
    gc.collect()

# 7. Testing OpenVINO Model




In [17]:
core = ov.Core()
device = "CPU"

In [18]:
# paths for the exported models
image_encoder_path = output_dir / "image_encoder.xml"
language_model_path = output_dir / "language_model.xml"
model_merger_path = output_dir / "model_merger.xml"
text_embeddings_path = output_dir / "text_embeddings.xml"

In [19]:
# compile the models
language_model = core.read_model(language_model_path)
compiled_language_model = core.compile_model(language_model, "AUTO")

image_encoder = core.compile_model(image_encoder_path, device)
model_merger = core.compile_model(model_merger_path, device)
text_embeddings = core.compile_model(text_embeddings_path, device)

In [20]:
from transformers.image_utils import load_image
from transformers import AutoProcessor, TextStreamer
from transformers import AutoProcessor, AutoConfig


DEVICE = "cpu"
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")

processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in english
prompt = "caption en"
inputs_new = processor(text=prompt, images=image1, return_tensors="pt")

request = compiled_language_model.create_infer_request()
merge_model_request = model_merger.create_infer_request()
# Set the input names
input_names = {key.get_any_name(): idx for idx, key in enumerate(language_model.inputs)}
inputs = {}
# Set the initial input_ids
current_input_ids = inputs_new["input_ids"]
attention_mask = inputs_new["attention_mask"]
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
pixel_values = inputs_new["pixel_values"]


generation_args = {"max_new_tokens": 200, "do_sample": False, "streamer": TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True)}
generated_tokens = []

for i in range(generation_args["max_new_tokens"]):
    # Generate input embeds each time
    text_embeds = torch.from_numpy(
            text_embeddings(current_input_ids
            )[0]
        )
    if current_input_ids.shape[-1] > 1:
        vision_embeds = torch.from_numpy(
            image_encoder(
                {
                    "pixel_values": pixel_values,
                }
            )[0]
        )
        merge_model_request.start_async({
            "vision_embeds": vision_embeds,
            "inputs_embeds": text_embeds,
            "input_ids": current_input_ids,
        }, share_inputs=True)
        merge_model_request.wait()
        final_embedding = torch.from_numpy(merge_model_request.get_tensor("final_embedding").data)
    else:
        final_embedding = text_embeds

    
    if i>0:
        inputs = {}
    # Prepare inputs for the model
    inputs["inputs_embeds"] = final_embedding
    inputs["attention_mask"] = attention_mask
    inputs["position_ids"] = position_ids
    if "beam_idx" in input_names:
        inputs["beam_idx"] = np.arange(attention_mask.shape[0], dtype=int)
    
    # Start inference
    request.start_async(inputs, share_inputs=True)
    request.wait()
    
    # Get the logits and find the next token
    logits = torch.from_numpy(request.get_tensor("logits").data)
    next_token = logits.argmax(-1)[0][-1]
    
    # Append the generated token
    generated_tokens.append(next_token)
    
    # Update input_ids with the new token
    current_input_ids = torch.cat([next_token.unsqueeze(0).unsqueeze(0)], dim=-1)
    
    # update the attention mask
    attention_mask = torch.cat([attention_mask, torch.ones_like(attention_mask[:, :1])], dim=-1)

    # Update inputs for the next iteration
    position_ids = attention_mask.long().cumsum(-1) - 1
    position_ids.masked_fill_(attention_mask == 0, 1)
    position_ids = position_ids[:, -current_input_ids.shape[1] :]
    inputs["position_ids"] = position_ids

generated_text = processor.decode(generated_tokens, skip_special_tokens=True)
print(generated_text)

You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


In this image we can see the statue of liberty on the water. In the background we can see the buildings and the sky.


# 8. Import and Save in Spark NLP
- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

Let's start Spark with Spark NLP included via our simple `start()` function

In [None]:
import sparknlp

# let's start Spark with Spark NLP
spark = sparknlp.start()






In [23]:
imageClassifier = PaliGemmaForMultiModal \
            .loadSavedModel(str(output_dir),spark) \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

25/04/14 10:47:16 WARN NativeLibrary: Failed to load library null: java.lang.UnsatisfiedLinkError: Can't load library: /tmp/openvino-native1828305829993043212/libtbb.so.2




In [24]:
imageClassifier.write().overwrite().save(f"file:///tmp/{model_id}_spark_nlp")

                                                                                

In [25]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql.functions import lit
from pyspark.ml import Pipeline
from pathlib import Path
import os

# download two images to test into ./images folder

url1 = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
url2 = "http://images.cocodataset.org/val2017/000000039769.jpg"

Path("images").mkdir(exist_ok=True)

!wget -q -O images/image1.jpg {url1}
!wget -q -O images/image2.jpg {url2}



images_path = "file://" + os.getcwd() + "/images/"
image_df = spark.read.format("image").load(
    path=images_path
)

test_df = image_df.withColumn("text", lit("<image><bos>caption en\n"))

image_assembler = ImageAssembler().setInputCol("image").setOutputCol("image_assembler")

imageClassifier = PaliGemmaForMultiModal.load(f"file:///tmp/{model_id}_spark_nlp")\
            .setMaxOutputLength(50) \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

pipeline = Pipeline(
            stages=[
                image_assembler,
                imageClassifier,
            ]
        )

model = pipeline.fit(test_df)

                                                                                

In [26]:
light_pipeline = LightPipeline(model)
image_path = os.getcwd() + "/images/" + "image1.jpg"
print("image_path: " + image_path)
annotations_result = light_pipeline.fullAnnotateImage(
    image_path,
    "<image><bos>caption en\n"
)

for result in annotations_result:
    print(result["answer"])

image_path: /mnt/research/Projects/ModelZoo/PaliGemma/images/image1.jpg
[Annotation(document, 0, 34, A cat is laying in a cardboard box., Map(), [])]


In [28]:
ZIP_NAME = f"{model_id.split('/')[-1].replace(' ','_').lower()}_int4_sn"
!cd /tmp/{model_id}_spark_nlp && zip -r {ZIP_NAME}.zip .

  adding: model_merger.xml (deflated 86%)
  adding: .model_merger.xml.crc (stored 0%)
  adding: .image_encoder.xml.crc (deflated 0%)
  adding: .language_model.xml.crc (deflated 0%)
  adding: fields/ (stored 0%)
  adding: fields/merges/ (stored 0%)
  adding: fields/merges/.part-00017.crc (stored 0%)
  adding: fields/merges/part-00022 (deflated 76%)
  adding: fields/merges/part-00005 (deflated 76%)
  adding: fields/merges/.part-00054.crc (stored 0%)
  adding: fields/merges/.part-00037.crc (stored 0%)
  adding: fields/merges/part-00037 (deflated 76%)
  adding: fields/merges/part-00017 (deflated 76%)
  adding: fields/merges/part-00009 (deflated 76%)
  adding: fields/merges/part-00010 (deflated 76%)
  adding: fields/merges/.part-00039.crc (stored 0%)
  adding: fields/merges/.part-00012.crc (stored 0%)
  adding: fields/merges/part-00040 (deflated 76%)
  adding: fields/merges/.part-00005.crc (stored 0%)
  adding: fields/merges/.part-00019.crc (stored 0%)
  adding: fields/merges/part-00021 (de