![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/openvino/HuggingFace_OpenVINO_in_Spark_NLP_MLLama.ipynb)

# Import OpenVINO MLLama models from HuggingFace 🤗 into Spark NLP 🚀

This notebook provides a detailed walkthrough on optimizing and importing MLLama models from HuggingFace  for use in Spark NLP, with [Intel OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html). The focus is on converting the model to the OpenVINO format and applying precision optimizations (INT8 and INT4), to enhance the performance and efficiency on CPU platforms using [Optimum Intel](https://huggingface.co/docs/optimum/main/en/intel/inference).

Let's keep in mind a few things before we start 😊

- OpenVINO support was introduced in  `Spark NLP 5.4.0`, enabling high performance CPU inference for models. So please make sure you have upgraded to the latest Spark NLP release.
- Model quantization is a computationally expensive process, so it is recommended to use a runtime with more than 32GB memory for exporting the quantized model from HuggingFace.
- You can import MLLama models via `MLLama`. These models are usually under `Text Generation` category and have `MLLama` in their labels.
- Reference: [MLLama](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md)
- Some [example models](https://huggingface.co/models?search=MLLama)
- Openvino export taken from [Openvino Notebooks](https://github.com/openvinotoolkit/openvino_notebooks/tree/b4a0791/notebooks/mllama-3.2)

## 1. Export and Save the HuggingFace model

- Let's install `transformers` and `openvino` packages with other dependencies. You don't need `openvino` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.41.2`. This doesn't mean it won't work with the future release, but we wanted you to know which versions have been tested successfully.

In [1]:
%pip install -q "torch>=2.1" "torchvision" "Pillow" "tqdm" "datasets>=2.14.6" "gradio>=4.36" "nncf>=2.14.0" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "transformers>=4.45" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -Uq "openvino>=2024.5.0"
%pip install -q --upgrade ipywidgets

utility_files = ["notebook_utils.py", "cmd_helper.py"]

import requests
from pathlib import Path

if not Path("ov_mllama_helper.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/b4a0791/notebooks/mllama-3.2/ov_mllama_helper.py")
    open("ov_mllama_helper.py", "w").write(r.text)

if not Path("gradio_helper.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/b4a0791/notebooks/mllama-3.2/gradio_helper.py")
    open("gradio_helper.py", "w").write(r.text)

if not Path("ov_mllama_compression.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/b4a0791/notebooks/mllama-3.2/ov_mllama_compression.py")
    open("ov_mllama_compression.py", "w").write(r.text)

if not Path("data_preprocessing.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/b4a0791/notebooks/mllama-3.2/data_preprocessing.py")
    open("data_preprocessing", "w").write(r.text)

if not Path("notebook_utils.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/b4a0791/utils/notebook_utils.py")
    open("notebook_utils.py", "w").write(r.text)

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### 1.1 Convert the model to OpenVino

In [1]:
from pathlib import Path
from ov_mllama_helper import convert_mllama

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model_dir = Path(model_id.split("/")[-1]) / "OV"

  self.__spec__.loader.exec_module(self)


In [2]:
from notebook_utils import device_widget

device = device_widget("CPU", exclude=["NPU"])

device

Dropdown(description='Device:', options=('CPU', 'AUTO'), value='CPU')

In [3]:
model_dir

PosixPath('Llama-3.2-11B-Vision-Instruct/OV')

In [None]:
convert_mllama(model_id, model_dir)

In [None]:
from ov_mllama_compression import compress
from ov_mllama_compression import compression_widgets_helper

compression_scenario, compress_args = compression_widgets_helper()

compression_scenario

In [None]:
compression_kwargs = {key: value.value for key, value in compress_args.items()}

language_model_path = compress(model_dir, **compression_kwargs)

In [None]:
from ov_mllama_compression import vision_encoder_selection_widget

vision_encoder_options = vision_encoder_selection_widget(device.value)

vision_encoder_options

In [None]:
from transformers import AutoProcessor
import nncf
import openvino as ov
import gc

from data_preprocessing import prepare_dataset_vision

processor = AutoProcessor.from_pretrained(model_dir)
core = ov.Core()

fp_vision_encoder_path = model_dir / "openvino_vision_encoder.xml"
int8_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8.xml")
int8_wc_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8_wc.xml")


if vision_encoder_options.value == "INT8 quantization":
    if not int8_vision_encoder_path.exists():
        calibration_data = prepare_dataset_vision(processor, 100)
        ov_model = core.read_model(fp_vision_encoder_path)
        calibration_dataset = nncf.Dataset(calibration_data)
        quantized_model = nncf.quantize(
            model=ov_model,
            calibration_dataset=calibration_dataset,
            model_type=nncf.ModelType.TRANSFORMER,
            advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.6),
        )
        ov.save_model(quantized_model, int8_vision_encoder_path)
        del quantized_model
        del ov_model
        del calibration_dataset
        del calibration_data
        gc.collect()

    vision_encoder_path = int8_vision_encoder_path
elif vision_encoder_options.value == "INT8 weights compression":
    if not int8_wc_vision_encoder_path.exists():
        ov_model = core.read_model(fp_vision_encoder_path)
        compressed_model = nncf.compress_weights(ov_model)
        ov.save_model(compressed_model, int8_wc_vision_encoder_path)
    vision_encoder_path = int8_wc_vision_encoder_path
else:
    vision_encoder_path = fp_vision_encoder_path

In [None]:
from transformers import AutoProcessor, AutoConfig

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
config = AutoConfig.from_pretrained(model_id)

import requests
from PIL import Image


question = "What is unusual on this image?"

messages = [
    {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": question}]},
]
text = processor.tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
url = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
raw_image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=text, images=[raw_image], return_tensors="pt")

pixel_values = inputs["pixel_values"]
aspect_ratio_ids = inputs["aspect_ratio_ids"]
aspect_ratio_mask = inputs["aspect_ratio_mask"]

image_inputs = {
    "pixel_values": pixel_values,
    "aspect_ratio_ids": aspect_ratio_ids,
    "aspect_ratio_mask": aspect_ratio_mask,
}


In [None]:
import openvino as ov
from pathlib import Path
core = ov.Core()

IMAGE_ENCODER_NAME = "openvino_vision_encoder.xml"

image_encoder = core.compile_model(model_path / IMAGE_ENCODER_NAME,"CPU")
cross_attn_outputs = [key.get_any_name() for key in image_encoder.outputs if "cross_attn_key_values" in key.get_any_name()]


image_request = image_encoder.create_infer_request()
image_request.start_async([pixel_values, aspect_ratio_ids, aspect_ratio_mask], share_inputs=True)
image_request.wait()
cross_attn_key_values = [image_request.get_tensor(name) for name in cross_attn_outputs]

In [None]:
import numpy as np
import torch

class PreprocessingMasks(torch.nn.Module):
    def __init__(self,):
        super().__init__()

    def forward(
        self,
        cross_attention_mask,
        attention_mask,
        current_input_ids,
        num_vision_tokens,
        past_cross_attn_kv_length
    ):
        dtype=torch.float32
        batch_size, text_total_length, *_ = cross_attention_mask.shape
        cross_attention_mask = cross_attention_mask.repeat_interleave(num_vision_tokens, dim=3)
        cross_attention_mask = cross_attention_mask.view(batch_size, text_total_length, -1)
        cross_attention_mask = cross_attention_mask.unsqueeze(1)

        inverted_cross_attn_mask = (1.0 - cross_attention_mask).to(dtype)
        cross_attention_mask = inverted_cross_attn_mask.masked_fill(inverted_cross_attn_mask.to(torch.bool), torch.finfo(dtype).min)

        # apply full-row bias, which return 4D tensor of shape [B, H, S1, 1] where value is 0 if the a full row in cross attn mask's
        # last dimension contains negative infinity values, otherwise it's 1
        negative_inf_value = torch.finfo(dtype).min
        full_text_row_masked_out_mask = (cross_attention_mask != negative_inf_value).any(dim=-1).type_as(cross_attention_mask)[..., None]
        cross_attention_mask *= full_text_row_masked_out_mask

        # if first_pass > 0:
        # past_cross_attn_kv_length = cross_attn_key_values[0].shape[-2]
        past_cross_attn_mask = torch.zeros((*cross_attention_mask.shape[:-1], past_cross_attn_kv_length), dtype=dtype)
        # concatenate both on image-seq-length dimension
        cross_attention_mask_second_pass = torch.cat([past_cross_attn_mask, cross_attention_mask], dim=-1)
        cache_position = (attention_mask.long().cumsum(-1) - 1)[:, -current_input_ids.shape[1] :][0]

        cross_attention_mask_second_pass = cross_attention_mask_second_pass[:, :, cache_position]

        cross_attention_mask = cross_attention_mask[:, :, cache_position]
        full_text_row_masked_out_mask = full_text_row_masked_out_mask[:, :, cache_position]

        return {
            "cache_position": cache_position.to(torch.int32),
            "cross_attention_mask_first_pass": cross_attention_mask.to(dtype),
            "cross_attention_mask_second_pass": cross_attention_mask_second_pass.to(dtype),
            "full_text_row_masked_out_mask": full_text_row_masked_out_mask.to(dtype),
        }

In [None]:
preprocessing_masks = PreprocessingMasks()
cross_attention_mask = inputs["cross_attention_mask"]
attention_mask = inputs["attention_mask"]
current_input_ids = inputs["input_ids"]
first_pass = torch.tensor(1)
num_vision_tokens = torch.tensor((config.vision_config.image_size // config.vision_config.patch_size) ** 2 + 1)
past_cross_attn_kv_length = torch.tensor(cross_attn_key_values[0].shape[-2])

In [None]:
import openvino as ov

ov_model_preprocessing_masks = ov.convert_model(
    preprocessing_masks,
    example_input={
        "cross_attention_mask": cross_attention_mask,
        "attention_mask": attention_mask,
        "current_input_ids": current_input_ids,
        "num_vision_tokens": num_vision_tokens,
        "past_cross_attn_kv_length": past_cross_attn_kv_length,
    }
)

ov.save_model(ov_model_preprocessing_masks,model_path/"openvino_reshape_model.xml")

### 1.2 Load openvino models

In [6]:
LANGUAGE_MODEL_NAME = "llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.xml"
LANGUAGE_MODEL_NAME_1 = "openvino_language_model.xml"
IMAGE_ENCODER_NAME = "openvino_vision_encoder.xml"
PREPROCESSING_MASKS_NAME = "openvino_reshape_model.xml"

In [None]:
import openvino as ov
import gc

core = ov.Core()
model_path = model_dir

language_model = core.read_model(model_path / LANGUAGE_MODEL_NAME)
compiled_language_model = core.compile_model(language_model, "CPU")
request = compiled_language_model.create_infer_request()

image_encoder = core.compile_model(model_path / IMAGE_ENCODER_NAME,"CPU")
preprocessing_masks = core.compile_model(model_path / PREPROCESSING_MASKS_NAME,"CPU")


In [None]:
# check if all the models are converted

print("⌛ Check if all models are converted")
language_model_path = model_dir / LANGUAGE_MODEL_NAME
# language_model_path_1 = model_dir / LANGUAGE_MODEL_NAME_1
image_encoder_path = model_dir / IMAGE_ENCODER_NAME
preprocessing_masks_path = model_dir / PREPROCESSING_MASKS_NAME

if all(
    [
        language_model_path.exists(),
        # language_model_path_1.exists(),
        image_encoder_path.exists(),
        preprocessing_masks_path.exists(),
    ]
):
    print(f"✅ All models are converted. You can find results in {model_dir}")
else:
    print("❌ Not all models are converted. Please check the conversion process")


⌛ Check if all models are converted
✅ All models are converted. You can find results in /mnt/research/Projects/ModelZoo/LLAMA-3.2-VI/Llama-3.2-11B-Vision-Instruct/OV


### 1.2 Copy assets to the assets folder

In [16]:
assets_dir = model_dir / "assets"
assets_dir.mkdir(exist_ok=True)

# copy all the assets to the assets directory (json files, vocab files, etc.)

import shutil

# copy all json files

for file in model_dir.glob("*.json"):
    shutil.copy(file, assets_dir)

    


In [9]:
!ls -lh {model_dir}

total 31G
drwxrwxr-x 2 prabod prabod 4.0K Jan 15 03:09 assets
-rw-rw-r-- 1 prabod prabod 5.0K Dec 12 01:53 chat_template.json
-rw-rw-r-- 1 prabod prabod 5.0K Jan 15 03:06 config.json
-rw-rw-r-- 1 prabod prabod  210 Dec 12 01:53 generation_config.json
-rw-rw-r-- 1 prabod prabod 4.9G Jan 23 01:10 llm_int4_asym_r10_gs64_max_activation_variance_all_layers.bin
-rw-rw-r-- 1 prabod prabod 3.9M Jan 23 01:10 llm_int4_asym_r10_gs64_max_activation_variance_all_layers.xml
-rw-rw-r-- 1 prabod prabod 4.9G Dec 12 04:28 llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.bin
-rw-rw-r-- 1 prabod prabod 3.9M Dec 12 04:28 llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.xml
-rw-rw-r-- 1 prabod prabod  19G Dec 12 01:55 openvino_language_model.bin
-rw-rw-r-- 1 prabod prabod 3.0M Dec 12 01:55 openvino_language_model.xml
-rw-rw-r-- 1 prabod prabod   92 Jan 22 05:14 openvino_reshape_model.bin
-rw-rw-r-- 1 prabod prabod  37K Jan 22 05:14 openvino_reshape_model.xml
-rw-rw-r-- 

In [11]:
!ls -lh {model_dir / "assets"}

total 17M
-rw-rw-r-- 1 prabod prabod 5.0K Jan 14 08:10  chat_template.json
-rw-rw-r-- 1 prabod prabod 5.0K Jan 15 03:09 'config copy.json'
-rw-rw-r-- 1 prabod prabod 5.0K Jan 15 03:09  config.json
-rw-rw-r-- 1 prabod prabod  210 Jan 14 08:10  generation_config.json
-rw-rw-r-- 1 prabod prabod  477 Jan 14 08:10  preprocessor_config.json
-rw-rw-r-- 1 prabod prabod  454 Jan 14 08:10  special_tokens_map.json
-rw-rw-r-- 1 prabod prabod  55K Jan 14 08:10  tokenizer_config.json
-rw-rw-r-- 1 prabod prabod  17M Jan 14 08:10  tokenizer.json


## 2. Import and Save MLLama in Spark NLP

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Let's start Spark with Spark NLP included via our simple `start()` function

In [1]:
import sparknlp

# let's start Spark with Spark NLP
spark = sparknlp.start()


24/11/07 09:56:55 WARN Utils: Your hostname, minotaur resolves to a loopback address: 127.0.1.1; using 192.168.1.4 instead (on interface eno1)
24/11/07 09:56:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
24/11/07 09:56:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [16]:
imageClassifier = MLLamaForMultimodal.loadSavedModel(str(model_path),spark) \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

25/02/14 02:49:23 WARN NativeLibrary: Failed to load library null: java.lang.UnsatisfiedLinkError: Can't load library: /tmp/openvino-native8030791226413631526/libtbb.so.2




In [17]:
imageClassifier.write().overwrite().save("file:///tmp/MLLama_spark_nlp")

                                                                                

In [18]:
!ls -lah /tmp/MLLama_spark_nlp

total 6.8G
drwxr-xr-x  4 prabod prabod 4.0K Feb 14 02:51 .
drwxr-xr-x 13 prabod root   4.0K Feb 14 02:50 ..
drwxr-xr-x  6 prabod prabod 4.0K Feb 14 02:50 fields
-rw-r--r--  1 prabod prabod 4.9G Feb 14 02:51 llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.xml
-rw-r--r--  1 prabod prabod  40M Feb 14 02:51 .llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.xml.crc
drwxr-xr-x  2 prabod prabod 4.0K Feb 14 02:50 metadata
-rw-r--r--  1 prabod prabod  37K Feb 14 02:51 openvino_reshape_model.xml
-rw-r--r--  1 prabod prabod  304 Feb 14 02:51 .openvino_reshape_model.xml.crc
-rw-r--r--  1 prabod prabod 1.8G Feb 14 02:51 openvino_vision_encoder.xml
-rw-r--r--  1 prabod prabod  15M Feb 14 02:51 .openvino_vision_encoder.xml.crc


In [19]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql.functions import lit
from pyspark.ml import Pipeline
from pathlib import Path
import os

# download two images to test into ./images folder

url1 = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
url2 = "http://images.cocodataset.org/val2017/000000039769.jpg"

Path("images").mkdir(exist_ok=True)

!wget -q -O images/image1.jpg {url1}
!wget -q -O images/image2.jpg {url2}



images_path = "file://" + os.getcwd() + "/images/"
image_df = spark.read.format("image").load(
    path=images_path
)

test_df = image_df.withColumn("text", lit("<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<|image|>What is unusual on this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"))

image_assembler = ImageAssembler().setInputCol("image").setOutputCol("image_assembler")

imageClassifier = MLLamaForMultimodal.load("file:///tmp/MLLama_spark_nlp")\
            .setMaxOutputLength(50) \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

pipeline = Pipeline(
            stages=[
                image_assembler,
                imageClassifier,
            ]
        )

model = pipeline.fit(test_df)

In [20]:
light_pipeline = LightPipeline(model)
image_path = os.getcwd() + "/images/" + "image1.jpg"
print("image_path: " + image_path)
annotations_result = light_pipeline.fullAnnotateImage(
    image_path,
    "<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<|image|>What is unusual on this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
)

for result in annotations_result:
    print(result["answer"])

image_path: /home/prabod/Projects/spark-nlp/examples/python/transformers/openvino/images/image1.jpg
[Annotation(document, 0, 208, This image depicts a cat lying in a box, on a carpet. The image features a cat lying in a box placed on a carpet. The image features a cat lying in a box placed on a carpet. The image features a cat lying in a, Map(), [])]
