![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/openvino/HuggingFace_OpenVINO_in_Spark_NLP_LLAVA.ipynb)

# Import OpenVINO LLAVA models from HuggingFace 🤗 into Spark NLP 🚀

This notebook provides a detailed walkthrough on optimizing and importing LLAVA models from HuggingFace  for use in Spark NLP, with [Intel OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html). The focus is on converting the model to the OpenVINO format and applying precision optimizations (INT8 and INT4), to enhance the performance and efficiency on CPU platforms using [Optimum Intel](https://huggingface.co/docs/optimum/main/en/intel/inference).

Let's keep in mind a few things before we start 😊

- OpenVINO support was introduced in  `Spark NLP 5.4.0`, enabling high performance CPU inference for models. So please make sure you have upgraded to the latest Spark NLP release.
- Model quantization is a computationally expensive process, so it is recommended to use a runtime with more than 32GB memory for exporting the quantized model from HuggingFace.
- You can import LLAVA models via `LLAVA`. These models are usually under `Text Generation` category and have `LLAVA` in their labels.
- Reference: [LLAVA](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LLAVA)
- Some [example models](https://huggingface.co/models?search=LLAVA)

## 1. Export and Save the HuggingFace model

- Let's install `transformers` and `openvino` packages with other dependencies. You don't need `openvino` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.41.2`. This doesn't mean it won't work with the future release, but we wanted you to know which versions have been tested successfully.

In [None]:

%pip install -q "nncf>=2.14.0" "torch>=2.1" "transformers>=4.39.1" "accelerate" "pillow" "gradio>=4.26" "datasets>=2.14.6" "tqdm" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q -U "openvino>=2024.5.0" "openvino-tokenizers>=2024.5.0" "openvino-genai>=2024.5"
%pip install -q "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [None]:
from pathlib import Path
import requests

utility_files = ["notebook_utils.py", "cmd_helper.py"]

for utility in utility_files:
    local_path = Path(utility)
    if not local_path.exists():
        r = requests.get(
            url=f"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/{local_path.name}",
        )
    with local_path.open("w") as f:
        f.write(r.text)

### 1.1 Convert the model to OpenVino

In [None]:
from cmd_helper import optimum_cli

model_id = "llava-hf/llava-1.5-7b-hf"
model_path = Path(model_id.split("/")[-1]) / "FP16"

if not model_path.exists():
    optimum_cli(model_id, model_path, additional_args={"weight-format": "fp16"})

**Export command:**

`optimum-cli export openvino --model llava-hf/llava-1.5-7b-hf llava-1.5-7b-hf/FP16 --weight-format fp16`

  self.__spec__.loader.exec_module(self)
Downloading shards: 100%|██████████| 3/3 [00:00<00:00,  3.84it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00,  1.90s/it]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
  or len(self.key_cache[layer_idx]) == 0  # the layer has no cache
  if sequence_length != 1:
  len(self.key_cache[layer_idx]) == 0
  if not interpolate_pos_encoding and (height != self.image_size or width != self.image_size):


In [None]:
import shutil
import nncf
import openvino as ov
import gc


compression_mode = "INT4"

core = ov.Core()


def compress_model_weights(precision):
    int4_compression_config = {"mode": nncf.CompressWeightsMode.INT4_ASYM, "group_size": 128, "ratio": 1, "all_layers": True}
    int8_compression_config = {"mode": nncf.CompressWeightsMode.INT8_ASYM}

    compressed_model_path = model_path.parent / precision

    if not compressed_model_path.exists():
        ov_model = core.read_model(model_path / "openvino_language_model.xml")
        compression_config = int4_compression_config if precision == "INT4" else int8_compression_config
        compressed_ov_model = nncf.compress_weights(ov_model, **compression_config)
        ov.save_model(compressed_ov_model, compressed_model_path / "openvino_language_model.xml")
        del compressed_ov_model
        del ov_model
        gc.collect()
        for file_name in model_path.glob("*"):
            if file_name.name in ["openvino_language_model.xml", "openvino_language_model.bin"]:
                continue
            shutil.copy(file_name, compressed_model_path)


compress_model_weights(compression_mode)

INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode   │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int4_asym                 │ 100% (225 / 225)            │ 100% (225 / 225)                       │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


### 1.2 Load openvino models

In [None]:
model_dir = model_path.parent / compression_mode
language_model = core.read_model(model_dir / "openvino_language_model.xml")
vision_embedding = core.compile_model(model_dir / "openvino_vision_embeddings_model.xml", "AUTO")
text_embedding = core.compile_model(model_dir / "openvino_text_embeddings_model.xml", "AUTO")
compiled_language_model = core.compile_model(language_model, "AUTO")


In [None]:
import requests
from PIL import Image
from io import BytesIO
from transformers import AutoProcessor, AutoConfig

config = AutoConfig.from_pretrained(model_path)

processor = AutoProcessor.from_pretrained(
    model_path, patch_size=config.vision_config.patch_size, vision_feature_select_strategy=config.vision_feature_select_strategy
)


def load_image(image_file):
    if image_file.startswith("http") or image_file.startswith("https"):
        response = requests.get(image_file)
        image = Image.open(BytesIO(response.content)).convert("RGB")
    else:
        image = Image.open(image_file).convert("RGB")
    return image


image_file = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
text_message = "What is unusual on this image?"

image = load_image(image_file)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": text_message},
            {"type": "image"},
        ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs_new = processor(images=image, text=prompt, return_tensors="pt")

  from .autonotebook import tqdm as notebook_tqdm
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


In [None]:

request = compiled_language_model.create_infer_request()
input_names = {key.get_any_name(): idx for idx, key in enumerate(language_model.inputs)}
inputs = {}
# Set the initial input_ids
current_input_ids = inputs_new["input_ids"]
attention_mask = inputs_new["attention_mask"]
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
pixel_values = inputs_new["pixel_values"]

# Set the initial input_ids
text_out = text_embedding(inputs_new["input_ids"])[0]
vision_out = vision_embedding(pixel_values)[0]

In [None]:
import numpy as np
import torch

class MergeMultiModalInputs(torch.nn.Module):
    def __init__(self,image_seq_length=576,image_token_index=32000):
        super().__init__()
        self.image_seq_length = image_seq_length
        self.image_token_index = image_token_index

    def forward(
        self,
        vision_embeds,
        inputs_embeds,
        input_ids,
    ):
        image_features = vision_embeds
        inputs_embeds = inputs_embeds
        special_image_mask = (input_ids == self.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
        # image_features = image_features.to(inputs_embeds.dtype)
        final_embedding = inputs_embeds.masked_scatter(special_image_mask, image_features)

        return {
            "final_embedding": final_embedding
        }

In [None]:
torch_model_merge = MergeMultiModalInputs(
    image_seq_length=config.image_seq_length,
    image_token_index=config.image_token_index
)

In [None]:
# test the model
inputs_embeds = torch.from_numpy(text_out)
input_ids = inputs_new["input_ids"]
vision_embeds = torch.from_numpy(vision_out)

final_embedding = torch_model_merge(vision_embeds, inputs_embeds, input_ids)

In [None]:
import openvino as ov

# convert MergeMultiModalInputs to OpenVINO IR
ov_model_merge = ov.convert_model(
    torch_model_merge,
    example_input={
        "vision_embeds": torch.from_numpy(vision_out),
        "inputs_embeds": torch.from_numpy(text_out),
        "input_ids": inputs_new["input_ids"],
    }
)
ov.save_model(ov_model_merge, model_dir/"openvino_merge_model.xml")



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [None]:
# check if all the models are converted

print("⌛ Check if all models are converted")
lang_model_path = model_dir / "openvino_language_model.xml"
image_embed_path = model_dir / "openvino_vision_embeddings_model.xml"
img_projection_path = model_dir / "openvino_text_embeddings_model.xml"
merge_model_path = model_dir / "openvino_merge_model.xml"



if all(
    [
        lang_model_path.exists(),
        image_embed_path.exists(),
        img_projection_path.exists(),
        merge_model_path.exists(),
    ]
):
    print(f"✅ All models are converted. You can find results in {model_dir}")
else:
    print("❌ Not all models are converted. Please check the conversion process")

⌛ Check if all models are converted
✅ All models are converted. You can find results in llava-1.5-7b-hf/INT4


### 1.2 Copy assets to the assets folder

In [None]:
assets_dir = model_dir / "assets"
assets_dir.mkdir(exist_ok=True)

# copy all the assets to the assets directory (json files, vocab files, etc.)

import shutil

# copy all json files

for file in model_dir.glob("*.json"):
    shutil.copy(file, assets_dir)

from transformers import AutoConfig

model_id = "llava-hf/llava-1.5-7b-hf"

config = AutoConfig.from_pretrained(model_id)
config.save_pretrained(assets_dir)



  from .autonotebook import tqdm as notebook_tqdm


In [None]:
!ls -lh {model_dir}

total 4.1G
-rw-rw-r-- 1 prabod prabod   41 Feb 13 05:09 added_tokens.json
drwxrwxr-x 2 prabod prabod 4.0K Feb 13 05:10 assets
-rw-rw-r-- 1 prabod prabod  701 Feb 13 05:09 chat_template.json
-rw-rw-r-- 1 prabod prabod 1.1K Feb 13 05:09 config.json
-rw-rw-r-- 1 prabod prabod  136 Feb 13 05:09 generation_config.json
-rw-rw-r-- 1 prabod prabod 332K Feb 13 05:09 openvino_detokenizer.bin
-rw-rw-r-- 1 prabod prabod  12K Feb 13 05:09 openvino_detokenizer.xml
-rw-rw-r-- 1 prabod prabod 3.2G Feb 13 05:09 openvino_language_model.bin
-rw-rw-r-- 1 prabod prabod 2.9M Feb 13 05:09 openvino_language_model.xml
-rw-rw-r-- 1 prabod prabod   40 Feb 13 05:10 openvino_merge_model.bin
-rw-rw-r-- 1 prabod prabod 9.9K Feb 13 05:10 openvino_merge_model.xml
-rw-rw-r-- 1 prabod prabod 251M Feb 13 05:09 openvino_text_embeddings_model.bin
-rw-rw-r-- 1 prabod prabod 3.1K Feb 13 05:09 openvino_text_embeddings_model.xml
-rw-rw-r-- 1 prabod prabod 1.2M Feb 13 05:09 openvino_tokenizer.bin
-rw-rw-r-- 1 prabod prabod  25K

In [None]:
!ls -lh {assets_dir}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


total 3.5M
-rw-rw-r-- 1 prabod prabod   41 Feb 13 05:10 added_tokens.json
-rw-rw-r-- 1 prabod prabod  701 Feb 13 05:10 chat_template.json
-rw-rw-r-- 1 prabod prabod 1.1K Feb 13 05:10 config.json
-rw-rw-r-- 1 prabod prabod  136 Feb 13 05:10 generation_config.json
-rw-rw-r-- 1 prabod prabod  505 Feb 13 05:10 preprocessor_config.json
-rw-rw-r-- 1 prabod prabod  173 Feb 13 05:10 processor_config.json
-rw-rw-r-- 1 prabod prabod  580 Feb 13 05:10 special_tokens_map.json
-rw-rw-r-- 1 prabod prabod 1.5K Feb 13 05:10 tokenizer_config.json
-rw-rw-r-- 1 prabod prabod 3.5M Feb 13 05:10 tokenizer.json


### 1.3 Test the openvino model

In [None]:
import openvino as ov
import torch

core = ov.Core()
device = "CPU"


In [None]:
language_model = core.read_model(model_dir / "openvino_language_model.xml")
language_model = core.read_model(model_dir / "openvino_language_model.xml")
vision_embedding = core.compile_model(model_dir / "openvino_vision_embeddings_model.xml", "AUTO")
text_embedding = core.compile_model(model_dir / "openvino_text_embeddings_model.xml", "AUTO")
compiled_language_model = core.compile_model(language_model, "AUTO")
merge_multi_modal = core.compile_model(model_dir / "openvino_merge_model.xml", "AUTO")

In [None]:
generated_tokens = []

from transformers import AutoProcessor, TextStreamer

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is unusual on this image?"},
            {"type": "image"},
        ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs_new = processor(images=image, text=prompt, return_tensors="pt")

# inputs_new = processor(prompt, [image], return_tensors="pt")

generation_args = {"max_new_tokens": 50, "do_sample": False, "streamer": TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True)}


request = compiled_language_model.create_infer_request()
merge_model_request = merge_multi_modal.create_infer_request()
input_names = {key.get_any_name(): idx for idx, key in enumerate(language_model.inputs)}
inputs = {}
# Set the initial input_ids
current_input_ids = inputs_new["input_ids"]
attention_mask = inputs_new["attention_mask"]
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
pixel_values = inputs_new["pixel_values"]

for i in range(generation_args["max_new_tokens"]):
    # Generate input embeds each time
    if current_input_ids.shape[-1] > 1:
        vision_embeds = torch.from_numpy(vision_embedding({
            "pixel_values": pixel_values,
        })[0])

    text_embeds = torch.from_numpy(text_embedding(current_input_ids)[0])

    if i == 0:
        merge_model_request.start_async({
            "vision_embeds": vision_embeds,
            "inputs_embeds": text_embeds,
            "input_ids": current_input_ids,
        }, share_inputs=True)
        merge_model_request.wait()
        final_embedding = torch.from_numpy(merge_model_request.get_tensor("final_embedding").data)
    else:
        final_embedding = text_embeds
    if i>0:
        inputs = {}
    # Prepare inputs for the model
    inputs["inputs_embeds"] = final_embedding
    inputs["attention_mask"] = attention_mask
    inputs["position_ids"] = position_ids
    if "beam_idx" in input_names:
        inputs["beam_idx"] = np.arange(attention_mask.shape[0], dtype=int)

    # Start inference
    request.start_async(inputs, share_inputs=True)
    request.wait()

    # Get the logits and find the next token
    logits = torch.from_numpy(request.get_tensor("logits").data)
    next_token = logits.argmax(-1)[0][-1]

    # Append the generated token
    generated_tokens.append(next_token)

    # Update input_ids with the new token
    current_input_ids = torch.cat([next_token.unsqueeze(0).unsqueeze(0)], dim=-1)

    # update the attention mask
    attention_mask = torch.cat([attention_mask, torch.ones_like(attention_mask[:, :1])], dim=-1)

    # Update inputs for the next iteration
    position_ids = attention_mask.long().cumsum(-1) - 1
    position_ids.masked_fill_(attention_mask == 0, 1)
    position_ids = position_ids[:, -current_input_ids.shape[1] :]
    inputs["position_ids"] = position_ids

In [None]:
generated_text = processor.decode(generated_tokens, skip_special_tokens=True)

image
print("Question:\n What is unusual on this picture?")
print("Answer:")
print(generated_text)

Question:
 What is unusual on this picture?
Answer:
The unusual aspect of this image is that a cat is lying inside a cardboard box, which is not a typical place for a cat to rest. Cats are known for their curiosity and love for small, enclosed spaces, but in this case


## 2. Import and Save LLAVA in Spark NLP

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Let's start Spark with Spark NLP included via our simple `start()` function

In [None]:
import sparknlp

# let's start Spark with Spark NLP
spark = sparknlp.start()


24/11/07 09:56:55 WARN Utils: Your hostname, minotaur resolves to a loopback address: 127.0.1.1; using 192.168.1.4 instead (on interface eno1)
24/11/07 09:56:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
24/11/07 09:56:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql.functions import lit
from pyspark.ml import Pipeline
from pathlib import Path
import os

imageClassifier = LLAVAForMultiModal.loadSavedModel(str(model_dir),spark) \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

25/02/13 06:30:15 WARN NativeLibrary: Failed to load library null: java.lang.UnsatisfiedLinkError: Can't load library: /tmp/openvino-native10897903401200889289/libtbb.so.2




In [None]:
imageClassifier.write().overwrite().save("file:///tmp/LLAVA_spark_nlp")

                                                                                

In [None]:


# download two images to test into ./images folder

url1 = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
url2 = "http://images.cocodataset.org/val2017/000000039769.jpg"

Path("images").mkdir(exist_ok=True)

!wget -q -O images/image1.jpg {url1}
!wget -q -O images/image2.jpg {url2}



images_path = "file://" + os.getcwd() + "/images/"
image_df = spark.read.format("image").load(
    path=images_path
)

test_df = image_df.withColumn("text", lit("USER: \n <|image|> \n What's this picture about? \n ASSISTANT:\n"))

image_assembler = ImageAssembler().setInputCol("image").setOutputCol("image_assembler")

imageClassifier = LLAVAForMultiModal.load("file:///tmp/LLAVA_spark_nlp")\
            .setMaxOutputLength(50) \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

pipeline = Pipeline(
            stages=[
                image_assembler,
                imageClassifier,
            ]
        )

model = pipeline.fit(test_df)

In [None]:
light_pipeline = LightPipeline(model)
image_path = os.getcwd() + "/images/" + "image1.jpg"
print("image_path: " + image_path)
annotations_result = light_pipeline.fullAnnotateImage(
    image_path,
    "USER: \n <|image|> \n What's this picture about? \n ASSISTANT:\n"
)

for result in annotations_result:
    print(result["answer"])

image_path: file:///home/prabod/Projects/spark-nlp/examples/python/transformers/openvino/images/image1.jpg
[Annotation(document, 0, 207, This image features a cat comfortably laying inside a cardboard box. The cat appears to be relaxed and enjoying its cozy spot. The scene takes place on a carpeted floor, which adds to the overall warm and inv, Map(), [])]
