![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/openvino/HuggingFace_OpenVINO_in_Spark_NLP_Nomic.ipynb)

# Import OpenVINO Nomic models from HuggingFace 🤗 into Spark NLP 🚀

This notebook provides a detailed walkthrough on optimizing and importing Nomic models from HuggingFace  for use in Spark NLP, with [Intel OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html). The focus is on converting the model to the OpenVINO format and applying precision optimizations (INT8 and INT4), to enhance the performance and efficiency on CPU platforms using [Optimum Intel](https://huggingface.co/docs/optimum/main/en/intel/inference).

Let's keep in mind a few things before we start 😊

- OpenVINO support was introduced in  `Spark NLP 5.4.0`, enabling high performance CPU inference for models. So please make sure you have upgraded to the latest Spark NLP release.
- Model quantization is a computationally expensive process, so it is recommended to use a runtime with more than 32GB memory for exporting the quantized model from HuggingFace.
- You can import Nomic models via `NomicModel`. These models are usually under `Text Generation` category and have `Nomic` in their labels.
- Some [example models](https://huggingface.co/models?search=Nomic)

## 1. Export and Save the HuggingFace model

- Let's install `transformers` and `openvino` packages with other dependencies. You don't need `openvino` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.43.4`. This doesn't mean it won't work with the future release, but we wanted you to know which versions have been tested successfully.

In [2]:
!pip install -q --upgrade transformers==4.43.4
!pip install -q --upgrade openvino==2024.3
!pip install -q --upgrade openvino-dev
!pip install -q --upgrade optimum-intel
!pip install -q --upgrade nncf
!pip install -q --upgrade huggingface_hub
!pip install -q --upgrade onnx==1.15.0
!pip install -q --upgrade torch

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.2.0.dev20240118+cu121 requires torch==2.3.0.dev20240118, but you have torch 2.3.0 which is incompatible.
torchvision 0.18.1 requires torch==2.3.1, but you have torch 2.3.0 which is incompatible.[0m[31m
[0m

In [None]:
from huggingface_hub import notebook_login
notebook_login()

[Optimum Intel](https://github.com/huggingface/optimum-intel?tab=readme-ov-file#openvino) is the interface between the Transformers library and the various model optimization and acceleration tools provided by Intel. HuggingFace models loaded with optimum-intel are automatically optimized for OpenVINO, while being compatible with the Transformers API. It also offers the ability to perform weight compression during export.
- To load a HuggingFace model directly for inference/export, just replace the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class. We can use this to import and export OpenVINO models with `from_pretrained` and `save_pretrained`.
- By setting `export=True`, the source model is converted to OpenVINO IR format on the fly.
- We'll use [openbmb/Nomic-2B-dpo-bf16](https://huggingface.co/openbmb/Nomic-2B-dpo-bf16) model from HuggingFace as an example.
- In addition to `NomicModel` we also need to save the tokenizer. This is the same for every model, these are assets needed for tokenization inside Spark NLP.

### First Convert the model to ONNX format

In [3]:
!optimum-cli export onnx --trust-remote-code --task feature-extraction --model nomic-ai/nomic-embed-text-v1 ./onnx_models/nomic-ai/nomic-embed-text-v1

  deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
  deprecate("VQEncoderOutput", "0.31", deprecation_message)
  deprecate("VQModel", "0.31", deprecation_message)
Framework not specified. Using pt to export the model.
<All keys matched successfully>
Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: SentenceTransformer *****
Using framework PyTorch: 2.3.0+cu121
Overriding 1 configuration item(s)
	- use_cache -> False
  if seqlen > self._seq_len_cached:
  if seqlen > self.max_position_embeddings:
  if (
  assert ro_dim <= x.shape[-1]
Post-processing the exported models...
Deduplicating shared (tied) weights...

Validating ONNX model onnx_models/nomic-ai/nomic-embed-text-v1/model.onnx...
	-[✓] ONNX model output names match reference model (sentence_embedding, token_embeddings)
	- Validating ONNX Model output "token_embeddings":
		-[✓] (2, 16, 768) matches (2, 16, 768)
		-[x] values not close

### Convert the model to OpenVINO format

In [13]:
import openvino as ov
import os

MODEL_NAME = "nomic-ai/nomic-embed-text-v1"

ov_model = ov.convert_model(f"./onnx_models/{MODEL_NAME}/model.onnx")

# Save the model
# create the directory if it does not exist
os.makedirs(f"models/{MODEL_NAME}", exist_ok=True)
ov.save_model(ov_model, f"models/{MODEL_NAME}/openvino_model.xml", compress_to_fp16=True)

EXPORT_PATH = f"models/{MODEL_NAME}"

### Save the model and tokenizer

In [14]:
from transformers import AutoTokenizer, AutoModel

ASSETS_PATH = f"./models/{MODEL_NAME}/assets"

os.makedirs(ASSETS_PATH, exist_ok=True)

tokenizer = AutoTokenizer.from_pretrained(f"bert-base-uncased")

tokenizer.save_vocabulary(ASSETS_PATH)

('./models/nomic-ai/nomic-embed-text-v1/assets/vocab.txt',)

Once the model export and quantization is complete, move the model assets needed for tokenization in Spark NLP to the `assets` directory.

Let's have a look inside these two directories and see what we are dealing with:

In [15]:
!ls -l {EXPORT_PATH}

total 267996
drwxrwxr-x 2 prabod prabod      4096 Sep  5 06:09 assets
-rw-rw-r-- 1 prabod prabod 273463642 Sep  5 06:24 openvino_model.bin
-rw-rw-r-- 1 prabod prabod    957222 Sep  5 06:24 openvino_model.xml


In [16]:
!ls -l {EXPORT_PATH}/assets

total 228
-rw-rw-r-- 1 prabod prabod 231508 Sep  5 06:24 vocab.txt


## 2. Import and Save Nomic in Spark NLP

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Let's start Spark with Spark NLP included via our simple `start()` function

In [None]:
import sparknlp

# let's start Spark with Spark NLP
spark = sparknlp.start()

- Let's use `loadSavedModel` functon in `NomicEmbeddings` which allows us to load the OpenVINO model.
- Most params will be set automatically. They can also be set later after loading the model in `NomicEmbeddings` during runtime, so don't worry about setting them now.
- `loadSavedModel` accepts two params, first is the path to the exported model. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [3]:
EXPORT_PATH = f"models/nomic-ai/nomic-embed-text-v1"

In [None]:
from sparknlp.annotator import *

Nomic = NomicEmbeddings \
    .loadSavedModel(EXPORT_PATH, spark) \
    .setInputCols(["documents"]) \
    .setOutputCol("generation")

Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [None]:
Nomic.write().overwrite().save(f"{MODEL_NAME}_spark_nlp")

Let's clean up stuff we don't need anymore

In [None]:
!rm -rf {EXPORT_PATH}

Awesome  😎 !

This is your OpenVINO Nomic model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [None]:
! ls -l {MODEL_NAME}_spark_nlp

Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny Nomic model 😊

In [5]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

test_data = spark.createDataFrame([
            [1, "query: how much protein should a female eat"],
            [2, "query: summit define"],
            [3, "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 "
                "is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're "
                "expecting or training for a marathon. Check out the chart below to see how much protein you should "
                "be eating each day.", ],
            [4, "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain :"
                " the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the "
                "leaders of two or more governments."]
        ]).toDF("id", "text")

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

nomic = NomicEmbeddings \
            .load(f"{MODEL_NAME}_spark_nlp") \
            .setInputCols(["documents"]) \
            .setOutputCol("nomic")

pipeline = Pipeline().setStages([document_assembler, nomic])
results = pipeline.fit(test_data).transform(test_data)

results.select("nomic.embeddings").show(truncate=False)

24/09/05 06:28:25 WARN SparkContext: The path /mnt/research/Projects/ModelZoo/Nomic/models/nomic-ai/nomic-embed-text-v1/openvino_model.xml has been added already. Overwriting of added paths is not supported in the current version.
24/09/05 06:28:25 WARN SparkContext: The path /mnt/research/Projects/ModelZoo/Nomic/models/nomic-ai/nomic-embed-text-v1/openvino_model.bin has been added already. Overwriting of added paths is not supported in the current version.


                                                                                

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

That's it! You can now go wild and use hundreds of Nomic models from HuggingFace 🤗 in Spark NLP 🚀
