![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/openvino/HuggingFace_OpenVINO_in_Spark_NLP_Instructor.ipynb)

# Import OpenVINO Instructor models from HuggingFace 🤗 into Spark NLP 🚀

This notebook provides a detailed walkthrough on optimizing and exporting Instructor models from HuggingFace for use in Spark NLP, leveraging the various tools provided in the [Intel OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html) ecosystem.

Let's keep in mind a few things before we start 😊

- OpenVINO support was introduced in  `Spark NLP 5.4.0`, enabling high performance inference for models. Please make sure you have upgraded to the latest Spark NLP release.
- You can import models for Instructor from Instructor and they have to be in `Fill Mask` category.

## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.49.0`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [1]:
!pip install -q --upgrade transformers[onnx]==4.52.4 optimum openvino

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m424.6/424.6 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.6/47.6 MB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.4/16.4 MB[0m [31m81.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.7/212.7 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m85.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m68.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m44.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[Optimum Intel](https://github.com/huggingface/optimum-intel?tab=readme-ov-file#openvino) is the interface between the Transformers library and the various model optimization and acceleration tools provided by Intel. HuggingFace models loaded with optimum-intel are automatically optimized for OpenVINO, while remaining compatible with the Transformers API.

- We first use the `optimum-cli` tool to export the [hkunlp/instructor-base](https://huggingface.co/hkunlp/instructor-base) model to ONNX format for the `feature-extraction` task.
- Then, we use `convert_model()` to convert the exported ONNX model into OpenVINO Intermediate Representation (IR) format (`.xml` and `.bin`) directly in Python.
- The resulting OpenVINO model is saved in the specified directory (`export_openvino/hkunlp-instructor-base`)


In [None]:
MODEL_NAME = "hkunlp/instructor-base"
EXPORT_PATH = f"export_onnx/{MODEL_NAME}"

! optimum-cli export onnx --model {MODEL_NAME} {EXPORT_PATH} --task feature-extraction

let's move the `spiece.model` file to an `assets` directory:

In [3]:
! mkdir -p {EXPORT_PATH}/assets && mv -t {EXPORT_PATH}/assets {EXPORT_PATH}/*.model

Converting ONNX Model to OpenVINO Format

In [4]:
import openvino as ov

model = ov.convert_model(f"{EXPORT_PATH}/model.onnx")
ov.save_model(model, 'openvino_model.xml')

!rm -rf {EXPORT_PATH}/model.onnx
!mv /content/openvino_model.bin {EXPORT_PATH}
!mv /content/openvino_model.xml {EXPORT_PATH}

In [5]:
!ls {EXPORT_PATH}

assets	     openvino_model.bin  special_tokens_map.json  tokenizer.json
config.json  openvino_model.xml  tokenizer_config.json


## Import and Save InstructorEmbeddings  in Spark NLP

- Install and set up Spark NLP in Google Colab
- This example uses specific versions of `pyspark` and `spark-nlp` that have been tested with the transformer model to ensure everything runs smoothly.

In [6]:
!pip install -q pyspark==3.5.4 spark-nlp==5.5.3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m635.7/635.7 kB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Let's start Spark with Spark NLP included via our simple `start()` function

In [7]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: {}".format(sparknlp.version()))
print("Apache Spark version: {}".format(spark.version))

Spark NLP version: 5.5.3
Apache Spark version: 3.5.4


- Let's use `loadSavedModel` functon in `InstructorEmbeddings ` which allows us to load the ONNX model
- Most params will be set automatically. They can also be set later after loading the model in `InstructorEmbeddings ` during runtime, so don't worry about setting them now
- `loadSavedModel` accepts two params, first is the path to the exported model. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [8]:
from sparknlp.annotator import InstructorEmbeddings

embedding = InstructorEmbeddings.loadSavedModel(
     EXPORT_PATH,
     spark
 )\
  .setInputCols(["document"])\
  .setOutputCol("instructor")

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [9]:
embedding.write().overwrite().save("./{}_spark_nlp".format(EXPORT_PATH))

Awesome  😎 !

This is your ONNX InstructorEmbeddings  model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [10]:
! ls -l {EXPORT_PATH}_spark_nlp

total 216600
-rw-r--r-- 1 root root 220997879 Jun 23 00:21 instructor_openvino
-rw-r--r-- 1 root root    791656 Jun 23 00:21 instructor_spp
drwxr-xr-x 2 root root      4096 Jun 23 00:21 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny InstructorEmbeddings  model 😊

In [11]:
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import InstructorEmbeddings
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

instructor_loaded = InstructorEmbeddings.load(f"{EXPORT_PATH}_spark_nlp")\
    .setInputCols(["document"])\
    .setOutputCol("instructor")\
    .setInstruction("Encode This:")


pipeline = Pipeline(stages=[
    document_assembler,
    instructor_loaded
])

data = spark.createDataFrame([[
    'William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist.'
]]).toDF("text")

model = pipeline.fit(data)
result = model.transform(data)

result.select("instructor.embeddings").show()

+--------------------+
|          embeddings|
+--------------------+
|[[-0.025555575, 0...|
+--------------------+



That's it! You can now go wild and use hundreds of InstructorEmbeddings  models from HuggingFace 🤗 in Spark NLP 🚀
