![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BGE.ipynb)

# Import ONNX BGE models from HuggingFace 🤗 into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- ONNX support for this annotator was introduced in  `Spark NLP 5.2.1`, enabling high performance inference for models. Please make sure you have upgraded to the latest Spark NLP release.
- You can import models for BGE from HuggingFace and they have to be in `Sentence Similarity` category. Meaning, you cannot use BGE models trained/fine-tuned on a specific task such as token/sequence classification.

## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.51.3`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [None]:
!pip install -q transformers[onnx]==4.51.3 optimum onnx

- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.
- We'll use [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) model from HuggingFace as an example and load it as a `ORTModelForFeatureExtraction`, representing an ONNX model.
- In addition to the BGE model, we also need to save the Tokenizer. This is the same for every model, these are assets (saved in `/assets`) needed for tokenization inside Spark NLP.


In [8]:
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForFeatureExtraction

MODEL_NAME = "BAAI/bge-base-en"
EXPORT_PATH = f"onnx_models/{MODEL_NAME}"

ort_model = ORTModelForFeatureExtraction.from_pretrained(MODEL_NAME, export=True)
ort_model.save_pretrained(EXPORT_PATH)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(EXPORT_PATH)

The model BAAI/bge-base-en was already converted to ONNX but got `export=True`, the model will be converted to ONNX once again. Don't forget to save the resulting model with `.save_pretrained()`


('onnx_models/BAAI/bge-base-en/tokenizer_config.json',
 'onnx_models/BAAI/bge-base-en/special_tokens_map.json',
 'onnx_models/BAAI/bge-base-en/vocab.txt',
 'onnx_models/BAAI/bge-base-en/added_tokens.json',
 'onnx_models/BAAI/bge-base-en/tokenizer.json')

Let's have a look inside these two directories and see what we are dealing with:

In [10]:
!ls -l {EXPORT_PATH}

total 426572
-rw-r--r-- 1 root root       696 Jun 10 20:02 config.json
-rw-r--r-- 1 root root 435844616 Jun 10 20:02 model.onnx
-rw-r--r-- 1 root root       695 Jun 10 20:02 special_tokens_map.json
-rw-r--r-- 1 root root      1272 Jun 10 20:02 tokenizer_config.json
-rw-r--r-- 1 root root    711396 Jun 10 20:02 tokenizer.json
-rw-r--r-- 1 root root    231508 Jun 10 20:02 vocab.txt


Create assets directory and move tokenizer files (required for Spark NLP)

In [12]:
!mkdir {EXPORT_PATH}/assets & mv {EXPORT_PATH}/vocab.txt {EXPORT_PATH}/assets/

In [13]:
!ls -l {EXPORT_PATH}/assets

total 228
-rw-r--r-- 1 root root 231508 Jun 10 20:02 vocab.txt


Voila! We have our `vocab.txt` inside assets directory

## Import and Save BGE in Spark NLP

- Let's install and setup Spark NLP in Google Colab. For this example, we'll use specific versions of `pyspark` and `spark-nlp` that we've already tested with this transformer model to make sure everything runs smoothly.

In [14]:
!pip install -q pyspark==3.5.4 spark-nlp==5.5.3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m635.7/635.7 kB[0m [31m41.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Let's start Spark with Spark NLP included via our simple `start()` function

In [15]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  5.5.3
Apache Spark version:  3.5.4


- Let's use `loadSavedModel` functon in `E5Embeddings` which allows us to load the ONNX model
- Most params will be set automatically. They can also be set later after loading the model in `E5Embeddings` during runtime, so don't worry about setting them now
- `loadSavedModel` accepts two params, first is the path to the exported model. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.


In [16]:
from sparknlp.annotator import BGEEmbeddings

BGE = BGEEmbeddings.loadSavedModel(f"{EXPORT_PATH}", spark)\
    .setInputCols(["document"])\
    .setOutputCol("bge")

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [17]:
BGE.write().overwrite().save(f"{MODEL_NAME}_spark_nlp")

Let's clean up stuff we don't need anymore

In [18]:
!rm -rf {EXPORT_PATH}

Awesome  😎 !

This is your ONNX BGE model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [19]:
! ls -l {MODEL_NAME}_spark_nlp

total 425708
-rw-r--r-- 1 root root 435911255 Jun 10 20:09 bge_onnx
drwxr-xr-x 3 root root      4096 Jun 10 20:09 fields
drwxr-xr-x 2 root root      4096 Jun 10 20:09 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny E5 model 😊

In [20]:
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import BGEEmbeddings
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

bge_embeddings = BGEEmbeddings.load(f"{MODEL_NAME}_spark_nlp") \
    .setInputCols(["document"]) \
    .setOutputCol("bge")

pipeline = Pipeline(stages=[
    document_assembler,
    bge_embeddings
])

data = spark.createDataFrame([
    ["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist."]
]).toDF("text")

model = pipeline.fit(data)
result = model.transform(data)

result.selectExpr("explode(bge.embeddings) as embeddings").show()

+--------------------+
|          embeddings|
+--------------------+
|[-0.03819768, 0.0...|
+--------------------+



That's it! You can now go wild and use hundreds of E5 models from HuggingFace 🤗 in Spark NLP 🚀
