![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DeBERTa.ipynb)

# Import ONNX DeBERTa models from HuggingFace 🤗 into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- ONNX support was introduced in  `Spark NLP 5.0.0`, enabling high performance inference for models. Please make sure you have upgraded to the latest Spark NLP release.
- You can import models for DeBERTa from HuggingFace and they have to be in `Fill Mask` category. Meaning, you cannot use DeBERTa models trained/fine-tuned on a specific task such as token/sequence classification.

## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.48.3`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [None]:
!pip install -q --upgrade transformers[onnx]==4.48.3 optimum onnx

- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.
- We'll use [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) model from HuggingFace as an example and load it as a `ORTModelForFeatureExtraction`, representing an ONNX model.
- In addition to the DeBERTa model, we also need to save the tokenizer. This is the same for every model, these are assets (saved in `/assets`) needed for tokenization inside Spark NLP.

In [5]:
from transformers import DebertaV2Tokenizer
from optimum.onnxruntime import ORTModelForFeatureExtraction

MODEL_NAME = "microsoft/deberta-v3-base"
EXPORT_PATH = f"onnx_models/{MODEL_NAME}"

ort_model = ORTModelForFeatureExtraction.from_pretrained(MODEL_NAME, export=True)
ort_model.save_pretrained(EXPORT_PATH)

tokenizer = DebertaV2Tokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(EXPORT_PATH)



('onnx_models/microsoft/deberta-v3-base/tokenizer_config.json',
 'onnx_models/microsoft/deberta-v3-base/special_tokens_map.json',
 'onnx_models/microsoft/deberta-v3-base/spm.model',
 'onnx_models/microsoft/deberta-v3-base/added_tokens.json')

Let's have a look inside these two directories and see what we are dealing with:

In [6]:
!ls -l {EXPORT_PATH}

total 721284
-rw-r--r-- 1 root root        23 Jun 11 22:28 added_tokens.json
-rw-r--r-- 1 root root       803 Jun 11 22:28 config.json
-rw-r--r-- 1 root root 736104888 Jun 11 22:28 model.onnx
-rw-r--r-- 1 root root       286 Jun 11 22:28 special_tokens_map.json
-rw-r--r-- 1 root root   2464616 Jun 11 22:28 spm.model
-rw-r--r-- 1 root root      1315 Jun 11 22:28 tokenizer_config.json


- We need to move the `spm.model` file from the tokenizer into an assets folder, as this is where Spark NLP looks for it.

In [7]:
!mkdir -p {EXPORT_PATH}/assets && mv {EXPORT_PATH}/spm.model {EXPORT_PATH}/assets/

In [8]:
!ls -l {EXPORT_PATH}/assets

total 2408
-rw-r--r-- 1 root root 2464616 Jun 11 22:28 spm.model


Voila! We have our `spm.model` inside assets directory

## Import and Save DeBERTa in Spark NLP

- Let's install and setup Spark NLP in Google Colab.
- For this example we'll use specific versions of `pyspark` and `spark-nlp` that we've already tested with this transformer model to make sure everything runs smoothly.

In [9]:
!pip install -q pyspark==3.5.4 spark-nlp==5.5.3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m635.7/635.7 kB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Let's start Spark with Spark NLP included via our simple `start()` function

In [10]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  5.5.3
Apache Spark version:  3.5.4


- Let's use `loadSavedModel` functon in `BertEmbeddings` which allows us to load the ONNX model
- Most params will be set automatically. They can also be set later after loading the model in `BertEmbeddings` during runtime, so don't worry about setting them now
- `loadSavedModel` accepts two params, first is the path to the exported model. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- `setStorageRef` is very important. When you are training a task like NER or any Text Classification, we use this reference to bound the trained model to this specific embeddings so you won't load a different embeddings by mistake and see terrible results 😊
- It's up to you what you put in `setStorageRef` but it cannot be changed later on. We usually use the name of the model to be clear, but you can get creative if you want!
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.


In [11]:
from sparknlp.annotator import DeBertaEmbeddings

deberta = DeBertaEmbeddings.loadSavedModel(f"{EXPORT_PATH}", spark)\
    .setInputCols(["document",'token'])\
    .setOutputCol("deberta")\
    .setCaseSensitive(True)\
    .setStorageRef('deberta-v3-base')

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [12]:
deberta.write().overwrite().save(f"{MODEL_NAME}_spark_nlp")

Let's clean up stuff we don't need anymore

In [13]:
!rm -rf {EXPORT_PATH}

Awesome  😎 !

This is your ONNX DeBERTa model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [14]:
! ls -l {MODEL_NAME}_spark_nlp

total 721380
-rw-r--r-- 1 root root 736217347 Jun 11 22:32 deberta_onnx
-rw-r--r-- 1 root root   2464616 Jun 11 22:32 deberta_spp
drwxr-xr-x 2 root root      4096 Jun 11 22:32 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny DeBERTa model 😊

In [15]:
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import Tokenizer, DeBertaEmbeddings
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

deberta_loaded = DeBertaEmbeddings.load(f"{MODEL_NAME}_spark_nlp") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("deberta")

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    deberta_loaded
])

data = spark.createDataFrame([[
    "William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist."
]]).toDF("text")

model = pipeline.fit(data)
result = model.transform(data)

result.selectExpr("explode(deberta.embeddings) as embeddings").show()

+--------------------+
|          embeddings|
+--------------------+
|[0.37620464, -0.5...|
|[0.44112626, -0.3...|
|[0.47031617, 0.05...|
|[-0.31039706, -0....|
|[0.5987083, 0.161...|
|[-0.64938843, 0.5...|
|[0.1911987, -0.47...|
|[0.16888708, -0.2...|
|[-0.07008469, -0....|
|[-0.2978857, 0.30...|
|[0.32503495, 0.23...|
|[-0.4973393, -0.2...|
|[-0.68204045, -0....|
|[-0.0047250018, -...|
|[0.0986197, -0.50...|
|[0.5566538, -0.21...|
|[1.5926999, 0.167...|
|[-0.17912892, -0....|
|[-0.20193331, -0....|
|[1.5722651, -0.10...|
+--------------------+
only showing top 20 rows



That's it! You can now go wild and use hundreds of DeBERTa models from HuggingFace 🤗 in Spark NLP 🚀
