![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBERT.ipynb)

# Import ONNX DistilBERT models from HuggingFace 🤗 into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- ONNX support was introduced in  `Spark NLP 5.0.0`, enabling high performance inference for models. Please make sure you have upgraded to the latest Spark NLP release.
- You can import models for DistilBERT from HuggingFace and they have to be in `Fill Mask` category. Meaning, you cannot use DistilBERT models trained/fine-tuned on a specific task such as token/sequence classification.

## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.48.3`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [3]:
!pip install -q --upgrade transformers[onnx]==4.48.3 optimum onnx

- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.
- We'll use [distilbert-base-cased](https://huggingface.co/distilbert-base-cased) model from HuggingFace as an example and load it as a `ORTModelForFeatureExtraction`, representing an ONNX model.
- In addition to the DistilBERT model, we also need to save the `DistilBertTokenizer`. This is the same for every model, these are assets (saved in `/assets`) needed for tokenization inside Spark NLP.

In [4]:
from transformers import DistilBertTokenizer
from optimum.onnxruntime import ORTModelForFeatureExtraction

MODEL_NAME = "distilbert-base-cased"
EXPORT_PATH = f"onnx_models/{MODEL_NAME}"

ort_model = ORTModelForFeatureExtraction.from_pretrained(MODEL_NAME, export=True)
ort_model.save_pretrained(EXPORT_PATH)

tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(EXPORT_PATH)

The model distilbert-base-cased was already converted to ONNX but got `export=True`, the model will be converted to ONNX once again. Don't forget to save the resulting model with `.save_pretrained()`


('onnx_models/distilbert-base-cased/tokenizer_config.json',
 'onnx_models/distilbert-base-cased/special_tokens_map.json',
 'onnx_models/distilbert-base-cased/vocab.txt',
 'onnx_models/distilbert-base-cased/added_tokens.json')

Let's have a look inside these two directories and see what we are dealing with:

In [5]:
!ls -l {EXPORT_PATH}

total 254992
-rw-r--r-- 1 root root       545 Jun 13 03:28 config.json
-rw-r--r-- 1 root root 260878383 Jun 13 03:28 model.onnx
-rw-r--r-- 1 root root       125 Jun 13 03:28 special_tokens_map.json
-rw-r--r-- 1 root root      1279 Jun 13 03:28 tokenizer_config.json
-rw-r--r-- 1 root root    213450 Jun 13 03:28 vocab.txt


- We need to move the `vocab.txt` file from the tokenizer into an assets folder

In [6]:
!mkdir {EXPORT_PATH}/assets && mv {EXPORT_PATH}/vocab.txt {EXPORT_PATH}/assets/

In [None]:
!ls -l {EXPORT_PATH}/assets

total 212
-rw-r--r-- 1 root root 213450 Jun 13 03:28 vocab.txt


## Import and Save DistilBERT in Spark NLP

- **Install and set up Spark NLP in Google Colab**
  - This example uses specific versions of `pyspark` and `spark-nlp` that have been tested with the transformer model to ensure everything runs smoothly.

- **Optional: Use the latest versions**
  - If you prefer to use the latest versions instead, you can install them with:
    ```bash
    !wget http://setup.johnsnowlabs.com/colab.sh -O - | bash
    ```
  - Note: The latest versions may introduce breaking changes, so you might need to adjust the code accordingly.


In [8]:
!pip install -q pyspark==3.5.4 spark-nlp==5.5.3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m635.7/635.7 kB[0m [31m37.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Let's start Spark with Spark NLP included via our simple `start()` function

In [9]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  5.5.3
Apache Spark version:  3.5.4


- Let's use `loadSavedModel` functon in `BertEmbeddings` which allows us to load the ONNX model
- Most params will be set automatically. They can also be set later after loading the model in `BertEmbeddings` during runtime, so don't worry about setting them now
- `loadSavedModel` accepts two params, first is the path to the exported model. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- `setStorageRef` is very important. When you are training a task like NER or any Text Classification, we use this reference to bound the trained model to this specific embeddings so you won't load a different embeddings by mistake and see terrible results 😊
- It's up to you what you put in `setStorageRef` but it cannot be changed later on. We usually use the name of the model to be clear, but you can get creative if you want!
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.


In [10]:
from sparknlp.annotator import DistilBertEmbeddings

distilbert = DistilBertEmbeddings.loadSavedModel(f"{EXPORT_PATH}", spark)\
    .setInputCols(["document",'token'])\
    .setOutputCol("distilbert")\
    .setCaseSensitive(True)\
    .setStorageRef('distilbert_base_cased')

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [11]:
distilbert.write().overwrite().save(f"{MODEL_NAME}_spark_nlp")

Let's clean up stuff we don't need anymore

In [12]:
!rm -rf {EXPORT_PATH}

Awesome  😎 !

This is your ONNX DistilBERT model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [13]:
! ls -l {MODEL_NAME}_spark_nlp

total 254816
-rw-r--r-- 1 root root 260918327 Jun 13 03:33 distilbert_onnx
drwxr-xr-x 3 root root      4096 Jun 13 03:33 fields
drwxr-xr-x 2 root root      4096 Jun 13 03:33 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny DistilBERT model 😊

In [14]:
from pyspark.ml import Pipeline
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import Tokenizer

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

distilbert_loaded = DistilBertEmbeddings.load(f"{MODEL_NAME}_spark_nlp")\
    .setInputCols(["document", "token"])\
    .setOutputCol("distilbert")

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    distilbert_loaded
])

data = spark.createDataFrame([[
    'William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist.'
]]).toDF("text")

model = pipeline.fit(data)
result = model.transform(data)

result.selectExpr("explode(distilbert.embeddings) as embeddings").show()

+--------------------+
|          embeddings|
+--------------------+
|[-0.079205066, -0...|
|[-0.008524727, -0...|
|[0.06860065, -0.1...|
|[-0.054616462, -0...|
|[-0.29974172, -0....|
|[0.0888023, -0.20...|
|[0.3460091, -0.49...|
|[-0.0032464452, -...|
|[0.6970492, -0.12...|
|[-0.45469904, 0.2...|
|[0.41657686, 0.18...|
|[0.09718056, 0.17...|
|[0.1975824, 0.163...|
|[0.20108229, -0.1...|
|[0.6448232, 0.120...|
|[0.28228003, -0.2...|
|[0.0451998, 0.476...|
|[0.4543021, 0.238...|
|[0.14045055, 0.05...|
|[0.05716961, 0.29...|
+--------------------+
only showing top 20 rows



That's it! You can now go wild and use hundreds of DistilBERT models from HuggingFace 🤗 in Spark NLP 🚀
