![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForTokenClassification.ipynb)

## Import ONNX DistilBertForTokenClassification models from HuggingFace 🤗  into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- ONNX support was introduced in  `Spark NLP 5.0.0`, enabling high performance inference for models.
- `DistilBertForTokenClassification` is only available since in `Spark NLP 5.1.3` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import DistilBERT models trained/fine-tuned for token classification via `DistilBertForTokenClassification` or `TFDistilBertForTokenClassification`. These models are usually under `Token Classification` category and have `bert` in their labels
- Reference: [TFDistilBertForTokenClassification](https://huggingface.co/transformers/model_doc/distilbert.html#tfdistilbertfortokenclassification)
- Some [example models](https://huggingface.co/models?filter=distilbert&pipeline_tag=token-classification)

## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.52.4`. This doesn't mean it won't work with the future releases
- Albert uses SentencePiece, so we will have to install that as well

In [None]:
!pip install -q --upgrade transformers[onnx]==4.52.4 optimum onnx

- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.
- We'll use [elastic/distilbert-base-cased-finetuned-conll03-english](https://huggingface.co/elastic/distilbert-base-cased-finetuned-conll03-english) model from HuggingFace as an example
- In addition to `TFDistilBertForTokenClassification` we also need to save the `DistilBertTokenizer`. This is the same for every model, these are assets needed for tokenization inside Spark NLP.

In [3]:
from transformers import DistilBertTokenizer
from optimum.onnxruntime import ORTModelForTokenClassification

MODEL_NAME = 'elastic/distilbert-base-cased-finetuned-conll03-english'
ONNX_MODEL = f"onnx_models/{MODEL_NAME}"

ort_model = ORTModelForTokenClassification.from_pretrained(MODEL_NAME, export=True)
ort_model.save_pretrained(ONNX_MODEL)

tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(ONNX_MODEL)

Error while fetching `HF_TOKEN` secret value from your vault: 'TypeError: Failed to fetch'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/954 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/257 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

('onnx_models/elastic/distilbert-base-cased-finetuned-conll03-english/tokenizer_config.json',
 'onnx_models/elastic/distilbert-base-cased-finetuned-conll03-english/special_tokens_map.json',
 'onnx_models/elastic/distilbert-base-cased-finetuned-conll03-english/vocab.txt',
 'onnx_models/elastic/distilbert-base-cased-finetuned-conll03-english/added_tokens.json')

Let's have a look inside these two directories and see what we are dealing with:

In [4]:
!ls -l {ONNX_MODEL}

total 255044
-rw-r--r-- 1 root root       882 Jun 14 00:39 config.json
-rw-r--r-- 1 root root 260928908 Jun 14 00:39 model.onnx
-rw-r--r-- 1 root root       125 Jun 14 00:39 special_tokens_map.json
-rw-r--r-- 1 root root      1279 Jun 14 00:39 tokenizer_config.json
-rw-r--r-- 1 root root    213450 Jun 14 00:39 vocab.txt


- As you can see, we need to move `vocab.txt` from the tokenizer to assets folder which Spark NLP will look for
- We also need `labels` and their `ids` which is saved inside the model's config. We will save this inside `labels.txt`

In [5]:
!mkdir -p {ONNX_MODEL}/assets

labels = ort_model.config.label2id
labels = sorted(labels, key=labels.get)

with open(f"{ONNX_MODEL}/assets/labels.txt", "w") as f:
    f.write("\n".join(labels))

!mv {ONNX_MODEL}/vocab.txt {ONNX_MODEL}/assets

In [6]:
!ls -lR {ONNX_MODEL}

onnx_models/elastic/distilbert-base-cased-finetuned-conll03-english:
total 254836
drwxr-xr-x 2 root root      4096 Jun 14 00:42 assets
-rw-r--r-- 1 root root       882 Jun 14 00:39 config.json
-rw-r--r-- 1 root root 260928908 Jun 14 00:39 model.onnx
-rw-r--r-- 1 root root       125 Jun 14 00:39 special_tokens_map.json
-rw-r--r-- 1 root root      1279 Jun 14 00:39 tokenizer_config.json

onnx_models/elastic/distilbert-base-cased-finetuned-conll03-english/assets:
total 216
-rw-r--r-- 1 root root     51 Jun 14 00:42 labels.txt
-rw-r--r-- 1 root root 213450 Jun 14 00:39 vocab.txt


In [7]:
!cat {ONNX_MODEL}/assets/labels.txt

O
B-PER
I-PER
B-ORG
I-ORG
B-LOC
I-LOC
B-MISC
I-MISC

Voila! We have our `vocab.txt` and `labels.txt` inside assets directory

## Import and Save DistilBertForTokenClassification in Spark NLP


- **Install and set up Spark NLP in Google Colab**
  - This example uses specific versions of `pyspark` and `spark-nlp` that have been tested with the transformer model to ensure everything runs smoothly.


In [8]:
!pip install -q pyspark==3.5.4 spark-nlp==5.5.3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m635.7/635.7 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Let's start Spark with Spark NLP included via our simple `start()` function

In [9]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: {}".format(sparknlp.version()))
print("Apache Spark version: {}".format(spark.version))

Spark NLP version: 5.5.3
Apache Spark version: 3.5.4


- Let's use `loadSavedModel` functon in `DistilBertForTokenClassification` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `DistilBertForTokenClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.



In [10]:
from sparknlp.annotator import DistilBertForTokenClassification

tokenClassifier = DistilBertForTokenClassification.loadSavedModel(
      ONNX_MODEL,
      spark
    )\
    .setInputCols(["document",'token'])\
    .setOutputCol("ner")\
    .setCaseSensitive(True)\
    .setMaxSentenceLength(128)

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [11]:
tokenClassifier.write().overwrite().save("./{}_spark_nlp_onnx".format(ONNX_MODEL))

Let's clean up stuff we don't need anymore

In [12]:
!rm -rf {ONNX_MODEL}

Awesome 😎  !

This is your DistilBertForTokenClassification model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [13]:
! ls -l {ONNX_MODEL}_spark_nlp_onnx

total 254864
-rw-r--r-- 1 root root 260968857 Jun 14 00:45 distilbert_classification_onnx
drwxr-xr-x 4 root root      4096 Jun 14 00:45 fields
drwxr-xr-x 2 root root      4096 Jun 14 00:45 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny DitilBertForTokenClassification model 😊

In [14]:
tokenClassifier_loaded = DistilBertForTokenClassification.load("./{}_spark_nlp_onnx".format(ONNX_MODEL))\
  .setInputCols(["document",'token'])\
  .setOutputCol("ner")

You can see what labels were used to train this model via `getClasses` function:

In [15]:
tokenClassifier_loaded.getClasses()

['B-LOC', 'I-ORG', 'I-MISC', 'I-LOC', 'I-PER', 'B-MISC', 'B-ORG', 'O', 'B-PER']

This is how you can use your loaded classifier model in Spark NLP 🚀 pipeline:

In [16]:
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import Tokenizer, NerConverter
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

converter = NerConverter() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    tokenClassifier_loaded,
    converter
])

example = spark.createDataFrame([
    ["Barack Obama was born in Hawaii and served as President of the United States."],
    ["Apple Inc. is based in Cupertino and was founded by Steve Jobs."],
    ["Cristiano Ronaldo plays for Al-Nassr and has won multiple Ballon d'Or awards."]
]).toDF("text")

result = pipeline.fit(example).transform(example)
result.select("text", "ner.result").show(truncate=False)

result.selectExpr("explode(ner_chunk) as chunk").selectExpr(
    "chunk.result as text",
    "chunk.metadata['entity'] as entity"
).show(truncate=False)

+-----------------------------------------------------------------------------+------------------------------------------------------------------+
|text                                                                         |result                                                            |
+-----------------------------------------------------------------------------+------------------------------------------------------------------+
|Barack Obama was born in Hawaii and served as President of the United States.|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, O, O, O, B-LOC, I-LOC, O] |
|Apple Inc. is based in Cupertino and was founded by Steve Jobs.              |[B-ORG, I-ORG, I-ORG, O, O, O, B-LOC, O, O, O, O, B-PER, I-PER, O]|
|Cristiano Ronaldo plays for Al-Nassr and has won multiple Ballon d'Or awards.|[B-PER, I-PER, O, O, B-ORG, O, O, O, O, B-ORG, I-MISC, O, O]      |
+-----------------------------------------------------------------------------+---------------------------------------

That's it! You can now go wild and use hundreds of `DistlBertForTokenClassification` models from HuggingFace 🤗 in Spark NLP 🚀
