![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForTokenClassification.ipynb)

## Import ONNX AlbertForTokenClassification models from HuggingFace 🤗  into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- ONNX support was introduced in  `Spark NLP 5.0.0`, enabling high performance inference for models.
- `AlbertForTokenClassification` is only available since in `Spark NLP 5.1.1` and after. So please make sure you have upgraded to the latest Spark NLP release- You can import ALBERT models trained/fine-tuned for token classification via `AlbertForTokenClassification` or `TFAlbertForTokenClassification`. These models are usually under `Token Classification` category and have `albert` in their labels
- Reference: [TFAlbertForTokenClassification](https://huggingface.co/transformers/model_doc/albert.html#tfalbertfortokenclassification)
- Some [example models](https://huggingface.co/models?filter=albert&pipeline_tag=token-classification)

## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.48.2`. This doesn't mean it won't work with the future releases
- Albert uses SentencePiece, so we will have to install that as well

In [None]:
!pip install -q --upgrade transformers[onnx]==4.48.2 optimum onnx

- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.
- We'll use [HooshvareLab/albert-fa-zwnj-base-v2-ner](https://huggingface.co/HooshvareLab/albert-fa-zwnj-base-v2-ner) model from HuggingFace as an example
- In addition to the ALBERT model, we also need to save the `AlbertTokenizer`. This is the same for every model, these are assets (saved in `/assets`) needed for tokenization inside Spark NLP.

In [17]:
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForTokenClassification

MODEL_NAME = 'HooshvareLab/albert-fa-zwnj-base-v2-ner'
EXPORT_PATH = f"onnx_models/{MODEL_NAME}"

ort_model = ORTModelForTokenClassification.from_pretrained(MODEL_NAME, export=True)
ort_model.save_pretrained(EXPORT_PATH)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(EXPORT_PATH)

('onnx_models/HooshvareLab/albert-fa-zwnj-base-v2-ner/tokenizer_config.json',
 'onnx_models/HooshvareLab/albert-fa-zwnj-base-v2-ner/special_tokens_map.json',
 'onnx_models/HooshvareLab/albert-fa-zwnj-base-v2-ner/spiece.model',
 'onnx_models/HooshvareLab/albert-fa-zwnj-base-v2-ner/added_tokens.json',
 'onnx_models/HooshvareLab/albert-fa-zwnj-base-v2-ner/tokenizer.json')

Let's have a look inside these two directories and see what we are dealing with:

In [18]:
!ls -l {EXPORT_PATH}

total 47024
-rw-r--r-- 1 root root     1630 Jun  7 07:07 config.json
-rw-r--r-- 1 root root 44875812 Jun  7 07:07 model.onnx
-rw-r--r-- 1 root root      971 Jun  7 07:07 special_tokens_map.json
-rw-r--r-- 1 root root   857476 Jun  7 07:07 spiece.model
-rw-r--r-- 1 root root    19227 Jun  7 07:07 tokenizer_config.json
-rw-r--r-- 1 root root  2381031 Jun  7 07:07 tokenizer.json


- We need to move the `spiece.model` file from the tokenizer into an assets folder, as this is where Spark NLP looks for it when working with models like Albert or other SentencePiece-based tokenizers.
- Additionally, we need to extract the `labels` and their corresponding `ids` from the model's config. This mapping will be saved as `labels.txt` inside the same `assets` folder.

In [19]:
!mkdir {EXPORT_PATH}/assets

labels = ort_model.config.label2id
labels = sorted(labels, key=labels.get)

with open(EXPORT_PATH + '/assets/labels.txt', 'w') as f:
    f.write('\n'.join(labels))

!mv {EXPORT_PATH}/spiece.model {EXPORT_PATH}/assets

In [20]:
!ls -lR {EXPORT_PATH}

onnx_models/HooshvareLab/albert-fa-zwnj-base-v2-ner:
total 46188
drwxr-xr-x 2 root root     4096 Jun  7 07:07 assets
-rw-r--r-- 1 root root     1630 Jun  7 07:07 config.json
-rw-r--r-- 1 root root 44875812 Jun  7 07:07 model.onnx
-rw-r--r-- 1 root root      971 Jun  7 07:07 special_tokens_map.json
-rw-r--r-- 1 root root    19227 Jun  7 07:07 tokenizer_config.json
-rw-r--r-- 1 root root  2381031 Jun  7 07:07 tokenizer.json

onnx_models/HooshvareLab/albert-fa-zwnj-base-v2-ner/assets:
total 844
-rw-r--r-- 1 root root    121 Jun  7 07:07 labels.txt
-rw-r--r-- 1 root root 857476 Jun  7 07:07 spiece.model


In [21]:
!cat {EXPORT_PATH}/assets/labels.txt

O
B-DAT
B-EVE
B-FAC
B-LOC
B-MON
B-ORG
B-PCT
B-PER
B-PRO
B-TIM
I-DAT
I-EVE
I-FAC
I-LOC
I-MON
I-ORG
I-PCT
I-PER
I-PRO
I-TIM

Voila! We have our `spiece.model` and `labels.txt` inside assets directory

## Import and Save AlbertForTokenClassification in Spark NLP


Let's install and setup Spark NLP in Google Colab. For this example, we'll use specific versions of `pyspark` and `spark-nlp` that we've already tested with this transformer model to make sure everything runs smoothly:

If you prefer to use the latest versions, feel free to run:

`!pip install -q pyspark spark-nlp`

Just keep in mind that newer versions might have some changes, so you may need to tweak your code a bit if anything breaks.

In [22]:
!pip install -q pyspark==3.5.4 spark-nlp==5.5.3

Let's start Spark with Spark NLP included via our simple `start()` function

In [23]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  5.5.3
Apache Spark version:  3.5.4


- Let's use `loadSavedModel` functon in `AlbertForTokenClassification` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `AlbertForTokenClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively



In [24]:
from sparknlp.annotator import AlbertForTokenClassification

tokenClassifier = AlbertForTokenClassification\
  .loadSavedModel(EXPORT_PATH, spark)\
  .setInputCols(["document",'token'])\
  .setOutputCol("ner")\
  .setCaseSensitive(False)\
  .setMaxSentenceLength(128)

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [25]:
tokenClassifier.write().overwrite().save("./{}_spark_nlp_onnx".format(MODEL_NAME))

Let's clean up stuff we don't need anymore

In [26]:
!rm -rf {EXPORT_PATH}

Awesome 😎  !

This is your AlbertForTokenClassification model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [27]:
! ls -l {MODEL_NAME}_spark_nlp_onnx

total 44680
-rw-r--r-- 1 root root 44882796 Jun  7 07:07 albert_classification_onnx
-rw-r--r-- 1 root root   857476 Jun  7 07:07 albert_spp
drwxr-xr-x 3 root root     4096 Jun  7 07:07 fields
drwxr-xr-x 2 root root     4096 Jun  7 07:07 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny AlbertForTokenClassification model 😊

In [28]:
tokenClassifier_loaded = AlbertForTokenClassification.load("./{}_spark_nlp_onnx".format(MODEL_NAME))\
  .setInputCols(["document",'token'])\
  .setOutputCol("ner")

You can see what labels were used to train this model via `getClasses` function:

In [29]:
tokenClassifier_loaded.getClasses()

['I-PCT',
 'B-PRO',
 'I-EVE',
 'B-LOC',
 'I-ORG',
 'B-FAC',
 'B-EVE',
 'B-TIM',
 'I-DAT',
 'B-MON',
 'B-PCT',
 'I-MON',
 'I-LOC',
 'I-FAC',
 'I-PRO',
 'I-TIM',
 'I-PER',
 'B-DAT',
 'B-ORG',
 'O',
 'B-PER']

This is how you can use your loaded classifier model in Spark NLP 🚀 pipeline:

In [30]:
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import Tokenizer, NerConverter
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

converter = NerConverter() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    tokenClassifier_loaded,
    converter
])

example = spark.createDataFrame([
    ["این سریال به صورت رسمی در تاریخ دهم می ۲۰۱۱ توسط شبکه فاکس برای پخش رزرو شد."],
    ["دفتر مرکزی شرکت کامیکو در شهر ساسکاتون ساسکاچوان قرار دارد."],
    ["در سال ۲۰۱۳ درگذشت و آندرتیکر و کین برای او مراسم یادبود گرفتند."]
], ["text"])

result = pipeline.fit(example).transform(example)

result.select("text", "ner.result").show(truncate=False)
result.selectExpr("explode(ner_chunk) as chunk").selectExpr(
    "chunk.result as text",
    "chunk.metadata['entity'] as entity"
).show(truncate=False)

+----------------------------------------------------------------------------+--------------------------------------------------------------+
|text                                                                        |result                                                        |
+----------------------------------------------------------------------------+--------------------------------------------------------------+
|این سریال به صورت رسمی در تاریخ دهم می ۲۰۱۱ توسط شبکه فاکس برای پخش رزرو شد.|[O, O, O, O, O, O, O, O, O, O, O, B-ORG, I-ORG, O, O, O, O, O]|
|دفتر مرکزی شرکت کامیکو در شهر ساسکاتون ساسکاچوان قرار دارد.                 |[O, O, B-ORG, I-ORG, O, B-LOC, I-LOC, I-LOC, O, O, O]         |
|در سال ۲۰۱۳ درگذشت و آندرتیکر و کین برای او مراسم یادبود گرفتند.            |[O, B-DAT, I-DAT, O, O, B-LOC, O, B-PER, O, O, O, O, O, O]    |
+----------------------------------------------------------------------------+--------------------------------------------------------------+

+----

That's it! You can now go wild and use hundreds of `AlbertForTokenClassification` models from HuggingFace 🤗 in Spark NLP 🚀
