![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForZeroShotClassification.ipynb)

## Import ONNX AlbertForZeroShotClassification models from HuggingFace 🤗  into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- ONNX support was introduced in  `Spark NLP 5.0.0`, enabling high performance inference for models.
- `AlbertForZeroShotClassification` is only available since in `Spark NLP 5.4.2` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import ALBERT models trained/fine-tuned for sequence classification via `AlbertForSequenceClassification` or `TFAlbertForSequenceClassification`. These models are usually under `Sequence Classification` category and have `camembert` in their labels
- Reference: [TFAlbertForSequenceClassification](https://huggingface.co/docs/transformers/model_doc/camembert#transformers.TFCamembertForSequenceClassification)
- Some [example models](https://huggingface.co/models?other=camembert&pipeline_tag=token-classification)

## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.48.2`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.
- CamembertTokenizer requires the `SentencePiece` library, so we install that as well

In [None]:
!pip install -q --upgrade transformers[onnx]==4.48.2 optimum==1.24.0 sentencepiece==0.2.0 tensorflow==2.18.0

- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.
- We'll use [DAMO-NLP-SG/zero-shot-classify-SSTuning-ALBERT](https://huggingface.co/DAMO-NLP-SG/zero-shot-classify-SSTuning-ALBERT)  model from HuggingFace as an example and load it as a `ORTModelForSequenceClassification`, representing an ONNX model.

In [2]:
MODEL_NAME = 'DAMO-NLP-SG/zero-shot-classify-SSTuning-ALBERT'
ONNX_MODEL = f"onnx_models/{MODEL_NAME}"

In [3]:
!optimum-cli export onnx --model {MODEL_NAME} {ONNX_MODEL}

2025-02-03 22:05:32.965016: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1738620333.309496    1261 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1738620333.396602    1261 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-03 22:05:34.062495: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
config.json: 100% 1.77k/1.77k [00:00<00:00, 10.5MB/s]
pytorch_model.bin: 100% 891M/891M [00:11<00:00, 76.2MB/s]
tokenizer_con

Let's have a look inside this directory and see what we are dealing with:

In [4]:
!ls -l {ONNX_MODEL}

total 872500
-rw-r--r-- 1 root root      1799 Feb  3 22:06 config.json
-rw-r--r-- 1 root root 891149939 Feb  3 22:07 model.onnx
-rw-r--r-- 1 root root       970 Feb  3 22:06 special_tokens_map.json
-rw-r--r-- 1 root root      1252 Feb  3 22:06 tokenizer_config.json
-rw-r--r-- 1 root root   2272346 Feb  3 22:06 tokenizer.json


We are using based model for the tokenizer because the model `DAMO-NLP-SG/zero-shot-classify-SSTuning-ALBERT` does not have sentencepiece

In [None]:
from transformers import AlbertTokenizer
import tensorflow as tf

try:
    tokenizer = AlbertTokenizer.from_pretrained('albert-xxlarge-v2')
    print("Tokenizer loaded successfully!")
except OSError as e:
    print(f"Error loading tokenizer: {e}")

try:
    tokenizer.save_pretrained(ONNX_MODEL)
    print("Tokenizer saved successfully!")
except Exception as e:
    print(f"Error saving tokenizer: {e}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/710 [00:00<?, ?B/s]

Tokenizer loaded successfully!
Tokenizer saved successfully!


In [6]:
!ls -l {ONNX_MODEL}

total 873244
-rw-r--r-- 1 root root      1799 Feb  3 22:06 config.json
-rw-r--r-- 1 root root 891149939 Feb  3 22:07 model.onnx
-rw-r--r-- 1 root root       286 Feb  3 22:24 special_tokens_map.json
-rw-r--r-- 1 root root    760289 Feb  3 22:24 spiece.model
-rw-r--r-- 1 root root      1277 Feb  3 22:24 tokenizer_config.json
-rw-r--r-- 1 root root   2272346 Feb  3 22:06 tokenizer.json


In [7]:
!mkdir {ONNX_MODEL}/assets

- As you can see, we need to move `spiece.model` from the tokenizer to assets folder which Spark NLP will look for
- In addition to vocabs, we also need `labels` and their `ids` which is saved inside the model's config. We will save this inside `labels.txt`

In [None]:
from transformers import AutoConfig
import os

human_readable_labels = ["entailment", "contradiction"]

config = AutoConfig.from_pretrained(MODEL_NAME)
labels = config.id2label


mapped_labels = [human_readable_labels[idx] for idx in sorted(labels.keys()) if idx < len(human_readable_labels)]

assets_path = os.path.join(ONNX_MODEL, "assets")
os.makedirs(assets_path, exist_ok=True)

labels_file = os.path.join(assets_path, "labels.txt")
with open(labels_file, "w") as f:
    f.write("\n".join(mapped_labels))

print(f"Labels saved to: {labels_file}")

In [9]:
!mv {ONNX_MODEL}/spiece.model {ONNX_MODEL}/assets

Voila! We have our `spiece.model` and `labels.txt` inside assets directory

In [10]:
!ls -lR {ONNX_MODEL}

onnx_models/DAMO-NLP-SG/zero-shot-classify-SSTuning-ALBERT:
total 872504
drwxr-xr-x 2 root root      4096 Feb  3 22:24 assets
-rw-r--r-- 1 root root      1799 Feb  3 22:06 config.json
-rw-r--r-- 1 root root 891149939 Feb  3 22:07 model.onnx
-rw-r--r-- 1 root root       286 Feb  3 22:24 special_tokens_map.json
-rw-r--r-- 1 root root      1277 Feb  3 22:24 tokenizer_config.json
-rw-r--r-- 1 root root   2272346 Feb  3 22:06 tokenizer.json

onnx_models/DAMO-NLP-SG/zero-shot-classify-SSTuning-ALBERT/assets:
total 748
-rw-r--r-- 1 root root     21 Feb  3 22:24 labels.txt
-rw-r--r-- 1 root root 760289 Feb  3 22:24 spiece.model


In [11]:
!cat {ONNX_MODEL}/assets/labels.txt

entailment
contradict

## Import and Save AlbertForZeroShotClassification in Spark NLP


- Let's install and setup Spark NLP in Google Colab

In [None]:
!pip install -q pyspark==3.5.4 spark-nlp==5.5.3

Let's start Spark with Spark NLP included via our simple `start()` function

In [None]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Apache Spark version: 3.5.4


- Let's use `loadSavedModel` functon in `AlbertForZeroShotClassification` which allows us to load ONNX model in SavedModel format
- Most params can be set later when you are loading this model in `AlbertForZeroShotClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [None]:
from sparknlp.annotator import AlbertForZeroShotClassification

zero_shot_classifier = AlbertForZeroShotClassification\
  .loadSavedModel(ONNX_MODEL, spark)\
  .setInputCols(["document",'token'])\
  .setOutputCol("class")\
  .setCaseSensitive(False)\
  .setMaxSentenceLength(128)\
  .setCandidateLabels(["urgent", "mobile", "technology"])

In [17]:
zero_shot_classifier.write().overwrite().save("./{}_spark_nlp_onnx".format(ONNX_MODEL))

Let's clean up stuff we don't need anymore

In [18]:
!rm -rf {ONNX_MODEL}

Awesome 😎  !

This is your AlbertForZeroShotClassification model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [None]:
zero_shot_classifier_loaded = AlbertForZeroShotClassification.load("./{}_spark_nlp_onnx".format(ONNX_MODEL))\
  .setInputCols(["document",'token'])\
  .setOutputCol("multi_class") \
  .setCandidateLabels(["urgent", "mobile", "technology"])

In [20]:
! ls -l {ONNX_MODEL}_spark_nlp_onnx

total 871156
-rw-r--r-- 1 root root 891286053 Feb  3 22:29 albert_classification_onnx
-rw-r--r-- 1 root root    760289 Feb  3 22:29 albert_spp
drwxr-xr-x 3 root root      4096 Feb  3 22:28 fields
drwxr-xr-x 2 root root      4096 Feb  3 22:28 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny AlbertForZeroShotClassification model 😊

This is how you can use your loaded classifier model in Spark NLP 🚀 pipeline:

In [None]:
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    zero_shot_classifier_loaded
])

example = spark.createDataFrame([
    ["I have a problem with my iPhone that needs to be resolved ASAP!"]
]).toDF("text")

result = pipeline.fit(example).transform(example)
result.select("text", "multi_class.result").show()

That's it! You can now go wild and use hundreds of `AlbertForZeroShotClassification` models from HuggingFace 🤗 in Spark NLP 🚀
