![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForZeroShotClassification.ipynb)

## Import ONNX BertForZeroShotClassification  models from HuggingFace 🤗  into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- ONNX support was introduced in  `Spark NLP 5.0.0`, enabling high performance inference for models.
- `BertForZeroShotClassification ` is only available since in `Spark NLP 5.2.4` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import BERT models trained/fine-tuned for zero shot classification via `BertForSequenceClassification` or `TFBertForSequenceClassification`. These models are usually under `Zero-Shot Classification` category and have `bert` in their labels
- Reference: [TFBertForSequenceClassification](https://huggingface.co/transformers/model_doc/bert.html#tfbertforsequenceclassification)
- Some [example models](https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=downloads&search=bert)

## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.51.3`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.
- Albert uses SentencePiece, so we will have to install that as well

In [None]:
!pip install -q --upgrade transformers[onnx]==4.51.3 optimum onnx

- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.
- We'll use [NbAiLab/nb-bert-base-mnli](https://huggingface.co/NbAiLab/nb-bert-base-mnli) model from HuggingFace as an example and load it as a `ORTModelForSequenceClassification`, representing an ONNX model.
- In addition to the BERT model, we also need to save the `BertTokenizer`. This is the same for every model, these are assets (saved in `/assets`) needed for tokenization inside Spark NLP.

In [9]:
from transformers import BertTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification

MODEL_NAME = "aloxatel/bert-base-mnli"
EXPORT_PATH = f"onnx_models/{MODEL_NAME}"

ort_model = ORTModelForSequenceClassification.from_pretrained(MODEL_NAME, export=True)
ort_model.save_pretrained(EXPORT_PATH)

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(EXPORT_PATH)

('onnx_models/aloxatel/bert-base-mnli/tokenizer_config.json',
 'onnx_models/aloxatel/bert-base-mnli/special_tokens_map.json',
 'onnx_models/aloxatel/bert-base-mnli/vocab.txt',
 'onnx_models/aloxatel/bert-base-mnli/added_tokens.json')

Let's have a look inside these two directories and see what we are dealing with:

In [3]:
!ls -l {EXPORT_PATH}

total 427980
-rw-r--r-- 1 root root       767 Jun 10 19:18 config.json
-rw-r--r-- 1 root root 438240610 Jun 10 19:18 model.onnx


- We need to move `vocabs.txt` from the tokenizer to assets folder which Spark NLP will look for
- We also need `labels` and their `ids` which is saved inside the model's config. We will save this inside `labels.txt`

In [10]:
!mkdir -p {EXPORT_PATH}/assets

labels = ort_model.config.id2label
sorted_labels = [label for _, label in sorted(labels.items())]

with open(f"{EXPORT_PATH}/assets/labels.txt", "w") as f:
    f.write("\n".join(sorted_labels))

!mv {EXPORT_PATH}/vocab.txt {EXPORT_PATH}/assets/

In [11]:
!cat {EXPORT_PATH}/assets/labels.txt

contradiction
entailment
neutral

In [12]:
!ls -lR {EXPORT_PATH}

onnx_models/aloxatel/bert-base-mnli:
total 427988
drwxr-xr-x 2 root root      4096 Jun 10 19:27 assets
-rw-r--r-- 1 root root       767 Jun 10 19:24 config.json
-rw-r--r-- 1 root root 438240610 Jun 10 19:24 model.onnx
-rw-r--r-- 1 root root       125 Jun 10 19:24 special_tokens_map.json
-rw-r--r-- 1 root root      1272 Jun 10 19:24 tokenizer_config.json

onnx_models/aloxatel/bert-base-mnli/assets:
total 232
-rw-r--r-- 1 root root     32 Jun 10 19:27 labels.txt
-rw-r--r-- 1 root root 231508 Jun 10 19:24 vocab.txt


Voila! We have our `vocab.txt` and `labels.txt` inside assets directory

## Import and Save BertForZeroShotClassification in Spark NLP


Let's install and setup Spark NLP in Google Colab. For this example, we'll use specific versions of `pyspark` and `spark-nlp` that we've already tested with this transformer model to make sure everything runs smoothly:

If you prefer to use the latest versions, feel free to run:

`!pip install -q pyspark spark-nlp`

Just keep in mind that newer versions might have some changes, so you may need to tweak your code a bit if anything breaks.

In [6]:
!pip install -q pyspark==3.5.4 spark-nlp==5.5.3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m635.7/635.7 kB[0m [31m49.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Let's start Spark with Spark NLP included via our simple `start()` function

In [7]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  5.5.3
Apache Spark version:  3.5.4


- Let's use `loadSavedModel` functon in `BertForZeroShotClassification` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `BertForZeroShotClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.


In [13]:
from sparknlp.annotator import BertForZeroShotClassification

zero_shot_classifier = BertForZeroShotClassification.loadSavedModel(
      EXPORT_PATH,
      spark
      )\
    .setInputCols(["document", "token"]) \
    .setOutputCol("class") \
    .setCandidateLabels(["urgent", "mobile", "travel", "movie", "music", "sport", "weather", "technology"])

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [14]:
zero_shot_classifier.write().overwrite().save("./{}_spark_nlp_onnx".format(MODEL_NAME))

Let's clean up stuff we don't need anymore

In [15]:
!rm -rf {EXPORT_PATH}

Awesome 😎  !

This is your BertForZeroShotClassification model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [16]:
! ls -l {MODEL_NAME}_spark_nlp_onnx

total 428048
-rw-r--r-- 1 root root 438307619 Jun 10 19:28 bert_classification_onnx
drwxr-xr-x 4 root root      4096 Jun 10 19:28 fields
drwxr-xr-x 2 root root      4096 Jun 10 19:28 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny BertForZeroShotClassification model 😊

In [17]:
zero_shot_classifier_loaded = BertForZeroShotClassification.load("./{}_spark_nlp_onnx".format(MODEL_NAME))\
  .setInputCols(["document",'token'])\
  .setOutputCol("class")

You can see what labels were used to train this model via `getClasses` function:

In [18]:
zero_shot_classifier_loaded.getClasses()

['contradiction', 'entailment', 'neutral']

This is how you can use your loaded classifier model in Spark NLP 🚀 pipeline:

In [19]:
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import Tokenizer
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    zero_shot_classifier_loaded
])

sample_texts = [
    ["I have a problem with my iPhone that needs to be resolved ASAP!!"],
    ["Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app."],
    ["I have a phone and I love it!"],
    ["I really want to visit Germany and I am planning to go there next year."],
    ["Let's watch some movies tonight! I am in the mood for a horror movie."],
    ["Have you watched the match yesterday? It was a great game!"],
    ["We need to hurry up and get to the airport. We are going to miss our flight!"]
]

input_df = spark.createDataFrame(sample_texts, ["text"])

model = pipeline.fit(input_df)
results = model.transform(input_df)

results.select("text", "class.result").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------+------------+
|text                                                                                                          |result      |
+--------------------------------------------------------------------------------------------------------------+------------+
|I have a problem with my iPhone that needs to be resolved ASAP!!                                              |[urgent]    |
|Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app.|[technology]|
|I have a phone and I love it!                                                                                 |[mobile]    |
|I really want to visit Germany and I am planning to go there next year.                                       |[travel]    |
|Let's watch some movies tonight! I am in the mood for a horror movie.                                         |[movie

That's it! You can now go wild and use hundreds of `BertForSequenceClassification` models from HuggingFace 🤗 in Spark NLP 🚀
