![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_RoBertaForSequenceClassification.ipynb)

## Import ONNX RoBertaForSequenceClassification models from HuggingFace 🤗  into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- ONNX support was introduced in  `Spark NLP 5.0.0`, enabling high performance inference for models.
- `RoBertaForSequenceClassification` is only available since in `Spark NLP 5.1.4` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import RoBERTa models trained/fine-tuned for sequence classification via `RobertaForSequenceClassification` or `TFRobertaForSequenceClassification`. These models are usually under `Text Classification` category and have `roberta` in their labels
- Reference: [TFRobertaForSequenceClassification](https://huggingface.co/docs/transformers/model_doc/roberta#transformers.TFRobertaForSequenceClassification)
- Some [example models](https://huggingface.co/models?filter=roberta&pipeline_tag=text-classification)

## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.52.3`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [None]:
!pip install -q --upgrade transformers[onnx]==4.52.3 optimum onnx

- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.
- We'll use [arpanghoshal/EmoRoBERTa](https://huggingface.co/arpanghoshal/EmoRoBERTa) model from HuggingFace as an example and load it as a `ORTModelForSequenceClassification`, representing an ONNX model.
- In addition to the RoBERTa model, we also need to save the tokenizer. This is the same for every model, these are assets (saved in `/assets`) needed for tokenization inside Spark NLP.

In [11]:
from transformers import RobertaTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification

MODEL_NAME = 'cardiffnlp/twitter-roberta-base-sentiment-latest'
ONNX_MODEL = f"onnx_models/{MODEL_NAME}"

ort_model = ORTModelForSequenceClassification.from_pretrained(MODEL_NAME)
ort_model.save_pretrained(ONNX_MODEL)

tokenizer = RobertaTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(ONNX_MODEL)

No ONNX files were found for cardiffnlp/twitter-roberta-base-sentiment-latest, setting `export=True` to convert the model to ONNX. Don't forget to save the resulting model with `.save_pretrained()`
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


('onnx_models/cardiffnlp/twitter-roberta-base-sentiment-latest/tokenizer_config.json',
 'onnx_models/cardiffnlp/twitter-roberta-base-sentiment-latest/special_tokens_map.json',
 'onnx_models/cardiffnlp/twitter-roberta-base-sentiment-latest/vocab.json',
 'onnx_models/cardiffnlp/twitter-roberta-base-sentiment-latest/merges.txt',
 'onnx_models/cardiffnlp/twitter-roberta-base-sentiment-latest/added_tokens.json')

Let's have a look inside these two directories and see what we are dealing with:

In [12]:
!ls -l {ONNX_MODEL}

total 488660
-rw-r--r-- 1 root root       844 Jun 16 16:38 config.json
-rw-r--r-- 1 root root    456318 Jun 16 16:39 merges.txt
-rw-r--r-- 1 root root 498911192 Jun 16 16:39 model.onnx
-rw-r--r-- 1 root root       958 Jun 16 16:39 special_tokens_map.json
-rw-r--r-- 1 root root      1250 Jun 16 16:39 tokenizer_config.json
-rw-r--r-- 1 root root    999355 Jun 16 16:39 vocab.json


- We need to convert `vocab.json` to a plain `vocab.txt` format, as required by Spark NLP.
- Move both `vocab.txt` and `merges.txt` into the assets folder.
- Additionally, we need to extract label-to-ID mappings from the model config and save them as `labels.txt` in the same folder for Spark NLP to use during inference.

In [13]:
import json

!mkdir -p {ONNX_MODEL}/assets && mv {ONNX_MODEL}/merges.txt {ONNX_MODEL}/assets/

with open(f"{ONNX_MODEL}/vocab.json") as f, open(f"{ONNX_MODEL}/assets/vocab.txt", "w") as out:
    out.write("\n".join(json.load(f)))

with open(f"{ONNX_MODEL}/assets/labels.txt", "w") as f:
    f.write("\n".join(ort_model.config.id2label[k] for k in sorted(ort_model.config.id2label)))

In [14]:
!ls -lR {ONNX_MODEL}

onnx_models/cardiffnlp/twitter-roberta-base-sentiment-latest:
total 488216
drwxr-xr-x 2 root root      4096 Jun 16 16:39 assets
-rw-r--r-- 1 root root       844 Jun 16 16:38 config.json
-rw-r--r-- 1 root root 498911192 Jun 16 16:39 model.onnx
-rw-r--r-- 1 root root       958 Jun 16 16:39 special_tokens_map.json
-rw-r--r-- 1 root root      1250 Jun 16 16:39 tokenizer_config.json
-rw-r--r-- 1 root root    999355 Jun 16 16:39 vocab.json

onnx_models/cardiffnlp/twitter-roberta-base-sentiment-latest/assets:
total 852
-rw-r--r-- 1 root root     25 Jun 16 16:39 labels.txt
-rw-r--r-- 1 root root 456318 Jun 16 16:39 merges.txt
-rw-r--r-- 1 root root 407064 Jun 16 16:39 vocab.txt


In [8]:
!cat {ONNX_MODEL}/assets/labels.txt

negative
neutral
positive

Voila! We have our `vocab.txt`, `merges.txt` and `labels.txt` inside assets directory

## Import and Save RoBertaForSequenceClassification in Spark NLP


- **Install and set up Spark NLP in Google Colab**
  - This example uses specific versions of `pyspark` and `spark-nlp` that have been tested with the transformer model to ensure everything runs smoothly.

In [15]:
!pip install -q pyspark==3.5.4 spark-nlp==5.5.3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m831.8 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m635.7/635.7 kB[0m [31m46.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Let's start Spark with Spark NLP included via our simple `start()` function

In [16]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: {}".format(sparknlp.version()))
print("Apache Spark version: {}".format(spark.version))

Spark NLP version: 5.5.3
Apache Spark version: 3.5.4


- Let's use `loadSavedModel` functon in `RoBertaForSequenceClassification` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `RoBertaForSequenceClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [17]:
from sparknlp.annotator import RoBertaForSequenceClassification

sequenceClassifier = RoBertaForSequenceClassification.loadSavedModel(
    ONNX_MODEL,
    spark
)\
  .setInputCols(["document",'token'])\
  .setOutputCol("class")\
  .setCaseSensitive(True)\
  .setMaxSentenceLength(128)

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [18]:
sequenceClassifier.write().overwrite().save("./{}_spark_nlp_onnx".format(ONNX_MODEL))

Let's clean up stuff we don't need anymore

In [19]:
!rm -rf {ONNX_MODEL}

Awesome 😎  !

This is your RoBertaForSequenceClassification model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [20]:
! ls -l {ONNX_MODEL}_spark_nlp_onnx

total 487308
drwxr-xr-x 5 root root      4096 Jun 16 16:42 fields
drwxr-xr-x 2 root root      4096 Jun 16 16:42 metadata
-rw-r--r-- 1 root root 498987456 Jun 16 16:42 roberta_classification_onnx


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny RoBertaForSequenceClassification model 😊

In [21]:
sequenceClassifier_loaded = RoBertaForSequenceClassification.load("./{}_spark_nlp_onnx".format(ONNX_MODEL))\
    .setInputCols(["document",'token'])\
    .setOutputCol("class")

You can see what labels were used to train this model via `getClasses` function:

In [22]:
sequenceClassifier_loaded.getClasses()

['neutral', 'positive', 'negative']

This is how you can use your loaded classifier model in Spark NLP 🚀 pipeline:

In [23]:
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import Tokenizer
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier_loaded
])

data = spark.createDataFrame([
    ["I love you!"],
    ["Kill yourself"]
], ["text"])

result = pipeline.fit(data).transform(data)
result.select("text", "class.result").show(truncate=False)

+-------------+----------+
|text         |result    |
+-------------+----------+
|I love you!  |[negative]|
|Kill yourself|[neutral] |
+-------------+----------+



That's it! You can now go wild and use hundreds of `RoBertaForSequenceClassification` models from HuggingFace 🤗 in Spark NLP 🚀
