![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_MPNetForSequenceClassification.ipynb)

## Import ONNX MPNetForSequenceClassification models from HuggingFace 🤗  into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- ONNX support was introduced in  `Spark NLP 5.0.0`, enabling high performance inference for models.
- `MPNetForSequenceClassification` is only available since in `Spark NLP 5.2.4` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import MPNet models trained/fine-tuned for text classification via `SetFitModel` from the `setfit` package. On huggingface, these models are usually under `Text Classification` category and have `mpnet` in their labels. Other models are currenlty not supported.
- Some [example models](https://huggingface.co/models?pipeline_tag=text-classification&other=mpnet)

## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.51.3`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.
- Additionally, we need to install `setfit` to load the model components.

In [None]:
!pip install -q --upgrade transformers[onnx]==4.51.3 setfit

- We'll use [rodekruis/sml-ukr-message-classifier](https://huggingface.co/rodekruis/sml-ukr-message-classifier). As this is not a pure `transformers` model, we need to export the modules separately and combine them.

In [3]:
from setfit import SetFitModel
from transformers import AutoTokenizer

MODEL_NAME = "rodekruis/sml-ukr-message-classifier"
ONNX_MODEL = f"onnx_models/{MODEL_NAME}"

model = SetFitModel.from_pretrained(MODEL_NAME)
model.save_pretrained(ONNX_MODEL)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, export=True)
tokenizer.save_pretrained(ONNX_MODEL)

('onnx_models/rodekruis/sml-ukr-message-classifier/tokenizer_config.json',
 'onnx_models/rodekruis/sml-ukr-message-classifier/special_tokens_map.json',
 'onnx_models/rodekruis/sml-ukr-message-classifier/vocab.txt',
 'onnx_models/rodekruis/sml-ukr-message-classifier/added_tokens.json',
 'onnx_models/rodekruis/sml-ukr-message-classifier/tokenizer.json')

## Exporting the Tokenizer

Let's have a look inside these two directories and see what we are dealing with:

In [4]:
!ls -l {ONNX_MODEL}

total 428848
drwxr-xr-x 2 root root      4096 Jun 16 00:07 1_Pooling
drwxr-xr-x 2 root root      4096 Jun 16 00:07 2_Normalize
-rw-r--r-- 1 root root       551 Jun 16 00:07 config.json
-rw-r--r-- 1 root root       205 Jun 16 00:07 config_sentence_transformers.json
-rw-r--r-- 1 root root        53 Jun 16 00:07 config_setfit.json
-rw-r--r-- 1 root root    179487 Jun 16 00:07 model_head.pkl
-rw-r--r-- 1 root root 437967672 Jun 16 00:07 model.safetensors
-rw-r--r-- 1 root root       349 Jun 16 00:07 modules.json
-rw-r--r-- 1 root root      4047 Jun 16 00:07 README.md
-rw-r--r-- 1 root root        53 Jun 16 00:07 sentence_bert_config.json
-rw-r--r-- 1 root root       964 Jun 16 00:07 special_tokens_map.json
-rw-r--r-- 1 root root      1632 Jun 16 00:07 tokenizer_config.json
-rw-r--r-- 1 root root    710932 Jun 16 00:07 tokenizer.json
-rw-r--r-- 1 root root    231536 Jun 16 00:07 vocab.txt


- As you can see, we need to move `vocab.txt` to assets folder which Spark NLP will look for
- We also need `labels`. These are not contained in the model itself and we will have to fetch them manually. We will save this inside `labels.txt`

In [5]:
!mkdir -p {ONNX_MODEL}/assets
!mv {ONNX_MODEL}/vocab.txt {ONNX_MODEL}/assets/
!wget https://huggingface.co/{MODEL_NAME}/raw/main/label_dict.json

import json
with open("label_dict.json") as f:
    labels = json.load(f)

labels = [value for key, value in sorted(labels.items(), key=lambda x: int(x[0]))]

with open(f"{ONNX_MODEL}/assets/labels.txt", "w") as f:
    f.write("\n".join(labels))

--2025-06-16 00:08:46--  https://huggingface.co/rodekruis/sml-ukr-message-classifier/raw/main/label_dict.json
Resolving huggingface.co (huggingface.co)... 3.168.73.111, 3.168.73.129, 3.168.73.106, ...
Connecting to huggingface.co (huggingface.co)|3.168.73.111|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 589 [text/plain]
Saving to: ‘label_dict.json’


2025-06-16 00:08:46 (178 MB/s) - ‘label_dict.json’ saved [589/589]



In [6]:
ls -l {ONNX_MODEL}/assets

total 232
-rw-r--r-- 1 root root    337 Jun 16 00:08 labels.txt
-rw-r--r-- 1 root root 231536 Jun 16 00:07 vocab.txt


In [7]:
!cat {ONNX_MODEL}/assets/labels.txt

ANOMALY
ARMY
CHILDREN
CONNECTIVITY
RC CONNECT WITH RED CROSS
EDUCATION
FOOD
GOODS/SERVICES
HEALTH
CVA INCLUSION
LEGAL
MONEY/BANKING
NFI
OTHER PROGRAMS/NGOS
PARCEL
CVA PAYMENT
PETS
RC PMER/NEW PROGRAMS
CVA PROGRAM INFO
RC PROGRAM INFO
PSS & RFL
CVA REGISTRATION
SENTIMENT
SHELTER
TRANSLATION/LANGUAGE
CAR
TRANSPORT/MOVEMENT
WASH
WORK/JOBS

Voila! We have our `vocab.txt` and `labels.txt` inside assets directory

## Combining and exporting the SetFit Modules

The `SetFitModel` is composed of these components, we need to export:

1. MPNet Embeddings Model
2. Pooling Module
3. Normalization Module
4. Prediction Module

We first create a custom torch module, to export it into a single ONNX graph.

In [8]:
import torch
from torch import nn

# Define a custom model class that replicates the SetFit prediction flow
class SentencePredictor(nn.Module):
    def __init__(self, model):
        super().__init__()

        # Extract linear classifier parameters
        self.coeffs = torch.Tensor(model.model_head.coef_)
        self.intercept = torch.Tensor(model.model_head.intercept_)

        # Unpack the transformer backbone and pooling layers
        self.embeddings, self.pooling, self.normalize = model.model_body

    def predict(self, normed_embeddings):
        # Apply linear layer manually
        logits = normed_embeddings @ self.coeffs.T + self.intercept
        return logits

    def forward(self, input_ids, attention_mask):
        input = {"input_ids": input_ids, "attention_mask": attention_mask}
        embeddings_out = self.embeddings(input)
        pooled = self.pooling(embeddings_out)
        normed = self.normalize(pooled)
        logits = self.predict(normed["sentence_embedding"])
        return {"logits": logits}

# Instantiate the model
sp = SentencePredictor(model)

# Prepare input batch
input = model.model_body.tokenize([
    "i loved the spiderman movie!",
    "pineapple on pizza is the worst 🤮"
])

# Export the model to ONNX
torch.onnx.export(
    sp,
    args=input,
    f=f"{ONNX_MODEL}/model.onnx",
    input_names=["input_ids", "attention_mask"],
    output_names=["logits"],
    dynamic_axes={
        "input_ids": {0: "batch_size", 1: "token_length"},
        "attention_mask": {0: "batch_size", 1: "token_length"},
        "logits": {0: "batch_size"},
    },
)

Now we have the model and all necessary files to import it into Spark NLP!

In [9]:
!ls -lR {ONNX_MODEL}

onnx_models/rodekruis/sml-ukr-message-classifier:
total 854384
drwxr-xr-x 2 root root      4096 Jun 16 00:07 1_Pooling
drwxr-xr-x 2 root root      4096 Jun 16 00:07 2_Normalize
drwxr-xr-x 2 root root      4096 Jun 16 00:08 assets
-rw-r--r-- 1 root root       551 Jun 16 00:07 config.json
-rw-r--r-- 1 root root       205 Jun 16 00:07 config_sentence_transformers.json
-rw-r--r-- 1 root root        53 Jun 16 00:07 config_setfit.json
-rw-r--r-- 1 root root    179487 Jun 16 00:07 model_head.pkl
-rw-r--r-- 1 root root 435970222 Jun 16 00:09 model.onnx
-rw-r--r-- 1 root root 437967672 Jun 16 00:07 model.safetensors
-rw-r--r-- 1 root root       349 Jun 16 00:07 modules.json
-rw-r--r-- 1 root root      4047 Jun 16 00:07 README.md
-rw-r--r-- 1 root root        53 Jun 16 00:07 sentence_bert_config.json
-rw-r--r-- 1 root root       964 Jun 16 00:07 special_tokens_map.json
-rw-r--r-- 1 root root      1632 Jun 16 00:07 tokenizer_config.json
-rw-r--r-- 1 root root    710932 Jun 16 00:07 tokenizer.json

## Import and Save MPNetForSequenceClassification in Spark NLP


- **Install and set up Spark NLP in Google Colab**
  - This example uses specific versions of `pyspark` and `spark-nlp` that have been tested with the transformer model to ensure everything runs smoothly.

In [10]:
!pip install -q pyspark==3.5.4 spark-nlp==5.5.3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m635.7/635.7 kB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Let's start Spark with Spark NLP included via our simple `start()` function

In [12]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  5.5.3
Apache Spark version:  3.5.4


- Let's use `loadSavedModel` functon in `MPNetForSequenceClassification` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `MPNetForSequenceClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.



In [13]:
from sparknlp.annotator import MPNetForSequenceClassification

sequenceClassifier = (
    MPNetForSequenceClassification.loadSavedModel(ONNX_MODEL, spark)
    .setInputCols(["document", "token"])
    .setOutputCol("label")
)

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [14]:
sequenceClassifier.write().overwrite().save("./{}_spark_nlp_onnx".format(ONNX_MODEL))

Let's clean up stuff we don't need anymore

In [15]:
!rm -rf {ONNX_MODEL}

Awesome 😎  !

This is your AlbertForSequenceClassification model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [16]:
! ls -l {ONNX_MODEL}_spark_nlp_onnx

total 425832
drwxr-xr-x 4 root root      4096 Jun 16 00:17 fields
drwxr-xr-x 2 root root      4096 Jun 16 00:17 metadata
-rw-r--r-- 1 root root 436036881 Jun 16 00:17 mpnet_classification_onnx


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny AlbertForSequenceClassification model 😊

In [17]:
sequenceClassifier_loaded = (
    MPNetForSequenceClassification.load("./{}_spark_nlp_onnx".format(ONNX_MODEL))
    .setInputCols(["document", "token"])
    .setOutputCol("label")
)

You can see what labels were used to train this model via `getClasses` function:

In [18]:
sequenceClassifier_loaded.getClasses()

['GOODS/SERVICES',
 'EDUCATION',
 'SHELTER',
 'OTHER PROGRAMS/NGOS',
 'RC PROGRAM INFO',
 'CVA REGISTRATION',
 'CAR',
 'ARMY',
 'PSS & RFL',
 'CVA PAYMENT',
 'CHILDREN',
 'CONNECTIVITY',
 'CVA INCLUSION',
 'FOOD',
 'HEALTH',
 'TRANSLATION/LANGUAGE',
 'LEGAL',
 'CVA PROGRAM INFO',
 'PETS',
 'MONEY/BANKING',
 'WORK/JOBS',
 'RC CONNECT WITH RED CROSS',
 'PARCEL',
 'TRANSPORT/MOVEMENT',
 'NFI',
 'ANOMALY',
 'RC PMER/NEW PROGRAMS',
 'WASH',
 'SENTIMENT']

This is how you can use your loaded classifier model in Spark NLP 🚀 pipeline:

In [26]:
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import Tokenizer
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier_loaded
])

data = [
    ("Where can I find food today?", "FOOD"),
    ("I need a safe place to sleep tonight.", "SHELTER"),
    ("My payment didn’t arrive, can you check?", "CVA PAYMENT"),
]

df = spark.createDataFrame(data, ["text", "expected_label"])

result = pipeline.fit(df).transform(df)
result.select("text", "expected_label", "label.result").show(truncate=False)

+----------------------------------------+--------------+-------------+
|text                                    |expected_label|result       |
+----------------------------------------+--------------+-------------+
|Where can I find food today?            |FOOD          |[FOOD]       |
|I need a safe place to sleep tonight.   |SHELTER       |[SHELTER]    |
|My payment didn’t arrive, can you check?|CVA PAYMENT   |[CVA PAYMENT]|
+----------------------------------------+--------------+-------------+



That's it! You can now go wild and use hundreds of `MPNetForSequenceClassification` models from HuggingFace 🤗 in Spark NLP 🚀
