![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_Marian.ipynb)

## Import ONNX Marian models from HuggingFace ü§ó into Spark NLP üöÄ

Let's keep in mind a few things before we start üòä

- ONNX support was introduced in  `Spark NLP 5.0.0`, enabling high performance inference for models.
- `MarianTransformer` is only available since in `Spark NLP 5.2.0` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import Marian models via `MarianMTModel`. These models are usually under `Text2Text Generation` category and have `marian` in their labels
- Reference: [MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)
- Some [example models](https://huggingface.co/models?other=marian)

## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.48.3`. This doesn't mean it won't work with the future releases
- We will also need `sentencepiece` for tokenization.

In [None]:
!pip install -q --upgrade transformers[onnx]==4.48.3 optimum onnx

- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models.
- We'll use [Helsinki-NLP/opus-mt-en-bg](https://huggingface.co/Helsinki-NLP/opus-mt-en-bg) model from HuggingFace as an example and export it with the `optimum-cli`.
- If we want to optimize the model, a GPU will be needed. Make sure to select the correct runtime.


In [2]:
MODEL_NAME = "Helsinki-NLP/opus-mt-en-bg"
EXPORT_PATH = "onnx_models/mt_en_bg_onnx"

# Export with optimization (O2) ‚Äî uncomment to enable
# !optimum-cli export onnx --task text2text-generation-with-past --model {MODEL_NAME} --optimize O2 {EXPORT_PATH}

# Note: Optimization (O2) may crash ONNX Runtime for T5-based models due to a known bug.
# Workarounds:
# 1. Manually patch ONNX Runtime (onnx_model_bert.py): comment out the head/hidden size assertion.
# 2. Skip optimization and export as-is (recommended for T5):

!optimum-cli export onnx --task text2text-generation-with-past --model {MODEL_NAME} {EXPORT_PATH}

2025-06-15 19:00:01.484188: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750014001.792926    1100 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750014001.875979    1100 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-15 19:00:02.526228: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
config.json: 100% 1.39k/1.39k [00:00<00:00, 8.27MB/s]
pytorch_model.bin: 100% 305M/305M [00:03<00:00, 78.7MB/s]
model.safeten

Let's have a look inside these two directories and see what we are dealing with:

In [3]:
!ls -l {EXPORT_PATH}

total 861024
-rw-r--r-- 1 root root      1378 Jun 15 19:00 config.json
-rw-r--r-- 1 root root 229119665 Jun 15 19:01 decoder_model_merged.onnx
-rw-r--r-- 1 root root 228868277 Jun 15 19:00 decoder_model.onnx
-rw-r--r-- 1 root root 216211747 Jun 15 19:01 decoder_with_past_model.onnx
-rw-r--r-- 1 root root 203204410 Jun 15 19:00 encoder_model.onnx
-rw-r--r-- 1 root root       288 Jun 15 19:00 generation_config.json
-rw-r--r-- 1 root root    791438 Jun 15 19:00 source.spm
-rw-r--r-- 1 root root        74 Jun 15 19:00 special_tokens_map.json
-rw-r--r-- 1 root root    999053 Jun 15 19:00 target.spm
-rw-r--r-- 1 root root       849 Jun 15 19:00 tokenizer_config.json
-rw-r--r-- 1 root root   2451253 Jun 15 19:00 vocab.json


- We need to move the sentence piece models `*.spm` from the tokenizer to assets folder which Spark NLP will look for
- We also need to process `vocab.json` for the tokenizer vocabulary. The Spark NLP Annotator expects a `vocab.txt` with one word per line.

In [4]:
!mkdir -p {EXPORT_PATH}/assets
!mv -t {EXPORT_PATH}/assets {EXPORT_PATH}/*.spm

import json
output_json = json.load(open(f"{EXPORT_PATH}/vocab.json"))

with open(f"{EXPORT_PATH}/assets/vocab.txt", "w") as f:
    for token in output_json.keys():
        print(token, file=f)

In [5]:
!ls -l {EXPORT_PATH}/assets

total 2528
-rw-r--r-- 1 root root 791438 Jun 15 19:00 source.spm
-rw-r--r-- 1 root root 999053 Jun 15 19:00 target.spm
-rw-r--r-- 1 root root 792353 Jun 15 19:03 vocab.txt


## Import and Save Marian in Spark NLP

- **Install and set up Spark NLP in Google Colab**
  - This example uses specific versions of `pyspark` and `spark-nlp` that have been tested with the transformer model to ensure everything runs smoothly.

In [6]:
!pip install -q pyspark==3.5.4 spark-nlp==5.5.3

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m317.3/317.3 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m635.7/635.7 kB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Let's start Spark with Spark NLP included via our simple `start()` function

In [7]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  5.5.3
Apache Spark version:  3.5.4


- Let's use `loadSavedModel` functon in `MarianTransformer` which allows us to load the ONNX model
- Most params will be set automatically. They can also be set later after loading the model in `MarianTransformer` during runtime, so don't worry about setting them now
- `loadSavedModel` accepts two params, first is the path to the exported model. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [8]:
from sparknlp.annotator import MarianTransformer

marian = MarianTransformer.loadSavedModel(EXPORT_PATH, spark)

Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [9]:
marian.write().overwrite().save(f"{MODEL_NAME}_spark_nlp")

Let's clean up stuff we don't need anymore

In [10]:
!rm -rf {EXPORT_PATH}

Awesome  üòé !

This is your ONNX Marian model from HuggingFace ü§ó  loaded and saved by Spark NLP üöÄ

In [11]:
! ls -l {MODEL_NAME}_spark_nlp

total 424020
-rw-r--r-- 1 root root 229154794 Jun 15 19:06 decoder.onxx
-rw-r--r-- 1 root root 203235570 Jun 15 19:06 encoder.onxx
-rw-r--r-- 1 root root    791438 Jun 15 19:06 marian_spp_src
-rw-r--r-- 1 root root    999053 Jun 15 19:06 marian_spp_trg
drwxr-xr-x 2 root root      4096 Jun 15 19:06 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny Marian model üòä

In [12]:
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import MarianTransformer
from pyspark.ml import Pipeline

test_data = spark.createDataFrame([
    (1, "Rome (Italian and Latin: Roma [ÀàroÀêma] ‚ìò) is the capital city of Italy...")
]).toDF("id", "text")

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

marian = MarianTransformer.load(f"{MODEL_NAME}_spark_nlp") \
    .setInputCols(["document"]) \
    .setOutputCol("translation") \
    .setMaxInputLength(512)

pipeline = Pipeline().setStages([document_assembler, marian])
result = pipeline.fit(test_data).transform(test_data)

result.select("translation.result").show(truncate=False)

+----------------------------------------------------+
|result                                              |
+----------------------------------------------------+
|[(–ò—Ç–∞–ª–∏—è: –†–æ–º–∏ [;rooma] –∏) –µ —Å—Ç–æ–ª–∏—Ü–∞—Ç–∞ –Ω–∞ –ò—Ç–∞–ª–∏—è...]|
+----------------------------------------------------+



That's it! You can now go wild and use hundreds of Marian models from HuggingFace ü§ó in Spark NLP üöÄ
