![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_Whisper.ipynb)

# Import ONNX Whisper models from HuggingFace 🤗 into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- ONNX support was introduced in `Spark NLP 5.0.0`, enabling high performance inference for models. Please make sure you have upgraded to the latest Spark NLP release.
- The Whisper model was introduced in `Spark NLP 5.1.0 and requires Spark version 3.4.1 and up.`
- Official models are supported, but not all custom models may work.

## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.31.0`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [None]:
!pip install -q --upgrade transformers[onnx]==4.52.3 optimum onnx

- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models
- We'll use the [whisper-tiny](https://huggingface.co/openai/whisper-tiny) model from HuggingFace as an example and export it with the `optimum-cli`.

In [None]:
MODEL_NAME = "openai/whisper-tiny"
EXPORT_PATH = f"export_onnx/{MODEL_NAME}"

! optimum-cli export onnx --model {MODEL_NAME} {EXPORT_PATH}

Let's have a look inside these two directories and see what we are dealing with:

In [3]:
!ls -l {EXPORT_PATH}

total 380152
-rw-r--r-- 1 root root     34604 Jun 26 06:12 added_tokens.json
-rw-r--r-- 1 root root      1327 Jun 26 06:12 config.json
-rw-r--r-- 1 root root 118509614 Jun 26 06:12 decoder_model_merged.onnx
-rw-r--r-- 1 root root 118364554 Jun 26 06:12 decoder_model.onnx
-rw-r--r-- 1 root root 113627714 Jun 26 06:12 decoder_with_past_model.onnx
-rw-r--r-- 1 root root  32894170 Jun 26 06:12 encoder_model.onnx
-rw-r--r-- 1 root root      3742 Jun 26 06:12 generation_config.json
-rw-r--r-- 1 root root    493869 Jun 26 06:12 merges.txt
-rw-r--r-- 1 root root     52666 Jun 26 06:12 normalizer.json
-rw-r--r-- 1 root root       356 Jun 26 06:12 preprocessor_config.json
-rw-r--r-- 1 root root      2194 Jun 26 06:12 special_tokens_map.json
-rw-r--r-- 1 root root    282713 Jun 26 06:12 tokenizer_config.json
-rw-r--r-- 1 root root   3930494 Jun 26 06:12 tokenizer.json
-rw-r--r-- 1 root root   1036584 Jun 26 06:12 vocab.json


We have to move additional model assets into a seperate folder, so that Spark NLP can load it properly.

In [4]:
!mkdir -p {EXPORT_PATH}/assets && mv -t {EXPORT_PATH}/assets {EXPORT_PATH}/*.json {EXPORT_PATH}/*.txt

In [5]:
!ls -l {EXPORT_PATH}/assets

total 5724
-rw-r--r-- 1 root root   34604 Jun 26 06:12 added_tokens.json
-rw-r--r-- 1 root root    1327 Jun 26 06:12 config.json
-rw-r--r-- 1 root root    3742 Jun 26 06:12 generation_config.json
-rw-r--r-- 1 root root  493869 Jun 26 06:12 merges.txt
-rw-r--r-- 1 root root   52666 Jun 26 06:12 normalizer.json
-rw-r--r-- 1 root root     356 Jun 26 06:12 preprocessor_config.json
-rw-r--r-- 1 root root    2194 Jun 26 06:12 special_tokens_map.json
-rw-r--r-- 1 root root  282713 Jun 26 06:12 tokenizer_config.json
-rw-r--r-- 1 root root 3930494 Jun 26 06:12 tokenizer.json
-rw-r--r-- 1 root root 1036584 Jun 26 06:12 vocab.json


## Import and Save Whisper in Spark NLP

- Install and set up Spark NLP in Google Colab
- This example uses specific versions of `pyspark` and `spark-nlp` that have been tested with the transformer model to ensure everything runs smoothly.

In [6]:
!pip install -q pyspark==3.5.4 spark-nlp==5.5.3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m635.7/635.7 kB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Let's start Spark with Spark NLP included via our simple `start()` function

In [7]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: {}".format(spark.version))

Spark NLP version:  5.5.3
Apache Spark version: 3.5.4


- Let's use `loadSavedModel` functon in `WhisperForCTC` which allows us to load the ONNX model
- Most params will be set automatically. They can also be set later after loading the model in `WhisperForCTC` during runtime, so don't worry about setting them now
- `loadSavedModel` accepts two params, first is the path to the exported model. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [8]:
from sparknlp.annotator import WhisperForCTC

whisper = (
    WhisperForCTC.loadSavedModel(f"{EXPORT_PATH}", spark)
    .setInputCols("audio_assembler")
    .setOutputCol("text")
)

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [9]:
whisper.write().overwrite().save(f"{MODEL_NAME}_spark_nlp")

Let's clean up stuff we don't need anymore

In [10]:
!rm -rf {EXPORT_PATH}

Awesome  😎 !

This is your ONNX Whisper model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [11]:
! ls -l {MODEL_NAME}_spark_nlp

total 258736
-rw-r--r-- 1 root root 118382769 Jun 26 06:15 decoder_model
-rw-r--r-- 1 root root 113645224 Jun 26 06:15 decoder_with_past_model
-rw-r--r-- 1 root root  32899340 Jun 26 06:15 encoder_model
drwxr-xr-x 6 root root      4096 Jun 26 06:15 fields
drwxr-xr-x 2 root root      4096 Jun 26 06:15 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny Whisper model 😊

In [12]:
! wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/audio/txt/librispeech_asr_0.txt

--2025-06-26 06:15:29--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/audio/txt/librispeech_asr_0.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2199992 (2.1M) [text/plain]
Saving to: ‘librispeech_asr_0.txt’


2025-06-26 06:15:30 (32.0 MB/s) - ‘librispeech_asr_0.txt’ saved [2199992/2199992]



In [13]:
from sparknlp.base import AudioAssembler
from sparknlp.annotator import WhisperForCTC
from pyspark.ml import Pipeline

audio_assembler = AudioAssembler() \
    .setInputCol("audio_content") \
    .setOutputCol("audio_assembler")

whisper_model = WhisperForCTC.load(f"{MODEL_NAME}_spark_nlp") \
    .setInputCols(["audio_assembler"]) \
    .setOutputCol("text")

pipeline = Pipeline(stages=[
    audio_assembler,
    whisper_model
])

with open("librispeech_asr_0.txt") as f:
    raw_floats = [float(x) for x in f.read().strip().split("\n")]

df = spark.createDataFrame([[raw_floats]], ["audio_content"])

model = pipeline.fit(df)
result = model.transform(df)

result.selectExpr("text.result[0] as transcription").show(truncate=False)

+----------------------------------------------------------------------------------------+
|transcription                                                                           |
+----------------------------------------------------------------------------------------+
| Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.|
+----------------------------------------------------------------------------------------+



That's it! You can now go wild and use hundreds of Whisper models from HuggingFace 🤗 in Spark NLP 🚀
