![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_VisionEncoderDecoderForImageCaptioning.ipynb)

# Import ONNX VisionEncoderDecoderForImageCaptioning  models from HuggingFace 🤗 into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- ONNX support was introduced in `Spark NLP 5.1.0`, enabling high performance inference for models. Please make sure you have upgraded to the latest Spark NLP release.


## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.31.0`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [1]:
!pip install -q --upgrade "transformers[onnx]==4.31.0" optimum

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.9/116.9 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m424.7/424.7 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.7/212.7 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m69.0 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.
- We'll use the [nlpconnect/vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning) model from HuggingFace as an example and export it with the `optimum-cli`.

In [2]:
MODEL_NAME = "nlpconnect/vit-gpt2-image-captioning"
EXPORT_PATH = f"export_onnx/{MODEL_NAME}"

In [3]:
! optimum-cli export onnx --model {MODEL_NAME} {EXPORT_PATH}

2024-07-30 10:43:01.658251: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-30 10:43:01.658332: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-30 10:43:01.741181: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Framework not specified. Using pt to export the model.
config.json: 100% 4.61k/4.61k [00:00<00:00, 18.4MB/s]
pytorch_model.bin: 100% 982M/982M [00:10<00:00, 95.9MB/s]
Automatic task detection to image-to-text-with-past.
tokenizer_config.json: 100% 241/241 [00:00<00:00, 1.20MB/s]
vocab.json: 100% 798k/798k [00:00<00:00, 4.53MB/s]
merges.txt: 100% 456k/456k [00:00<

We have to move additional model assets into a seperate folder, so that Spark NLP can load it properly.

In [4]:
! mkdir -p {EXPORT_PATH}/assets
! mv -t {EXPORT_PATH}/assets {EXPORT_PATH}/*.json {EXPORT_PATH}/*.txt

Let's have a look inside these two directories and see what we are dealing with:

In [5]:
!ls -l {EXPORT_PATH}

total 2133548
drwxr-xr-x 2 root root      4096 Jul 30 10:45 assets
-rw-r--r-- 1 root root 615029740 Jul 30 10:45 decoder_model_merged.onnx
-rw-r--r-- 1 root root 613132137 Jul 30 10:44 decoder_model.onnx
-rw-r--r-- 1 root root 613129445 Jul 30 10:44 decoder_with_past_model.onnx
-rw-r--r-- 1 root root 343440610 Jul 30 10:43 encoder_model.onnx


In [6]:
!ls -l {EXPORT_PATH}/assets

total 3312
-rw-r--r-- 1 root root    5038 Jul 30 10:43 config.json
-rw-r--r-- 1 root root     179 Jul 30 10:43 generation_config.json
-rw-r--r-- 1 root root  456318 Jul 30 10:43 merges.txt
-rw-r--r-- 1 root root     378 Jul 30 10:43 preprocessor_config.json
-rw-r--r-- 1 root root     131 Jul 30 10:43 special_tokens_map.json
-rw-r--r-- 1 root root     234 Jul 30 10:43 tokenizer_config.json
-rw-r--r-- 1 root root 2107928 Jul 30 10:43 tokenizer.json
-rw-r--r-- 1 root root  798156 Jul 30 10:43 vocab.json


## Import and Save VisionEncoderDecoderForImageCaptioning  in Spark NLP

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script
- Additionally, we need to upgrade Spark to version 3.4.1.

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash
! pip install -U pyspark==3.4.1

Let's start Spark with Spark NLP included via our simple `start()` function

In [None]:
import sparknlp

# let's start Spark with Spark NLP
spark = sparknlp.start()

- Let's use `loadSavedModel` functon in `VisionEncoderDecoderForImageCaptioning ` which allows us to load the ONNX model
- Most params will be set automatically. They can also be set later after loading the model in `VisionEncoderDecoderForImageCaptioning ` during runtime, so don't worry about setting them now
- `loadSavedModel` accepts two params, first is the path to the exported model. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [None]:
from sparknlp.annotator import *
from sparknlp.base import *

imageClassifier = VisionEncoderDecoderForImageCaptioning .loadSavedModel(
     '{}/saved_model/1'.format(MODEL_NAME),
     spark
 )\
  .setInputCols(["image_assembler"])\
  .setOutputCol("caption")

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [None]:
imageClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))

Let's clean up stuff we don't need anymore

In [None]:
!rm -rf {EXPORT_PATH}

Awesome  😎 !

This is your ONNX VisionEncoderDecoderForImageCaptioning  model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [None]:
! ls -l {MODEL_NAME}_spark_nlp

Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny VisionEncoderDecoderForImageCaptioning  model 😊

In [None]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/image/hippopotamus.JPEG
from IPython.display import Image, display
display(Image("hippopotamus.JPEG"))

In [None]:
document_assembler = ImageAssembler() \
    .setInputCol("image") \
    .setOutputCol("image_assembler")

imageClassifier_loaded = ConvNextForImageClassification\
  .setInputCols(["image_assembler"])\
  .setOutputCol("class")

imageCaptioning = VisionEncoderDecoderForImageCaptioning.load("./{}_spark_nlp".format(MODEL_NAME))\
    .setBeamSize(2) \
    .setDoSample(False) \
    .setInputCols(["image_assembler"]) \
    .setOutputCol("caption")

pipeline = Pipeline().setStages([
    document_assembler,
    imageCaptioning
])

test_image = spark.read\
    .format("image")\
    .option("dropInvalid", value = True)\
    .load("./hippopotamus.JPEG")

result = pipeline.fit(test_image).transform(test_image)
result \
    .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "caption.result") \
    .show(truncate = False)

That's it! You can now go wild and use hundreds of VisionEncoderDecoderForImageCaptioning  models from HuggingFace 🤗 in Spark NLP 🚀
