![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_CamemBertForQuestionAnswering.ipynb)

## Import ONNX CamemBertForQuestionAnswering models from HuggingFace 🤗 into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- ONNX support was introduced in  `Spark NLP 5.0.0`, enabling high performance inference for models.
- `CamemBertForQuestionAnswering` is only available since in `Spark NLP 5.2.0` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import CamemBERT models trained/fine-tuned for question answering via `CamembertForQuestionAnswering` or `TFCamembertForQuestionAnswering`. These models are usually under `Question Answering` category and have `camembert` in their labels
- Reference: [TFCamembertForQuestionAnswering](https://huggingface.co/docs/transformers/model_doc/camembert#transformers.TFCamembertForQuestionAnswering)
- Some [example models](https://huggingface.co/models?other=camembert&pipeline_tag=question-answering&sort=downloads)

## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.51.3`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.
- CamembertTokenizer requires the `SentencePiece` library, so we install that as well

In [3]:
!pip install -q transformers[onnx]==4.51.3 optimum onnx

- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.
- We'll use [illuin/camembert-base-fquad](https://huggingface.co/illuin/camembert-base-fquad) model from HuggingFace as an example and load it as a `ORTModelForQuestionAnswering`, representing an ONNX model.
- In addition to the CamemBERT model, we also need to save the `CamembertTokenizer`. This is the same for every model, these are assets (saved in `/assets`) needed for tokenization inside Spark NLP.

In [4]:
from transformers import CamembertTokenizer
from optimum.onnxruntime import ORTModelForQuestionAnswering

MODEL_NAME = 'illuin/camembert-base-fquad'
ONNX_MODEL = f"onnx_models/{MODEL_NAME}"

ort_model = ORTModelForQuestionAnswering.from_pretrained(MODEL_NAME, export=True)
ort_model.save_pretrained(ONNX_MODEL)

tokenizer = CamembertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(ONNX_MODEL)

Some weights of the model checkpoint at illuin/camembert-base-fquad were not used when initializing CamembertForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing CamembertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


('onnx_models/illuin/camembert-base-fquad/tokenizer_config.json',
 'onnx_models/illuin/camembert-base-fquad/special_tokens_map.json',
 'onnx_models/illuin/camembert-base-fquad/sentencepiece.bpe.model',
 'onnx_models/illuin/camembert-base-fquad/added_tokens.json')

Let's have a look inside these two directories and see what we are dealing with:

In [5]:
!ls -l {ONNX_MODEL}

total 430936
-rw-r--r-- 1 root root        22 Jun 10 20:25 added_tokens.json
-rw-r--r-- 1 root root       667 Jun 10 20:25 config.json
-rw-r--r-- 1 root root 440450774 Jun 10 20:25 model.onnx
-rw-r--r-- 1 root root    810912 Jun 10 20:25 sentencepiece.bpe.model
-rw-r--r-- 1 root root       354 Jun 10 20:25 special_tokens_map.json
-rw-r--r-- 1 root root      1847 Jun 10 20:25 tokenizer_config.json


- We need to move the `sentencepiece.bpe.model` file from the tokenizer into an assets folder, as this is where Spark NLP looks for it when working with models like Camembert or other SentencePiece-based tokenizers.

In [6]:
!mkdir {ONNX_MODEL}/assets && mv {ONNX_MODEL}/sentencepiece.bpe.model {ONNX_MODEL}/assets

In [7]:
!ls -lR {ONNX_MODEL}

onnx_models/illuin/camembert-base-fquad:
total 430148
-rw-r--r-- 1 root root        22 Jun 10 20:25 added_tokens.json
drwxr-xr-x 2 root root      4096 Jun 10 20:25 assets
-rw-r--r-- 1 root root       667 Jun 10 20:25 config.json
-rw-r--r-- 1 root root 440450774 Jun 10 20:25 model.onnx
-rw-r--r-- 1 root root       354 Jun 10 20:25 special_tokens_map.json
-rw-r--r-- 1 root root      1847 Jun 10 20:25 tokenizer_config.json

onnx_models/illuin/camembert-base-fquad/assets:
total 792
-rw-r--r-- 1 root root 810912 Jun 10 20:25 sentencepiece.bpe.model


Voila! We have our `spiece.model` inside assets directory

## Import and Save CamemBertForQuestionAnswering in Spark NLP


- Let's install and setup Spark NLP in Google Colab. For this example, we'll use specific versions of `pyspark` and `spark-nlp` that we've already tested with this transformer model to make sure everything runs smoothly.

In [8]:
!pip install -q pyspark==3.5.4 spark-nlp==5.5.3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m635.7/635.7 kB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Let's start Spark with Spark NLP included via our simple `start()` function

In [9]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  5.5.3
Apache Spark version:  3.5.4


- Let's use `loadSavedModel` functon in `CamemBertForQuestionAnswering` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `CamemBertForQuestionAnswering` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.


In [10]:
from sparknlp.annotator import CamemBertForQuestionAnswering

spanClassifier = CamemBertForQuestionAnswering.loadSavedModel(
     f"{ONNX_MODEL}",
     spark
 )\
  .setInputCols(["document_question",'document_context'])\
  .setOutputCol("answer")\
  .setCaseSensitive(False)\
  .setMaxSentenceLength(512)

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [11]:
spanClassifier.write().overwrite().save("./{}_spark_nlp_onnx".format(ONNX_MODEL))

Let's clean up stuff we don't need anymore

In [12]:
!rm -rf {ONNX_MODEL}

Awesome 😎  !

This is your CamemBertForQuestionAnswering model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [13]:
! ls -l {ONNX_MODEL}_spark_nlp_onnx

total 430996
-rw-r--r-- 1 root root 440518118 Jun 10 20:31 camembert_classification_onnx
-rw-r--r-- 1 root root    810912 Jun 10 20:31 camembert_spp
drwxr-xr-x 2 root root      4096 Jun 10 20:31 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny CamemBertForQuestionAnswering model in Spark NLP 🚀 pipeline!

In [14]:
from sparknlp.base import MultiDocumentAssembler
from sparknlp.annotator import Tokenizer
from pyspark.ml import Pipeline

document_assembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier_loaded = CamemBertForQuestionAnswering.load(f"{ONNX_MODEL}_spark_nlp_onnx") \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer")

pipeline = Pipeline(stages=[
    document_assembler,
    spanClassifier_loaded
])

context = "Mon nom est Wolfgang et je vis à Berlin"
question = "Où est-ce que je vis?"
example = spark.createDataFrame([[question, context]]).toDF("question", "context")

model = pipeline.fit(example)
result = model.transform(example)

result.select("question", "answer.result").show(truncate=False)

+---------------------+--------+
|question             |result  |
+---------------------+--------+
|Où est-ce que je vis?|[berlin]|
+---------------------+--------+



That's it! You can now go wild and use hundreds of `CamemBertForQuestionAnswering` models from HuggingFace 🤗 in Spark NLP 🚀
