![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForSequenceClassification.ipynb)

## Import ONNX CamemBertForSequenceClassification models from HuggingFace 🤗  into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- ONNX support was introduced in  `Spark NLP 5.0.0`, enabling high performance inference for models.
- `CamemBertForSequenceClassification` is only available since in `Spark NLP 5.2.0` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import CamemBERT models trained/fine-tuned for sequence classification via `CamembertForSequenceClassification` or `TFCamembertForSequenceClassification`. These models are usually under `Sequence Classification` category and have `camembert` in their labels
- Reference: [TFCamembertForSequenceClassification](https://huggingface.co/docs/transformers/model_doc/camembert#transformers.TFCamembertForSequenceClassification)
- Some [example models](https://huggingface.co/models?other=camembert&pipeline_tag=token-classification)

## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.29.1`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.
- CamembertTokenizer requires the `SentencePiece` library, so we install that as well

In [3]:
!pip install -q --upgrade transformers[onnx]==4.29.1 optimum sentencepiece tensorflow

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m396.5/396.5 kB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m65.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.5/84.5 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m454.7/454.7 kB[0m [31m47.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.2/6.2 MB[0m [31m87.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.
- We'll use [tblard/tf-allocine](https://huggingface.co/tblard/tf-allocine)  model from HuggingFace as an example and load it as a `ORTModelForSequenceClassification`, representing an ONNX model.

In [4]:
from optimum.onnxruntime import ORTModelForSequenceClassification
import tensorflow as tf

MODEL_NAME = 'tblard/tf-allocine'
ONNX_MODEL = f"onnx_models/{MODEL_NAME}"

ort_model = ORTModelForSequenceClassification.from_pretrained(MODEL_NAME, export=True)

# Save the ONNX model
ort_model.save_pretrained(ONNX_MODEL)

(…)ard/tf-allocine/resolve/main/config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

Framework not specified. Using tf to export to ONNX.


tf_model.h5:   0%|          | 0.00/445M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFCamembertForSequenceClassification.

All the layers of TFCamembertForSequenceClassification were initialized from the model checkpoint at tblard/tf-allocine.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFCamembertForSequenceClassification for predictions without further training.


(…)ocine/resolve/main/tokenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

(…)ine/resolve/main/sentencepiece.bpe.model:   0%|          | 0.00/811k [00:00<?, ?B/s]

(…)ine/resolve/main/special_tokens_map.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

Using the export variant default. Available variants are:
	- default: The default ONNX variant.
`input_shapes` argument is not supported by the Tensorflow ONNX export and will be ignored.
Using framework TensorFlow: 2.11.1
Overriding 1 configuration item(s)
	- use_cache -> False


Let's have a look inside these two directories and see what we are dealing with:

In [5]:
!ls -l {ONNX_MODEL}

total 435760
-rw-r--r-- 1 root root       835 Nov  3 19:56 config.json
-rw-r--r-- 1 root root 442966534 Nov  3 19:56 model.onnx
-rw-r--r-- 1 root root    810912 Nov  3 19:56 sentencepiece.bpe.model
-rw-r--r-- 1 root root       241 Nov  3 19:56 special_tokens_map.json
-rw-r--r-- 1 root root       544 Nov  3 19:56 tokenizer_config.json
-rw-r--r-- 1 root root   2418877 Nov  3 19:56 tokenizer.json


- As you can see, we need to move `sentencepiece.bpe.model` from the tokenizer to assets folder which Spark NLP will look for
- In addition to vocabs, we also need `labels` and their `ids` which is saved inside the model's config. We will save this inside `labels.txt`

In [6]:
!mkdir {ONNX_MODEL}/assets

In [7]:
# get label2id dictionary
labels = ort_model.config.id2label
# sort the dictionary based on the id
labels = [value for key,value in sorted(labels.items(), reverse=False)]

with open(ONNX_MODEL + '/assets/labels.txt', 'w') as f:
    f.write('\n'.join(labels))

In [8]:
!mv {ONNX_MODEL}/sentencepiece.bpe.model {ONNX_MODEL}/assets

Voila! We have our `sentencepiece.bpe.model` and `labels.txt` inside assets directory

In [9]:
!ls -lR {ONNX_MODEL}

onnx_models/tblard/tf-allocine:
total 434972
drwxr-xr-x 2 root root      4096 Nov  3 19:56 assets
-rw-r--r-- 1 root root       835 Nov  3 19:56 config.json
-rw-r--r-- 1 root root 442966534 Nov  3 19:56 model.onnx
-rw-r--r-- 1 root root       241 Nov  3 19:56 special_tokens_map.json
-rw-r--r-- 1 root root       544 Nov  3 19:56 tokenizer_config.json
-rw-r--r-- 1 root root   2418877 Nov  3 19:56 tokenizer.json

onnx_models/tblard/tf-allocine/assets:
total 796
-rw-r--r-- 1 root root     17 Nov  3 19:56 labels.txt
-rw-r--r-- 1 root root 810912 Nov  3 19:56 sentencepiece.bpe.model


## Import and Save CamemBertForSequenceClassification in Spark NLP


- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [10]:
! wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2023-11-03 19:56:31--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2023-11-03 19:56:31--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1191 (1.2K) [text/plain]
Saving to: ‘STDOUT’


2023-11-03 19:56:31 (92.1 MB/s) - written to stdout [1191/1191]

Installing PySpark 3.2.3 and Spark NLP 5.1.4
setup Colab for PySpark 3.2.3 and Spark NLP 5

Let's start Spark with Spark NLP included via our simple `start()` function

In [11]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()

print("Apache Spark version: {}".format(spark.version))

Apache Spark version: 3.2.3


- Let's use `loadSavedModel` functon in `CamemBertForSequenceClassification` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `CamemBertForSequenceClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [12]:
from sparknlp.annotator import *

sequenceClassifier = CamemBertForSequenceClassification.loadSavedModel(
     f"{ONNX_MODEL}",
     spark
 )\
  .setInputCols(["document",'token'])\
  .setOutputCol("class")\
  .setCaseSensitive(True)\
  .setMaxSentenceLength(128)

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [13]:
sequenceClassifier.write().overwrite().save("./{}_spark_nlp_onnx".format(ONNX_MODEL))

Let's clean up stuff we don't need anymore

In [14]:
!rm -rf {ONNX_MODEL}

Awesome 😎  !

This is your CamemBertForSequenceClassification model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [15]:
! ls -l {ONNX_MODEL}_spark_nlp_onnx

total 433456
-rw-r--r-- 1 root root 443034301 Nov  3 20:00 camembert_classification_onnx
-rw-r--r-- 1 root root    810912 Nov  3 20:00 camembert_spp
drwxr-xr-x 3 root root      4096 Nov  3 19:58 fields
drwxr-xr-x 2 root root      4096 Nov  3 19:58 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny CamemBertForSequenceClassification model 😊

In [16]:
sequenceClassifier_loaded = CamemBertForSequenceClassification.load("./{}_spark_nlp_onnx".format(ONNX_MODEL))\
  .setInputCols(["document",'token'])\
  .setOutputCol("class")

You can see what labels were used to train this model via `getClasses` function:

In [17]:
# .getClasses was introduced in spark-nlp==3.4.0
sequenceClassifier_loaded.getClasses()

['NEGATIVE', 'POSITIVE']

This is how you can use your loaded classifier model in Spark NLP 🚀 pipeline:

In [18]:
from pyspark.ml import Pipeline

from sparknlp.base import *
from sparknlp.annotator import *

document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier_loaded
])

# couple of simple examples
example = spark.createDataFrame([["Alad'2 est clairement le meilleur film de l'année 2018."], ["Je m'attendais à mieux de la part de Franck Dubosc !"]]).toDF("text")

result = pipeline.fit(example).transform(example)

# result is a DataFrame
result.select("text", "class.result").show()

+--------------------+----------+
|                text|    result|
+--------------------+----------+
|Alad'2 est claire...|[POSITIVE]|
|Je m'attendais à ...|[NEGATIVE]|
+--------------------+----------+



That's it! You can now go wild and use hundreds of `CamemBertForSequenceClassification` models from HuggingFace 🤗 in Spark NLP 🚀
