![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_GPT2.ipynb)

## Import ONNX GPT2 models from HuggingFace 🤗 into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- ONNX support was introduced in  `Spark NLP 5.0.0`, enabling high performance inference for models.
- ONNX support for the `TFGPT2Model` is only available since in `Spark NLP 5.2.0` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import GPT2 models via `TFGPT2Model`. These models are usually under `Text2Text Generation` category and have `GPT2` in their labels
- This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- Reference: [TFGPT2Model](https://huggingface.co/docs/transformers/en/model_doc/gpt2)
- Some [example models](https://huggingface.co/models?other=GPT2)

## Export and Save HuggingFace model

- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.51.3`. This doesn't mean it won't work with the future releases
- We will also need `sentencepiece` for tokenization.

In [None]:
!pip install -q --upgrade transformers[onnx]==4.48.3 optimum onnx

- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.
- We'll use [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) model from HuggingFace as an example
- In addition to `GPT2` we also need to save the tokenizer. This is the same for every model, these are assets needed for tokenization inside Spark NLP.
- If we want to optimize the model, a GPU will be needed. Make sure to select the correct runtime.

In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.exporters.onnx import main_export

MODEL_NAME = "openai-community/gpt2"
EXPORT_PATH = f"onnx_models/{MODEL_NAME}"

main_export(
    model_name_or_path=MODEL_NAME,
    output=EXPORT_PATH,
    task="text-generation",
    opset=14
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(EXPORT_PATH)

  if (input_shape[-1] > 1 or self.sliding_window is not None) and self.is_causal:
  if past_key_values_length > 0:
Found different candidate ONNX initializers (likely duplicate) for the tied weights:
	lm_head.weight: {'onnx::MatMul_3447'}
	transformer.wte.weight: {'transformer.wte.weight'}


('onnx_models/openai-community/gpt2/tokenizer_config.json',
 'onnx_models/openai-community/gpt2/special_tokens_map.json',
 'onnx_models/openai-community/gpt2/vocab.json',
 'onnx_models/openai-community/gpt2/merges.txt',
 'onnx_models/openai-community/gpt2/added_tokens.json',
 'onnx_models/openai-community/gpt2/tokenizer.json')

Let's have a look inside these two directories and see what we are dealing with:

In [7]:
!ls -l {EXPORT_PATH}

total 491232
-rw-r--r-- 1 root root       937 Jun 14 03:03 config.json
-rw-r--r-- 1 root root       119 Jun 14 03:03 generation_config.json
-rw-r--r-- 1 root root    456318 Jun 14 03:04 merges.txt
-rw-r--r-- 1 root root 498186250 Jun 14 03:04 model.onnx
-rw-r--r-- 1 root root        99 Jun 14 03:04 special_tokens_map.json
-rw-r--r-- 1 root root       475 Jun 14 03:04 tokenizer_config.json
-rw-r--r-- 1 root root   3557680 Jun 14 03:04 tokenizer.json
-rw-r--r-- 1 root root    798156 Jun 14 03:04 vocab.json


- We need to organize tokenizer files into an `assets` folder and convert `vocab.json` to `vocab.txt` because Spark NLP requires this format to properly load and use the model.

In [8]:
!mkdir -p {EXPORT_PATH}/assets && mv {EXPORT_PATH}/merges.txt {EXPORT_PATH}/assets/

import json

vocab = json.load(open(f"{EXPORT_PATH}/vocab.json"))
with open(f"{EXPORT_PATH}/assets/vocab.txt", "w") as f:
    f.writelines(f"{token}\n" for token in vocab)

In [9]:
!ls -l {EXPORT_PATH}/assets

total 848
-rw-r--r-- 1 root root 456318 Jun 14 03:04 merges.txt
-rw-r--r-- 1 root root 406992 Jun 14 03:05 vocab.txt


All set! assets are prepped and ready for Spark NLP. We're good to go.

## Import and Save GPT2 in Spark NLP

- **Install and set up Spark NLP in Google Colab**
  - This example uses specific versions of `pyspark` and `spark-nlp` that have been tested with the transformer model to ensure everything runs smoothly.

In [10]:
!pip install -q pyspark==3.5.4 spark-nlp==5.5.3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m635.7/635.7 kB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Let's start Spark with Spark NLP included via our simple `start()` function

In [11]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  5.5.3
Apache Spark version:  3.5.4


- Let's use `loadSavedModel` functon in `GPT2Transformer` which allows us to load the ONNX model
- Most params will be set automatically. They can also be set later after loading the model in `GPT2Transformer` during runtime, so don't worry about setting them now
- `loadSavedModel` accepts two params, first is the path to the exported model. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [12]:
from sparknlp.annotator import GPT2Transformer

gpt2 = GPT2Transformer.loadSavedModel(EXPORT_PATH, spark)\
    .setInputCols(["documents"])\
    .setMaxOutputLength(50)\
    .setDoSample(True)\
    .setTopK(50)\
    .setTemperature(0)\
    .setBatchSize(5)\
    .setNoRepeatNgramSize(3)\
    .setOutputCol("generation")

Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [13]:
gpt2.write().overwrite().save(f"{MODEL_NAME}_spark_nlp")

Let's clean up stuff we don't need anymore

In [14]:
!rm -rf {EXPORT_PATH}

Awesome  😎 !

This is your ONNX GPT2 model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [15]:
! ls -l {MODEL_NAME}_spark_nlp

total 486600
drwxr-xr-x 4 root root      4096 Jun 14 03:10 fields
-rw-r--r-- 1 root root 498262404 Jun 14 03:10 gpt2_onnx
drwxr-xr-x 2 root root      4096 Jun 14 03:10 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny GPT2 model 😊

In [None]:
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import GPT2Transformer
from pyspark.ml import Pipeline

example = spark.createDataFrame([
    ["Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a"]
]).toDF("text")

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

gpt2 = GPT2Transformer.load(f"{MODEL_NAME}_spark_nlp")\
    .setInputCols(["document"])\
    .setOutputCol("generation")\
    .setMaxOutputLength(50)\
    .setDoSample(True)\
    .setTopK(50)\
    .setTemperature(0.7)\
    .setBatchSize(1)\
    .setNoRepeatNgramSize(3)

pipeline = Pipeline().setStages([
    document_assembler,
    gpt2
])

result = pipeline.fit(example).transform(example)
result.select("generation.result").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a more general task. This approach shows that learning to learn a new task is a matter of learning to master. As described in the]|
+-------------------------------------------------------

That's it! You can now go wild and use hundreds of GPT2 models from HuggingFace 🤗 in Spark NLP 🚀
