![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/openvino/HuggingFace_OpenVINO_in_Spark_NLP_Bart.ipynb)

# Import OpenVINO GPT2  models from HuggingFace 🤗 into Spark NLP 🚀

This notebook provides a detailed walkthrough on optimizing and exporting BGE models from HuggingFace for use in Spark NLP, leveraging the various tools provided in the [Intel OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html) ecosystem.

Let's keep in mind a few things before we start 😊

- OpenVINO support was introduced in  `Spark NLP 5.4.0`, enabling high performance inference for models. Please make sure you have upgraded to the latest Spark NLP release.


## 1. Export and Save the HuggingFace model

- Let's install `transformers` and `openvino` packages with other dependencies. You don't need `openvino` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.48.3`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [1]:
!pip install -q --upgrade transformers==4.48.3 openvino==2025.0.0 optimum-intel==1.22.0 huggingface-hub

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.7/46.7 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m299.4/299.4 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.1/468.1 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m433.6/433.6 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m49.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m45.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[Optimum Intel](https://github.com/huggingface/optimum-intel?tab=readme-ov-file#openvino) is the interface between the Transformers library and the various model optimization and acceleration tools provided by Intel. HuggingFace models loaded with optimum-intel are automatically optimized for OpenVINO, while being compatible with the Transformers API.
- To load a HuggingFace model directly for inference/export, just replace the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class. We can use this to import and export OpenVINO models with `from_pretrained` and `save_pretrained`.
- By setting `export=True`, the source model is converted to OpenVINO IR format on the fly.
- We'll use [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) model from HuggingFace, representing an OpenVINO model.
- In addition to the OVModelForFeatureExtraction model, we also need to save the `AutoTokenizer`. This is the same for every model, these are assets (saved in `/assets`) needed for tokenization inside Spark NLP.

In [2]:
# Define the model name and export path
MODEL_NAME = "openai-community/gpt2"
EXPORT_PATH = f"ov_models/{MODEL_NAME}"

# Export the pretrained GPT-2 model to OpenVINO format using Optimum
!optimum-cli export openvino --model {MODEL_NAME} --task text-generation {EXPORT_PATH}

2025-03-05 23:43:12.017710: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1741218192.269724    1213 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741218192.340844    1213 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-05 23:43:12.913203: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
config.json: 100% 665/665 [00:00<00:00, 2.51MB/s]
model.safetensors: 100% 548M/548M [00:05<00:00, 106MB/s] 
generation_config

Create 'assets' directory required for Spark NLP compatibility

In [3]:
!mkdir {EXPORT_PATH}/assets

In [4]:
! mv -t {EXPORT_PATH}/assets {EXPORT_PATH}/*.json {EXPORT_PATH}/*.txt

In [5]:
import json

# Load vocabulary from JSON file
output_json = json.load(open(f"{EXPORT_PATH}/assets/vocab.json"))

# Write vocabulary tokens to vocab.txt
with open(f"{EXPORT_PATH}/assets/vocab.txt", "w") as f:
    for key in output_json.keys():
        print(key, file=f)

In [6]:
!ls -l {EXPORT_PATH}/assets

total 5120
-rw-r--r-- 1 root root     936 Mar  5 23:43 config.json
-rw-r--r-- 1 root root     119 Mar  5 23:43 generation_config.json
-rw-r--r-- 1 root root  456318 Mar  5 23:43 merges.txt
-rw-r--r-- 1 root root      99 Mar  5 23:43 special_tokens_map.json
-rw-r--r-- 1 root root     475 Mar  5 23:43 tokenizer_config.json
-rw-r--r-- 1 root root 3557680 Mar  5 23:43 tokenizer.json
-rw-r--r-- 1 root root  798156 Mar  5 23:43 vocab.json
-rw-r--r-- 1 root root  406992 Mar  5 23:44 vocab.txt


## Import and Save GPT2 in Spark NLP

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script
- However, we need to upgrade Spark to a more recent version to use this annotator.

In [7]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash
! pip install -q -U pyspark

Installing PySpark 3.2.3 and Spark NLP 5.5.3
setup Colab for PySpark 3.2.3 and Spark NLP 5.5.3
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.5/281.5 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m635.7/635.7 kB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-spark-connect 0.5.2 requires pyspark>=3.5, but you have pyspark 3.2.3 which is incompatible.[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.2/317.2 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?2

Let's start Spark with Spark NLP included via our simple `start()` function

In [8]:
import sparknlp

# let's start Spark with Spark NLP
spark = sparknlp.start()

print("Spark NLP version: {}".format(sparknlp.version()))
print("Apache Spark version: {}".format(spark.version))

Spark NLP version: 5.5.3
Apache Spark version: 3.5.5


- Let's use `loadSavedModel` functon in `GPT2Transformer` which allows us to load the Openvino model
- Most params will be set automatically. They can also be set later after loading the model in `GPT2Transformer` during runtime, so don't worry about setting them now
- `loadSavedModel` accepts two params, first is the path to the exported model. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [9]:
from sparknlp.annotator import GPT2Transformer

gpt2 = GPT2Transformer.loadSavedModel(EXPORT_PATH, spark)\
  .setInputCols(["documents"])\
  .setMaxOutputLength(50)\
  .setDoSample(True)\
  .setTopK(50)\
  .setTemperature(0)\
  .setBatchSize(5)\
  .setNoRepeatNgramSize(3)\
  .setOutputCol("generation")

Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [10]:
gpt2.write().overwrite().save(f"{MODEL_NAME}_spark_nlp")

Let's clean up stuff we don't need anymore

In [11]:
!rm -rf {EXPORT_PATH}

Awesome  😎 !

This is your Openvino GPT2 model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [12]:
! ls -l {MODEL_NAME}_spark_nlp

total 486576
drwxr-xr-x 4 root root      4096 Mar  5 23:47 fields
-rw-r--r-- 1 root root 498237785 Mar  5 23:47 gpt2_openvino
drwxr-xr-x 2 root root      4096 Mar  5 23:47 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny GPT2 model 😊

In [13]:
from sparknlp.base import DocumentAssembler
from pyspark.ml import Pipeline

# Sample text for text generation
test_data = spark.createDataFrame([
    ["Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a " +
     "downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness " +
     "of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this " +
     "paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework " +
     "that converts all text-based language problems into a text-to-text format. Our systematic study compares " +
     "pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens " +
     "of language understanding tasks. By combining the insights from our exploration with scale and our new " +
     "Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering " +
     "summarization, question answering, text classification, and more. To facilitate future work on transfer " +
     "learning for NLP, we release our dataset, pre-trained models, and code."]
]).toDF("text")

# Assemble the document from text
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# Load GPT-2 model for text generation
gpt2 = GPT2Transformer.load(f"{MODEL_NAME}_spark_nlp") \
    .setInputCols(["document"]) \
    .setOutputCol("generation") \
    .setMaxOutputLength(50) \
    .setDoSample(True) \
    .setTopK(50) \
    .setTemperature(0) \
    .setBatchSize(5) \
    .setNoRepeatNgramSize(3)

# Define the NLP pipeline
pipeline = Pipeline().setStages([
    document_assembler,
    gpt2
])

# Run the pipeline on test data
result = pipeline.fit(test_data).transform(test_data)

# Show the generated text
result.select("generation.result").show(truncate=False)


+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

That's it! You can now go wild and use hundreds of GPT2 models from HuggingFace 🤗 in Spark NLP 🚀
