![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/openvino/HuggingFace_OpenVINO_in_Spark_NLP_T5.ipynb)

# Import OpenVINO LLama2 models from HuggingFace 🤗 into Spark NLP 🚀

This notebook provides a detailed walkthrough on optimizing and importing Llama2 models from HuggingFace  for use in Spark NLP, with [Intel OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html). The focus is on converting the model to the OpenVINO format and applying precision optimizations (INT8 and INT4), to enhance the performance and efficiency on CPU platforms using [Optimum Intel](https://huggingface.co/docs/optimum/main/en/intel/inference).

Let's keep in mind a few things before we start 😊

- OpenVINO support was introduced in  `Spark NLP 5.4.0`, enabling high performance CPU inference for models. So please make sure you have upgraded to the latest Spark NLP release.
- Model quantization is a computationally expensive process, so it is recommended to use a runtime with more than 32GB memory for exporting the quantized model from HuggingFace.
- You can import LLama models via `LlamaModel`. These models are usually under `Text Generation` category and have `Llama2` in their labels.
- Reference: [LlamaModel](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel)
- Some [example models](https://huggingface.co/models?search=Llama2)

## 1. Export and Save the HuggingFace model

- Let's install `transformers` and `openvino` packages with other dependencies. You don't need `openvino` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.41.2`. This doesn't mean it won't work with the future release, but we wanted you to know which versions have been tested successfully.

In [1]:
!pip install -q --upgrade transformers==4.41.2
!pip install -q --upgrade openvino==2024.1
!pip install -q --upgrade optimum-intel
!pip install -q --upgrade nncf
!pip install -q --upgrade huggingface_hub
!pip install -q --upgrade onnx==1.15.0
!pip install -q --upgrade torch==2.2.1

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/9.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/9.1 MB[0m [31m5.3 MB/s[0m eta [36m0:00:02[0m[2K     [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/9.1 MB[0m [31m34.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m9.1/9.1 MB[0m [31m90.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m66.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m217.1/217.1 kB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.7/38.7 MB[0m [31m40.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m86.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90

In [2]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

[Optimum Intel](https://github.com/huggingface/optimum-intel?tab=readme-ov-file#openvino) is the interface between the Transformers library and the various model optimization and acceleration tools provided by Intel. HuggingFace models loaded with optimum-intel are automatically optimized for OpenVINO, while being compatible with the Transformers API. It also offers the ability to perform weight compression during export.
- To load a HuggingFace model directly for inference/export, just replace the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class. We can use this to import and export OpenVINO models with `from_pretrained` and `save_pretrained`.
- By setting `export=True`, the source model is converted to OpenVINO IR format on the fly.
- We'll use [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model from HuggingFace as an example.
- In addition to `LlamaModel` we also need to save the tokenizer. This is the same for every model, these are assets needed for tokenization inside Spark NLP.

### Option 1: Exporting to OpenVINO IR in INT8 Precision

Passing the `load_in_8bit` parameter applies 8-bit quantization on the model weights.

In [3]:
from optimum.intel import OVModelForCausalLM
from transformers import LlamaTokenizer, LlamaConfig

MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"
EXPORT_PATH = f"./ov_models/int8/{MODEL_NAME}"

ov_model = OVModelForCausalLM.from_pretrained(MODEL_NAME, export=True, load_in_8bit=True)
tokenizer = LlamaTokenizer.from_pretrained(MODEL_NAME)
config = LlamaConfig.from_pretrained(MODEL_NAME)

# Save the OpenVINO model
ov_model.save_pretrained(EXPORT_PATH)
tokenizer.save_pretrained(EXPORT_PATH)
config.save_pretrained(EXPORT_PATH)

Framework not specified. Using pt to export the model.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Using framework PyTorch: 2.2.1+cu121
Overriding 1 configuration item(s)
	- use_cache -> True
  if sequence_length != 1:


INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│              8 │ 100% (226 / 226)            │ 100% (226 / 226)                       │
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


Output()

Compiling the model to CPU ...
Configuration saved in ./ov_models/int8/meta-llama/Llama-2-7b-chat-hf/openvino_config.json


### Option 2: Exporting to OpenVINO IR in INT4 Precision

Alternately, Optimum Intel also provides [4-bit weight compression](https://huggingface.co/docs/optimum/intel/optimization_ov#4-bit) with `OVWeightQuantizationConfig` class to control weight quantization parameters. The `ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.8, it means that 80% of the layers will be quantized to int4 while the remaining 20% will be quantized to int8.

In [4]:
from optimum.intel.openvino import OVWeightQuantizationConfig, OVModelForCausalLM
from transformers import LlamaTokenizer, LlamaConfig

MODEL_NAME = 'meta-llama/Llama-2-7b-chat-hf'
EXPORT_PATH = f"./ov_models/int4/{MODEL_NAME}"
q_config = OVWeightQuantizationConfig(bits=4, sym=True, group_size=128, ratio=0.8)

ov_model = OVModelForCausalLM.from_pretrained(MODEL_NAME, export=True, quantization_config=q_config)
tokenizer = LlamaTokenizer.from_pretrained(MODEL_NAME)
config = LlamaConfig.from_pretrained(MODEL_NAME)

# Save the OpenVINO model
ov_model.save_pretrained(EXPORT_PATH)
tokenizer.save_pretrained(EXPORT_PATH)
config.save_pretrained(EXPORT_PATH)

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Framework not specified. Using pt to export the model.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Using framework PyTorch: 2.2.1+cu121
Overriding 1 configuration item(s)
	- use_cache -> True
  if sequence_length != 1:


Output()

INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│              8 │ 23% (58 / 226)              │ 20% (56 / 224)                         │
├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
│              4 │ 77% (168 / 226)             │ 80% (168 / 224)                        │
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙


Output()

Configuration saved in ./ov_models/int4/meta-llama/Llama-2-7b-chat-hf/openvino_config.json


Once the model export and quantization is complete, move the model assets needed for tokenization in Spark NLP to the `assets` directory.

In [5]:
!mkdir {EXPORT_PATH}/assets
!cp {EXPORT_PATH}/tokenizer.model {EXPORT_PATH}/assets/
!cp {EXPORT_PATH}/config.json {EXPORT_PATH}/assets/

Let's have a look inside these two directories and see what we are dealing with:

In [6]:
!ls -l {EXPORT_PATH}

total 4141212
drwxr-xr-x 2 root root       4096 Jun  6 16:20 assets
-rw-r--r-- 1 root root        732 Jun  6 16:14 config.json
-rw-r--r-- 1 root root        183 Jun  6 16:14 generation_config.json
-rw-r--r-- 1 root root        449 Jun  6 16:14 openvino_config.json
-rw-r--r-- 1 root root 4236905793 Jun  6 16:14 openvino_model.bin
-rw-r--r-- 1 root root    3159230 Jun  6 16:14 openvino_model.xml
-rw-r--r-- 1 root root        414 Jun  6 16:14 special_tokens_map.json
-rw-r--r-- 1 root root       1830 Jun  6 16:14 tokenizer_config.json
-rw-r--r-- 1 root root     499723 Jun  6 16:14 tokenizer.model


In [7]:
!ls -l {EXPORT_PATH}/assets

total 496
-rw-r--r-- 1 root root    732 Jun  6 17:32 config.json
-rw-r--r-- 1 root root 499723 Jun  6 17:32 tokenizer.model


## 2. Import and Save Llama2 in Spark NLP

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Let's start Spark with Spark NLP included via our simple `start()` function

In [None]:
import sparknlp

# let's start Spark with Spark NLP
spark = sparknlp.start()

- Let's use `loadSavedModel` functon in `LLAMA2Transformer` which allows us to load the OpenVINO model.
- Most params will be set automatically. They can also be set later after loading the model in `LLAMA2Transformer` during runtime, so don't worry about setting them now.
- `loadSavedModel` accepts two params, first is the path to the exported model. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [10]:
from sparknlp.annotator import *

llama2 = LLAMA2Transformer \
    .loadSavedModel(EXPORT_PATH, spark) \
    .setMaxOutputLength(50) \
    .setDoSample(False) \
    .setTopK(50) \
    .setInputCols(["documents"]) \
    .setOutputCol("generation")

Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [11]:
llama2.write().overwrite().save(f"{MODEL_NAME}_spark_nlp")

Let's clean up stuff we don't need anymore

In [12]:
!rm -rf {EXPORT_PATH}

Awesome  😎 !

This is your OpenVINO LLama2 model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [13]:
! ls -l {MODEL_NAME}_spark_nlp

total 4141828
drwxr-xr-x 3 root root       4096 Jun  6 16:35 fields
-rw-r--r-- 1 root root 4240712291 Jun  6 16:36 llama2_openvino
-rw-r--r-- 1 root root     499723 Jun  6 16:36 llama2_spp
drwxr-xr-x 2 root root       4096 Jun  6 16:35 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny Llama2 model 😊

In [14]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

test_data = spark.createDataFrame([
    ["Llama 2 outperforms other open language models on many external benchmarks,"]
]).toDF("text")


document_assembler = DocumentAssembler() \
    .setInputCol("text")\
    .setOutputCol("document")

llama2 = LLAMA2Transformer.load(f"{MODEL_NAME}_spark_nlp") \
  .setInputCols(["document"]) \
  .setOutputCol("generation")

pipeline = Pipeline().setStages([document_assembler, llama2])

result = pipeline.fit(test_data).transform(test_data)
result.select("generation.result").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Llama 2 outperforms other ope

That's it! You can now go wild and use hundreds of Llama2 models from HuggingFace 🤗 in Spark NLP 🚀
