![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/23.0.AutoGGUF_in_SparkNLP.ipynb)

## Colab Setup

In [None]:
!pip install -q pyspark==3.5.0 spark-nlp

In [None]:
import sparknlp

spark = sparknlp.start(gpu=True)

from sparknlp.base import *
from sparknlp.annotator import *

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 5.5.3
Apache Spark version: 3.5.0


# Import llama.cpp 🦙 models into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- llama.cpp support was introduced in `Spark NLP 5.5.0`, enabling quantized LLM inference on a wide range of devices. Please make sure you have upgraded to the latest Spark NLP release.
- You need to use your own `.gguf` model files, which also include the models from the [Hugging Face Models](https://huggingface.co/models?library=gguf).

## Download a GGUF Model

Lets download a GGUF model to test it out. For this, we will use [bartowski/Phi-3.5-mini-instruct-GGUF](https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF). It is a 3.8B parameter model which also is available in 4-bit quantization.

We can download the model by selecting the q4 GGUF file from the "Files and versions" tab.

Once downloaded, we can directly import this model into Spark NLP!

In [None]:
EXPORT_PATH = "Phi-3.5-mini-instruct-Q4_K_M.gguf"
! wget "https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF/resolve/main/Phi-3.5-mini-instruct-Q4_K_M.gguf?download=true" -O  {EXPORT_PATH}

--2025-04-02 16:44:39--  https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF/resolve/main/Phi-3.5-mini-instruct-Q4_K_M.gguf?download=true
Resolving huggingface.co (huggingface.co)... 3.165.160.59, 3.165.160.11, 3.165.160.12, ...
Connecting to huggingface.co (huggingface.co)|3.165.160.59|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/cd/20/cd204ca23871eaf6393e5d738941cb2934512acd406d6a48c70a1ed50034800c/e4165e3a71af97f1b4820da61079826d8752a2088e313af0c7d346796c38eff5?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27Phi-3.5-mini-instruct-Q4_K_M.gguf%3B+filename%3D%22Phi-3.5-mini-instruct-Q4_K_M.gguf%22%3B&Expires=1743615879&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0MzYxNTg3OX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2NkLzIwL2NkMjA0Y2EyMzg3MWVhZjYzOTNlNWQ3Mzg5NDFjYjI5MzQ1MTJhY2Q0MDZkNmE0OGM3MGExZWQ1MDAzNDgwMGMvZTQxNjVlM2E3MWFmOTdmM

In [None]:
from sparknlp.annotator import *

# All these params should be identical to the original ONNX model
autoGGUFModel = (
    AutoGGUFModel.loadSavedModel(EXPORT_PATH, spark)
    .setInputCols("document")
    .setOutputCol("completions")
    .setBatchSize(4)
    .setNPredict(20)
    .setNGpuLayers(99)
)

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [None]:
autoGGUFModel.write().overwrite().save(f"Phi-3.5-mini-instruct-Q4_K_M_spark_nlp")

Let's clean up stuff we don't need anymore

In [None]:
!rm -rf {EXPORT_PATH}

Awesome  😎 !

This is your GGUF model from loaded and saved by Spark NLP 🚀

In [None]:
! ls -l Phi-3.5-mini-instruct-Q4_K_M_spark_nlp

total 2337152
drwxr-xr-x 2 root root       4096 Apr  2 16:45 metadata
-rwxr-xr-x 1 root root 2393232672 Apr  2 16:45 Phi-3.5-mini-instruct-Q4_K_M.gguf


🔴 Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny GGUF model 😊

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

auto_gguf_model = AutoGGUFModel.load("Phi-3.5-mini-instruct-Q4_K_M_spark_nlp")\
    .setInputCols("document")\
    .setOutputCol("completions")\
    .setBatchSize(4)\
    .setNPredict(100)

pipeline = Pipeline(stages=[document_assembler, auto_gguf_model])

data = spark.createDataFrame([["Patient has hypertension "]]).toDF("text")

result = pipeline.fit(data).transform(data)
result.select("completions").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|completions                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------

🔴 Now, we experiment with a question-based prompt

In [None]:
prompt = "What is the indication for the drug Methadone?"
data = spark.createDataFrame([[prompt]]).toDF("text")
data.show(truncate=100)

+----------------------------------------------+
|                                          text|
+----------------------------------------------+
|What is the indication for the drug Methadone?|
+----------------------------------------------+



In [None]:
results = pipeline.fit(data).transform(data).cache()
results.select("completions").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|completions                                                                                                                                                                                                                                                                                                                                                                                                                          |
+-------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
print(results.select("completions").collect()[0].completions[0].result)

Methadone is indicated primarily for the treatment of opioid dependence and for the management of severe chronic pain. Here's a brief explanation of each:

1. Opioid Dependence Treatment: Methadone is a synthetic opioid that acts on the same opioid receptors in the brain as heroin, morphine, and other opioids. It helps to prevent withdrawal symptoms and reduces cravings for


🔴 Now, we test a multiple-choice question prompt

In [None]:
prompt = """
A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus.
Which of the following is the best treatment for this patient?
A: Ampicillin
B: Ceftriaxone
C: Ciprofloxacin
D: Doxycycline
E: Nitrofurantoin
"""

data = spark.createDataFrame([[prompt]]).toDF("text")
data.show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|\nA 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She st...|
+----------------------------------------------------------------------------------------------------+



In [None]:
results = pipeline.fit(data).transform(data).cache()

print(results.select("completions").collect()[0].completions[0].result)

The best treatment for a pregnant woman presenting with a urinary tract infection (UTI), specifically pyelonephritis, would be a safe antibiotic that is effective against common urinary pathogens. In this case, Nitrofurantoin (option E) is considered a first-line treatment for uncomplicated UTIs during pregnancy, provided the infection is in the early stage and there are no other contraindications


That's it! You can now go wild and use hundreds of GGUF models from HuggingFace 🤗 in Spark NLP 🚀

# llama.cpp 🦙 embedding models in Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- Support for llama.cpp embeddings was introduced in `Spark NLP 5.5.1`, enabling quantized LLM inference on a wide range of devices. Please make sure you have upgraded to the latest Spark NLP release.
- You need to use your own `.gguf` model files, which also include the models from the [Hugging Face Models](https://huggingface.co/models?library=gguf).

## Download a GGUF Model

Lets download a GGUF model to test it out. For this, we will use [nomic-ai/nomic-embed-text-v1.5-GGUF](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF). We can download the model by selecting the Q8_0 GGUF file from the "Files and versions" tab.

Once downloaded, we can directly import this model into Spark NLP!

In [None]:
EXPORT_PATH = "nomic-embed-text-v1.5.Q8_0.gguf"
! wget "https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF/resolve/main/{EXPORT_PATH}?download=true" -O  {EXPORT_PATH}

--2025-04-02 21:04:37--  https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF/resolve/main/nomic-embed-text-v1.5.Q8_0.gguf?download=true
Resolving huggingface.co (huggingface.co)... 3.165.160.11, 3.165.160.12, 3.165.160.59, ...
Connecting to huggingface.co (huggingface.co)|3.165.160.11|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/19/39/19396cd98fe8b02e39b1be815db29f6b251fee34fc5d6550db0b478083fdda2f/3e24342164b3d94991ba9692fdc0dd08e3fd7362e0aacc396a9a5c54a544c3b7?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27nomic-embed-text-v1.5.Q8_0.gguf%3B+filename%3D%22nomic-embed-text-v1.5.Q8_0.gguf%22%3B&Expires=1743631478&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0MzYzMTQ3OH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzE5LzM5LzE5Mzk2Y2Q5OGZlOGIwMmUzOWIxYmU4MTVkYjI5ZjZiMjUxZmVlMzRmYzVkNjU1MGRiMGI0NzgwODNmZGRhMmYvM2UyNDM0MjE2NGIzZDk0OTkxYmE5

- Let's use the `loadSavedModel` functon in `AutoGGUFModel`
- Most params will be set automatically. They can also be set later after loading the model in `AutoGGUFModel` during runtime, so don't worry about setting them now.
- `loadSavedModel` accepts two params, first is the path to the exported model. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- We can set the model to embedding mode with `setEmbedding`. Afterwards the model will return the embeddings in the Annotations.
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [None]:
from sparknlp.annotator import *

# All these params should be identical to the original ONNX model
autoGGUFEmbeddings = (
    AutoGGUFEmbeddings.loadSavedModel(EXPORT_PATH, spark)
    .setInputCols("document")
    .setOutputCol("embeddings")
    .setBatchSize(4)
    .setNGpuLayers(99)
)

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [None]:
autoGGUFEmbeddings.write().overwrite().save(f"nomic-embed-text-v1.5.Q8_0.gguf_spark_nlp")

Let's clean up stuff we don't need anymore

In [None]:
!rm -rf {EXPORT_PATH}

Awesome  😎 !

This is your GGUF model from loaded and saved by Spark NLP 🚀

In [None]:
! ls -l nomic-embed-text-v1.5.Q8_0.gguf_spark_nlp/

total 142728
drwxr-xr-x 2 root root      4096 Apr  2 21:04 metadata
-rwxr-xr-x 1 root root 146146432 Apr  2 21:04 nomic-embed-text-v1.5.Q8_0.gguf


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny GGUF model 😊

In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

autoGGUFEmbeddings = AutoGGUFEmbeddings.load("nomic-embed-text-v1.5.Q8_0.gguf_spark_nlp")\
    .setInputCols("document")\
    .setOutputCol("embeddings")

pipeline = Pipeline(stages=[document_assembler, autoGGUFEmbeddings])

data = spark.createDataFrame([["Diagnosed with diabetes."]]).toDF("text")

result = pipeline.fit(data).transform(data)
result.select("embeddings.embeddings").show(1, 80)

+--------------------------------------------------------------------------------+
|                                                                      embeddings|
+--------------------------------------------------------------------------------+
|[[0.0738504, 0.03141154, -0.15088692, 0.015990928, -0.002257554, 0.019066498,...|
+--------------------------------------------------------------------------------+



That's it! You can now go wild and use hundreds of GGUF models from HuggingFace 🤗 in Spark NLP 🚀

# 📜 PromptAssembler with AutoGGUFModel

Let's keep in mind a few things before we start 😊

- llama.cpp support in the form of the `AutoGGUFModel` was introduced in `Spark NLP 5.5.0`, enabling quantized LLM inference on a wide range of devices. Please make sure you have upgraded to the latest Spark NLP release.
- The `PromptAssembler` was introduced in `Spark NLP 5.5.1` to enable the construction of message prompts.

This notebook will show you how you can construct your own message prompts for the AutoGGUFModel.

 🔴 Let's create a `PromptAssembler` and use it to recreate the following conversation between a chatbot and a user:

```
SYSTEM: You are a medical assistant.  
ASSISTANT: Hello there! How can I assist you today?  
USER: I have a headache. What should I do?  
```

First we need to structure our messages in our Spark DataFrame correctly. For each row, the PromptAssembler expects an array of two-tuples. The first field should be the role and the second field the message. We will call this column `message`.

In [None]:
messages = [
    ("system", "You are a medical assistant."),
    ("assistant", "Hello there! How can I assist you today?"),
    ("user", "I have a headache. What should I do?"),
]
df = spark.createDataFrame([[messages]]).toDF("messages")
df.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------+
|messages                                                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------------+
|[{system, You are a medical assistant.}, {assistant, Hello there! How can I assist you today?}, {user, I have a headache. What should I do?}]|
+---------------------------------------------------------------------------------------------------------------------------------------------+



Let's create the PromptAssembler to generate the prompts. We will use the template from [llama3.1 (extracted from the gguf file)](https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF?show_file_info=Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf).

By default, the `addAssistant` parameter is set to `True`, so a assistant header will be appended to the end.

**Extracting chat_template from Hugging Face**

🔹 Hugging Face Authentication for Restricted Models

Some models (like Meta's Llama 3.1-8B) require authentication to access files.  
Follow these steps to authenticate:

1️⃣ Get your Hugging Face token from: https://huggingface.co/settings/tokens  
2️⃣ Use the token to log in:

   ```python
   from huggingface_hub import login  
   login("your_hf_token_here")  



In [None]:
!pip install huggingface_hub

In [None]:
from huggingface_hub import hf_hub_download
import json

config_path = hf_hub_download("meta-llama/Llama-3.1-8B-Instruct", "tokenizer_config.json")

with open(config_path, "r") as f:
    config = json.load(f)

template = config.get("chat_template", "Not found")
print(template)

{{- bos_token }}
{%- if custom_tools is defined %}
    {%- set tools = custom_tools %}
{%- endif %}
{%- if not tools_in_user_message is defined %}
    {%- set tools_in_user_message = true %}
{%- endif %}
{%- if not date_string is defined %}
    {%- set date_string = "26 Jul 2024" %}
{%- endif %}
{%- if not tools is defined %}
    {%- set tools = none %}
{%- endif %}

{#- This block extracts the system message, so we can slot it into the right place. #}
{%- if messages[0]['role'] == 'system' %}
    {%- set system_message = messages[0]['content']|trim %}
    {%- set messages = messages[1:] %}
{%- else %}
    {%- set system_message = "" %}
{%- endif %}

{#- System message + builtin tools #}
{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
{%- if builtin_tools is defined or tools is not none %}
    {{- "Environment: ipython\n" }}
{%- endif %}
{%- if builtin_tools is defined %}
    {{- "Tools: " + builtin_tools | reject('equalto', 'code_interpreter') | join(", ") + "\n\n"}}
{%- endif

In [None]:
from sparknlp.base import *

promptAssembler = (
    PromptAssembler()
    .setInputCol("messages")
    .setOutputCol("prompt")
    .setChatTemplate(template)
)

Let's see how the final prompt looks like.

In [None]:
results = promptAssembler.transform(df).cache()

print(results.select("prompt").collect()[0].prompt[0].result)

<|start_header_id|>system<|end_header_id|>

You are a medical assistant.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hello there! How can I assist you today?<|eot_id|><|start_header_id|>user<|end_header_id|>

I have a headache. What should I do?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




Now you can feed the prompt to a llama3.1 model loaded with AutoGGUFModel. Depending on your messages, you might need to the chat template or system prompt in the AutoGGUFModel. For example:

```
from sparknlp.annotator import AutoGGUFModel

autoGGUFModel = (
    AutoGGUFModel.loadSavedModel("path/to/llama3.1", spark)
    .setInputCols("prompt")
    .setOutputCol("completions")
    .setBatchSize(4)
    .setNGpuLayers(99)
    .setUseChatTemplate(False)  # Don't apply the chat template
    .setSystemPrompt(
        "Your system prompt"
    )  # Set custom system prompt if not specified in the messages. Leave empty for default.
)
```