![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/23.1.Native_GGUF_Models_in_SparkNLP.ipynb)

This notebook demonstrates how to integrate and use **GGUF models** within the Spark NLP ecosystem.

## Colab Setup

In [None]:
!wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

In [None]:
import time
import pandas as pd
import numpy as np

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

from pyspark.sql import functions as F

spark = sparknlp.start(gpu=True)

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 6.2.0
Apache Spark version: 3.4.4


In [None]:
!ls /root/.ivy2/jars | grep com.johnsnowlabs

com.johnsnowlabs.nlp_jsl-llamacpp-gpu-1.0.2-compat-rc1.jar
com.johnsnowlabs.nlp_jsl-openvino-cpu_2.12-0.2.0.jar
com.johnsnowlabs.nlp_spark-nlp-gpu_2.12-6.2.0.jar
com.johnsnowlabs.nlp_tensorflow-gpu_2.12-0.4.4.jar


setting `sparknlp.start(gpu=True)` download all the necessary gpu jars

now we can levergae **GGUF** models better

lets fetch some files for later examples from: https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/open-source-nlp/data

In [None]:
%%bash
set -e
git clone -q --no-checkout https://github.com/JohnSnowLabs/spark-nlp-workshop.git tmp
cd tmp
git sparse-checkout set reader2doc reader2table reader2image
git checkout -q
mkdir -p /content/files
mv reader2doc reader2table reader2image /content/files/
cd ..
rm -rf tmp

## GGUF: A Compact Binary Format for Efficient Model Inference

[**GGUF**](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) is a modern binary file format developed within the [**GGML**](https://github.com/ggml-org/ggml) ecosystem, the same low-level runtime that powers tools such as [*llama.cpp*](https://github.com/ggerganov/llama.cpp). It was designed to make **loading and executing large models fast and lightweight**, especially on systems that rely on CPUs instead of GPUs.

‚Äé

<p align="center">
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/gguf-spec.png" alt="gguf-spec" width="1000">
</p>

<p align="center">
  <em>As we can see in this graph, unlike tensor-only file formats like safetensors GGUF encodes both the tensors and a standardized set of metadata. Diagram by <a href="https://github.com/mishig25">@mishig25/</a> (GGUF v3)
  </em>
</p>

‚Äé
‚Äé

Unlike conventional model files (for example, PyTorch `.bin` or TensorFlow `.pb`), which mostly store raw tensors, **GGUF encodes both the tensor data and rich metadata** that describes the model‚Äôs structure, tokenizer, architecture, and quantization details.
This self-contained structure allows inference engines to start execution immediately without rebuilding computational graphs or importing heavy Python frameworks.


### How GGUF Makes Models Smaller and Faster

**GGUF** works by using a method called **quantization**, which means storing model weights (the numbers inside the model) with fewer bits.  
This reduces the model‚Äôs **size** and makes it **run faster**, especially on CPUs.

For example, a normal 16-bit or 32-bit model might be **tens of gigabytes**, but a quantized GGUF model can be **under 5 GB** with almost the same accuracy.

### The Main Types of Quantization in GGUF

‚Äé

| Type | Bit Precision | What It Means |
|------|----------------|---------------|
| **F32 / F16 / BF16** | 32 or 16 bits | The original high-precision formats. Best for fine-tuning or maximum accuracy. |
| **Q8_K ‚Äì Q2_K** | 8 to 2 bits | ‚ÄúK-Quant‚Äù models. These use different bit levels in different layers to save space while keeping accuracy. |
| **IQ4 / IQ3 / IQ2 / IQ1** | 4 to 1 bits | ‚ÄúImportance Quantization‚Äù models. These use smarter compression to keep the most important weights more precise. |

‚Äé
‚Äé

The **K-Quant** models (such as `Q4_K` and `Q2_K`) use **mixed precision across layers**, meaning GGUF adjusts how many bits each layer uses depending on its sensitivity to information loss. Important layers retain higher precision while less critical ones use fewer bits. On average, `Q4_K` models use about **4.5 bits per weight** and `Q2_K` around **2.6 bits**, allowing them to stay accurate while using far less memory.

Building on this, the **IQ (Importance Quantization)** family identifies which weights most influence the model‚Äôs output and assigns them higher precision, while compressing less important ones more aggressively. Using **importance matrices** and **scaling factors**, IQ quantization achieves extreme compression sometimes as low as **1.5 bits per weight** with minimal performance loss. Common IQ variants include:
- **IQ4_XS** ‚Üí ‚âà4.25 bpw (near full quality)
- **IQ3_S** ‚Üí ‚âà3.44 bpw (balanced)
- **IQ2_S** ‚Üí ‚âà2.5 bpw (lightweight)
- **IQ1_M** ‚Üí ‚âà1.75 bpw (extremely compact but still suitable for embeddings and chat tasks)


# The GGUF Annotator Ecosystem in SparkNLP

Spark NLP provides **native support for these GGUF models**, which you can leverage through the following annotators:

- `AutoGGUFEmbeddings` ‚Üí Generate dense sentence embeddings for RAG, clustering, or semantic search tasks.
- `AutoGGUFModel` ‚Üí Perform chat, Q&A, and text completion using a compact, high-performance language model.
- `AutoGGUFVisionModel` ‚Üí Extract visual features, perform image-to-text processing, and handle multimodal tasks.
- `AutoGGUFReranker` ‚Üí Score candidate texts or documents for relevance, improving ranking and retrieval workflows.

These annotators integrate seamlessly into Spark NLP pipelines, allowing you to combine embeddings, generation, ranking, and vision models efficiently in distributed workflows.



> All models for these annotators are available on our [Model Hub](https://sparknlp.org/models).


## AutoGGUFEmbeddings

This is an annotator that generates dense vector embeddings from text using quantized GGUF models. It converts sentences, paragraphs, or documents into numerical representations that capture semantic meaning, making them ideal for similarity search, clustering, or retrieval tasks.

You can basically use any of the state-of-the-art embedding models you‚Äôve been working with in PyTorch, ONNX, or OpenVINO. But now in GGUF format, directly inside Spark NLP.

Docs: https://sparknlp.org/docs/en/annotators#autoggufembeddings


In [None]:
document = (
    DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
)

autoGGUFEmbeddings = (
    AutoGGUFEmbeddings.pretrained("Qwen3_Embedding_0.6B_Q8_0_gguf")
    .setInputCols(["document"])
    .setOutputCol("embeddings")
)

pipeline = Pipeline().setStages([document, autoGGUFEmbeddings])

data = spark.createDataFrame([["The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("embeddings.embeddings").show(truncate=80)


+--------------------------------------------------------------------------------+
|                                                                      embeddings|
+--------------------------------------------------------------------------------+
|[[-0.0123171685, -0.038418945, -0.005815847, 0.019013459, 0.019893406, -0.019...|
+--------------------------------------------------------------------------------+



### **Exploring Key Parameters for `AutoGGUFEmbeddings`**

<!-- ‚Äé

| Parameter       | Description                                                    | Typical Use / Notes                                                                 |
| --------------- | -------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
| `inputCols`     | Columns containing input text/annotations                      | Set this to the previous annotation column, usually `"document"`      |
| `outputCol`     | Column to store embeddings                                     | Default is `"embeddings"`. Change if needed to avoid overwriting.                   |
| `batchSize`     | Size of each batch during processing                           | Controls memory vs speed tradeoff; larger batches use more RAM/VRAM but are faster. |
| `nUbatch`       | Physical batch size for prompt processing                      | Usually >=32 for optimal BLAS usage; often matches `batchSize`.                     |
| `poolingType`   | How token embeddings are combined into a single vector         | Options: `"mean"`, `"max"`, `"cls"` (default uses model‚Äôs setting).                 |
| `nCtx`          | Size of the prompt context                                     | Typically set if you expect long inputs; defaults to model max context.             |
| `lazyAnnotator` | Whether embeddings are generated lazily in recursive pipelines | Useful if you don‚Äôt want immediate computation in large pipelines.                  |
| `useMmap`       | Whether to memory-map model file for faster load               | Can speed up load; may increase page swaps if system memory is limited.             |
| `useMlock`      | Keep model fully in RAM                                        | Prevents swapping; useful on servers with enough RAM to avoid I/O bottlenecks.      |


‚Äé -->


In [None]:
print(AutoGGUFEmbeddings().explainParams())

batchSize: Size of every batch (undefined)
defragmentationThreshold: Set the KV cache defragmentation threshold (undefined)
flashAttention: Whether to enable Flash Attention (undefined)
gpuSplitMode: Set how to split the model across GPUs (undefined)
grpAttnN: Set the group-attention factor (undefined)
grpAttnW: Set the group-attention width (undefined)
inputCols: previous annotations columns, if renamed (undefined)
lazyAnnotator: Whether this AnnotatorModel acts as lazy in RecursivePipelines (default: False)
mainGpu: Set the main GPU that is used for scratch and small tensors. (undefined)
nBatch: Set the logical batch size for prompt processing (must be >=32 to use BLAS) (default: 512)
nChunks: Set the maximal number of chunks to process (undefined)
nCtx: Set the size of the prompt context (default: 4096)
nGpuLayers: Set the number of layers to store in VRAM (-1 - use default) (default: 99)
nSequences: Set the number of sequences to decode (undefined)
nThreads: Set the number of threa

Understanding `poolingType`

When a model encodes a sequence of tokens, it produces **one embedding vector per token**.  
The `poolingType` parameter controls **how these token embeddings are combined into a single sentence embedding**.

‚Äé

Main types of Pooling

| Pooling Type | How It Works                           | Effect on Resulting Embedding                                                                                       |
|--------------|--------------------------------------|--------------------------------------------------------------------------------------------------------------------|
| `mean`       | Averages all token embeddings element-wise | Captures the **overall semantic content** of the sentence. Produces smooth and balanced embeddings, ideal for general similarity tasks. Less sensitive to individual token variations. |
| `max`        | Takes the maximum value across tokens element-wise | Emphasizes the **strongest features** from the sentence. Highlights key tokens and dominant signals, but may overrepresent rare or extreme token values, possibly exaggerating differences.
| `cls`        | Uses the embedding of the `[CLS]` token (or first token) | Represents a **learned summary** of the entire input based on the model‚Äôs pretraining. Effective for classification tasks but might lose fine-grained information from other tokens and be less stable in clustering.

‚Äé

Cause and Effect Examples

- **`poolingType = max`**  
  Each embedding dimension is determined by the strongest token feature, causing strong words to dominate the vector. This can exaggerate differences in similarity comparisons between otherwise similar sentences.

- **`poolingType = cls`**  
  Utilizes only the `[CLS]` token embedding as a summary representation, reflecting the model‚Äôs learned "gist" of the input. Often effective for classification but may lead to coarser embeddings and less stability for semantic clustering.

- **`poolingType = mean`**  
  Generates embeddings by averaging semantic content across all tokens, resulting in smoother similarity scores and more balanced clustering. This approach tends to reflect the sentence's overall meaning best.

‚Äé

Why `poolingType` Matters

- **Semantic similarity tasks:** `mean` pooling typically produces the most intuitive and robust embeddings.  
- **Highlighting key tokens or features:** `max` pooling emphasizes important and dominant tokens.  
- **Model-specific summary tasks (e.g., classification):** `cls` pooling works best, especially when aligned with the model‚Äôs pretraining objectives.  
- **Sequence-focused tasks:** `last` pooling can be useful when sequence-ending tokens carry crucial information.  
- **Token-level tasks:** `none` pooling keeps token embeddings separate.

‚Äé

Choosing the right pooling strategy impacts how sentence embeddings capture meaning and perform in downstream tasks and should be aligned with your specific NLP goals.

In [None]:
autoGGUFEmbeddings.setPoolingType("MEAN")

result = pipeline.fit(data).transform(data)
result.select("embeddings.embeddings").show(truncate=80)


+--------------------------------------------------------------------------------+
|                                                                      embeddings|
+--------------------------------------------------------------------------------+
|[[-0.012416776, -0.038989745, -0.0057717287, 0.017973736, 0.020065758, -0.019...|
+--------------------------------------------------------------------------------+



Understanding `batchSize`

- **`batchSize`**  
  Determines how many input samples (e.g., sentences or documents) are processed simultaneously in one inference pass. Larger batch sizes improve throughput by leveraging parallelism, making embedding generation faster. However, this requires more RAM or VRAM to hold the data and intermediate computations. If memory is limited, large batch sizes might cause out-of-memory errors or system swapping, which significantly slows processing. Smaller batch sizes use less memory but increase total processing time.

> Start with a batch size that fits your memory and experiment to find the best trade-off between speed and stability.

In [None]:
!wget -q -O news_category_train.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_train.csv

df = pd.read_csv("news_category_train.csv", on_bad_lines='skip', header=0, usecols=[1], names=["text"]).iloc[:1000]
dataframe = spark.createDataFrame(df).cache()
dataframe.show(5, truncate=80)


+--------------------------------------------------------------------------------+
|                                                                            text|
+--------------------------------------------------------------------------------+
| Short sellers, Wall Street's dwindling band of ultra cynics, are seeing gree...|
| Private investment firm Carlyle Group, which has a reputation for making wel...|
| Soaring crude prices plus worries about the economy and the outlook for earn...|
| Authorities have halted oil export flows from the main pipeline in southern ...|
| Tearaway world oil prices, toppling records and straining wallets, present a...|
+--------------------------------------------------------------------------------+
only showing top 5 rows



In [None]:
def bench_embeddings(pipeline, embeddings, df, batch_sizes):
    results = []
    for b in batch_sizes:
        embeddings.setBatchSize(b).setNUbatch(32)
        data = df.select("text")
        start = time.time()
        pipeline.fit(data).transform(data).select("embeddings").collect()
        end = time.time()
        total = end - start
        per_seq = total / data.count()
        print(f"Batch Size {b:<3} | Total: {total:6.2f}s | Per seq: {per_seq:.4f}s")
        results.append((b, total, per_seq))

bench_embeddings(pipeline, autoGGUFEmbeddings, dataframe, [1, 2, 4, 8, 16, 32, 64, 128])


Batch Size 1   | Total:  30.86s | Per seq: 0.0309s
Batch Size 2   | Total:  21.76s | Per seq: 0.0218s
Batch Size 4   | Total:  14.52s | Per seq: 0.0145s
Batch Size 8   | Total:  14.56s | Per seq: 0.0146s
Batch Size 16  | Total:  15.90s | Per seq: 0.0159s
Batch Size 32  | Total:  15.44s | Per seq: 0.0154s
Batch Size 64  | Total:  16.01s | Per seq: 0.0160s
Batch Size 128 | Total:  17.79s | Per seq: 0.0178s


Batch Size 4-8 seems to be the sweet spot for our data

check out [HasLlamaCppInferenceProperties](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/HasLlamaCppInferenceProperties.scala), [HasLlamaCppModelProperties](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/HasLlamaCppModelProperties.scalar) or refer to the llama.cpp documentation of [server.cpp](https://github.com/ggerganov/llama.cpp/tree/7d5e8777ae1d21af99d4f95be10db4870720da91/examples/server) for more information!

## AutoGGUFModel

This is an annotator that performs text generation and instruction following using lightweight GGUF-format language models.

Docs: https://sparknlp.org/docs/en/annotators#autoggufmodel

Basic document summarization using `AutoGGUFModel`

In [None]:
from sparknlp.annotator import AutoGGUFModel
from sparknlp.reader.reader2doc import Reader2Doc

reader2doc = Reader2Doc().setContentPath("/content/files/reader2doc")

auto_gguf_model = (
    AutoGGUFModel.pretrained("phi_4_mini_instruct_bf16_gguf", "en")
    .setInputCols(["document"])
    .setOutputCol("completions")
    .setSystemPrompt("You are a helpful assistant. Read the text below and write a clear, concise summary capturing the key ideas, facts, and tone.")
    .setCachePrompt(True)
    .setNPredict(200)
)

pipeline = Pipeline().setStages([
    reader2doc,
    auto_gguf_model
])

empty_df = spark.createDataFrame([], "string").toDF("text")

model = pipeline.fit(empty_df)
result = model.transform(empty_df)


phi_4_mini_instruct_bf16_gguf download started this may take some time.
Approximate size to download 5.7 GB
[OK!]


In [None]:
result.select("fileName", "completions.result").show(truncate=False)

+-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|fileName                             |result                                                                                                                                                                                                                                                           

### **Exploring Key Parameters for `AutoGGUFModel`**

| **Parameter**         | **Description**                                             | **Typical Use / Notes**                                              |
| --------------------- | ----------------------------------------------------------- | -------------------------------------------------------------------- |
| **systemPrompt**      | Sets a global instruction or persona for the model          | Controls tone, reasoning depth, and response style                   |
| **cachePrompt**       | Enables caching of previous prompts for conversational flow | Maintains multi-turn chat context across generations                 |
| **chatTemplate**      | Defines the message formatting for chat-style interactions  | Applied automatically for most chat-tuned GGUF models                |
| **useChatTemplate**   | Toggles use of the chat template                            | Set `True` for multi-turn or role-based conversations                |
| **modelAlias**        | Assigns a shorthand name to the model                       | Useful for logging and managing multiple models in pipelines         |
| **nCtx**              | Maximum token length per input sequence                     | Increase to handle longer inputs; larger values require more memory  |
| **temperature**       | Controls randomness in generation                           | Lower = deterministic; higher (0.7‚Äì1.0) = more creative outputs      |
| **topP**              | Nucleus sampling probability threshold                      | Typical range: 0.8‚Äì0.95; balances diversity and coherence            |
| **topK**              | Limits sampling to top-K likely tokens                      | Common range: 40‚Äì100; influences variability and determinism         |
| **repeatPenalty**     | Penalizes repeated phrases                                  | Values around 1.1‚Äì1.2 reduce looping or redundant output             |
| **presencePenalty**   | Discourages reuse of tokens                                 | Encourages introducing new concepts or vocabulary                    |
| **frequencyPenalty**  | Penalizes frequent token repetition                         | Similar to presence penalty; improves output diversity               |
| **stopStrings**       | Defines custom stop tokens or phrases                       | Example: `["User:", "###"]` to stop generation at a marker           |
| **nPredict**          | Sets the maximum number of tokens to generate               | Controls output length; `-1` allows full auto-completion             |
| **grammar**           | Enforces a formal grammar on the output                     | Ideal for structured outputs (e.g., JSON, code, categorical answers) |
| **removeThinkingTag** | Removes internal reasoning tags like `<think>`              | Keeps final outputs clean and display-ready                          |
| **batchSize**         | Number of sequences processed concurrently                  | Balances throughput and memory usage during inference                |
| **flashAttention**    | Enables FlashAttention optimization                         | Improves speed and efficiency on GPU-based inference                 |


covering some common parameters

In [None]:
sys_prompt = """You are Qwen, a helpful, intelligent, and precise AI assistant.
You provide responses that are clear, accurate, and well reasoned.

Behavior guidelines:
- Responses are concise but complete, explaining reasoning when relevant.
- Examples or analogies may be used to clarify complex ideas.
- Ambiguities are acknowledged with possible interpretations and balanced handling.
- Avoid unsupported assumptions or speculation; prioritize factual accuracy.
- Responses should be formatted neatly, using bullet points or code blocks when helpful.
"""

document_assembler = (
    DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
)

autoGGUFModel = (
    AutoGGUFModel.pretrained("qwen3_4b_bf16_gguf", "en")
    .setInputCols(["document"])
    .setOutputCol("completions")
    .setSystemPrompt(sys_prompt)
    .setCachePrompt(True)
    # .setChatTemplate("") # you can get the chat template for this model from: https://huggingface.co/Qwen/Qwen3-4B?chat_template=default it's already applied by deafult
    # .setUseChatTemplate(True)
    .setNPredict(-1)
    .setNCtx(32768) # this model has an context Length (nctx) of 32,768 natively and 131,072 tokens with YaRN.
    .setBatchSize(4)
    .setNUbatch(32)
)

pipeline = Pipeline().setStages([document_assembler, autoGGUFModel])

# you can disable thinking for qwen models with "/no_think" prefixes in your prompts or system prompt on simpler tasks for faster responses
data = spark.createDataFrame([[
    "Explain why gradient clipping is used during neural network training, and give a short example."
]]).toDF("text")


qwen3_4b_bf16_gguf download started this may take some time.
Approximate size to download 6 GB
[OK!]


In [None]:
result = pipeline.fit(data).transform(data)
print(result.select("completions.result").first().result[0])


<think>
Okay, the user is asking why gradient clipping is used in neural network training and wants a short example. Let me start by recalling what gradient clipping is. From what I remember, gradient clipping is a technique used to prevent the gradients from becoming too large during backpropagation. Large gradients can cause problems like vanishing or exploding gradients, which can make training unstable or slow.

So the main reason for using gradient clipping is to maintain the stability of the training process. When gradients are too big, they can cause the weights to update by huge amounts, leading to divergence. Clipping limits the gradient's magnitude, so the updates are more controlled.

Now, I need to explain why this is important. Maybe mention that without clipping, the model might not converge or might oscillate. Also, it helps in preventing the loss from exploding, which can make the training process difficult.

For the example part, let's think of a scenario where gradien

Understanding the `grammar` parameter

lets you constrain model generation using a BNF-like (Backus‚ÄìNaur Form) syntax definition. This is not for natural language ‚Äúgrammar‚Äù like English, it‚Äôs for formal grammars that define what output patterns are valid (e.g., valid JSON, SQL, lists, etc.).

Say you want the model to only output one of `"yes"` or `"no"`:

In [None]:
grammar = r"""
root ::= "yes" | "no"
"""

autoGGUFModel.setGrammar(grammar)

data = spark.createDataFrame([["Is the statement 'Water boils at 100 degrees Celsius' scientifically correct?"]]).toDF("text")

pipeline = Pipeline().setStages([document_assembler, autoGGUFModel])
result = pipeline.fit(data).transform(data)

result.select("completions.result").show()


+------+
|result|
+------+
| [yes]|
+------+



check out [HasLlamaCppInferenceProperties](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/HasLlamaCppInferenceProperties.scala), [HasLlamaCppModelProperties](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/HasLlamaCppModelProperties.scalar) or refer to the llama.cpp documentation of [server.cpp](https://github.com/ggerganov/llama.cpp/tree/7d5e8777ae1d21af99d4f95be10db4870720da91/examples/server) for more information!

## AutoGGUFVisionModel

This is an annotator that enables multimodal understanding by combining text prompts with visual input. It can analyze images and respond to natural language instructions, extracting structured information, descriptions, or summaries from visual data.

This section builds a multimodal pipeline that takes a medical document image and a guiding text prompt to extract structured information.

In [None]:
!wget -q -O prescription.png https://github.com/JohnSnowLabs/spark-nlp-workshop/raw/master/healthcare-nlp/data/ocr/prescription_02.png


In [None]:
prompt = """You are an expert medical document parser. Analyze the given image of a prescription or outpatient summary and extract all relevant structured information clearly.

Identify and return the following fields where available:
- Hospital/Clinic Name
- Department
- Patient Information: Name, Age, Sex
- Identifiers: Hospital No, Episode No, Episode Date
- Doctor Details: Name, Department, Designation (if any)
- Consultation Notes or Diagnosis: summary of the patient‚Äôs condition, symptoms, and relevant findings
- Treatment Plan: list all prescribed medications with full details (drug name, dosage, frequency, and duration)
- Follow-up Instructions: review timeline, tests, or other advice

Preserve the original units, abbreviations, and formatting of medicines as they appear.

Output the extracted data in JSON format, structured like this example:

{
  "hospital_name": "",
  "department": "",
  "patient": {
    "name": "",
    "age": "",
    "sex": ""
  },
  "identifiers": {
    "hospital_no": "",
    "episode_no": "",
    "episode_date": ""
  },
  "doctor": {
    "name": "",
    "department": ""
  },
  "consultation_notes": "",
  "treatment": [
    {
      "drug_name": "",
      "dosage": "",
    }
  ],
  "follow_up": ""
}"""


In [None]:
data = ImageAssembler.loadImagesAsBytes(spark, "prescription.png")
data = data.withColumn("prompt", F.lit(prompt))

document_assembler = (
    DocumentAssembler()
    .setInputCol("prompt")
    .setOutputCol("prompt_document")
)

image_assembler = (
    ImageAssembler()
    .setInputCol("image")
    .setOutputCol("image_assembler")
)

autoGGUFVisionModel = (
    AutoGGUFVisionModel.pretrained("qwen2_vl_2b_instruct_q4_gguf")
    .setInputCols(["prompt_document", "image_assembler"])
    .setOutputCol("completions")
    .setNPredict(-1)
    .setBatchSize(1)
    .setNCtx(32768)
    .setNGpuLayers(-1)
    .setFlashAttention(True)
)

pipeline = Pipeline().setStages([
    document_assembler,
    image_assembler,
    autoGGUFVisionModel
])


qwen2_vl_2b_instruct_q4_gguf download started this may take some time.
Approximate size to download 1.5 GB
[OK!]


In [None]:
result = pipeline.fit(data).transform(data)
print(result.select("completions.result").first().result[0])


```json
[
  {
    "hospital_name": "Department of Rheumatology",
    "department": "Rheumatology",
    "patient": {
      "name": "Ms Rukhsana Shaheen",
      "age": "56 yrs/Female"
    },
    "identifiers": {
      "hospital_no": "MH005990453",
      "episode_no": "02/07/2021 08:31AM",
      "episode_date": "02/07/2021"
    },
    "doctor": {
      "name": "DR Darshan Singh Bhakuni",
      "department": "Rheumatology"
    },
    "consultation_notes": "Video consultation done. Known case of systemic lupus erythematosus and scleroderma overlap with interstitial lung disease on medication.",
    "treatment": [
      {
        "drug_name": "Tab Sildinafil Citrate",
        "dosage": "0.5 mg twice a day after meals",
        "frequency": "",
        "duration": ""
      }
    ],
    "follow_up": "Review after 4 weeks."
]
```


## AutoGGUFReranker

This is an annotator that reorders documents based on how relevant they are to a given query. It assigns each document a `relevance_score`, allowing the most relevant passages to appear first.

This is particularly useful in *RAG systems*, where reranking acts as a second-pass filter that refines search results. After the initial retrieval step (like vector similarity), the reranker reevaluates each document in the context of your query and reorders them by relevance. This adds a bit of latency but delivers far more accurate results. It‚Äôs especially useful because feeding irrelevant context to an LLM wastes tokens, increases cost, and can lead to hallucinations or incorrect answers.

Docs: https://sparknlp.org/docs/en/annotators#autoggufreranker


In [None]:
document = (
    DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
)

reranker = (
    AutoGGUFReranker.pretrained("bge_reranker_v2_m3_Q4_K_M")
    .setInputCols(["document"])
    .setOutputCol("reranked_documents")
    .setQuery("A man is eating pasta.")
)

finisher = (
    GGUFRankingFinisher()
    .setInputCols("reranked_documents")
    .setOutputCol("finished_reranked_documents")
    .setMinRelevanceScore(0.4) # Drops scores below 0.4
    .setMinMaxScaling(True)
)

pipeline = Pipeline().setStages([document, reranker, finisher])

data = spark.createDataFrame([
    ["A man is eating food."],
    ["A man is eating a piece of bread."],
    ["The girl is carrying a baby."],
    ["A man is riding a horse."]
]).toDF("text")


bge_reranker_v2_m3_Q4_K_M download started this may take some time.
Approximate size to download 396.7 MB
[OK!]


In [None]:
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_reranked_documents) as doc") \
  .selectExpr("doc.metadata['rank'] as rank", "doc.result as text", "doc.metadata['relevance_score'] as score") \
  .show(truncate=False)


+----+---------------------------------+------------------+
|rank|text                             |score             |
+----+---------------------------------+------------------+
|1   |A man is eating food.            |1.0               |
|2   |A man is eating a piece of bread.|0.7244979587924696|
+----+---------------------------------+------------------+



# Import your own GGUF models into Spark NLP

You can also use your own `.gguf` models with any of the annotators by calling `.loadSavedModel()` instead of `.pretrained()`.

In [None]:
!wget -c https://huggingface.co/MaziyarPanahi/gemma-2b-it-GGUF/resolve/main/gemma-2b-it.Q8_0.gguf \
     -O gemma-2b-it.Q8_0.gguf


In [None]:
AutoGGUFModel\
  .loadSavedModel("gemma-2b-it.Q8_0.gguf", spark)\
  .setInputCols(["document"])\
  .setOutputCol("completions")\
  .write().overwrite().save("gemma_2b_it_q8_0_gguf")
  # ^ Optionally save it on disk so it is easier to be moved around and also be used later via .load function


In [None]:
!ls -lh gemma_2b_it_q8_0_gguf

total 2.5G
-rw-r--r-- 1 root root 2.5G Nov  5 12:46 gemma-2b-it.Q8_0.gguf
drwxr-xr-x 2 root root 4.0K Nov  5 12:46 metadata


inference

In [None]:
document_assembler = (
    DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
)

auto_gguf_model_loaded = (
    AutoGGUFModel.load("gemma_2b_it_q8_0_gguf")
    .setInputCols(["document"])
    .setOutputCol("completions")
    .setSystemPrompt("You are a story writing assistant.")
    .setCachePrompt(True)
    .setNPredict(-1)
)

pipeline = Pipeline().setStages([
    document_assembler,
    auto_gguf_model_loaded
])

prompt = spark.createDataFrame([["Write a story about llamas"]]).toDF("text")

result = pipeline.fit(data).transform(data)
print(result.select("completions.result").first().result[0])


The man, weathered and with eyes weary from a long day, sat in his humble apartment, the aroma of a simple stew filling the air. He took a bite of the stew, the flavors slowly mingling on his tongue.

He was a man of few means, but his heart was full. He had worked tirelessly all day, his hands calloused and his muscles aching. He had faced challenges that he had overcome with grit and determination.

As he chewed the stew, he reflected on his journey. He had come a long way from where he started. He had faced poverty and loss, but he had never given up on his dreams. He had always held onto the hope that he could build a better future for himself and his family.

He smiled as he took another bite of the stew, his eyes twinkling with the satisfaction of a hard-earned meal. He knew that life was never easy, but he was grateful for the simple pleasures in life, the small moments of joy that made his heart soar.

As the stew finished cooking, he looked up, a sense of contentment washing o

That's it! You can now go wild and use any GGUF model in Spark NLP üöÄ