![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/llama.cpp/llama.cpp_in_Spark_NLP_AutoGGUFEmbeddings.ipynb)

# llama.cpp 🦙 embedding models in Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- Support for llama.cpp embeddings was introduced in `Spark NLP 5.5.1`, enabling quantized LLM inference on a wide range of devices. Please make sure you have upgraded to the latest Spark NLP release.
- You need to use your own `.gguf` model files, which also include the models from the [Hugging Face Models](https://huggingface.co/models?library=gguf).

## Download a GGUF Model

Lets download a GGUF model to test it out. For this, we will use [nomic-ai/nomic-embed-text-v1.5-GGUF](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF). We can download the model by selecting the Q8_0 GGUF file from the "Files and versions" tab.

Once downloaded, we can directly import this model into Spark NLP!

In [None]:
EXPORT_PATH = "nomic-embed-text-v1.5.Q8_0.gguf"
! wget "https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF/resolve/main/{EXPORT_PATH}?download=true" -O  {EXPORT_PATH}

--2024-11-02 13:42:45--  https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF/resolve/main/nomic-embed-text-v1.5.Q8_0.gguf?download=true
Resolving huggingface.co (huggingface.co)... 3.160.39.87, 3.160.39.100, 3.160.39.99, ...
Connecting to huggingface.co (huggingface.co)|3.160.39.87|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/19/39/19396cd98fe8b02e39b1be815db29f6b251fee34fc5d6550db0b478083fdda2f/f7af6f66802f4df86eda10fe9bbcfc75c39562bed48ef6ace719a251cf1c2fdb?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27nomic-embed-text-v1.5.Q8_0.gguf%3B+filename%3D%22nomic-embed-text-v1.5.Q8_0.gguf%22%3B&Expires=1730810566&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczMDgxMDU2Nn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzE5LzM5LzE5Mzk2Y2Q5OGZlOGIwMmUzOWIxYmU4MTVkYjI5ZjZiMjUxZmVlMzRmYzVkNjU1MGRiMGI0NzgwODNmZGRhMmYvZjdhZjZmNjY4MDJmNGRmODZlZGExMGZ

## Import and Save AutGGUF models in Spark NLP

- Let's install and setup Spark NLP (if running it Google Colab)
- This part is pretty easy via our simple script

In [None]:
# Only execute this if you are on Google Colab
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Let's start Spark with Spark NLP included via our simple `start()` function

In [None]:
import sparknlp

# let's start Spark with Spark NLP with GPU enabled. If you don't have GPUs available remove this parameter.
spark = sparknlp.start(gpu=True)
print(sparknlp.version())

- Let's use the `loadSavedModel` functon in `AutoGGUFModel`
- Most params will be set automatically. They can also be set later after loading the model in `AutoGGUFModel` during runtime, so don't worry about setting them now.
- `loadSavedModel` accepts two params, first is the path to the exported model. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- We can set the model to embedding mode with `setEmbedding`. Afterwards the model will return the embeddings in the Annotations.
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [None]:
from sparknlp.annotator import *

# All these params should be identical to the original ONNX model
autoGGUFEmbeddings = (
    AutoGGUFEmbeddings.loadSavedModel(EXPORT_PATH, spark)
    .setInputCols("document")
    .setOutputCol("embeddings")
    .setBatchSize(4)
    .setNGpuLayers(99)
)

jsl-llama: Extracted 'libjllama.so' to '/tmp/libjllama.so'


- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [None]:
autoGGUFEmbeddings.write().overwrite().save(f"nomic-embed-text-v1.5.Q8_0.gguf_spark_nlp")

24/11/02 13:48:29 WARN TaskSetManager: Stage 0 contains a task of very large size (1073 KiB). The maximum recommended task size is 1000 KiB.


Let's clean up stuff we don't need anymore

In [None]:
!rm -rf {EXPORT_PATH}

Awesome  😎 !

This is your GGUF model from loaded and saved by Spark NLP 🚀

In [None]:
! ls -l nomic-embed-text-v1.5.Q8_0.gguf_spark_nlp/

total 267872
drwxr-xr-x 2 root root      4096 Nov  2 13:48 metadata
-rwxrwxr-x 1 root root 274290560 Nov  2 13:48 nomic-embed-text-v1.5.Q8_0.gguf


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny GGUF model 😊

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

autoGGUFEmbeddings = AutoGGUFEmbeddings.load("nomic-embed-text-v1.5.Q8_0.gguf_spark_nlp")

pipeline = Pipeline().setStages([document_assembler, autoGGUFEmbeddings])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")

result = pipeline.fit(data).transform(data)
result.select("embeddings.embeddings").show(1, 80)

24/11/02 13:48:57 WARN SparkContext: The path /home/root/Workspace/scala/spark-nlp/examples/python/llama.cpp/nomic-embed-text-v1.5.Q8_0.gguf_spark_nlp/nomic-embed-text-v1.5.Q8_0.gguf has been added already. Overwriting of added paths is not supported in the current version.
24/11/02 13:48:57 WARN DAGScheduler: Broadcasting large task binary with size 1028.0 KiB
24/11/02 13:48:57 WARN DAGScheduler: Broadcasting large task binary with size 1028.0 KiB
24/11/02 13:48:57 WARN DAGScheduler: Broadcasting large task binary with size 1028.0 KiB
llama_model_loader: loaded meta data with 22 key-value pairs and 112 tensors from /tmp/spark-6de50aee-1059-4698-98e2-db9d68663467/userFiles-932de0e7-9a8f-41f5-9aaf-94bb7406df74/nomic-embed-text-v1.5.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
llama_model_loader: -

[WARN] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support n_gpu_layers=-1
[INFO] build info build=3534 commit="641f5dd2"
[INFO] system info n_threads=6 n_threads_batch=-1 total_threads=6 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "


llama_kv_cache_init:        CPU KV buffer size =   144.00 MiB
llama_new_context_with_model: KV self size  =  144.00 MiB, K (f16):   72.00 MiB, V (f16):   72.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 23.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    23.00 MiB
llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 1


[INFO] initializing slots n_slots=4
[INFO] new slot id_slot=0 n_ctx_slot=1024
[INFO] new slot id_slot=1 n_ctx_slot=1024
[INFO] new slot id_slot=2 n_ctx_slot=1024
[INFO] new slot id_slot=3 n_ctx_slot=1024
[INFO] model loaded
[INFO] chat template chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
[INFO] slot is processing task id_slot=0 id_task=0
[INFO] kv cache rm [p0, end) id_slot=0 id_task=0 p0=0


[Stage 12:>                                                         (0 + 1) / 1]

[INFO] slot released id_slot=0 id_task=0 n_ctx=4096 n_past=7 n_system_tokens=0 n_cache_tokens=0 truncated=false
[INFO] all slots are idle
+--------------------------------------------------------------------------------+
|                                                                      embeddings|
+--------------------------------------------------------------------------------+
|[[0.046383496, 0.02353651, -0.12484242, -0.009759982, 0.05522549, -0.01701891...|
+--------------------------------------------------------------------------------+



                                                                                

That's it! You can now go wild and use hundreds of GGUF models from HuggingFace 🤗 in Spark NLP 🚀
