# Background

This node will run some examples of using llama cpp python bind and provide samples

# Install python binding example

In [None]:
!CMAKE_ARGS="-DGGML_CUDA=on" 


In [None]:
pip install llama-cpp-python

# High level API

In [None]:
from llama_cpp import Llama

llm = Llama(
      model_path="/kaggle/input/smollm-1.7b-instruct-gguf/gguf/fp16/1/SmolLM-1.7B-Instruct-f16.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)

# Pulling models from Hugging Face Hub

Llama cpp python could pull image from hugging face.  For example


In [None]:
llm = Llama.from_pretrained(
    repo_id="Qwen/Qwen2-0.5B-Instruct-GGUF",
    filename="*q8_0.gguf",
    verbose=False
)

# Chat Completion

The high-level API also provides a simple interface for chat completion.

Chat completion requires that the model knows how to format the messages into a single prompt. The Llama class does this using pre-registered chat formats (ie. chatml, llama-2, gemma, etc) or by providing a custom chat handler object.

In [None]:
from llama_cpp import Llama
llm = Llama(
      model_path="/kaggle/input/smollm-1.7b-instruct-gguf/gguf/fp16/1/SmolLM-1.7B-Instruct-f16.gguf",
      chat_format="llama-2"
)
llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are an assistant who perfectly describes images."},
          {
              "role": "user",
              "content": "Describe this image in detail please."
          }
      ]
)

# JSON and JSON Schema Mode

To constrain chat responses to only valid JSON or a specific JSON Schema use the response_format argument in create_chat_completion.

## JSON Mode

The following example will constrain the response to valid JSON strings only.

In [None]:
from llama_cpp import Llama
llm = Llama(model_path="/kaggle/input/smollm-1.7b-instruct-gguf/gguf/fp16/1/SmolLM-1.7B-Instruct-f16.gguf", chat_format="chatml")
llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs in JSON.",
        },
        {"role": "user", "content": "Who won the world series in 2020"},
    ],
    response_format={
        "type": "json_object",
    },
    temperature=0.7,
)

## Json schema mode
To constrain the response further to a specific JSON Schema add the schema to the schema property of the response_format argument.

In [None]:
from llama_cpp import Llama
llm = Llama(model_path="/kaggle/input/smollm-1.7b-instruct-gguf/gguf/fp16/1/SmolLM-1.7B-Instruct-f16.gguf", chat_format="chatml")
llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs in JSON.",
        },
        {"role": "user", "content": "Who won the world series in 2020"},
    ],
    response_format={
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {"team_name": {"type": "string"}},
            "required": ["team_name"],
        },
    },
    temperature=0.7,
)

# Function Calling

The high-level API supports OpenAI compatible function and tool calling. This is possible through the functionary pre-trained models chat format or through the generic chatml-function-calling chat format.

In [None]:
from llama_cpp import Llama
llm = Llama(model_path="/kaggle/input/smollm-1.7b-instruct-gguf/gguf/fp16/1/SmolLM-1.7B-Instruct-f16.gguf", chat_format="chatml-function-calling")
llm.create_chat_completion(
      messages = [
        {
          "role": "system",
          "content": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary"

        },
        {
          "role": "user",
          "content": "Extract Jason is 25 years old"
        }
      ],
      tools=[{
        "type": "function",
        "function": {
          "name": "UserDetail",
          "parameters": {
            "type": "object",
            "title": "UserDetail",
            "properties": {
              "name": {
                "title": "Name",
                "type": "string"
              },
              "age": {
                "title": "Age",
                "type": "integer"
              }
            },
            "required": [ "name", "age" ]
          }
        }
      }],
      tool_choice={
        "type": "function",
        "function": {
          "name": "UserDetail"
        }
      }
)

# Speculative Decoding

llama-cpp-python supports speculative decoding which allows the model to generate completions based on a draft model.

The fastest way to use speculative decoding is through the LlamaPromptLookupDecoding class.

Just pass this as a draft model to the Llama class during initialization.

In [None]:
from llama_cpp import Llama
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding

llama = Llama(
    model_path="/kaggle/input/smollm-1.7b-instruct-gguf/gguf/fp16/1/SmolLM-1.7B-Instruct-f16.gguf",
    draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
)

# Embeddings

To generate text embeddings use create_embedding or embed. Note that you must pass embedding=True to the constructor upon model creation for these to work properly.

In [None]:
import llama_cpp

llm = llama_cpp.Llama(model_path="/kaggle/input/smollm-1.7b-instruct-gguf/gguf/fp16/1/SmolLM-1.7B-Instruct-f16.gguf", embedding=True)

embeddings = llm.create_embedding("Hello, world!")

# or create multiple embeddings at once

embeddings = llm.create_embedding(["Hello, world!", "Goodbye, world!"])