##### Master Degree in Computer Science and Data Science for Economics

# llama.cpp

### Sergio Picascia

The main goal of [llama.cpp](https://github.com/ggml-org/llama.cpp) is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.

[llama-cpp-python](https://github.com/abetlen/llama-cpp-python) is a simply Python bindings for llama.cpp. To install the package, run:
```
pip install llama-cpp-python
```

- To install with CUDA support, set the GGML_CUDA=on environment variable before installing:
```
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
```

- To install with Metal (MPS), set the GGML_METAL=on environment variable before installing:
```
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
```

In [1]:
# Install llama-cpp-python with GPU support
%pip install llama-cpp-python==0.2.90 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122
Collecting llama-cpp-python==0.2.90
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.90-cu122/llama_cpp_python-0.2.90-cp312-cp312-linux_x86_64.whl (443.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m443.8/443.8 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.2.90)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.2.90


Key Features of Llama

* Diverse Sizes: The Llama family includes models of various sizes, ranging from 7 billion to 70 billion parameters. This allows developers to choose a model that balances performance and computational cost for their specific use case.

* Open-Source Nature: By releasing Llama as an open-source project, Meta has enabled a massive community of researchers and developers to build upon, fine-tune, and deploy the models for a wide range of applications.

* Architectural Foundation: The Llama models are built on the Transformer architecture, which is the standard for modern LLMs. They use a number of optimizations to improve performance and efficiency, such as Grouped-Query Attention and SwiGLU activation functions.

Purpose: Llama models are designed for a variety of tasks, including text generation, summarization, question answering, and coding assistance. Their open nature makes them a popular choice for building custom applications and performing research.

In [3]:
from llama_cpp import Llama

RuntimeError: Failed to load shared library '/usr/local/lib/python3.12/dist-packages/llama_cpp/lib/libllama.so': libcuda.so.1: cannot open shared object file: No such file or directory

In [None]:
# Load the model
llm = Llama.from_pretrained(repo_id="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF", # repository name
                            filename="Meta-Llama-3.1-8B-Instruct-Q8_0.gguf", # model file
                            n_gpu_layers=-1, # use all GPU layers
                            n_ctx=32768, # context size
                            flash_attn=True, # use flash attention
                            chat_format="llama-3", # chat format
                            verbose=False)

In [None]:
# Output generation
output = llm.create_chat_completion(
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant.",
            },
            {
                "role": "user",
                "content": "What are the planets of the solar system?",
            },
        ],
        temperature=0.7,
)

In [None]:
output

{'id': 'chatcmpl-4febcd29-287f-45c1-be98-baa194c3f2f2',
 'object': 'chat.completion',
 'created': 1745305781,
 'model': '/Users/sergiopicascia/.cache/huggingface/hub/models--bartowski--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/bf5b95e96dac0462e2a09145ec66cae9a3f12067/./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': 'The planets of our solar system, in order from the Sun, are:\n\n1. Mercury\n2. Venus\n3. Earth\n4. Mars\n5. Jupiter\n6. Saturn\n7. Uranus\n8. Neptune\n\nNote: Pluto was previously considered a planet, but in 2006 it was reclassified as a dwarf planet by the International Astronomical Union (IAU).\n\nHere\'s a fun way to remember the order of the planets:\n\n"Mary\'s Violet Eyes Make Jeremy Stay Up Nights"\n\nThe first letter of each word corresponds to the first letter of each planet\'s name!\n\nWould you like to know more about any of the planets?'},
   'logprobs': None,
   'finish_reason': 'stop'}],
 'us

In [None]:
output["choices"][0]["message"]

{'role': 'assistant',
 'content': 'The planets of our solar system, in order from the Sun, are:\n\n1. Mercury\n2. Venus\n3. Earth\n4. Mars\n5. Jupiter\n6. Saturn\n7. Uranus\n8. Neptune\n\nNote: Pluto was previously considered a planet, but in 2006 it was reclassified as a dwarf planet by the International Astronomical Union (IAU).\n\nHere\'s a fun way to remember the order of the planets:\n\n"Mary\'s Violet Eyes Make Jeremy Stay Up Nights"\n\nThe first letter of each word corresponds to the first letter of each planet\'s name!\n\nWould you like to know more about any of the planets?'}

In [None]:
output = llm.create_chat_completion(
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant.",
            },
            {
                "role": "user",
                "content": "Explain more easily the previous answer.",
            },
        ],
        temperature=0.7,
)

In [None]:
# The model does not retain the context of the previous conversation, so we need to provide the context again.
output["choices"][0]["message"]

{'role': 'assistant',
 'content': "Since I didn't give an answer previously, let's start fresh.\n\nYou asked me to explain something, but I didn't receive a specific question. Could you please ask me something, and I'll do my best to provide a clear and easy-to-understand explanation?"}

In [None]:
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant.",
    },
    {
        "role": "user",
        "content": "What are the planets of the solar system?",
    },
]

In [None]:
output = llm.create_chat_completion(
        messages=messages,
        temperature=0.7,
)

In [None]:
output["choices"][0]["message"]

{'role': 'assistant',
 'content': 'There are 8 planets in our solar system. Here they are in order from the Sun:\n\n1. Mercury\n2. Venus\n3. Earth\n4. Mars\n5. Jupiter\n6. Saturn\n7. Uranus\n8. Neptune\n\nNote: Pluto was previously considered a planet, but in 2006 it was reclassified as a dwarf planet by the International Astronomical Union (IAU).\n\nWould you like to know more about any of these planets?'}

In [None]:
messages.append(output["choices"][0]["message"])

In [None]:
messages.append({"role": "user", "content": "Order the planets in inverse order.",})

In [None]:
output = llm.create_chat_completion(
        messages=messages,
        temperature=0.7,
)

In [None]:
output["choices"][0]["message"]

{'role': 'assistant',
 'content': 'Here are the 8 planets in our solar system in reverse order from the Sun:\n\n1. Neptune\n2. Uranus\n3. Saturn\n4. Jupiter\n5. Mars\n6. Earth\n7. Venus\n8. Mercury'}

In [None]:
# Number of tokens used
output["usage"]

{'prompt_tokens': 144, 'completion_tokens': 49, 'total_tokens': 193}

In [None]:
# How the model tokenizes the input
llm.tokenize("What are the planets of the solar system?".encode("utf-8"))

[128000, 3923, 527, 279, 33975, 315, 279, 13238, 1887, 30]

In [None]:
# Example of streaming output
for output in llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {
            "role": "user",
            "content": "What are the planets of the solar system?",
        },
    ],
    temperature=0.7,
    stream=True,
):
    print(output["choices"][0]["delta"].get("content", ""), end="")

There are 8 planets in our solar system. Here's a list of them in order from the Sun:

1. Mercury
2. Venus
3. Earth
4. Mars
5. Jupiter
6. Saturn
7. Uranus
8. Neptune

Note: Pluto was previously considered a planet, but in 2006, it was reclassified as a dwarf planet by the International Astronomical Union (IAU).

Would you like to know more about a specific planet or the solar system in general?

In [None]:
# Set the temperature to 0 for deterministic output
for output in llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {
            "role": "user",
            "content": "What are the planets of the solar system?",
        },
    ],
    temperature=0,
    stream=True, #Enables streaming, allowing you to receive the output in chunks as it is generated.
    #When stream=False (the default), you would have to wait until the model has finished generating the entire response before receiving anything.
):
    print(output["choices"][0]["delta"].get("content", ""), end="")

There are 8 planets in our solar system, which are:

1. Mercury
2. Venus
3. Earth
4. Mars
5. Jupiter
6. Saturn
7. Uranus
8. Neptune

Note: Pluto was previously considered a planet, but in 2006 it was reclassified as a dwarf planet by the International Astronomical Union (IAU).

Here's a fun way to remember the order of the planets:

"My Very Excellent Mother Just Served Us Nachos"

M - Mercury
V - Venus
E - Earth
M - Mars
J - Jupiter
S - Saturn
U - Uranus
N - Neptune

I hope that helps!

In [None]:
# Enforce the output to be in JSON format
for output in llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant answering in JSON format.",
        },
        {
            "role": "user",
            "content": "What are the planets of the solar system?",
        },
    ],
    temperature=0.7,
    stream=True,
    response_format={"type": "json_object"}, #specify reponse format
):
    print(output["choices"][0]["delta"].get("content", ""), end="")

{
  "planets": [
    {
      "name": "Mercury",
      "order": 1
    },
    {
      "name": "Venus",
      "order": 2
    },
    {
      "name": "Earth",
      "order": 3
    },
    {
      "name": "Mars",
      "order": 4
    },
    {
      "name": "Jupiter",
      "order": 5
    },
    {
      "name": "Saturn",
      "order": 6
    },
    {
      "name": "Uranus",
      "order": 7
    },
    {
      "name": "Neptune",
      "order": 8
    }
  ]
}

In [None]:
# Enforce the output to follow a specific JSON schema
for output in llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant answering in JSON format.",
        },
        {
            "role": "user",
            "content": "What are the planets of the solar system?",
        },
    ],
    temperature=0.7,
    stream=True,
    response_format={ #specify response schema
        "type": "json_object",
        "schema": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "planet": {"type": "string"},
                    "distance_from_sun": {"type": "number"}
                    },
                "required": ["planet", "distance_from_sun"]
                }
            }
        }
):
    print(output["choices"][0]["delta"].get("content", ""), end="")

[{"planet": "Mercury", "distance_from_sun": 57.9}, {"planet": "Venus", "distance_from_sun": 108.2}, {"planet": "Earth", "distance_from_sun": 149.6}, {"planet": "Mars", "distance_from_sun": 227.9}, {"planet": "Jupiter", "distance_from_sun": 778.3}, {"planet": "Saturn", "distance_from_sun": 1426.7}, {"planet": "Uranus", "distance_from_sun": 2870.9}, {"planet": "Neptune", "distance_from_sun": 4497.0}]