# Using Llama.cpp for Chat Completion


## Installation & Setup

First, install the `llama-cpp-python` package if you haven't already:

```bash
pip install llama-cpp-python
```
Ensure you have downloaded a compatible model from [Hugging Face](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF).


In [1]:
import os
from llama_cpp import Llama

# Set the cache directory
CACHE_DIR = f"{os.getenv('HUGGINGFACE_CACHE_DIR')}/gguf"

#### Running Inference in Python

In [2]:

# Load the Llama model from a local GGUF file
model_path = f"{CACHE_DIR}/mistral-7b-instruct-v0.2.Q2_K.gguf"
llm = Llama(model_path=model_path)


llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from \\MYCLOUDEX2ULTRA\research\llm\models/gguf/mistral-7b-instruct-v0.2.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_

In [3]:

# Example query
response = llm.create_chat_completion(messages=[
    {"role": "user", "content": "How big is the sky?"}
])

# Print the response
print(response)


llama_perf_context_print:        load time =    7371.00 ms
llama_perf_context_print: prompt eval time =    7370.60 ms /    14 tokens (  526.47 ms per token,     1.90 tokens per second)
llama_perf_context_print:        eval time =   13733.07 ms /    95 runs   (  144.56 ms per token,     6.92 tokens per second)
llama_perf_context_print:       total time =   21145.50 ms /   109 tokens


{'id': 'chatcmpl-e5ccbbc1-1bc3-42ea-9a56-6d476fc149d7', 'object': 'chat.completion', 'created': 1739317290, 'model': '\\\\MYCLOUDEX2ULTRA\\research\\llm\\models/gguf/mistral-7b-instruct-v0.2.Q2_K.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': " The size of the sky is not something that can be measured or quantified in the same way that we can measure and describe the size of physical objects. The sky is not a physical object with defined boundaries. It's the expanse above the earth, and it includes the atmosphere, the weather phenomena, and the stars, moon, and sun. It's essentially infinite in size, as it extends beyond our solar system and into the vastness of space."}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 14, 'completion_tokens': 95, 'total_tokens': 109}}


### Downloading and Using GGUF Models with Llama.from_pretrained

In [5]:
# Alternatively, load the model directly from Hugging Face
llm = Llama.from_pretrained(
    repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
    filename="mistral-7b-instruct-v0.2.Q2_K.gguf",
    cache_dir=CACHE_DIR
)


llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from \\MYCLOUDEX2ULTRA\research\llm\models/gguf\models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF\snapshots\3a6fbf4a41a1d52e415a4958cde6856d34b2db93\.\mistral-7b-instruct-v0.2.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
l

In [6]:
response = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "How does a black hole work?"}
    ]
)
print(response)

llama_perf_context_print:        load time =   15363.27 ms
llama_perf_context_print: prompt eval time =   15362.99 ms /    15 tokens ( 1024.20 ms per token,     0.98 tokens per second)
llama_perf_context_print:        eval time =   54128.24 ms /   388 runs   (  139.51 ms per token,     7.17 tokens per second)
llama_perf_context_print:       total time =   69801.98 ms /   403 tokens


{'id': 'chatcmpl-120bd07c-4187-4543-bb7f-83fc8e9b261e', 'object': 'chat.completion', 'created': 1739317680, 'model': '\\\\MYCLOUDEX2ULTRA\\research\\llm\\models/gguf\\models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF\\snapshots\\3a6fbf4a41a1d52e415a4958cde6856d34b2db93\\.\\mistral-7b-instruct-v0.2.Q2_K.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': ' A black hole is a region in space where the gravitational pull is so strong that nothing, not even light, can escape. The gravity of a black hole is so intense because matter is squeezed into a very small space.\n\nBlack holes are formed when a massive star collapses under its own gravity at the end of its life. The core collapses in on itself, forming a singularity, which is a point of infinite density and zero volume. The singularity is surrounded by an event horizon, which is the boundary of the black hole from which no escape is possible.\n\nThe intense gravity of a black hole warps the fabric of spacetime ar


## Serving the LLM as a REST API

To serve the model using `llama-server`, run the following command:

```bash
llama-server -m mistral-7b-instruct-v0.2.Q4_K_M.gguf --host 0.0.0.0 --port 8080
```

Then, use the following Python script to send requests to the API:


In [8]:

import requests
import json

# Define the API endpoint
url = "http://localhost:8080/v1/completions"

# Define the payload
payload = {
    "model": "mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    "prompt": "How big is the sky?",
    "temperature": 0.7,
    "max_tokens": 50
}

# Send a POST request
headers = {"Content-Type": "application/json"}
try:
    response = requests.post(url, json=payload, headers=headers)

    if response.status_code == 200:
        response_data = response.json()
        choices = response_data.get("choices", [])
        if choices:
            print("Response:", choices[0].get("text", ""))
        else:
            print("No choices found in the response.")
    else:
        print(f"Request failed with status code {response.status_code}: {response.text}")
except Exception as e:
    print(f"Error occurred: {e}")


Response: 

In a philosophical or poetic sense, the sky is often considered infinite, as it's the vast expanse above us where stars, planets, and galaxies exist. In reality, however, the size of the observable sky is
