# Test llama-cpp-python Installation

***important:***
needs installed llama-cpp-python and model-file!

With Apple Silicon GPU Support (MPS) on macOS:
```
pip uninstall llama-cpp-python # if already installed w/o MPS
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
```

For use inside an Apple Silicon Linux VM:
```
CMAKE_ARGS="-DUNAME_M=arm64 -DUNAME_p=arm -DLLAMA_NO_METAL=1" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
```

In [1]:
# LLama Class Initialization

from llama_cpp import Llama

model_path = "./models/openbuddy-llama2-13b-v11.1.Q4_K_M.gguf"
n_ctx = 4096
logits_all = True

llm = Llama(model_path=model_path,
            n_ctx=n_ctx,
            logits_all=logits_all, 
            embedding = False,
            n_threads = 8,
            n_gpu_layers=0,
            verbose = True,
        )

llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ./models/openbuddy-llama2-13b-v11.1.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  5120, 37632,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q4_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q4_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q6_K     [ 13824,  

## Selct Prompts

In [2]:
# GERMAN Language
user_prompt = "Wasser aus einem Krug füllt einen Becher, bis er leer ist. Was ist jetzt leer?"
sys_prompt = "Es folgt eine Aufgabe. Die Aufgabe erklärt auch den Kontext. Ergänze die Antwort, welche die Aufgabe gut erfüllt."
sys_prompt += " ### Aufgabe: {0} ### Antwort:"

### or

In [None]:
# ENGLISH Language
user_prompt = "Water pours from a flask into a cup until it is empty. What became empty?"
sys_prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
sys_prompt += " ### Instruction: {0} ### Response:"

## Test Token-by-Token generation

In [3]:
# Generation Parameters
print_prompt_tokens = False     # Display prompt tokens
logprobs = 0                    # Display result variance and details
max_tokens = 500
temperature = 0

# --- code ---

# insert prompt into system prompt
prompt = sys_prompt.format(user_prompt)

# tokenize & print prompt
# tbd if user-prompt or full
tokens=llm.tokenize(user_prompt.encode('utf-8'),add_bos=False)

if print_prompt_tokens:
    print("User-Prompt:")
    print(f"  Number of tokens: {tokens.__len__()}")
    for i in range(tokens.__len__()):
        s = llm.detokenize([tokens[i]]).decode('utf-8')
        print(f"  Token[{i:}]: '{s}' = {tokens[i]:4d}")
else:
    print(f"User-Prompt:\n '{llm.detokenize(tokens).decode('utf-8')}'")

# Generate Result
all_lps=[]
print("Response:\n  ", end="'")
for chunk in llm(
    prompt = prompt,
    max_tokens = max_tokens,
    temperature=temperature,
    stop = ["###"],
    logprobs=logprobs,
    stream = True,
    ):
    for choice in chunk['choices']:# type: ignore
        print(choice['text'], end='', flush=True) # type: ignore
        if choice['logprobs']: # type: ignore
            all_lps.append(choice['logprobs']['top_logprobs']) # type: ignore
print("'",flush=True)

# Print Logprobs
if (logprobs>0) & (len(all_lps)>0):
    print("Top %d Possible Result Tokens with Probabilities:" % logprobs)
    for lp in all_lps:
        print("  ",{k:round(10**v,3) for k,v in lp[0].items()})


User-Prompt:
 ' Wasser aus einem Krug füllt einen Becher, bis er leer ist. Was ist jetzt leer?'
Response:
  ' Der Krug ist leer.
'



llama_print_timings:        load time =  4516.24 ms
llama_print_timings:      sample time =     4.08 ms /     9 runs   (    0.45 ms per token,  2204.80 tokens per second)
llama_print_timings: prompt eval time =  4516.17 ms /    69 tokens (   65.45 ms per token,    15.28 tokens per second)
llama_print_timings:        eval time =   615.84 ms /     8 runs   (   76.98 ms per token,    12.99 tokens per second)
llama_print_timings:       total time =  5409.10 ms


## Test Full Inference Run

In [None]:
output = llm(
    prompt = "Question: What are the names of the planets in the solar system? Answer: ",
    max_tokens = 48,
    temperature=0.3,
    echo = False,
    stop = ["Q:", "\n"],
    stream = False,
    )

#import json
#print(json.dumps(output, indent=2))

print(output['choices'][0]['text']) # type: ignore

# Documentation

## High-Level Llama Class Initializer
```
__init__(model_path, n_ctx=512, n_parts=-1,
        n_gpu_layers=0, seed=1337, f16_kv=True, logits_all=False,
        vocab_only=False, use_mmap=True, use_mlock=False,
        embedding=False, n_threads=None, n_batch=512,
        last_n_tokens_size=64, lora_base=None, lora_path=None,
        low_vram=False, verbose=True)
```
Load a llama.cpp model from model_path.
### Parameters:
| Name | Type  | Description | Default |
|------|-------|-------------|---------|
|model_path  | str  | Path to the model.  | required|
|n_ctx |  int |  Maximum context size.  | 512 |
|n_parts |  int | Number of parts to split the model into. If -1, the number of parts is automatically determined. |  -1 |
|seed |  int  | Random seed. -1 for random.  | 1337 |
|f16_kv |  bool  | Use half-precision for key/value cache.  | True |
|logits_all | bool | Return logits for all tokens, not just the last token. | False|
|vocab_only | bool | Only load the vocabulary no weights.  |False|
|use_mmap | bool | Use mmap if possible. | True|
|use_mlock | bool | Force the system to keep the model in RAM. | False|
|embedding | bool | Embedding mode only. | False|
|n_threads | Optional[int] | Number of threads to use. If None, the number of threads is automatically determined. | None|
|n_batch | int | Maximum number of prompt tokens to batch together when calling llama_eval. | 512|
|last_n_tokens_size | int | Maximum number of tokens to keep in the last_n_tokens deque. | 64|
|lora_base | Optional[str] | Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model. | None|
|lora_path | Optional[str] | Path to a LoRA file to apply to the model. | None|
|verbose | bool | Print verbose output to stderr. | True|
### Returns:
|Type  |Description|
|-------|-----------|
||A Llama instance.|

## High-Level API Inference
```
__call__ / create_completion
       (
        prompt, suffix=None, max_tokens=128, 
        temperature=0.8, top_p=0.95, logprobs=None, echo=False, 
        stop=[], frequency_penalty=0.0, presence_penalty=0.0, 
        repeat_penalty=1.1, top_k=40, stream=False, tfs_z=1.0, 
        mirostat_mode=0, mirostat_tau=5.0, mirostat_eta=0.1, 
        model=None, stopping_criteria=None, 
        logits_processor=None
       )
```
Generate text from a prompt.
### Parameters:
| Name | Type  | Description | Default |
|------|-------|-------------|---------|
|prompt | str | The prompt to generate text from. | required
|suffix | Optional[str] | A suffix to append to the generated text. If None, no suffix is appended. | None
|max_tokens | int | The maximum number of tokens to generate. If max_tokens <= 0, the maximum number of tokens to generate is unlimited and depends on n_ctx. | 128
|temperature | float | The temperature to use for sampling. | 0.8
|top_p | float | The top-p value to use for sampling. | 0.95
|logprobs | Optional[int] | The number of logprobs to return. If None, no logprobs are returned. | None
|echo | bool | Whether to echo the prompt. | False
|stop | Optional[Union[str, List[str]]] | A list of strings to stop generation when encountered. | []
|repeat_penalty | float | The penalty to apply to repeated tokens. | 1.1
|top_k | int | The top-k value to use for sampling. | 40
|stream | bool | Whether to stream the results. | False
### Returns:
|Type   |Description|
|-------|-----------|
|Union[Completion, Iterator[CompletionChunk]] | Response object containing the generated text.|
## Tokenizer
```
tokenize(text, add_bos=True)
```
Tokenize a string.
### Parameters:
| Name | Type  | Description | Default |
|------|-------|-------------|---------|
|text | bytes | The utf-8 encoded string to tokenize. | required
### Returns:
Type | Description
|----|-------------|
List[int] | A list of tokens.
```

detokenize(tokens)
```
Detokenize a list of tokens.
### Parameters:
| Name | Type  | Description | Default |
|------|-------|-------------|---------|
|tokens | List[int] | The list of tokens to detokenize. | required
### Returns:
Type | Description
|----|-------------|
bytes | The detokenized string.