### 伺服器 llama-server
`llama-server` 是筆者相當喜歡的一個程式，他可以將模型變成一個服務，使用者可以透過 HTTP API 來存取模型。

#### 1. 參數與 llama-cli 大致相同，例如：
* 比較大的不同在於參數 `-cb` 與 `-np`：
    * `-cb` 指的是 Continuous Batching，也就是說使用者的輸入會不斷加入批次裡面，而不需要等整個批次都結束了才能處理下個輸入。
    * `-np` 則是指能夠同時處理的輸入數量，這裡設定 `-np 4` 就代表系統最多能同時處理四個輸入。

In [None]:
%%bash

cd llama.cpp

./build/bin/llama-server \
    -m llama-3-8b-inst.gguf \
    -ngl 99 -c 8192 -fa -cb -np 4 \
    --host 0.0.0.0 --port 8888

build: 1 (a3c3084) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 32



system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8888, http threads: 31
main: loading model
srv    load_model: loading model 'llama-3-8b-inst.gguf'
llama_model_loader: loaded meta data with 27 key-value pairs and 291 tensors from llama-3-8b-inst.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = 5f0b02c75b57c5855da9ae460ce51323ea669d8a
llama_model_loader: - kv   3:      

ERROR! Session/line number was not unique in database. History logging moved to new session 69


load_tensors:   CPU_Mapped model buffer size = 15317.02 MiB
.........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     1.96 MiB
llama_kv_cache_unified:        CPU KV buffer size =  1024.00 MiB
llama_kv_cache_unified: size = 1024.00 MiB (  8192 cells,  32 layers,  4 seqs), K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_context:        CPU compute buffer size =   266.50 MiB
llama_context: graph nodes  = 1031
llama_context: graph splits = 1
comm

#### 2. 啟動之後可以在 `http://127.0.0.1:8888/` 打開網頁介面進行互動，這個介面只是用來簡單測試，一般開發通常還是以 API 呼叫居多：

In [2]:
%%bash

curl -X POST http://localhost:8888/completion \
    -d '{"prompt": "你好!", "n_predict": 16}'
# Output: {"content": "今天我們要為大家介紹的是...

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1706  100  1668  100    38    347      7  0:00:05  0:00:04  0:00:01   355  100    38    347      7  0:00:05  0:00:04  0:00:01   463


{"index":0,"content":" (nǐ hǎo) - Hello!\n\nWelcome to my GitHub page!","tokens":[],"id_slot":0,"stop":true,"model":"gpt-3.5-turbo","tokens_predicted":16,"tokens_evaluated":4,"generation_settings":{"n_predict":16,"seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":8192,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":16,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format

#### 3. 可以撰寫一個 Python 程式用串流的方式接收模型輸出：
* 這裡透過 `stop` 參數就能指定模型輸出的停止點。

In [3]:
import json
import requests

url = "http://127.0.0.1:8888/completion" 
prompt = "[INST] 什麼是語言模型? [/INST]"

params = {
    "prompt": prompt,
    "stream": True,
    "stop": ["\n", "\n\n"],
}

resp = requests.post(url, json=params, stream=True) 
for chunk in resp.iter_lines():
    if not chunk:
        continue
    # 會有固定的 "data:" 前級，需要跳掉 5 個字元
    content = json.loads(chunk[5:])["content"]
    print(end=content, flush=True)
print()

  Language models are artificial intelligence (AI) systems designed to process, generate, and manipulate human language. They are trained on large amounts of text data, such as books, articles, and social media posts, to learn patterns, relationships, and structures of language. Language models can perform various tasks, including:


* 為了避免使用者輸入太長的提示，可以透過 `tokenize` 與 `detokenize` API 來截斷使用者的提示，例如：

In [4]:
url = "http://127.0.0.1:8888/tokenize"
params = {"content": "hello, llama.cpp!"}
resp = requests.post(url, json=params)
tokens = json.loads(resp.text)["tokens"]
print(tokens) # [6312, 28709, 28725,...1

[15339, 11, 94776, 7356, 0]


* 假設我們只需要最後面三個 Tokens 的話：

In [None]:
url = "http://127.0.0.1:8888/detokenize"

# for i in range(len(tokens)):
#     token=tokens[i]
for i, token in enumerate(tokens):
    params = {"tokens": [token]}
    resp = requests.post(url, json=params)
    content = json.loads(resp.text)["content"]
    print(i, ":", content) # hello, llama.cpp!

print("------------------------------")

params = {"tokens": tokens[-3:]}
resp = requests.post(url, json=params)
content = json.loads(resp.text)["content"]
print(content) #  llama.cpp!

0 : hello
1 : ,
2 :  llama
3 : .cpp
4 : !
 llama.cpp!


* 這樣就完成了截斷提示長度的組合操作囉!
* 除了以上這些 API 以外，還有使用 LLM 做檢索時能透過 `/embedding` 取得文句向量，以及 Code LLM 常用的 `/infill` 程式碼填充，也有與 OpenAl API 容的 `/v1/chat/completions` 可以使用。