# Local-Llama-Inference - Advanced API Endpoints

Comprehensive guide to all 30+ API endpoints available in local-llama-inference.

## API Categories
1. **Chat & Completions** - Text generation
2. **Embeddings** - Vector representations
3. **Tokenization** - Token operations
4. **Server Management** - Health & status
5. **Advanced** - Infill, reranking, LoRA
6. **Configuration** - Model settings

In [None]:
from local_llama_inference import LlamaServer, LlamaClient, detect_gpus
from pathlib import Path
from huggingface_hub import hf_hub_download
import json

print("✅ Package imported")

## Setup: Download Model and Start Server

In [None]:
# Download model
models_dir = Path.home() / "models"
models_dir.mkdir(exist_ok=True)

model_path = hf_hub_download(
    repo_id="TheBloke/phi-2-GGUF",
    filename="phi-2.Q4_K_M.gguf",
    local_dir=str(models_dir),
)

# Start server
print("🚀 Starting server...")
server = LlamaServer(
    model_path=model_path,
    n_gpu_layers=33,
    n_threads=4,
)
server.start()
server.wait_ready(timeout=60)
print(f"✅ Server ready\n")

client = LlamaClient()

## 1. Chat & Completions API

In [None]:
print("=" * 60)print("1. CHAT & COMPLETIONS API")print("=" * 60)# 1.1 Basic Chat Completionprint("\n1.1 Chat Completion (Non-streaming)")response = client.chat(    messages=[{"role": "user", "content": "Say hello"}],    temperature=0.7,    max_tokens=50,)print(f"Response: {response['choices'][0]['message']['content']}")# 1.2 Streaming Chatprint("\n1.2 Chat Completion (Streaming)")print("Response: ", end="", flush=True)for chunk in client.stream_chat(    messages=[{"role": "user", "content": "What is AI?"}],    max_tokens=50,):    token = chunk    if token:        print(token, end="", flush=True)print("\n")# 1.3 Text Completionprint("1.3 Text Completion")response = client.complete(    prompt="The future of AI is",    max_tokens=50,)print(f"Completion: {response['choices'][0].get('text', '')}")# 1.4 Streaming Completionprint("\n1.4 Streaming Completion")print("Response: ", end="", flush=True)for chunk in client.stream_complete(    prompt="Python is a programming language that",    max_tokens=40,):    token = chunk['choices'][0].get('text', '')    if token:        print(token, end="", flush=True)print("\n")

## 2. Sampling Parameters

In [None]:
print("=" * 60)print("2. SAMPLING PARAMETERS")print("=" * 60)# 2.1 Temperature (creativity)print("\n2.1 Temperature Control (creativity)")for temp in [0.1, 0.5, 1.0]:    response = client.chat(        messages=[{"role": "user", "content": "Complete: The sky is"}],        temperature=temp,        max_tokens=20,    )    text = response['choices'][0]['message']['content'][:50]    print(f"  Temp={temp}: {text}...")# 2.2 Top-P (nucleus sampling)print("\n2.2 Top-P Sampling")response = client.chat(    messages=[{"role": "user", "content": "Say something"}],    top_p=0.9,    max_tokens=30,)print(f"Response: {response['choices'][0]['message']['content'][:60]}...")# 2.3 Top-K Samplingprint("\n2.3 Top-K Sampling")response = client.chat(    messages=[{"role": "user", "content": "Tell me a fact"}],    top_k=40,    max_tokens=30,)print(f"Response: {response['choices'][0]['message']['content'][:60]}...")# 2.4 Repetition Penaltyprint("\n2.4 Repetition Penalty")response = client.chat(    messages=[{"role": "user", "content": "Say hello"}],    repeat_penalty=1.5,    max_tokens=30,)print(f"Response: {response['choices'][0]['message']['content'][:60]}...")

## 3. Embeddings & Tokenization

In [None]:
print("=" * 60)print("3. EMBEDDINGS & TOKENIZATION")print("=" * 60)# 3.1 Generate Embeddingsprint("\n3.1 Generate Embeddings")response = client.embed(input="Hello world")embedding = response['data'][0]['embedding']print(f"Embedding dimension: {len(embedding)}")print(f"First 5 values: {embedding[:5]}")# 3.2 Multiple Embeddingsprint("\n3.2 Multiple Embeddings")response = client.embed(input=["AI is great", "Machine learning is fun"])print(f"Generated {len(response.data)} embeddings")for i, item in enumerate(response.data):    print(f"  Embedding {i}: dimension {len(item['embedding'])}")# 3.3 Tokenizationprint("\n3.3 Tokenize Text")response = client.tokenize(content="Hello world, how are you?")print(f"Text: 'Hello world, how are you?'")print(f"Tokens: {response.get('tokens', [])}")print(f"Token count: {len(response.get('tokens', []))}")# 3.4 Detokenizationprint("\n3.4 Detokenize Tokens")response = client.detokenize(tokens=[1, 2, 3, 4, 5])print(f"Tokens [1,2,3,4,5] -> '{response.content}'")

## 4. Server Status & Health

In [None]:
print("=" * 60)
print("4. SERVER STATUS & HEALTH")
print("=" * 60)

# 4.1 Health Check
print("\n4.1 Server Health")
health = client.health()
print(f"Status: {health.get('status', 'unknown')}")

# 4.2 Server Properties
print("\n4.2 Server Properties")
props = client.get_props()
print(json.dumps(props, indent=2)[:500] + "...")

# 4.3 Server Metrics
print("\n4.3 Server Metrics")
try:
    metrics = client.get_metrics()
    print(f"Metrics: {metrics}")
except:
    print("Metrics not available on this server version")

## 5. Advanced Features

In [None]:
print("=" * 60)print("5. ADVANCED FEATURES")print("=" * 60)# 5.1 Apply Chat Templateprint("\n5.1 Apply Chat Template")try:    response = client.apply_template(        messages=[            {"role": "user", "content": "Hello"},            {"role": "assistant", "content": "Hi there"},        ]    )    print(f"Formatted prompt: {response.prompt[:100]}...")except:    print("Chat template not available")# 5.2 Code Infill (if supported)print("\n5.2 Code Infill")try:    response = client.infill(        prompt="def hello():\n    print(",        suffix=")\n    return True",    )    print(f"Infilled code: {response}")except:    print("Code infill not available")# 5.3 Rerankingprint("\n5.3 Reranking")try:    response = client.rerank(        query="What is machine learning?",        documents=[            "ML is a subset of AI",            "Deep learning uses neural networks",            "The weather is sunny today",        ]    )    print("Reranked results:")    for result in response['results']:        print(f"  {result}")except:    print("Reranking not available")# 5.4 LoRA Adaptersprint("\n5.4 LoRA Adapters")try:    response = client.set_lora_adapters(        lora_adapter=[],  # Empty list to clear    )    print(f"LoRA status: {response}")except:    print("LoRA not available")

## 6. Batch Operations

In [None]:
print("=" * 60)print("6. BATCH OPERATIONS")print("=" * 60)# Process multiple requestsrequests = [    "What is Python?",    "What is CUDA?",    "What is machine learning?",]print(f"\nProcessing {len(requests)} requests...\n")results = []for i, prompt in enumerate(requests, 1):    response = client.chat(        messages=[{"role": "user", "content": prompt}],        max_tokens=50,    )    answer = response['choices'][0]['message']['content']    results.append({"prompt": prompt, "answer": answer})    print(f"[{i}] {prompt}")    print(f"    {answer[:80]}...\n")print(f"Processed {len(results)} requests successfully")

## 7. Error Handling

In [None]:
print("=" * 60)print("7. ERROR HANDLING")print("=" * 60)# 7.1 Invalid parametersprint("\n7.1 Handling Invalid Parameters")try:    response = client.chat(        messages=[{"role": "user", "content": "Hello"}],        max_tokens=-100,  # Invalid: negative    )except Exception as e:    print(f"Error caught: {type(e).__name__}")    print(f"Message: {str(e)[:100]}")# 7.2 Valid request succeedsprint("\n7.2 Valid Request")try:    response = client.chat(        messages=[{"role": "user", "content": "Hello"}],        max_tokens=50,    )    print(f"✅ Success: Got response with {response.get('usage', {}).get('completion_tokens', 0)} tokens")except Exception as e:    print(f"Error: {e}")

## 8. API Reference Summary

In [None]:
print("=" * 60)
print("API REFERENCE SUMMARY")
print("=" * 60)

api_endpoints = {
    "Chat & Completions": [
        "chat_completion(messages, ...)",
        "stream_chat(messages, ...)",
        "complete(prompt, ...)",
        "stream_complete(prompt, ...)",
    ],
    "Embeddings & Tokens": [
        "embed(input)",
        "tokenize(content)",
        "detokenize(tokens)",
        "apply_template(messages)",
    ],
    "Advanced": [
        "infill(prompt, suffix)",
        "rerank(query, documents)",
        "set_lora_adapters(lora_adapter)",
    ],
    "Server": [
        "health()",
        "get_props()",
        "get_metrics()",
    ],
}

for category, methods in api_endpoints.items():
    print(f"\n{category}:")
    for method in methods:
        print(f"  • {method}")

print("\n" + "=" * 60)
print("See LlamaClient source for complete parameter documentation")
print("=" * 60)

## Stop Server

In [None]:
print("\n🛑 Stopping server...")
server.stop()
print("✅ Done")

## Common Parameters Across Endpoints

### Generation Parameters
- `max_tokens`: Maximum tokens to generate
- `temperature`: 0.0 (deterministic) to 1.0+ (creative)
- `top_p`: Nucleus sampling (0.0-1.0)
- `top_k`: Top-K sampling (0-100)
- `repeat_penalty`: Penalize repetition (1.0-2.0)
- `frequency_penalty`: Frequency-based penalty
- `presence_penalty`: Presence-based penalty

### Advanced Parameters
- `n_predict`: Alias for max_tokens
- `seed`: Random seed for reproducibility
- `stop`: Stop sequences (list of strings)
- `logit_bias`: Bias logits for specific tokens

## Best Practices

1. **Use streaming** for real-time feedback
2. **Batch requests** to improve throughput
3. **Cache embeddings** for semantic search
4. **Monitor health** for production systems
5. **Handle errors** gracefully
6. **Use appropriate models** for embeddings vs generation

## Next Notebooks

- `06_gpu_detection.ipynb` - Detailed GPU analysis