# Local-Llama-Inference - Streaming Responses

Demonstrates token-by-token streaming for real-time text generation.

## Features
- **Streaming Chat**: See tokens as they're generated
- **Real-time Output**: Perfect for interactive applications
- **Python Generators**: Use Python's async/await patterns
- **Low Latency**: Immediate feedback

In [1]:
#Step 1
from local_llama_inference import LlamaServer, LlamaClient, detect_gpus
from pathlib import Path
from huggingface_hub import hf_hub_download

print("✅ Package imported")

✅ Package imported


In [2]:
import inspect
from local_llama_inference import LlamaServer
print(inspect.signature(LlamaServer.__init__))

(self, config: local_llama_inference.config.ServerConfig, binary_path: Optional[str] = None)


## Download Model

In [3]:
#Step 2

from pathlib import Path
from huggingface_hub import hf_hub_download

# Define your local models directory
models_dir = Path.home() / "models"
models_dir.mkdir(exist_ok=True)

# Download Gemma-3-1b-it-Q4_K_M model if not already present
# Using the Unsloth repository as requested
model_path = hf_hub_download(
    repo_id="unsloth/gemma-3-1b-it-GGUF",
    filename="gemma-3-1b-it-Q4_K_M.gguf",
    local_dir=str(models_dir),
    local_dir_use_symlinks=False  # Ensures the actual file is in your models folder
)

print(f"✅ Gemma 3 Model ready at: {model_path}")



✅ Gemma 3 Model ready at: /home/waqasm86/models/gemma-3-1b-it-Q4_K_M.gguf


In [4]:
print(models_dir)

/home/waqasm86/models


## Start Server

In [5]:
#Step 3

from local_llama_inference import LlamaServer, LlamaClient, ServerConfig

# model_path is already defined from Step 2
print("🚀 Starting server...")

# Create a configuration object with your desired settings
config = ServerConfig(
    model_path=model_path,      # full path to your .gguf file
    n_gpu_layers=1,             # offload 1 layer to GPU (adjust if needed)
    n_threads=1,                 # CPU threads for prompt processing
    # Optional: specify host/port if you want non-default values
    host="127.0.0.1",
    port=8090,
)

# Pass the config object to LlamaServer
server = LlamaServer(config)
server.start()
server.wait_ready(timeout=60)
print("✅ Server ready")

# Client connects to http://127.0.0.1:8090 by default
client = LlamaClient()

🚀 Starting server...
✅ Server ready


## Example 1: Basic Streaming Chat

In [13]:
# Step 4 - FIXED: stream_chat() returns Iterator[str], not chunk objects

client = LlamaClient(base_url="http://127.0.0.1:8090")
print("🤖 Streaming response:\n")
print("Q: Write a haiku about machine learning\nA: ", end="", flush=True)

# stream_chat() yields raw token strings directly
for token in client.stream_chat(
    messages=[{"role": "user", "content": "Write a haiku about machine learning"}],
    max_tokens=100,
):
    # token is a raw string, just print it
    if token:
        print(token, end="", flush=True)
print("\n")


In [11]:
# Use client.chat() for non-streaming (returns dict)
response = client.chat(
    messages=[{"role": "user", "content": "Write a haiku about machine learning"}],
    max_tokens=100
)
# chat() returns a dict (OpenAI-compatible JSON)
print("Assistant:", response["choices"][0]["message"]["content"])


In [12]:
# Check available methods in LlamaClient
print("Available methods (no chat_completion in v0.1.0):")
methods = [m for m in dir(LlamaClient) if not m.startswith('_')]
for method in sorted(methods):
    print(f"  - {method}")


In [7]:
# Check what's actually inside LlamaClient
from local_llama_inference import LlamaClient
print("LlamaClient methods:")
print([method for method in dir(LlamaClient) if not method.startswith('_')])

LlamaClient methods:
['apply_template', 'chat', 'close', 'complete', 'detokenize', 'embed', 'erase_slot', 'get_lora_adapters', 'get_metrics', 'get_models', 'get_props', 'get_slots', 'health', 'infill', 'load_model', 'rerank', 'restore_slot', 'save_slot', 'set_lora_adapters', 'set_props', 'stream_chat', 'stream_complete', 'tokenize', 'unload_model']


In [8]:
import local_llama_inference
print(f"SDK version: {local_llama_inference.__version__}")
print(f"Installed at: {local_llama_inference.__file__}")

SDK version: 0.1.0
Installed at: /home/waqasm86/.local/lib/python3.11/site-packages/local_llama_inference/__init__.py


In [9]:
import inspect
print(inspect.getsource(LlamaClient))  # This will print the class definition
# Or for a cleaner view:
help(LlamaClient)

class LlamaClient:
    """Synchronous HTTP client for llama-server REST API."""

    def __init__(
        self,
        base_url: str = "http://127.0.0.1:8080",
        api_key: Optional[str] = None,
        timeout: float = 600.0,
    ):
        """
        Initialize client.

        Args:
            base_url: Base URL of llama-server
            api_key: Optional API key for authentication
            timeout: Request timeout in seconds
        """
        self.base_url = base_url.rstrip("/")
        self._headers = {}
        if api_key:
            self._headers["Authorization"] = f"Bearer {api_key}"
        self._client = httpx.Client(headers=self._headers, timeout=timeout)

    def _request(
        self,
        method: str,
        endpoint: str,
        **kwargs,
    ) -> httpx.Response:
        """Make HTTP request."""
        url = f"{self.base_url}{endpoint}"
        try:
            response = self._client.request(method, url, **kwargs)
            response.raise_for_st

AttributeError: 'LlamaClient' object has no attribute 'chat_completion'

## Example 2: Streaming Text Completion

In [None]:
print("🤖 Streaming completion:\n")
print("Prompt: The future of AI is...\nResponse: ", end="", flush=True)

prompt = "The future of AI is"
# stream_complete() returns Iterator[str] of raw tokens
for token in client.stream_complete(
    prompt=prompt,
    max_tokens=150,
    temperature=0.8,
):
    # token is a raw string
    if token:
        print(token, end="", flush=True)

print("\n")


## Example 3: Measuring Latency

In [None]:
import time

print("⏱️  Measuring token generation latency:\n")

start_time = time.time()
token_count = 0
first_token_time = None

print("Prompt: Explain quantum computing\nResponse: ", end="", flush=True)

# stream_chat() yields raw token strings
for token in client.stream_chat(
    messages=[
        {"role": "user", "content": "Explain quantum computing in one sentence"}
    ],
    max_tokens=100,
):
    # token is a raw string
    if token:
        token_count += 1
        if first_token_time is None:
            first_token_time = time.time() - start_time
        print(token, end="", flush=True)

total_time = time.time() - start_time

print("\n")
print(f"\n📊 Latency Metrics:")
print(f"  First token latency: {first_token_time*1000:.1f} ms")
print(f"  Tokens generated: {token_count}")
print(f"  Total time: {total_time:.2f} seconds")
if total_time > 0:
    print(f"  Tokens/second: {token_count/total_time:.2f}")


## Example 4: Batch Multiple Streaming Requests

In [None]:
print("🔄 Streaming multiple requests sequentially:\n")

prompts = [
    "What is Python?",
    "What is CUDA?",
    "What is GPU computing?",
]

for i, prompt in enumerate(prompts, 1):
    print(f"\n[{i}] Q: {prompt}")
    print("    A: ", end="", flush=True)
    
    # stream_chat() yields raw token strings
    for token in client.stream_chat(
        messages=[{"role": "user", "content": prompt}],
        max_tokens=80,
    ):
        if token:
            print(token, end="", flush=True)
    
    print()  # Newline


## Example 5: Interactive Streaming (Multi-turn)

In [None]:
print("💬 Interactive streaming conversation:\n")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant who speaks concisely."}
]

# Turn 1
user_input = "What is recursion in programming?"
print(f"👤 User: {user_input}")
print("🤖 Assistant: ", end="", flush=True)

messages.append({"role": "user", "content": user_input})
assistant_response = ""

# stream_chat() yields raw token strings
for token in client.stream_chat(
    messages=messages,
    max_tokens=100,
):
    if token:
        assistant_response += token
        print(token, end="", flush=True)

print("\n")
messages.append({"role": "assistant", "content": assistant_response})

# Turn 2
user_input = "Can you give an example?"
print(f"👤 User: {user_input}")
print("🤖 Assistant: ", end="", flush=True)

messages.append({"role": "user", "content": user_input})
assistant_response = ""

# stream_chat() yields raw token strings
for token in client.stream_chat(
    messages=messages,
    max_tokens=100,
):
    if token:
        assistant_response += token
        print(token, end="", flush=True)

print()


## Stop Server

In [None]:
print("\n🛑 Stopping server...")
server.stop()
print("✅ Done")

## Key Points

- **Streaming**: Use `stream_chat()` or `stream_complete()` for token-by-token generation
- **Python Generators**: Iterate with `for` loop over chunks
- **Low Latency**: First token appears quickly, visible feedback
- **Multi-turn**: Build conversation history as you stream
- **Parameters**: All standard parameters work with streaming (temperature, max_tokens, etc.)

## Next Notebooks

- `03_embeddings.ipynb` - Generate text embeddings
- `04_multi_gpu.ipynb` - Multi-GPU tensor parallelism
- `05_advanced_api.ipynb` - All 30+ API endpoints