# Local-Llama-Inference v0.1.0 - Quick Start

This notebook demonstrates:
1. **Auto-Download of CUDA Binaries** (834 MB from Hugging Face)
2. **GPU Detection** (NVIDIA sm_50+)
3. **Basic Chat with LLM**
4. **Server Management**

## About This Notebook

- **First Run**: Will download 834 MB CUDA binaries from Hugging Face (~10-15 minutes)
- **Cached**: Subsequent runs use cached binaries (instant)
- **GPU Required**: NVIDIA GPU with compute capability 5.0+ (Kepler, Maxwell, Pascal, etc.)
- **Models**: Download GGUF models from https://huggingface.co/models?search=gguf

## Step 1: Install Package (if not already installed)

```bash
pip install git+https://github.com/Local-Llama-Inference/Local-Llama-Inference.git@v0.1.0
```

Or from PyPI when available:
```bash
pip install local-llama-inference
```

## Step 2: Import Package (Triggers Auto-Download on First Import)

‚ö†Ô∏è **First time only**: This will download binaries from Hugging Face

In [None]:
# Import the main classes
from local_llama_inference import (
    LlamaServer,
    LlamaClient,
    detect_gpus,
    suggest_tensor_split,
)

print("‚úÖ Package imported successfully!")
print("‚úÖ CUDA binaries are now downloaded and cached")

## Step 3: Detect Available GPUs

In [None]:
# Detect available GPUs
gpus = detect_gpus()

print(f"\nüéÆ Detected {len(gpus)} GPU(s):\n")
for i, gpu in enumerate(gpus):
    print(f"GPU {i}: {gpu['name']}")
    print(f"  Compute Capability: {gpu['compute_capability']}")
    print(f"  VRAM: {gpu['memory_mb']} MB")
    print(f"  UUID: {gpu['uuid']}")
    print()

## Step 4: Download a GGUF Model

For this example, we'll use a small model. Popular options:
- **Mistral 7B Q4** (4.3 GB): TheBloke/Mistral-7B-Instruct-v0.1-GGUF
- **Phi-2 Q4** (1.4 GB): TheBloke/phi-2-GGUF  
- **Llama 2 7B Q4** (3.8 GB): TheBloke/Llama-2-7B-Chat-GGUF
- **Orca Mini 3B Q4** (1.9 GB): TheBloke/orca_mini-3B-GGUF

In [None]:
from huggingface_hub import hf_hub_download
from pathlib import Path

# Create models directory
models_dir = Path.home() / "models"
models_dir.mkdir(exist_ok=True)

print("üì• Downloading Phi-2 Q4 model (1.4 GB)...")
print("   This may take a few minutes depending on internet speed...\n")

# Download Phi-2 Q4 model (smaller, faster)
model_path = hf_hub_download(
    repo_id="TheBloke/phi-2-GGUF",
    filename="phi-2.Q4_K_M.gguf",
    local_dir=str(models_dir),
)

print(f"‚úÖ Model downloaded to: {model_path}")
print(f"   File size: {Path(model_path).stat().st_size / (1024**3):.2f} GB")

## Step 5: Start LLM Server

In [None]:
# Detect GPUs for optimal configuration
gpus = detect_gpus()

print(f"üöÄ Starting LLM server with {len(gpus)} GPU(s)...\n")

# Create server with optimal settings
server = LlamaServer(
    model_path=model_path,
    n_gpu_layers=33,  # Offload layers to GPU
    n_threads=4,      # CPU threads
    n_ctx=2048,       # Context size
    verbose=False,
)

# Start the server
server.start()
print("‚è≥ Waiting for server to be ready...")

# Wait for server to be ready
server.wait_ready(timeout=60)
print(f"‚úÖ Server ready at {server.base_url}")

## Step 6: Create Client and Chat

In [None]:
# Create client
client = LlamaClient()

# Send a chat message
print("ü§ñ Sending message to LLM...\n")

response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning? Explain in 2-3 sentences."}
    ],
    temperature=0.7,
    max_tokens=200,
)

print("Response:")
print(response.choices[0].message.content)

## Step 7: Multi-Turn Conversation

In [None]:
# Multi-turn conversation
messages = [
    {"role": "system", "content": "You are a helpful programming assistant."},
]

# First turn
user_message = "Explain Python decorators in simple terms."
print(f"üë§ User: {user_message}\n")

messages.append({"role": "user", "content": user_message})

response = client.chat_completion(
    messages=messages,
    temperature=0.7,
    max_tokens=300,
)

assistant_message = response.choices[0].message.content
print(f"ü§ñ Assistant: {assistant_message}\n")

messages.append({"role": "assistant", "content": assistant_message})

# Second turn
user_message = "Can you show me a simple example?"
print(f"üë§ User: {user_message}\n")

messages.append({"role": "user", "content": user_message})

response = client.chat_completion(
    messages=messages,
    temperature=0.7,
    max_tokens=300,
)

assistant_message = response.choices[0].message.content
print(f"ü§ñ Assistant: {assistant_message}")

## Step 8: Check Server Health

In [None]:
# Check server health
health = client.health()
print("üìä Server Health:")
print(f"  Status: {health.get('status', 'unknown')}")

# Get server properties
props = client.get_props()
print("\n‚öôÔ∏è  Server Properties:")
print(f"  Model: {props.get('default_generation_settings', {}).get('model', 'unknown')}")
print(f"  Context Size: {props.get('default_generation_settings', {}).get('n_ctx', 'unknown')}")
print(f"  GPU Layers: {props.get('default_generation_settings', {}).get('n_gpu_layers', 'unknown')}")

## Step 9: Stop Server (Cleanup)

‚ö†Ô∏è **Important**: Always stop the server when done to free GPU memory

In [None]:
# Stop the server
print("üõë Stopping server...")
server.stop()
print("‚úÖ Server stopped")

## Summary

‚úÖ **You've successfully:**
1. Auto-downloaded CUDA binaries from Hugging Face (one-time setup)
2. Detected your GPU and its capabilities
3. Downloaded a GGUF model
4. Started a local LLM server
5. Created a multi-turn conversation
6. Checked server health
7. Properly cleaned up resources

## Next Steps

- Check out `02_streaming_responses.ipynb` for token-by-token generation
- See `03_embeddings.ipynb` for text embeddings
- Try `04_multi_gpu.ipynb` for multi-GPU tensor parallelism
- Explore `05_advanced_api.ipynb` for all available endpoints

## Resources

- **GitHub**: https://github.com/Local-Llama-Inference/Local-Llama-Inference
- **Models**: https://huggingface.co/models?search=gguf
- **Binaries**: https://huggingface.co/datasets/waqasm86/Local-Llama-Inference/
- **Documentation**: Check README.md in the repository