# Local-Llama-Inference - Multi-GPU Tensor Parallelism

Demonstrates how to distribute a large model across multiple GPUs using tensor parallelism.

## Features
- **GPU Detection**: Automatically detect all available GPUs
- **Tensor Split**: Automatic layer distribution across GPUs
- **NCCL Communication**: Multi-GPU communication via NVIDIA NCCL
- **Optimal Configuration**: Suggestions for best settings
- **Scaling**: Linear performance improvement with more GPUs

In [None]:
from local_llama_inference import (
    LlamaServer,
    LlamaClient,
    detect_gpus,
    suggest_tensor_split,
)
from pathlib import Path
from huggingface_hub import hf_hub_download

print("✅ Package imported")

## Step 1: Detect Available GPUs

In [None]:
# Detect all GPUs
gpus = detect_gpus()

print(f"🎮 Detected {len(gpus)} GPU(s):\n")

total_vram = 0
for i, gpu in enumerate(gpus):
    print(f"GPU {i}: {gpu['name']}")
    print(f"  Compute Capability: {gpu['compute_capability']}")
    print(f"  VRAM: {gpu['memory_mb']} MB ({gpu['memory_mb']/1024:.2f} GB)")
    print(f"  UUID: {gpu['uuid']}")
    print()
    total_vram += gpu['memory_mb']

print(f"📊 Total VRAM: {total_vram} MB ({total_vram/1024:.2f} GB)")

## Step 2: Get Tensor Split Recommendations

In [None]:
# Get optimal tensor split
tensor_split = suggest_tensor_split(gpus)

print("💡 Recommended Tensor Split:\n")
if isinstance(tensor_split, list):
    for i, split in enumerate(tensor_split):
        print(f"  GPU {i}: {split*100:.1f}%")
else:
    print(f"  Single GPU (only 1 GPU detected)")

print(f"\n📝 Tensor Split Explanation:")
print(f"  - Distributes model layers across GPUs")
print(f"  - Proportional to GPU VRAM")
print(f"  - Requires NVIDIA NCCL for communication")
print(f"  - Linear scaling: 2x GPUs ≈ 2x faster (ideal case)")

## Step 3: Download Model

In [None]:
models_dir = Path.home() / "models"
models_dir.mkdir(exist_ok=True)

# For multi-GPU, larger models show better speedup
# Using Phi-2 as example (works on single or multi-GPU)
print("📥 Downloading model...")

model_path = hf_hub_download(
    repo_id="TheBloke/phi-2-GGUF",
    filename="phi-2.Q4_K_M.gguf",
    local_dir=str(models_dir),
)

print(f"✅ Model ready: {model_path}")

## Step 4: Start Server with Multi-GPU Support

In [None]:
print("🚀 Starting server with multi-GPU support...\n")

if len(gpus) > 1:
    # Multi-GPU configuration
    server = LlamaServer(
        model_path=model_path,
        n_gpu_layers=33,
        tensor_split=tensor_split,  # Distribute across GPUs
        n_threads=4,
        verbose=False,
    )
    print(f"📍 Using {len(gpus)} GPUs with tensor parallelism")
else:
    # Single GPU configuration
    server = LlamaServer(
        model_path=model_path,
        n_gpu_layers=33,
        n_threads=4,
        verbose=False,
    )
    print(f"📍 Using 1 GPU")

server.start()
print("⏳ Waiting for server to be ready...")
server.wait_ready(timeout=60)
print(f"✅ Server ready at {server.base_url}")

client = LlamaClient()

## Step 5: Verify GPU Utilization

In [None]:
import subprocess
import time

print("📊 GPU Utilization (nvidia-smi):\n")

try:
    result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
    # Show last 15 lines (GPU info)
    lines = result.stdout.split('\n')
    for line in lines[-15:]:
        if line.strip():
            print(line)
except:
    print("nvidia-smi not available")

## Step 6: Run Inference with Multi-GPU

In [None]:
import time# Test single inferenceprompt = "What are the benefits of using multiple GPUs for machine learning?"print(f"🤖 Generating response (multi-GPU)...\n")print(f"Prompt: {prompt}\n")start_time = time.time()response = client.chat(    messages=[        {"role": "user", "content": prompt}    ],    max_tokens=200,    temperature=0.7,)elapsed = time.time() - start_timeanswer = response['choices'][0]['message']['content']print(f"Response: {answer}\n")print(f"⏱️  Generation time: {elapsed:.2f} seconds")

## Step 7: Benchmark Multi-GPU vs Single-GPU

In [None]:
import time# Simple benchmark with multiple promptsprompts = [    "Explain Python in one sentence.",    "What is machine learning?",    "Describe cloud computing.",]print("⏱️  Benchmarking multi-GPU inference...\n")total_time = 0total_tokens = 0for i, prompt in enumerate(prompts, 1):    start_time = time.time()        response = client.chat(        messages=[{"role": "user", "content": prompt}],        max_tokens=100,    )        elapsed = time.time() - start_time    tokens = response.get('usage', {}).get('completion_tokens', 0) if hasattr(response, 'usage') else 100        total_time += elapsed    total_tokens += tokens        print(f"[{i}] Time: {elapsed:.2f}s | {prompt[:40]}...")avg_time = total_time / len(prompts)throughput = total_tokens / total_time if total_time > 0 else 0print(f"\n📊 Benchmark Results:")print(f"  Total time: {total_time:.2f} seconds")print(f"  Average per prompt: {avg_time:.2f} seconds")print(f"  Throughput: {throughput:.1f} tokens/second")print(f"  Total tokens: {total_tokens}")

## Step 8: Multi-Turn Conversation with Multi-GPU

In [None]:
print("💬 Multi-turn conversation with multi-GPU:\n")messages = [    {"role": "system", "content": "You are a helpful AI assistant."}]# Turn 1user_input = "What is distributed computing?"print(f"User: {user_input}")messages.append({"role": "user", "content": user_input})response = client.chat(    messages=messages,    max_tokens=100,)assistant_message = response['choices'][0]['message']['content']print(f"Assistant: {assistant_message}\n")messages.append({"role": "assistant", "content": assistant_message})# Turn 2user_input = "How does it relate to GPUs?"print(f"User: {user_input}")messages.append({"role": "user", "content": user_input})response = client.chat(    messages=messages,    max_tokens=100,)assistant_message = response['choices'][0]['message']['content']print(f"Assistant: {assistant_message}")

## Step 9: Stop Server

In [None]:
print("\n🛑 Stopping server...")
server.stop()
print("✅ Done")

## Key Concepts

### Tensor Parallelism
- **Distribution**: Splits model layers across GPUs
- **Communication**: Uses NVIDIA NCCL for GPU-to-GPU communication
- **Scaling**: Better for very large models
- **Throughput**: Linear improvement with more GPUs (ideal case)

### When to Use Multi-GPU
- Model doesn't fit on single GPU
- Want faster inference
- Have multiple identical GPUs
- Running production workloads

### Performance Considerations
- **PCIe Bandwidth**: GPU-to-GPU communication overhead
- **Network**: Consider if GPUs are on different machines
- **Model Size**: Very large models show better scaling
- **Batch Size**: Larger batches improve GPU utilization

## Monitoring

Use `nvidia-smi` to monitor GPU utilization:
```bash
nvidia-smi dmon  # Real-time monitoring
nvidia-smi       # One-time snapshot
```

## Next Notebooks

- `05_advanced_api.ipynb` - All 30+ API endpoints
- `06_gpu_detection.ipynb` - Detailed GPU information