# Advanced Ollama Configuration

**Optimizing Your Local AI Setup**

---

Welcome to the advanced guide for **Ollama configuration**! This notebook dives deep into optimizing your local AI setup, understanding model quantization, hardware acceleration, and advanced features. By the end of this 10-minute tutorial, you'll be able to fine-tune your local AI environment for maximum performance.

### 🎯 What You'll Learn

In this advanced tutorial, you will:
- Understand model sizes and quantization levels
- Configure GPU acceleration for faster inference
- Create custom modelfiles for specialized behavior
- Optimize memory usage and performance
- Troubleshoot common issues
- Build production-ready local AI systems

### 📋 Prerequisites

This notebook assumes you've completed **Video 4** and have:
- Ollama installed and running
- Basic understanding of local models
- At least one model downloaded

## 🔍 Step 1: Installation Verification

### Re-checking Your Setup
Let's verify that Ollama is properly installed and provide troubleshooting guidance if needed.

#### 📝 Haven't installed Ollama yet?
If you haven't installed Ollama yet, please refer to **Video 4** notebook (`04_local_models_ollama.ipynb`) which contains detailed installation instructions for:
- 🪟 Windows (installer download and setup)
- 🍎 macOS (Homebrew or direct download)
- 🐧 Linux (one-line install script)

The installation typically takes just 2-3 minutes and is a one-time setup.

In [None]:
import platform
import subprocess
import os
import json
import psutil  # For system monitoring

def check_ollama_installation():
    """Check Ollama installation and provide troubleshooting guidance."""
    try:
        # Try to run ollama version command
        result = subprocess.run(['ollama', '--version'], 
                              capture_output=True, text=True, 
                              shell=(platform.system() == 'Windows'))
        
        if result.returncode == 0:
            print("✅ Ollama is installed!")
            print(f"   Version: {result.stdout.strip()}")
            
            # Check if Ollama service is running
            try:
                list_result = subprocess.run(['ollama', 'list'], 
                                           capture_output=True, text=True,
                                           shell=(platform.system() == 'Windows'))
                if list_result.returncode == 0:
                    print("✅ Ollama service is running!")
                else:
                    print("⚠️  Ollama is installed but service might not be running.")
                    print("   Try: 'ollama serve' in a separate terminal")
            except:
                print("⚠️  Could not verify Ollama service status")
                
            return True
        else:
            raise Exception("Ollama command failed")
            
    except (subprocess.CalledProcessError, FileNotFoundError, Exception) as e:
        print("❌ Ollama is not properly installed or configured.")
        print(f"\n🔧 Troubleshooting Guide:\n")
        
        system = platform.system()
        
        print("1. Common Issues:")
        print("   • PATH not updated after installation")
        print("   • Service not started automatically")
        print("   • Firewall blocking Ollama")
        
        if system == "Windows":
            print("\n2. Windows-specific fixes:")
            print("   • Restart your terminal/VS Code")
            print("   • Check Windows Services for 'Ollama'")
            print("   • Run as Administrator if needed")
            
        elif system == "Darwin":
            print("\n2. macOS-specific fixes:")
            print("   • Check if Ollama app is running")
            print("   • Try: 'brew services restart ollama'")
            print("   • Check Activity Monitor for 'ollama'")
            
        elif system == "Linux":
            print("\n2. Linux-specific fixes:")
            print("   • Check systemd: 'sudo systemctl status ollama'")
            print("   • Start service: 'sudo systemctl start ollama'")
            print("   • Check logs: 'journalctl -u ollama'")
        
        print("\n3. If still not working:")
        print("   • Reinstall Ollama from https://ollama.com")
        print("   • Check GitHub issues: https://github.com/ollama/ollama/issues")
        return False

# Check installation
ollama_ok = check_ollama_installation()

if ollama_ok:
    print("\n🚀 Great! Let's dive into advanced configuration!")
else:
    print("\n⚠️  Please resolve the installation issues before continuing.")

## 📊 Step 2: Understanding Model Quantization

### What is Quantization?
Quantization reduces model precision to save memory and increase speed. Different quantization levels offer different trade-offs:

- **Q8_0**: 8-bit quantization (minimal quality loss)
- **Q5_K_M**: 5-bit quantization (good balance)
- **Q4_0**: 4-bit quantization (faster, some quality loss)
- **Q2_K**: 2-bit quantization (very fast, noticeable quality loss)

In [None]:
# Model size comparison
print("📊 MODEL QUANTIZATION GUIDE")
print("=" * 60)

# Model size data (approximate)
model_sizes = [
    {"name": "Llama 3.1 70B", "fp16": 140, "q8": 75, "q5": 48, "q4": 39, "q2": 24},
    {"name": "Llama 3.1 8B", "fp16": 16, "q8": 8.5, "q5": 5.5, "q4": 4.5, "q2": 3},
    {"name": "Mistral 7B", "fp16": 14, "q8": 7.5, "q5": 4.8, "q4": 3.9, "q2": 2.5},
    {"name": "Phi-3 3.8B", "fp16": 7.6, "q8": 4, "q5": 2.6, "q4": 2.1, "q2": 1.3}
]

print("\n📏 Model Sizes (GB) by Quantization Level:\n")
print(f"{'Model':<15} {'FP16':<8} {'Q8':<8} {'Q5':<8} {'Q4':<8} {'Q2':<8}")
print("-" * 55)

for model in model_sizes:
    print(f"{model['name']:<15} {model['fp16']:<8} {model['q8']:<8} "
          f"{model['q5']:<8} {model['q4']:<8} {model['q2']:<8}")

print("\n💡 Recommendations:")
print("   • Q8: Best quality, use if you have RAM")
print("   • Q5: Great balance for most users")
print("   • Q4: Good for limited RAM systems")
print("   • Q2: Only for very constrained systems")

# Download specific quantization
print("\n📥 To download specific quantizations:")
print("   ollama pull llama3.1:8b-instruct-q4_0")
print("   ollama pull mistral:7b-instruct-q5_K_M")

## 🚀 Step 3: GPU Acceleration Setup

### Maximizing Performance with GPU
Ollama automatically uses GPU if available, but let's check and optimize your setup.

In [None]:
def check_gpu_support():
    """Check GPU support and provide optimization tips."""
    print("🎮 GPU ACCELERATION CHECK")
    print("=" * 60)
    
    system = platform.system()
    
    # Check for NVIDIA GPU
    try:
        nvidia_result = subprocess.run(['nvidia-smi'], 
                                     capture_output=True, text=True,
                                     shell=(system == 'Windows'))
        
        if nvidia_result.returncode == 0:
            print("✅ NVIDIA GPU detected!")
            print("\n📊 GPU Information:")
            # Parse basic info from nvidia-smi
            lines = nvidia_result.stdout.split('\n')
            for line in lines:
                if 'NVIDIA' in line and 'Driver' in line:
                    print(f"   {line.strip()}")
            
            print("\n⚡ GPU Optimization Tips:")
            print("   • Ollama will automatically use your GPU")
            print("   • For multiple GPUs: Set CUDA_VISIBLE_DEVICES")
            print("   • Monitor GPU usage with 'nvidia-smi -l 1'")
            return True
            
    except:
        pass
    
    # Check for AMD GPU (ROCm)
    try:
        if system == "Linux":
            rocm_result = subprocess.run(['rocm-smi'], 
                                       capture_output=True, text=True)
            if rocm_result.returncode == 0:
                print("✅ AMD GPU detected (ROCm)!")
                print("   Ollama supports AMD GPUs on Linux")
                return True
    except:
        pass
    
    # Check for Apple Silicon
    if system == "Darwin":
        try:
            # Check if running on Apple Silicon
            arch_result = subprocess.run(['uname', '-m'], 
                                       capture_output=True, text=True)
            if 'arm64' in arch_result.stdout:
                print("✅ Apple Silicon detected!")
                print("   • Metal Performance Shaders enabled")
                print("   • Unified memory architecture benefits")
                print("   • Excellent performance for local models")
                return True
        except:
            pass
    
    print("❌ No GPU acceleration detected")
    print("\n💡 CPU-Only Optimization:")
    print("   • Use smaller models (3B-7B parameters)")
    print("   • Use Q4 or Q5 quantization")
    print("   • Close other applications to free RAM")
    print("   • Consider cloud models for heavy tasks")
    return False

# Check GPU support
has_gpu = check_gpu_support()

# Memory recommendations
print("\n💾 MEMORY RECOMMENDATIONS")
print("=" * 60)
print("\nMinimum RAM for smooth operation:")
print("   • 3B models: 8GB RAM (4GB with Q4)")
print("   • 7B models: 16GB RAM (8GB with Q4)")
print("   • 13B models: 32GB RAM (16GB with Q4)")
print("   • 70B models: 64GB+ RAM (32GB with Q2)")

## 🔧 Step 4: Custom Modelfiles

### Creating Specialized Models
Modelfiles let you create custom versions of models with specific parameters, prompts, and behaviors.

In [None]:
# Example Modelfile creation
print("🔧 CUSTOM MODELFILE EXAMPLES")
print("=" * 60)

# Create a custom coding assistant modelfile
coding_modelfile = """
# Coding Assistant Modelfile
FROM mistral

# Set custom parameters
PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1

# System prompt for coding
SYSTEM """You are an expert programmer with deep knowledge of multiple programming languages. 
You provide clean, efficient, and well-commented code. 
Always explain your code and suggest best practices."""

# Template for conversations
TEMPLATE """{{ .System }}
User: {{ .Prompt }}
Assistant: """
"""

# Save modelfile
with open('coding_assistant.modelfile', 'w') as f:
    f.write(coding_modelfile)

print("📝 Created: coding_assistant.modelfile")
print("\nTo create this model:")
print("   ollama create coding-assistant -f coding_assistant.modelfile")

# Creative writing modelfile
creative_modelfile = """
# Creative Writing Assistant
FROM llama3.2

# Creative parameters
PARAMETER temperature 0.8
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.0

SYSTEM """You are a creative writing assistant with a vivid imagination. 
You help craft engaging stories, develop characters, and create immersive worlds. 
Be descriptive, creative, and original in your responses."""
"""

# Fast response modelfile
fast_modelfile = """
# Fast Response Assistant
FROM phi3:mini

# Speed-optimized parameters
PARAMETER temperature 0.1
PARAMETER num_predict 150
PARAMETER stop "User:"
PARAMETER stop "\n\n"

SYSTEM """You are a quick-response assistant. 
Provide concise, direct answers without unnecessary elaboration."""
"""

print("\n📚 Key Modelfile Parameters:")
print("   • temperature: Controls randomness (0.0-1.0)")
print("   • top_p: Nucleus sampling threshold")
print("   • repeat_penalty: Reduces repetition")
print("   • num_predict: Max tokens to generate")
print("   • num_ctx: Context window size")
print("   • num_gpu: Number of layers on GPU")

## 🎛️ Step 5: Performance Tuning

### Optimizing Ollama for Your System
Let's explore environment variables and settings that can dramatically improve performance.

In [None]:
print("🎛️ PERFORMANCE TUNING GUIDE")
print("=" * 60)

# Environment variables for optimization
env_vars = [
    {
        "var": "OLLAMA_NUM_PARALLEL",
        "default": "1",
        "desc": "Number of parallel requests",
        "tip": "Set to 2-4 for multi-user scenarios"
    },
    {
        "var": "OLLAMA_MAX_LOADED_MODELS",
        "default": "1",
        "desc": "Max models kept in memory",
        "tip": "Increase if switching between models frequently"
    },
    {
        "var": "OLLAMA_KEEP_ALIVE",
        "default": "5m",
        "desc": "How long to keep models loaded",
        "tip": "Set to '-1' to keep models always loaded"
    },
    {
        "var": "OLLAMA_HOST",
        "default": "127.0.0.1:11434",
        "desc": "API endpoint",
        "tip": "Change for network access"
    }
]

print("\n🔧 Environment Variables:\n")
for var in env_vars:
    print(f"📌 {var['var']}")
    print(f"   Default: {var['default']}")
    print(f"   Purpose: {var['desc']}")
    print(f"   💡 Tip: {var['tip']}")
    print()

# Platform-specific setup
system = platform.system()

print("\n⚙️ Platform-Specific Setup:")
print("=" * 40)

if system == "Windows":
    print("\n🪟 Windows:")
    print("   Set environment variables in System Properties")
    print("   Or use PowerShell:")
    print("   $env:OLLAMA_NUM_PARALLEL = '2'")
    
elif system == "Darwin":
    print("\n🍎 macOS:")
    print("   Add to ~/.zshrc or ~/.bash_profile:")
    print("   export OLLAMA_NUM_PARALLEL=2")
    print("   export OLLAMA_KEEP_ALIVE=-1")
    
elif system == "Linux":
    print("\n🐧 Linux:")
    print("   Edit /etc/systemd/system/ollama.service:")
    print("   Environment='OLLAMA_NUM_PARALLEL=2'")
    print("   Then: sudo systemctl daemon-reload")

# Memory optimization tips
print("\n💾 Memory Optimization:")
print("=" * 40)
print("\n1. Model Loading Strategy:")
print("   • Single model: Keep loaded for speed")
print("   • Multiple models: Use shorter keep_alive")
print("   • Limited RAM: Load on demand only")

print("\n2. Context Window Tuning:")
print("   • Default: 2048 tokens")
print("   • Reduce for faster responses")
print("   • Increase for longer conversations")
print("   • Set in modelfile: PARAMETER num_ctx 4096")

## 🏃 Step 6: Benchmarking Your Setup

### Measuring Performance
Let's create a simple benchmark to test your system's capabilities.

In [None]:
from strands import Agent
from strands.models import OllamaModel
import time
import statistics

def benchmark_model(model_name, num_runs=3):
    """Benchmark a model's performance."""
    print(f"\n🏃 Benchmarking {model_name}...")
    
    try:
        # Create agent
        agent = Agent(
            model=OllamaModel(model_id=model_name),
            system_prompt="You are a benchmark assistant. Be concise."
        )
        
        # Test prompts
        prompts = [
            "What is 2+2?",
            "Write a haiku about AI.",
            "Explain quantum computing in one sentence."
        ]
        
        times = []
        tokens_per_second = []
        
        for i, prompt in enumerate(prompts):
            print(f"   Test {i+1}: {prompt[:30]}...")
            
            start = time.time()
            response = agent(prompt)
            end = time.time()
            
            duration = end - start
            times.append(duration)
            
            # Estimate tokens (rough)
            response_length = len(str(response).split())
            tps = response_length / duration
            tokens_per_second.append(tps)
            
            print(f"      Time: {duration:.2f}s, ~{tps:.1f} tokens/sec")
        
        # Calculate statistics
        avg_time = statistics.mean(times)
        avg_tps = statistics.mean(tokens_per_second)
        
        print(f"\n   📊 Results for {model_name}:")
        print(f"      Average response time: {avg_time:.2f}s")
        print(f"      Average tokens/second: {avg_tps:.1f}")
        
        return avg_time, avg_tps
        
    except Exception as e:
        print(f"   ❌ Error benchmarking {model_name}: {e}")
        return None, None

print("🏃 SYSTEM BENCHMARK")
print("=" * 60)

# Get system info
try:
    import psutil
    print("\n💻 System Information:")
    print(f"   CPU: {psutil.cpu_count()} cores")
    print(f"   RAM: {psutil.virtual_memory().total / (1024**3):.1f} GB")
    print(f"   Available RAM: {psutil.virtual_memory().available / (1024**3):.1f} GB")
except:
    print("   Install psutil for system info: pip install psutil")

# List available models for benchmarking
print("\n📋 Available models for benchmarking:")
try:
    result = subprocess.run(['ollama', 'list'], 
                          capture_output=True, text=True,
                          shell=(platform.system() == 'Windows'))
    if result.returncode == 0:
        print(result.stdout)
        
        # Benchmark the smallest model available
        print("\n🚀 Starting benchmark (this may take a minute)...")
        
        # Try to benchmark phi3:mini as it's fast
        benchmark_model("phi3:mini")
        
except Exception as e:
    print(f"❌ Benchmark error: {e}")

print("\n💡 Performance Tips Based on Results:")
print("   • <1s response: Excellent for real-time apps")
print("   • 1-3s response: Good for interactive use")
print("   • >3s response: Consider smaller model or GPU")

## 🔨 Step 7: Advanced Troubleshooting

### Common Issues and Solutions
Let's diagnose and fix common Ollama problems.

In [None]:
def diagnose_ollama():
    """Comprehensive Ollama diagnostics."""
    print("🔨 OLLAMA DIAGNOSTICS")
    print("=" * 60)
    
    issues_found = []
    
    # 1. Check Ollama version
    print("\n1️⃣ Checking Ollama version...")
    try:
        result = subprocess.run(['ollama', '--version'], 
                              capture_output=True, text=True,
                              shell=(platform.system() == 'Windows'))
        if result.returncode == 0:
            print(f"   ✅ {result.stdout.strip()}")
        else:
            issues_found.append("Ollama not in PATH")
            print("   ❌ Ollama not found in PATH")
    except:
        issues_found.append("Cannot execute ollama command")
        print("   ❌ Cannot execute ollama command")
    
    # 2. Check service status
    print("\n2️⃣ Checking Ollama service...")
    try:
        import requests
        response = requests.get('http://localhost:11434/api/tags', timeout=2)
        if response.status_code == 200:
            print("   ✅ Ollama service is running")
        else:
            issues_found.append("Ollama service not responding")
            print("   ❌ Ollama service not responding")
    except:
        issues_found.append("Cannot connect to Ollama API")
        print("   ❌ Cannot connect to Ollama API (http://localhost:11434)")
    
    # 3. Check models
    print("\n3️⃣ Checking installed models...")
    try:
        result = subprocess.run(['ollama', 'list'], 
                              capture_output=True, text=True,
                              shell=(platform.system() == 'Windows'))
        if result.returncode == 0:
            models = result.stdout.strip().split('\n')
            if len(models) > 1:  # Header + at least one model
                print(f"   ✅ {len(models)-1} models installed")
            else:
                issues_found.append("No models installed")
                print("   ❌ No models installed")
    except:
        issues_found.append("Cannot list models")
        print("   ❌ Cannot list models")
    
    # 4. Check disk space
    print("\n4️⃣ Checking disk space...")
    try:
        import shutil
        total, used, free = shutil.disk_usage("/")
        free_gb = free / (1024**3)
        if free_gb > 10:
            print(f"   ✅ {free_gb:.1f} GB free")
        else:
            issues_found.append(f"Low disk space: {free_gb:.1f} GB")
            print(f"   ⚠️  Low disk space: {free_gb:.1f} GB")
    except:
        print("   ⚠️  Could not check disk space")
    
    # Provide solutions
    if issues_found:
        print("\n\n🔧 SOLUTIONS:")
        print("=" * 40)
        
        for issue in issues_found:
            print(f"\n❌ Issue: {issue}")
            
            if "PATH" in issue:
                print("   Solution: Restart your terminal/IDE after installation")
            elif "service" in issue or "API" in issue:
                print("   Solution: Start Ollama service")
                if platform.system() == "Windows":
                    print("   - Check Windows Services")
                elif platform.system() == "Darwin":
                    print("   - Open Ollama from Applications")
                else:
                    print("   - Run: sudo systemctl start ollama")
            elif "No models" in issue:
                print("   Solution: Download a model")
                print("   - Run: ollama pull llama3.2")
            elif "disk space" in issue:
                print("   Solution: Free up disk space")
                print("   - Models need 3-10 GB each")
    else:
        print("\n✅ All diagnostics passed! Ollama is ready to use.")
    
    return len(issues_found) == 0

# Run diagnostics
ollama_healthy = diagnose_ollama()

if not ollama_healthy:
    print("\n💡 Fix the issues above before proceeding with advanced configuration.")

## 🌐 Step 8: Network Configuration

### Sharing Your Local AI
Want to access Ollama from other devices or share with teammates? Let's configure network access.

In [None]:
print("🌐 NETWORK CONFIGURATION")
print("=" * 60)

print("\n🔧 To enable network access:")
print("\n1. Set OLLAMA_HOST environment variable:")
print("   OLLAMA_HOST=0.0.0.0:11434")

print("\n2. Restart Ollama service")

print("\n3. Access from other devices:")
print("   http://YOUR_IP_ADDRESS:11434")

print("\n⚠️  Security Considerations:")
print("   • Only expose on trusted networks")
print("   • Consider using a reverse proxy")
print("   • Implement authentication if needed")
print("   • Use firewall rules to restrict access")

# Get local IP address
try:
    import socket
    hostname = socket.gethostname()
    local_ip = socket.gethostbyname(hostname)
    print(f"\n💻 Your local IP: {local_ip}")
    print(f"   After configuration, access at: http://{local_ip}:11434")
except:
    print("\n💻 Could not determine local IP address")

## 🎯 Step 9: Production Best Practices

### Building Reliable Local AI Systems
Here are essential practices for production deployments.

In [None]:
print("🎯 PRODUCTION BEST PRACTICES")
print("=" * 60)

production_tips = {
    "🔒 Security": [
        "Keep Ollama updated regularly",
        "Use firewall rules for network access",
        "Monitor access logs",
        "Run with minimal privileges"
    ],
    "⚡ Performance": [
        "Use GPU acceleration when available",
        "Choose appropriate quantization levels",
        "Monitor memory usage",
        "Implement request queuing"
    ],
    "🛡️ Reliability": [
        "Set up automatic restarts",
        "Monitor service health",
        "Implement graceful error handling",
        "Use model fallbacks"
    ],
    "📊 Monitoring": [
        "Track response times",
        "Monitor token usage",
        "Log errors and warnings",
        "Set up alerts for issues"
    ]
}

for category, tips in production_tips.items():
    print(f"\n{category}")
    for tip in tips:
        print(f"   • {tip}")

print("\n\n📝 Example Production Setup:")
print("="*40)
print("\n1. SystemD service with auto-restart")
print("2. Nginx reverse proxy with SSL")
print("3. Prometheus metrics collection")
print("4. Grafana dashboard for monitoring")
print("5. Automated model updates")

## 🎉 Congratulations!

### 🏆 What You've Mastered
In this advanced tutorial, you've learned:
- ✅ Model quantization and optimization
- ✅ GPU acceleration configuration
- ✅ Custom modelfile creation
- ✅ Performance tuning techniques
- ✅ Advanced troubleshooting
- ✅ Production deployment practices

### 🚀 Your Local AI Arsenal

You now have the skills to:
- Optimize models for your hardware
- Create specialized AI assistants
- Diagnose and fix common issues
- Deploy production-ready local AI

### 💡 Advanced Tips

1. **Experiment with Quantization**: Find the sweet spot between quality and speed
2. **Create Model Libraries**: Build a collection of specialized models
3. **Automate Everything**: Scripts for model updates and health checks
4. **Join the Community**: Share your modelfiles and learn from others

### 📚 Further Resources

- [Ollama Documentation](https://ollama.com/docs)
- [Ollama GitHub](https://github.com/ollama/ollama)
- [Model Library](https://ollama.com/library)
- [Community Discord](https://discord.gg/ollama)

### 🌟 Final Thoughts

With these advanced skills, you're ready to build sophisticated local AI systems that are:
- **Fast**: Optimized for your hardware
- **Private**: Your data stays yours
- **Reliable**: Production-ready configurations
- **Flexible**: Custom models for any use case

Keep experimenting, and remember: the best AI is the one that works for YOUR needs!

Happy building with your supercharged local AI setup! 🚀