# Multi-Model LLM Consensus

This notebook demonstrates advanced LLM orchestration with multi-model consensus strategies:

1. Single model queries
2. Multi-model voting
3. Weighted consensus
4. Chain-of-thought reasoning
5. Self-consistency sampling
6. Ensemble confidence analysis

## Prerequisites

```bash
pip install requests numpy pandas matplotlib seaborn
```

In [None]:
import requests
import json
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Any
from collections import Counter

# Configuration
API_KEY = "your-api-key-here"
BASE_URL = "http://localhost:8080"
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

# Styling
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Configuration loaded")

## 1. Single Model Query

Basic query to a single LLM model.

In [None]:
def query_llm(prompt: str, model: str = "gpt-4", temperature: float = 0.7, max_tokens: int = 200) -> Dict:
    """Query a single LLM model."""
    request = {
        "prompt": prompt,
        "model": model,
        "temperature": temperature,
        "max_tokens": max_tokens
    }
    
    response = requests.post(
        f"{BASE_URL}/api/v1/llm/query",
        headers=headers,
        json=request
    )
    
    if response.status_code == 200:
        return response.json()['data']
    else:
        raise Exception(f"Error {response.status_code}: {response.text}")

# Example query
prompt = """Analyze the following cybersecurity threat:

An attacker has gained access to a web server through an unpatched vulnerability.
They have uploaded a reverse shell and are attempting lateral movement.

What are the top 3 immediate actions to take?"""

result = query_llm(prompt, model="gpt-4", temperature=0.3)

print("Single Model Query Result:\n")
print(f"Model: {result['model_used']}")
print(f"Response:\n{result['text']}\n")
print(f"Tokens Used: {result['tokens_used']}")
print(f"Cost: ${result['cost_usd']:.4f}")
print(f"Latency: {result['latency_ms']:.0f}ms")

## 2. Multi-Model Voting

Query multiple models and aggregate responses using majority voting.

In [None]:
# Multi-model consensus query
consensus_request = {
    "prompt": """Question: What is the capital of Australia?
    
Provide only the city name, no explanation.""",
    "models": [
        {"name": "gpt-4", "weight": 1.0},
        {"name": "gpt-3.5-turbo", "weight": 0.8},
        {"name": "claude-3", "weight": 1.0},
        {"name": "palm-2", "weight": 0.9}
    ],
    "strategy": "majority_vote",
    "temperature": 0.1,
    "max_tokens": 50
}

response = requests.post(
    f"{BASE_URL}/api/v1/llm/consensus",
    headers=headers,
    json=consensus_request
)

if response.status_code == 200:
    result = response.json()['data']
    
    print("Multi-Model Consensus Results:\n")
    print(f"Strategy: {result['strategy']}")
    print(f"Consensus Answer: {result['consensus_text']}")
    print(f"Confidence: {result['confidence']:.2%}\n")
    
    print("Individual Model Responses:")
    print(f"{'Model':<20} {'Response':<30} {'Latency (ms)':<15} {'Cost ($)'}")
    print("=" * 85)
    
    for resp in result['individual_responses']:
        print(f"{resp['model']:<20} {resp['text']:<30} {resp['latency_ms']:>10.0f}     {resp['cost_usd']:>8.4f}")
    
    print(f"\n{'='*85}")
    print(f"Total Cost: ${result['total_cost_usd']:.4f}")
    print(f"Total Time: {result['total_time_ms']:.0f}ms")
    print(f"Agreement Rate: {result['agreement_rate']:.1%}")
    
    # Visualize consensus
    if 'vote_distribution' in result:
        votes = result['vote_distribution']
        plt.figure(figsize=(10, 5))
        plt.bar(votes.keys(), votes.values(), color='steelblue', edgecolor='black')
        plt.xlabel('Response')
        plt.ylabel('Votes')
        plt.title('Multi-Model Vote Distribution')
        plt.xticks(rotation=45, ha='right')
        plt.tight_layout()
        plt.show()
else:
    print(f"Error: {response.status_code}")
    print(response.text)

## 3. Weighted Consensus

Use weighted voting based on model reliability for specific task types.

In [None]:
# Technical/coding question with model specialization
coding_prompt = """Write a Python function to find the longest palindromic substring.
Provide only the function definition, no explanation."""

weighted_request = {
    "prompt": coding_prompt,
    "models": [
        {"name": "gpt-4", "weight": 1.5},          # Strong at coding
        {"name": "claude-3", "weight": 1.3},       # Strong at reasoning
        {"name": "codex", "weight": 2.0},          # Specialized for code
        {"name": "palm-2", "weight": 1.0}          # General purpose
    ],
    "strategy": "weighted_average",
    "temperature": 0.2,
    "max_tokens": 300
}

response = requests.post(
    f"{BASE_URL}/api/v1/llm/consensus",
    headers=headers,
    json=weighted_request
)

if response.status_code == 200:
    result = response.json()['data']
    
    print("Weighted Consensus Results:\n")
    print(f"Consensus Response:\n{result['consensus_text']}\n")
    print(f"Weighted Confidence: {result['confidence']:.2%}")
    print(f"Total Cost: ${result['total_cost_usd']:.4f}")
    print(f"Total Time: {result['total_time_ms']:.0f}ms\n")
    
    # Visualize model weights and contributions
    models = [r['model'] for r in result['individual_responses']]
    weights = [r['weight'] for r in result['individual_responses']]
    contributions = [r['contribution'] for r in result['individual_responses']]
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Model weights
    ax1.barh(models, weights, color='skyblue', edgecolor='black')
    ax1.set_xlabel('Weight')
    ax1.set_title('Model Weights')
    ax1.grid(axis='x', alpha=0.3)
    
    # Contributions to consensus
    ax2.barh(models, contributions, color='lightcoral', edgecolor='black')
    ax2.set_xlabel('Contribution to Consensus')
    ax2.set_title('Model Contributions')
    ax2.grid(axis='x', alpha=0.3)
    
    plt.tight_layout()
    plt.show()
else:
    print(f"Error: {response.status_code}")
    print(response.text)

## 4. Chain-of-Thought Reasoning

Use chain-of-thought prompting for complex reasoning tasks.

In [None]:
# Complex reasoning problem
cot_prompt = """Problem: A satellite is in a circular orbit 500km above Earth's surface.
Earth's radius is 6,371km and mass is 5.97 × 10^24 kg.
Calculate the orbital velocity in km/s.

Think step-by-step and show your reasoning."""

cot_request = {
    "prompt": cot_prompt,
    "models": [
        {"name": "gpt-4", "weight": 1.0},
        {"name": "claude-3", "weight": 1.0}
    ],
    "strategy": "chain_of_thought",
    "temperature": 0.3,
    "max_tokens": 500
}

response = requests.post(
    f"{BASE_URL}/api/v1/llm/consensus",
    headers=headers,
    json=cot_request
)

if response.status_code == 200:
    result = response.json()['data']
    
    print("Chain-of-Thought Consensus:\n")
    print(f"{'='*80}")
    print(result['consensus_text'])
    print(f"{'='*80}\n")
    
    print(f"Reasoning Quality Score: {result['reasoning_quality']:.2f}/10")
    print(f"Confidence: {result['confidence']:.2%}")
    print(f"\nIndividual Model Reasoning Paths:")
    
    for i, resp in enumerate(result['individual_responses'], 1):
        print(f"\n--- Model {i}: {resp['model']} ---")
        print(resp['text'][:300] + "..." if len(resp['text']) > 300 else resp['text'])
else:
    print(f"Error: {response.status_code}")
    print(response.text)

## 5. Self-Consistency Sampling

Generate multiple responses and select the most consistent answer.

In [None]:
# Question with multiple valid reasoning paths
sc_prompt = """Question: If you have a 3-gallon jug and a 5-gallon jug,
how can you measure exactly 4 gallons of water?

Provide your answer as a sequence of steps."""

sc_request = {
    "prompt": sc_prompt,
    "model": "gpt-4",
    "strategy": "self_consistency",
    "num_samples": 5,
    "temperature": 0.8,  # Higher temperature for diversity
    "max_tokens": 300
}

response = requests.post(
    f"{BASE_URL}/api/v1/llm/consensus",
    headers=headers,
    json=sc_request
)

if response.status_code == 200:
    result = response.json()['data']
    
    print("Self-Consistency Sampling Results:\n")
    print(f"Most Consistent Answer:\n{result['consensus_text']}\n")
    print(f"Consistency Score: {result['consistency_score']:.2%}")
    print(f"Number of Unique Approaches: {result['num_unique_answers']}")
    print(f"\nAll Generated Samples:\n")
    
    for i, sample in enumerate(result['samples'], 1):
        print(f"{'='*80}")
        print(f"Sample {i} (Similarity: {sample['similarity_to_consensus']:.2%}):")
        print(f"{'='*80}")
        print(sample['text'])
        print()
    
    # Visualize consistency
    similarities = [s['similarity_to_consensus'] for s in result['samples']]
    
    plt.figure(figsize=(10, 5))
    plt.bar(range(1, len(similarities)+1), similarities, color='mediumseagreen', edgecolor='black')
    plt.axhline(y=np.mean(similarities), color='red', linestyle='--', 
                label=f'Mean: {np.mean(similarities):.2%}')
    plt.xlabel('Sample Number')
    plt.ylabel('Similarity to Consensus')
    plt.title('Self-Consistency Sample Similarity')
    plt.ylim(0, 1)
    plt.legend()
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    print(f"Error: {response.status_code}")
    print(response.text)

## 6. Ensemble Confidence Analysis

Analyze confidence and agreement across multiple models and strategies.

In [None]:
# Comparative analysis across strategies
def compare_strategies(prompt: str) -> pd.DataFrame:
    """Compare different consensus strategies on the same prompt."""
    strategies = ['majority_vote', 'weighted_average', 'best_of_n', 'unanimous']
    results = []
    
    for strategy in strategies:
        request = {
            "prompt": prompt,
            "models": [
                {"name": "gpt-4", "weight": 1.0},
                {"name": "gpt-3.5-turbo", "weight": 0.8},
                {"name": "claude-3", "weight": 1.0}
            ],
            "strategy": strategy,
            "temperature": 0.3,
            "max_tokens": 100
        }
        
        try:
            response = requests.post(
                f"{BASE_URL}/api/v1/llm/consensus",
                headers=headers,
                json=request
            )
            
            if response.status_code == 200:
                data = response.json()['data']
                results.append({
                    'Strategy': strategy,
                    'Confidence': data['confidence'],
                    'Agreement': data.get('agreement_rate', 0),
                    'Cost': data['total_cost_usd'],
                    'Time (ms)': data['total_time_ms'],
                    'Answer': data['consensus_text'][:50] + '...'
                })
        except Exception as e:
            print(f"Error with {strategy}: {e}")
    
    return pd.DataFrame(results)

# Test prompt
test_prompt = """What is the time complexity of binary search?
Provide answer as Big-O notation only."""

comparison_df = compare_strategies(test_prompt)

print("Strategy Comparison:\n")
print(comparison_df.to_string(index=False))
print()

# Visualize comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Confidence
axes[0, 0].bar(comparison_df['Strategy'], comparison_df['Confidence'], 
               color='skyblue', edgecolor='black')
axes[0, 0].set_ylabel('Confidence')
axes[0, 0].set_title('Confidence by Strategy')
axes[0, 0].set_ylim(0, 1)
axes[0, 0].tick_params(axis='x', rotation=45)
axes[0, 0].grid(axis='y', alpha=0.3)

# Agreement
axes[0, 1].bar(comparison_df['Strategy'], comparison_df['Agreement'], 
               color='lightcoral', edgecolor='black')
axes[0, 1].set_ylabel('Agreement Rate')
axes[0, 1].set_title('Agreement by Strategy')
axes[0, 1].set_ylim(0, 1)
axes[0, 1].tick_params(axis='x', rotation=45)
axes[0, 1].grid(axis='y', alpha=0.3)

# Cost
axes[1, 0].bar(comparison_df['Strategy'], comparison_df['Cost'], 
               color='lightgreen', edgecolor='black')
axes[1, 0].set_ylabel('Cost (USD)')
axes[1, 0].set_title('Cost by Strategy')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].grid(axis='y', alpha=0.3)

# Latency
axes[1, 1].bar(comparison_df['Strategy'], comparison_df['Time (ms)'], 
               color='plum', edgecolor='black')
axes[1, 1].set_ylabel('Latency (ms)')
axes[1, 1].set_title('Latency by Strategy')
axes[1, 1].tick_params(axis='x', rotation=45)
axes[1, 1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Recommendations
print("\nStrategy Recommendations:")
print("=" * 60)
best_confidence = comparison_df.loc[comparison_df['Confidence'].idxmax()]
print(f"Highest Confidence: {best_confidence['Strategy']} ({best_confidence['Confidence']:.2%})")

best_agreement = comparison_df.loc[comparison_df['Agreement'].idxmax()]
print(f"Best Agreement: {best_agreement['Strategy']} ({best_agreement['Agreement']:.2%})")

lowest_cost = comparison_df.loc[comparison_df['Cost'].idxmin()]
print(f"Lowest Cost: {lowest_cost['Strategy']} (${lowest_cost['Cost']:.4f})")

fastest = comparison_df.loc[comparison_df['Time (ms)'].idxmin()]
print(f"Fastest: {fastest['Strategy']} ({fastest['Time (ms)']:.0f}ms)")

## 7. Real-World Use Case: Security Threat Analysis

Apply multi-model consensus to analyze security threats.

In [None]:
# Complex security scenario
security_prompt = """Analyze this security incident:

Logs show:
- Multiple failed SSH login attempts from 203.0.113.42
- Successful login after 127 attempts
- Immediate privilege escalation attempt
- Creation of new user account 'sysbackup'
- Download of 2.3GB of data from /var/www/
- Connection to external IP 198.51.100.15:4444

Provide:
1. Threat classification (severity 1-10)
2. Attack vector analysis
3. Top 3 immediate response actions
4. Top 3 preventive measures for future"""

security_request = {
    "prompt": security_prompt,
    "models": [
        {"name": "gpt-4", "weight": 1.5},      # Strong reasoning
        {"name": "claude-3", "weight": 1.5},   # Strong at analysis
        {"name": "palm-2", "weight": 1.0}      # Diverse perspective
    ],
    "strategy": "weighted_average",
    "temperature": 0.2,  # Low temp for factual analysis
    "max_tokens": 600
}

response = requests.post(
    f"{BASE_URL}/api/v1/llm/consensus",
    headers=headers,
    json=security_request
)

if response.status_code == 200:
    result = response.json()['data']
    
    print("Security Threat Analysis - Multi-Model Consensus\n")
    print("=" * 80)
    print(result['consensus_text'])
    print("=" * 80)
    print(f"\nAnalysis Confidence: {result['confidence']:.2%}")
    print(f"Model Agreement: {result['agreement_rate']:.1%}")
    print(f"Total Analysis Time: {result['total_time_ms']:.0f}ms")
    print(f"Total Cost: ${result['total_cost_usd']:.4f}")
    
    # Show model-by-model comparison
    print("\n" + "="*80)
    print("Individual Model Assessments:")
    print("="*80)
    
    for resp in result['individual_responses']:
        print(f"\n{resp['model']} (Weight: {resp['weight']}, Cost: ${resp['cost_usd']:.4f}):")
        print("-" * 80)
        print(resp['text'][:400] + "..." if len(resp['text']) > 400 else resp['text'])
else:
    print(f"Error: {response.status_code}")
    print(response.text)

## Summary

This tutorial demonstrated advanced LLM orchestration:

1. ✓ Single model queries with performance metrics
2. ✓ Multi-model voting for improved accuracy
3. ✓ Weighted consensus based on task-specific expertise
4. ✓ Chain-of-thought reasoning for complex problems
5. ✓ Self-consistency sampling for robust answers
6. ✓ Ensemble confidence analysis across strategies
7. ✓ Real-world security threat analysis

## Key Insights

**When to use each strategy:**
- **Majority Vote**: Simple questions with clear correct answers
- **Weighted Average**: When models have known strengths for task types
- **Chain-of-Thought**: Complex reasoning or mathematical problems
- **Self-Consistency**: Multiple valid solution paths exist
- **Best-of-N**: Quality over consensus matters most
- **Unanimous**: Critical decisions requiring high certainty

## Next Steps

- Explore streaming responses with `/api/v1/llm/stream`
- See `04_pixel_processing.ipynb` for computer vision tasks
- Check [LLM Documentation](../docs/API.md#llm-endpoints) for all options
- Review cost optimization strategies in the [Architecture Guide](../docs/ARCHITECTURE.md)