# Session 6: Knowledge Scaling Pathology
## The Confidence-Competence Gap

**Production LLM Deployment: Risk Characterization Before Failure**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Javihaus/Production_LLM_Deployment/blob/main/sessions/session_06_failure_mode_scaling/notebook.ipynb)

---

**Learning Objectives:**
1. Identify tasks exhibiting scaling pathology
2. Calculate the confidence-competence gap metric
3. Recognize when larger models won't help
4. Recommend appropriate model sizes by task type

## Setup

In [None]:
!pip install -q anthropic numpy pandas matplotlib seaborn scipy

import anthropic
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from typing import List, Dict
import time

plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

try:
    from google.colab import userdata
    api_key = userdata.get('ANTHROPIC_API_KEY')
except:
    import os
    api_key = os.environ.get('ANTHROPIC_API_KEY')

client = anthropic.Anthropic(api_key=api_key)
print("Setup complete!")

## Part 1: The Scaling Paradox

Conventional wisdom: Bigger models = better performance.

Reality: On certain tasks, bigger models fail more confidently.

In [None]:
# Simulated scaling data based on real experiments
# Task: Temporal constraint reasoning

scaling_data = {
    "model_size_B": [1.0, 2.7, 3.8, 7.0, 8.0, 13.0, 70.0],
    "loss": [2.8, 2.4, 2.1, 1.8, 1.6, 1.4, 1.0],  # Lower = better (perplexity-like)
    "accuracy": [45, 50, 62.5, 50, 75, 62.5, 65],  # Temporal task accuracy
    "confidence": [60, 70, 75, 80, 85, 90, 95],  # Model's stated confidence
}

df_scaling = pd.DataFrame(scaling_data)

# Visualize the paradox
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Loss vs Size (looks good!)
axes[0].plot(df_scaling['model_size_B'], df_scaling['loss'], 'b-o', linewidth=2)
axes[0].set_xlabel('Model Size (B parameters)')
axes[0].set_ylabel('Loss')
axes[0].set_title('Loss vs Model Size\n(Looks great!)')
axes[0].set_xscale('log')

# Accuracy vs Size (not so good...)
axes[1].plot(df_scaling['model_size_B'], df_scaling['accuracy'], 'r-o', linewidth=2)
axes[1].axhline(50, color='gray', linestyle='--', label='Random chance')
axes[1].set_xlabel('Model Size (B parameters)')
axes[1].set_ylabel('Accuracy (%)')
axes[1].set_title('Accuracy vs Model Size\n(No improvement!)')
axes[1].set_xscale('log')
axes[1].legend()

# Confidence vs Size (problematic!)
axes[2].plot(df_scaling['model_size_B'], df_scaling['confidence'], 'g-o', linewidth=2)
axes[2].plot(df_scaling['model_size_B'], df_scaling['accuracy'], 'r--o', alpha=0.5, label='Actual accuracy')
axes[2].set_xlabel('Model Size (B parameters)')
axes[2].set_ylabel('Percentage')
axes[2].set_title('Confidence vs Accuracy\n(Growing gap!)')
axes[2].set_xscale('log')
axes[2].legend(['Confidence', 'Accuracy'])

plt.tight_layout()
plt.show()

print("\nKey Observation: Loss improves by 64% while accuracy stays flat!")
print("Confidence increases from 60% to 95% while accuracy barely changes.")

## Part 2: Calculating the Confidence-Competence Gap

In [None]:
def calculate_ccg(sizes: List[float], losses: List[float], accuracies: List[float]) -> Dict:
    """Calculate the Confidence-Competence Gap.
    
    CCG = (∂L/∂N) / (∂A/∂N)
    
    A highly negative CCG indicates pathological scaling:
    loss improves while accuracy doesn't (or gets worse).
    """
    
    # Calculate derivatives using linear regression in log space
    log_sizes = np.log(sizes)
    
    # Loss vs size slope
    loss_slope, loss_intercept, _, _, _ = stats.linregress(log_sizes, losses)
    
    # Accuracy vs size slope  
    acc_slope, acc_intercept, _, _, _ = stats.linregress(log_sizes, accuracies)
    
    # CCG calculation
    if abs(acc_slope) < 0.01:  # Near-zero accuracy improvement
        ccg = float('-inf') if loss_slope < 0 else float('inf')
        interpretation = "Pathological: Loss improves but accuracy flat"
    else:
        ccg = loss_slope / acc_slope
        if ccg < -10:
            interpretation = "Severe pathology: Confident failures"
        elif ccg < 0:
            interpretation = "Mild pathology: Some scaling issues"
        else:
            interpretation = "Healthy: Scaling helps"
    
    return {
        "ccg": ccg,
        "loss_slope": loss_slope,
        "accuracy_slope": acc_slope,
        "interpretation": interpretation,
        "loss_improvement_pct": (losses[0] - losses[-1]) / losses[0] * 100,
        "accuracy_change_pp": accuracies[-1] - accuracies[0]
    }


# Calculate CCG for temporal task
temporal_ccg = calculate_ccg(
    df_scaling['model_size_B'].tolist(),
    df_scaling['loss'].tolist(),
    df_scaling['accuracy'].tolist()
)

print("=" * 60)
print("CONFIDENCE-COMPETENCE GAP ANALYSIS")
print("=" * 60)
print(f"\nTask: Temporal Constraint Reasoning")
print(f"\nCCG Value: {temporal_ccg['ccg']:.2f}")
print(f"Loss slope: {temporal_ccg['loss_slope']:.4f} (negative = improving)")
print(f"Accuracy slope: {temporal_ccg['accuracy_slope']:.4f} (positive = improving)")
print(f"\nLoss improved: {temporal_ccg['loss_improvement_pct']:.1f}%")
print(f"Accuracy changed: {temporal_ccg['accuracy_change_pp']:.1f} percentage points")
print(f"\nInterpretation: {temporal_ccg['interpretation']}")

## Part 3: Comparing Task Types

In [None]:
# Compare scaling behavior across task types

task_scaling = {
    "Pattern Completion": {
        "sizes": [1, 3, 7, 13, 70],
        "losses": [3.0, 2.5, 2.0, 1.5, 1.0],
        "accuracies": [60, 70, 80, 88, 95]
    },
    "Knowledge Retrieval": {
        "sizes": [1, 3, 7, 13, 70],
        "losses": [2.8, 2.3, 1.9, 1.5, 1.1],
        "accuracies": [40, 55, 65, 72, 80]
    },
    "Temporal Reasoning": {
        "sizes": [1, 3, 7, 13, 70],
        "losses": [2.8, 2.4, 2.0, 1.5, 1.0],
        "accuracies": [45, 50, 55, 52, 55]
    },
    "Compositional": {
        "sizes": [1, 3, 7, 13, 70],
        "losses": [2.9, 2.5, 2.1, 1.6, 1.1],
        "accuracies": [30, 35, 40, 38, 42]
    }
}

# Calculate CCG for each task
results = []
for task_name, data in task_scaling.items():
    ccg_result = calculate_ccg(data["sizes"], data["losses"], data["accuracies"])
    results.append({
        "task": task_name,
        "ccg": ccg_result["ccg"] if ccg_result["ccg"] != float('-inf') else -100,
        "loss_improvement": ccg_result["loss_improvement_pct"],
        "accuracy_change": ccg_result["accuracy_change_pp"],
        "interpretation": ccg_result["interpretation"]
    })

df_results = pd.DataFrame(results)

print("=" * 70)
print("SCALING BEHAVIOR BY TASK TYPE")
print("=" * 70)
print(df_results.to_string(index=False))

In [None]:
# Visualize scaling by task type
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

colors = ['green', 'blue', 'red', 'orange']
tasks = list(task_scaling.keys())

for i, (task_name, data) in enumerate(task_scaling.items()):
    ax = axes[i // 2, i % 2]
    
    # Plot accuracy
    ax.plot(data["sizes"], data["accuracies"], f'{colors[i][0]}-o', linewidth=2, markersize=8)
    ax.axhline(50, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel('Model Size (B parameters)')
    ax.set_ylabel('Accuracy (%)')
    ax.set_title(f'{task_name}\nAccuracy vs Scale')
    ax.set_xscale('log')
    ax.set_ylim(0, 100)
    
    # Annotate with CCG
    ccg_val = df_results[df_results['task'] == task_name]['ccg'].values[0]
    ax.annotate(f'CCG: {ccg_val:.1f}', xy=(0.05, 0.95), xycoords='axes fraction',
                fontsize=10, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

## Part 4: Model Size Recommendations

In [None]:
def recommend_model_size(task_type: str, ccg: float, accuracy_change: float) -> Dict:
    """Recommend model size strategy based on scaling analysis."""
    
    if accuracy_change > 20 and ccg > -5:
        return {
            "recommendation": "Scale up",
            "reasoning": "Scaling provides meaningful accuracy improvements",
            "suggested_size": "Largest available within budget",
            "hybrid_needed": False
        }
    elif accuracy_change > 10 and ccg > -20:
        return {
            "recommendation": "Moderate scaling",
            "reasoning": "Some benefit from scaling, diminishing returns",
            "suggested_size": "7-13B range",
            "hybrid_needed": False
        }
    elif accuracy_change < 10 or ccg < -20:
        return {
            "recommendation": "Don't scale - use hybrid",
            "reasoning": "Scaling doesn't help; architectural limitation",
            "suggested_size": "Smallest that handles NLU (3-7B)",
            "hybrid_needed": True
        }
    else:
        return {
            "recommendation": "Test before deciding",
            "reasoning": "Mixed scaling behavior",
            "suggested_size": "Run experiments at multiple sizes",
            "hybrid_needed": "Maybe"
        }


print("=" * 60)
print("MODEL SIZE RECOMMENDATIONS")
print("=" * 60)

for _, row in df_results.iterrows():
    rec = recommend_model_size(row['task'], row['ccg'], row['accuracy_change'])
    print(f"\n{row['task']}:")
    print(f"  Recommendation: {rec['recommendation']}")
    print(f"  Reasoning: {rec['reasoning']}")
    print(f"  Suggested size: {rec['suggested_size']}")
    print(f"  Hybrid needed: {rec['hybrid_needed']}")

## Part 5: Exercise - Analyze Your Task's Scaling

In [None]:
# YOUR EXERCISE: Analyze scaling for your task

# Replace with your actual measurements
my_scaling_data = {
    "sizes": [1, 7, 70],  # Model sizes in billions
    "losses": [2.5, 1.8, 1.2],  # Lower = better
    "accuracies": [50, 55, 60]  # Your accuracy measurements
}

my_ccg = calculate_ccg(
    my_scaling_data["sizes"],
    my_scaling_data["losses"],
    my_scaling_data["accuracies"]
)

my_recommendation = recommend_model_size(
    "Your Task",
    my_ccg["ccg"] if my_ccg["ccg"] != float('-inf') else -100,
    my_ccg["accuracy_change_pp"]
)

print("YOUR SCALING ANALYSIS")
print("=" * 40)
print(f"CCG: {my_ccg['ccg']}")
print(f"Interpretation: {my_ccg['interpretation']}")
print(f"\nRecommendation: {my_recommendation['recommendation']}")
print(f"Hybrid needed: {my_recommendation['hybrid_needed']}")

## Key Takeaways

1. **Loss improvement doesn't guarantee accuracy improvement.** Models can get better at being confidently wrong.

2. **CCG quantifies scaling pathology.** Highly negative CCG = scaling won't help.

3. **Task type determines scaling behavior:**
   - Pattern completion: Scales well
   - Temporal/compositional: Pathological scaling

4. **Hybrid over scaling.** When CCG is pathological, invest in hybrid architectures, not larger models.

5. **Test before you scale.** Always verify scaling helps your specific task.

---

**Homework:** Calculate CCG for your deployment task across at least 3 model sizes.

**Next Session:** Failure Mode III—Prompt Brittleness and Pattern Matching