# When Does Prompt Optimization Matter?

## An Empirical Study of DSPy Prompt Optimization Across Task Types

---

**Research Question:** Does automated prompt optimization provide meaningful improvements across different task types, or are modern LLMs already well-suited for certain tasks out of the box?

**Hypothesis:** The benefit of prompt optimization varies by task complexity:
- **Simple classification tasks** may see minimal improvement (LLMs already perform well)
- **Complex reasoning tasks** may see significant improvement (optimization discovers effective strategies)

**Framework:** [DSPy](https://github.com/stanfordnlp/dspy) - A framework for programming with foundation models that replaces manual prompt engineering with automatic optimization.

### DSPy Overview

DSPy (Declarative Self-improving Language Programs) provides:

1. **Signatures** - Declarative specifications of input/output behavior
2. **Modules** - Composable building blocks (e.g., `ChainOfThought`)
3. **Optimizers** - Automatic prompt optimization algorithms

```python
# Example: Define a signature
class IntentClassifier(dspy.Signature):
    """Classify customer intent from support query."""
    query: str = dspy.InputField()
    intent: str = dspy.OutputField()

# Create module and optimize
classifier = dspy.Predict(IntentClassifier)
optimized = optimizer.compile(classifier, trainset=data)
```

**Key Insight:** In DSPy, the signature docstring and field descriptions become the "system prompt". Optimizers like MIPROv2 can discover better instructions and few-shot examples automatically.

## Setup and Imports

In [None]:
import json
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import Image, display, Markdown

# Add project root to path for imports
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Configure plotting style
sns.set_style("whitegrid")
plt.rcParams["figure.dpi"] = 100
plt.rcParams["font.size"] = 11

# Define paths
RESULTS_DIR = project_root / "results"
VIZ_DIR = RESULTS_DIR / "visualizations"
DATASETS_DIR = project_root / "datasets"

print(f"Project root: {project_root}")
print(f"Results directory: {RESULTS_DIR}")
print(f"Visualizations: {VIZ_DIR}")

In [None]:
# Load all pre-computed results
def load_json(path):
    with open(path) as f:
        return json.load(f)

# Intent Classification
intent_results = load_json(RESULTS_DIR / "intent_classification" / "optimization.json")
intent_model = load_json(RESULTS_DIR / "intent_classification" / "optimized_model.json")

# Response Generation
response_results = load_json(RESULTS_DIR / "response_generation" / "optimization.json")
response_model = load_json(RESULTS_DIR / "response_generation" / "optimized_model.json")

# Math Solver
math_light_results = load_json(RESULTS_DIR / "math_solver" / "light_optimization.json")
math_medium_results = load_json(RESULTS_DIR / "math_solver" / "medium_optimization.json")
math_model = load_json(RESULTS_DIR / "math_solver" / "optimized_model_light.json")

print("All results loaded successfully!")
print(f"\nExperiments:")
print(f"  - Intent Classification: {intent_results['optimizer']}")
print(f"  - Response Generation: MIPROv2")
print(f"  - Math Solver: MIPROv2 (Light & Medium)")

---

## Section 2: Methodology

### Task Selection Rationale

We selected three tasks representing a spectrum of complexity:

| Task | Type | Complexity | Expected Optimization Benefit |
|------|------|------------|------------------------------|
| Intent Classification | Pattern Matching | Low | Minimal (LLMs already good) |
| Response Generation | Text Generation | Medium | Moderate (benefits from instructions) |
| Math Word Problems | Multi-step Reasoning | High | Significant (benefits from strategies) |

### Optimizers Used

1. **BootstrapFewShot**
   - Selects effective few-shot examples from training data
   - Fast and low-cost
   - Best for: Simple tasks where examples help

2. **MIPROv2 (Multi-prompt Instruction PRoposal Optimizer)**
   - Uses meta-prompting to generate instruction candidates
   - Evaluates and selects best instructions + demos
   - Modes: Light (fewer trials), Medium (more thorough)
   - Best for: Complex tasks where instructions matter

### Evaluation Metrics

- **Intent Classification:** Exact match accuracy
- **Response Generation:** LLM-as-Judge quality score (0.0-1.0)
- **Math Solver:** Numerical answer accuracy

---

## Section 3: Dataset Exploration

### Dataset 1: Bitext Customer Support (Intent Classification & Response Generation)

The Bitext dataset contains 27,000 customer support interactions across 27 intent categories.

In [None]:
# Load Bitext dataset
bitext_path = DATASETS_DIR / "Bitext_Sample_Customer_Support_Training_Dataset_27K_responses-v11.csv"
bitext_df = pd.read_csv(bitext_path)

print(f"Dataset size: {len(bitext_df):,} examples")
print(f"Intent categories: {bitext_df['intent'].nunique()}")
print(f"High-level categories: {bitext_df['category'].nunique()}")
print(f"\nColumns: {list(bitext_df.columns)}")

In [None]:
# Show sample examples
print("Sample Customer Support Queries:\n")
samples = bitext_df.groupby('intent').first().reset_index().head(5)
for _, row in samples.iterrows():
    print(f"Intent: {row['intent']}")
    print(f"Query: {row['instruction'][:100]}...")
    print()

In [None]:
# Plot intent distribution
fig, ax = plt.subplots(figsize=(14, 6))

intent_counts = bitext_df['intent'].value_counts()
colors = plt.cm.viridis(np.linspace(0, 0.8, len(intent_counts)))

bars = ax.bar(range(len(intent_counts)), intent_counts.values, color=colors)
ax.set_xticks(range(len(intent_counts)))
ax.set_xticklabels(intent_counts.index, rotation=45, ha='right', fontsize=9)
ax.set_xlabel('Intent Category', fontweight='bold')
ax.set_ylabel('Number of Examples', fontweight='bold')
ax.set_title('Bitext Dataset: Distribution of Intent Categories', fontweight='bold', fontsize=12)

# Add count labels
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(height)}', ha='center', va='bottom', fontsize=7)

plt.tight_layout()
plt.show()

print(f"\nDataset is balanced: each intent has ~{intent_counts.mean():.0f} examples")

### Dataset 2: GSM8K (Math Word Problems)

GSM8K contains grade-school level math word problems requiring multi-step reasoning.

In [None]:
# Load GSM8K samples
gsm8k_train = []
with open(DATASETS_DIR / "train.jsonl") as f:
    for i, line in enumerate(f):
        if i >= 5:  # Just load a few for display
            break
        gsm8k_train.append(json.loads(line))

# Count total
with open(DATASETS_DIR / "train.jsonl") as f:
    train_count = sum(1 for _ in f)
with open(DATASETS_DIR / "test.jsonl") as f:
    test_count = sum(1 for _ in f)

print(f"GSM8K Dataset:")
print(f"  Training examples: {train_count:,}")
print(f"  Test examples: {test_count:,}")
print(f"  Total: {train_count + test_count:,}")

In [None]:
# Show sample problems
print("Sample Math Word Problems:\n")
print("=" * 80)
for i, item in enumerate(gsm8k_train[:3], 1):
    print(f"\nProblem {i}:")
    print(f"{item['question']}")
    # Extract just the final answer
    answer = item['answer'].split('####')[-1].strip() if '####' in item['answer'] else item['answer']
    print(f"\nAnswer: {answer}")
    print("-" * 80)

---

## Section 4: Experiment 1 - Intent Classification

**Task:** Classify customer support queries into one of 27 intent categories.

**Hypothesis:** Modern LLMs should already be good at this pattern-matching task, yielding minimal improvement from optimization.

In [None]:
# Display intent classification results
print("Intent Classification Results")
print("=" * 50)
print(f"\nOptimizer: {intent_results['optimizer']}")
print(f"Dataset: {intent_results['dataset']}")
print(f"Train size: {intent_results['dataset_sizes']['train']}")
print(f"Test size: {intent_results['dataset_sizes']['test']}")
print()
print(f"Baseline Accuracy:  {intent_results['baseline']['accuracy']:.1f}%")
print(f"Optimized Accuracy: {intent_results['optimized']['accuracy']:.1f}%")
print(f"\nChange: {intent_results['improvement']['absolute']:+.1f}% ({intent_results['improvement']['percentage']:+.1f}% relative)")

In [None]:
# Visualize intent classification results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Bar chart comparison
categories = ['Baseline', 'Optimized']
accuracies = [intent_results['baseline']['accuracy'], intent_results['optimized']['accuracy']]
colors = ['#3498db', '#e74c3c']  # Blue for baseline, red for worse

bars = ax1.bar(categories, accuracies, color=colors, alpha=0.8, edgecolor='black', linewidth=1.2)
ax1.set_ylabel('Accuracy (%)', fontweight='bold')
ax1.set_title('Intent Classification: Baseline vs Optimized', fontweight='bold')
ax1.set_ylim([80, 100])
ax1.axhline(y=90, color='gray', linestyle='--', alpha=0.5, label='90% threshold')

for bar in bars:
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 0.5,
            f'{height:.1f}%', ha='center', va='bottom', fontsize=12, fontweight='bold')

# Improvement delta
improvement = intent_results['improvement']['absolute']
color = '#2ecc71' if improvement > 0 else '#e74c3c'
ax2.bar(['BootstrapFewShot'], [improvement], color=color, alpha=0.8, edgecolor='black', linewidth=1.2)
ax2.axhline(y=0, color='black', linestyle='-', linewidth=1)
ax2.set_ylabel('Accuracy Change (%)', fontweight='bold')
ax2.set_title('Optimization Impact', fontweight='bold')
ax2.set_ylim([-5, 5])
ax2.text(0, improvement - 0.3, f'{improvement:+.1f}%', ha='center', va='top', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

### Analysis: Intent Classification

**Key Finding:** The baseline model already achieves 92% accuracy, and optimization actually resulted in a slight decrease (-2%).

**Interpretation:**
- Modern LLMs are already well-suited for intent classification
- The task involves pattern matching against a fixed set of categories
- Few-shot examples may have introduced noise rather than helping
- The high baseline leaves little room for improvement

**Conclusion:** For simple classification tasks where LLMs already perform well, prompt optimization provides minimal or no benefit.

---

## Section 5: Experiment 2 - Response Generation

**Task:** Generate helpful customer support responses given a query and detected intent.

**Evaluation:** LLM-as-Judge (GPT-4o-mini rates response quality 0.0-1.0)

**Hypothesis:** Response quality should improve with better instructions discovered by MIPROv2.

In [None]:
# Display response generation results
print("Response Generation Results")
print("=" * 50)
print(f"\nOptimizer: MIPROv2")
print(f"Train size: {response_results['dataset']['train_size']}")
print(f"Test size: {response_results['dataset']['test_size']}")
print()
print("Baseline:")
print(f"  Average Quality: {response_results['baseline']['average_quality']:.3f}")
print(f"  Min/Max: {response_results['baseline']['min_quality']:.2f} / {response_results['baseline']['max_quality']:.2f}")
print()
print("Optimized:")
print(f"  Average Quality: {response_results['optimized']['average_quality']:.3f}")
print(f"  Min/Max: {response_results['optimized']['min_quality']:.2f} / {response_results['optimized']['max_quality']:.2f}")
print()
print(f"Improvement: +{response_results['improvement']['absolute']:.3f} ({response_results['improvement']['percentage']:+.1f}% relative)")

In [None]:
# Compare instructions before and after optimization
print("Discovered Instructions (Before vs After Optimization)")
print("=" * 70)
print()
print("BASELINE INSTRUCTION:")
print("-" * 70)
print(response_results['baseline']['instructions'])
print()
print("OPTIMIZED INSTRUCTION (discovered by MIPROv2):")
print("-" * 70)
print(response_results['optimized']['instructions'])

In [None]:
# Visualize response generation results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Quality comparison
categories = ['Baseline', 'Optimized']
qualities = [response_results['baseline']['average_quality'], 
             response_results['optimized']['average_quality']]
colors = ['#3498db', '#2ecc71']

bars = ax1.bar(categories, qualities, color=colors, alpha=0.8, edgecolor='black', linewidth=1.2)
ax1.set_ylabel('Average Quality Score', fontweight='bold')
ax1.set_title('Response Generation: Quality Comparison', fontweight='bold')
ax1.set_ylim([0, 1])

for bar in bars:
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 0.02,
            f'{height:.3f}', ha='center', va='bottom', fontsize=12, fontweight='bold')

# Quality distribution (min/avg/max)
x = np.arange(2)
width = 0.25

mins = [response_results['baseline']['min_quality'], response_results['optimized']['min_quality']]
avgs = [response_results['baseline']['average_quality'], response_results['optimized']['average_quality']]
maxs = [response_results['baseline']['max_quality'], response_results['optimized']['max_quality']]

ax2.bar(x - width, mins, width, label='Min', color='#e74c3c', alpha=0.8)
ax2.bar(x, avgs, width, label='Average', color='#3498db', alpha=0.8)
ax2.bar(x + width, maxs, width, label='Max', color='#2ecc71', alpha=0.8)

ax2.set_ylabel('Quality Score', fontweight='bold')
ax2.set_title('Quality Distribution: Baseline vs Optimized', fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(['Baseline', 'Optimized'])
ax2.legend()
ax2.set_ylim([0, 1])

plt.tight_layout()
plt.show()

### Analysis: Response Generation

**Key Finding:** MIPROv2 discovered instructions that improved average quality from 0.713 to 0.735 (+3.1%).

**Notable Improvements:**
- Minimum quality increased from 0.40 to 0.60 (fewer bad responses)
- Maximum quality increased from 0.92 to 0.95
- The discovered instruction is more specific about empathy and structure

**The Optimized Instruction:**
> "Given a customer support query and the detected customer intent, use the Predict module to predict the reasoning behind the query and generate a tailored response that aligns with both the query and intent. Ensure that the response is professional, empathetic, and provides clear next steps for the customer."

**Conclusion:** For generation tasks, optimization provides moderate improvement by discovering more effective instructions.

---

## Section 6: Experiment 3 - Math Word Problems

**Task:** Solve grade-school math word problems requiring multi-step reasoning.

**Dataset:** GSM8K (Grade School Math 8K)

**Hypothesis:** Complex reasoning tasks should benefit significantly from optimization (discovered strategies + few-shot examples).

In [None]:
# Display math solver results
print("Math Word Problem Solver Results")
print("=" * 50)
print()
print(f"Baseline Accuracy: {math_light_results['baseline']['accuracy']:.1f}%")
print(f"  Correct: {math_light_results['baseline']['correct']}/50")
print()
print("MIPROv2 Light:")
print(f"  Accuracy: {math_light_results['optimized']['accuracy']:.1f}%")
print(f"  Correct: {math_light_results['optimized']['correct']}/50")
print(f"  Improvement: +{math_light_results['improvement']['absolute']:.1f}%")
print()
print("MIPROv2 Medium:")
opt_medium = math_medium_results['optimizations'][0]
print(f"  Accuracy: {opt_medium['accuracy']:.1f}%")
print(f"  Correct: {opt_medium['correct']}/50")
print(f"  Improvement: +{opt_medium['accuracy'] - math_medium_results['baseline']['accuracy']:.1f}%")
print(f"  Optimization time: {opt_medium['time_seconds']/60:.1f} minutes")

In [None]:
# Visualize math solver results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
categories = ['Baseline', 'MIPROv2 Light', 'MIPROv2 Medium']
accuracies = [
    math_light_results['baseline']['accuracy'],
    math_light_results['optimized']['accuracy'],
    math_medium_results['optimizations'][0]['accuracy']
]
colors = ['#3498db', '#f39c12', '#2ecc71']

bars = ax1.bar(categories, accuracies, color=colors, alpha=0.8, edgecolor='black', linewidth=1.2)
ax1.set_ylabel('Accuracy (%)', fontweight='bold')
ax1.set_title('Math Solver: Accuracy by Optimization Level', fontweight='bold')
ax1.set_ylim([0, 100])

for bar in bars:
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 1,
            f'{height:.0f}%', ha='center', va='bottom', fontsize=12, fontweight='bold')

# Improvement delta
improvements = [
    math_light_results['improvement']['absolute'],
    math_medium_results['optimizations'][0]['accuracy'] - math_medium_results['baseline']['accuracy']
]
opt_names = ['MIPROv2 Light', 'MIPROv2 Medium']

bars2 = ax2.bar(opt_names, improvements, color=['#f39c12', '#2ecc71'], alpha=0.8, edgecolor='black', linewidth=1.2)
ax2.axhline(y=0, color='black', linestyle='-', linewidth=1)
ax2.set_ylabel('Accuracy Improvement (%)', fontweight='bold')
ax2.set_title('Optimization Impact: Light vs Medium', fontweight='bold')
ax2.set_ylim([0, 30])

for bar in bars2:
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + 0.5,
            f'+{height:.0f}%', ha='center', va='bottom', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Show discovered few-shot examples
print("Few-Shot Examples Discovered by MIPROv2")
print("=" * 70)
print()
print("These examples were automatically selected to demonstrate effective reasoning:")
print()

demos = math_model['predict']['demos']
for i, demo in enumerate(demos[:2], 1):  # Show first 2
    print(f"EXAMPLE {i}:")
    print(f"Question: {demo['question']}")
    print(f"\nReasoning: {demo['reasoning']}")
    print(f"\nAnswer: {demo['answer']}")
    print("-" * 70)

### Analysis: Math Word Problems

**Key Finding:** MIPROv2 achieved dramatic improvements:
- Light mode: 64% → 84% (+20% absolute, +31% relative)
- Medium mode: 64% → 88% (+24% absolute, +38% relative)

**Why Such Large Improvements?**

1. **Few-shot examples matter**: The discovered examples demonstrate clear step-by-step reasoning
2. **Task requires strategy**: Multi-step arithmetic benefits from explicit decomposition
3. **Low baseline leaves room**: 64% baseline indicates significant room for improvement
4. **Chain-of-thought prompting**: The optimized model uses structured reasoning

**Conclusion:** Complex reasoning tasks benefit substantially from prompt optimization, with gains of 20-24% accuracy.

---

## Section 7: Cross-Task Comparison

In [None]:
# Create summary DataFrame
summary_data = {
    'Task': ['Intent Classification', 'Response Generation', 'Math Solver (Light)', 'Math Solver (Medium)'],
    'Baseline': [92.0, 71.3, 64.0, 64.0],
    'Optimized': [90.0, 73.5, 84.0, 88.0],
    'Change (abs)': [-2.0, 2.2, 20.0, 24.0],
    'Change (%)': [-2.2, 3.1, 31.3, 37.5],
    'Optimizer': ['BootstrapFewShot', 'MIPROv2', 'MIPROv2 Light', 'MIPROv2 Medium']
}

summary_df = pd.DataFrame(summary_data)
print("Summary of All Experiments")
print("=" * 90)
print(summary_df.to_string(index=False))

In [None]:
# Comprehensive comparison visualization
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Plot 1: Baseline vs Optimized
ax1 = axes[0]
x = np.arange(4)
width = 0.35

baseline = [92.0, 71.3, 64.0, 64.0]
optimized = [90.0, 73.5, 84.0, 88.0]
tasks = ['Intent\nClassification', 'Response\nGeneration', 'Math\n(Light)', 'Math\n(Medium)']

ax1.bar(x - width/2, baseline, width, label='Baseline', color='#3498db', alpha=0.8)
ax1.bar(x + width/2, optimized, width, label='Optimized', color='#2ecc71', alpha=0.8)
ax1.set_ylabel('Score (%)', fontweight='bold')
ax1.set_title('Performance: Baseline vs Optimized', fontweight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels(tasks)
ax1.legend()
ax1.set_ylim([0, 100])

# Plot 2: Improvement delta
ax2 = axes[1]
improvements = [-2.0, 2.2, 20.0, 24.0]
colors = ['#e74c3c', '#2ecc71', '#2ecc71', '#2ecc71']

bars = ax2.bar(tasks, improvements, color=colors, alpha=0.8, edgecolor='black', linewidth=1)
ax2.axhline(y=0, color='black', linestyle='-', linewidth=1)
ax2.set_ylabel('Improvement (%)', fontweight='bold')
ax2.set_title('Optimization Impact by Task', fontweight='bold')

for bar, imp in zip(bars, improvements):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + (1 if height >= 0 else -2),
            f'{imp:+.0f}%', ha='center', va='bottom' if height >= 0 else 'top', 
            fontsize=11, fontweight='bold')

# Plot 3: Task complexity vs benefit
ax3 = axes[2]
complexity = [1, 2, 3, 3]  # Arbitrary complexity scores
benefit = [-2.0, 2.2, 20.0, 24.0]
task_labels = ['Intent', 'Response', 'Math (L)', 'Math (M)']
colors = plt.cm.RdYlGn(np.interp(benefit, [-5, 25], [0, 1]))

scatter = ax3.scatter(complexity, benefit, s=300, c=colors, edgecolors='black', linewidth=1.5)
for i, label in enumerate(task_labels):
    ax3.annotate(label, (complexity[i], benefit[i]), textcoords="offset points", 
                 xytext=(0, 12), ha='center', fontsize=10, fontweight='bold')

ax3.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax3.set_xlabel('Task Complexity', fontweight='bold')
ax3.set_ylabel('Optimization Benefit (%)', fontweight='bold')
ax3.set_title('Complexity vs Optimization Benefit', fontweight='bold')
ax3.set_xticks([1, 2, 3])
ax3.set_xticklabels(['Low\n(Classification)', 'Medium\n(Generation)', 'High\n(Reasoning)'])

plt.tight_layout()
plt.show()

In [None]:
# Display existing visualization (from intent classification experiment)
improvement_path = VIZ_DIR / "improvement_delta.png"
if improvement_path.exists():
    print("Pre-generated Visualization: Intent Classification Improvement")
    display(Image(filename=str(improvement_path), width=600))
else:
    print(f"Note: Visualization not found at {improvement_path}")

---

## Section 8: Conclusions

### Key Finding: Task Complexity Determines Optimization Value

Our experiments reveal a clear pattern:

| Task Complexity | Example | Optimization Benefit | Recommendation |
|----------------|---------|---------------------|----------------|
| **Low** | Intent Classification | Minimal/Negative (-2%) | Skip optimization; use baseline |
| **Medium** | Response Generation | Moderate (+3%) | Optional; cost-benefit analysis needed |
| **High** | Math Reasoning | Significant (+20-24%) | Strongly recommended |

### Why Does This Pattern Emerge?

**Low-Complexity Tasks (Classification):**
- Modern LLMs have strong pattern-matching capabilities
- The mapping from input to output is relatively straightforward
- Baseline performance is already high (92%)
- Few-shot examples may introduce noise rather than help

**Medium-Complexity Tasks (Generation):**
- Quality depends on following implicit guidelines
- Discovered instructions can codify better practices
- Improvement comes from clearer task specification

**High-Complexity Tasks (Reasoning):**
- Success requires structured problem decomposition
- Few-shot examples demonstrate effective strategies
- Chain-of-thought reasoning significantly helps
- Low baseline indicates room for improvement

### Practical Recommendations

**When to Use Prompt Optimization:**

1. **Always optimize** when:
   - Task requires multi-step reasoning
   - Baseline performance is below 70%
   - Task has complex output structure

2. **Consider optimizing** when:
   - Quality consistency is important
   - You have training data and compute budget
   - Small improvements justify the cost

3. **Skip optimization** when:
   - Baseline is already >90%
   - Task is simple classification
   - Compute budget is limited

**Optimizer Selection:**
- **BootstrapFewShot**: Fast, cheap, good for simple improvements
- **MIPROv2 Light**: Moderate cost, discovers instructions
- **MIPROv2 Medium**: Higher cost, better for complex tasks

### Limitations and Future Work

**Limitations:**
- Small test sets (50 examples) may have variance
- Single model (GPT-3.5-turbo) - results may differ with other models
- Limited optimizer configurations tested
- LLM-as-Judge evaluation has its own biases

**Future Directions:**
- Test with larger evaluation sets
- Compare across multiple LLM backends
- Explore more task types (summarization, translation, coding)
- Investigate optimizer hyperparameter sensitivity
- Cost-benefit analysis including optimization compute costs

---

## Summary

This research demonstrates that **prompt optimization is not universally beneficial**. The value of optimization depends critically on task complexity:

- **Simple tasks** (classification): LLMs already perform well; optimization unnecessary
- **Complex tasks** (multi-step reasoning): Optimization provides substantial gains (20%+)

**The key insight:** Before investing in prompt optimization, evaluate your baseline performance. If it's already high (>90%), focus efforts elsewhere. If it's moderate or low, especially for reasoning-heavy tasks, optimization can yield significant improvements.

---

*Research conducted using [DSPy](https://github.com/stanfordnlp/dspy) framework with GPT-3.5-turbo as the base model.*