# Model Version Evaluation with LangSmith

This notebook demonstrates how to evaluate and compare two AI models/providers in terms of correctness and latency using LangSmith.

## Prerequisites

1. Set up your virtual environment (see README.md)
2. Copy `.env.example` to `.env` and fill in your API keys
3. Install dependencies: `pip install -r requirements.txt`

## Overview

This evaluation will:
- Load a dataset from LangSmith
- Run two different models on the same inputs
- Measure correctness and latency
- Compare results and generate visualizations

## 1. Setup and Imports

In [None]:
import sys
import os

# Add src directory to path for imports
sys.path.append(os.path.join(os.path.dirname(os.path.abspath('')), 'src'))

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from dotenv import load_dotenv

# Import custom utilities
from src.config import validate_config
from src.utils import measure_latency, calculate_metrics
from src.evaluators import combined_evaluator

# Load environment variables
load_dotenv()

# Validate configuration
try:
    validate_config()
    print("✓ Configuration validated successfully")
except ValueError as e:
    print(f"✗ Configuration error: {e}")

In [None]:
# Import LangChain and LangSmith
from langsmith import Client
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage

# Initialize LangSmith client
client = Client()
print("✓ LangSmith client initialized")

## 2. Configure Models to Compare

We'll compare two different OpenAI models, but you can easily adapt this to compare different providers.

In [None]:
# Configure two models to compare
model_1 = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0.7,
    tags=["model-1", "gpt-3.5"]
)

model_2 = ChatOpenAI(
    model="gpt-4",
    temperature=0.7,
    tags=["model-2", "gpt-4"]
)

print("Model 1: GPT-3.5 Turbo")
print("Model 2: GPT-4")

## 3. Create or Load Test Dataset

For this example, we'll create a simple test dataset. In practice, you would load this from LangSmith.

In [None]:
# Example test dataset
test_data = [
    {
        "input": "What is the capital of France?",
        "expected_output": "Paris"
    },
    {
        "input": "What is 2 + 2?",
        "expected_output": "4"
    },
    {
        "input": "Who wrote Romeo and Juliet?",
        "expected_output": "William Shakespeare"
    },
    {
        "input": "What is the largest planet in our solar system?",
        "expected_output": "Jupiter"
    },
    {
        "input": "What is the chemical symbol for gold?",
        "expected_output": "Au"
    }
]

print(f"Test dataset loaded: {len(test_data)} examples")

# Display first example
print("\nExample test case:")
print(f"Input: {test_data[0]['input']}")
print(f"Expected: {test_data[0]['expected_output']}")

## 4. Run Evaluation on Both Models

We'll run each model on the test dataset and measure both correctness and latency.

In [None]:
def run_model_evaluation(model, test_data, model_name):
    """
    Run evaluation for a single model.
    """
    results = []
    
    for i, test_case in enumerate(test_data):
        print(f"Running {model_name} - Test {i+1}/{len(test_data)}", end="\r")
        
        # Measure latency
        message = HumanMessage(content=test_case["input"])
        response, latency = measure_latency(model.invoke, [message])
        
        # Extract output
        output = response.content
        
        # Evaluate
        evaluation = combined_evaluator(
            output=output,
            expected=test_case["expected_output"],
            latency=latency,
            latency_threshold=5.0
        )
        
        results.append({
            "model": model_name,
            "input": test_case["input"],
            "expected": test_case["expected_output"],
            "output": output,
            "correct": evaluation["correct"],
            "latency": latency,
            "performance": evaluation["performance"],
            "overall_pass": evaluation["overall_pass"]
        })
    
    print(f"\n{model_name} evaluation complete!")
    return results

In [None]:
# Run evaluations
print("Starting Model 1 evaluation...")
results_model_1 = run_model_evaluation(model_1, test_data, "GPT-3.5")

print("\nStarting Model 2 evaluation...")
results_model_2 = run_model_evaluation(model_2, test_data, "GPT-4")

# Combine results
all_results = results_model_1 + results_model_2
df_results = pd.DataFrame(all_results)

print("\n✓ All evaluations complete!")

## 5. Analyze Results

In [None]:
# Display summary statistics
print("=" * 60)
print("EVALUATION SUMMARY")
print("=" * 60)

for model_name in ["GPT-3.5", "GPT-4"]:
    model_data = df_results[df_results["model"] == model_name]
    
    print(f"\n{model_name}:")
    print(f"  Correctness Rate: {model_data['correct'].mean():.1%}")
    print(f"  Average Latency: {model_data['latency'].mean():.3f}s")
    print(f"  Overall Pass Rate: {model_data['overall_pass'].mean():.1%}")
    print(f"  Latency Range: {model_data['latency'].min():.3f}s - {model_data['latency'].max():.3f}s")

In [None]:
# Display detailed results table
print("\nDetailed Results:")
display_columns = ["model", "input", "correct", "latency", "performance"]
df_results[display_columns]

## 6. Visualize Comparison

In [None]:
# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)

# Create subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Model Comparison: GPT-3.5 vs GPT-4', fontsize=16, fontweight='bold')

# 1. Correctness comparison
correctness_data = df_results.groupby('model')['correct'].mean()
axes[0, 0].bar(correctness_data.index, correctness_data.values, color=['#3498db', '#e74c3c'])
axes[0, 0].set_title('Correctness Rate', fontweight='bold')
axes[0, 0].set_ylabel('Correctness Rate')
axes[0, 0].set_ylim([0, 1.1])
for i, v in enumerate(correctness_data.values):
    axes[0, 0].text(i, v + 0.02, f'{v:.1%}', ha='center', fontweight='bold')

# 2. Average latency comparison
latency_data = df_results.groupby('model')['latency'].mean()
axes[0, 1].bar(latency_data.index, latency_data.values, color=['#3498db', '#e74c3c'])
axes[0, 1].set_title('Average Latency', fontweight='bold')
axes[0, 1].set_ylabel('Latency (seconds)')
for i, v in enumerate(latency_data.values):
    axes[0, 1].text(i, v + 0.05, f'{v:.3f}s', ha='center', fontweight='bold')

# 3. Latency distribution
for model_name in ['GPT-3.5', 'GPT-4']:
    model_latencies = df_results[df_results['model'] == model_name]['latency']
    axes[1, 0].hist(model_latencies, alpha=0.6, label=model_name, bins=10)
axes[1, 0].set_title('Latency Distribution', fontweight='bold')
axes[1, 0].set_xlabel('Latency (seconds)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].legend()

# 4. Overall pass rate
pass_rate_data = df_results.groupby('model')['overall_pass'].mean()
axes[1, 1].bar(pass_rate_data.index, pass_rate_data.values, color=['#3498db', '#e74c3c'])
axes[1, 1].set_title('Overall Pass Rate (Correct + Fast)', fontweight='bold')
axes[1, 1].set_ylabel('Pass Rate')
axes[1, 1].set_ylim([0, 1.1])
for i, v in enumerate(pass_rate_data.values):
    axes[1, 1].text(i, v + 0.02, f'{v:.1%}', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

## 7. Export Results

Save the evaluation results to a CSV file for further analysis.

In [None]:
# Export to CSV
output_file = "../data/evaluation_results.csv"
df_results.to_csv(output_file, index=False)
print(f"✓ Results exported to {output_file}")

## Conclusion

This notebook demonstrated:
- Setting up model evaluation with LangSmith
- Comparing two models on correctness and latency
- Visualizing comparison results
- Exporting results for further analysis

## Next Steps

1. **Customize the evaluation**: Modify the `combined_evaluator` function to match your specific needs
2. **Load real datasets**: Connect to your LangSmith datasets
3. **Add more models**: Compare additional models or providers
4. **Advanced metrics**: Implement custom evaluation metrics
5. **Automated reporting**: Set up scheduled evaluations and reports