# LSM-004: Building Robust Evaluation Pipelines

## üéØ Learning Objectives

By the end of this notebook, you will:
- Master advanced dataset creation and management techniques
- Build custom evaluators for different types of LLM applications
- Implement LLM-as-Judge evaluation patterns
- Set up regression testing and CI/CD integration
- Use comparison views and A/B testing for optimization
- Create comprehensive evaluation workflows for production systems

## üõ†Ô∏è Setup and Dependencies

Let's start by setting up our comprehensive evaluation environment.

In [None]:
# Install required packages for evaluation
!pip install langsmith langchain langchain-openai langchain-community
!pip install datasets scikit-learn rouge-score bert-score
!pip install python-dotenv pandas numpy matplotlib seaborn
!pip install pytest asyncio

In [None]:
import os
import json
import time
import asyncio
import pandas as pd
import numpy as np
from typing import List, Dict, Any, Optional, Callable
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

from dotenv import load_dotenv
from langsmith import Client, traceable
from langsmith.evaluation import evaluate, EvaluationResult
from langsmith.schemas import Example, Run

from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage
from langchain.prompts import ChatPromptTemplate
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field

# Load environment variables
load_dotenv()

# Initialize LangSmith client
client = Client()

print(f"‚úÖ Environment setup complete")
print(f"üìä Project: {os.getenv('LANGSMITH_PROJECT', 'Not set')}")

## üìä Advanced Dataset Creation and Management

Let's start by creating comprehensive datasets for different types of evaluation scenarios.

In [None]:
# Define different types of evaluation datasets

class DatasetBuilder:
    """Advanced dataset builder for LLM evaluation"""
    
    def __init__(self, client: Client):
        self.client = client
    
    def create_qa_dataset(self, name: str, examples: List[Dict]) -> str:
        """Create a Q&A evaluation dataset"""
        try:
            dataset = self.client.create_dataset(
                dataset_name=name,
                description="Question-Answering evaluation dataset with ground truth answers"
            )
            
            self.client.create_examples(
                inputs=[ex["inputs"] for ex in examples],
                outputs=[ex["outputs"] for ex in examples],
                dataset_id=dataset.id
            )
            
            return dataset.id
            
        except Exception as e:
            if "already exists" in str(e):
                datasets = list(self.client.list_datasets(dataset_name=name))
                return datasets[0].id if datasets else None
            raise e
    
    def create_classification_dataset(self, name: str, examples: List[Dict]) -> str:
        """Create a classification evaluation dataset"""
        try:
            dataset = self.client.create_dataset(
                dataset_name=name,
                description="Text classification dataset with labels and confidence scores"
            )
            
            self.client.create_examples(
                inputs=[ex["inputs"] for ex in examples],
                outputs=[ex["outputs"] for ex in examples],
                dataset_id=dataset.id
            )
            
            return dataset.id
            
        except Exception as e:
            if "already exists" in str(e):
                datasets = list(self.client.list_datasets(dataset_name=name))
                return datasets[0].id if datasets else None
            raise e
    
    def create_summarization_dataset(self, name: str, examples: List[Dict]) -> str:
        """Create a summarization evaluation dataset"""
        try:
            dataset = self.client.create_dataset(
                dataset_name=name,
                description="Text summarization dataset with reference summaries"
            )
            
            self.client.create_examples(
                inputs=[ex["inputs"] for ex in examples],
                outputs=[ex["outputs"] for ex in examples],
                dataset_id=dataset.id
            )
            
            return dataset.id
            
        except Exception as e:
            if "already exists" in str(e):
                datasets = list(self.client.list_datasets(dataset_name=name))
                return datasets[0].id if datasets else None
            raise e

# Initialize dataset builder
dataset_builder = DatasetBuilder(client)

# Create sample datasets
print("üìä Creating evaluation datasets...")

# Q&A Dataset
qa_examples = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": {"answer": "Paris", "confidence": 1.0, "category": "geography"}
    },
    {
        "inputs": {"question": "Explain the concept of machine learning in one sentence."},
        "outputs": {"answer": "Machine learning is a subset of AI that enables computers to learn and make decisions from data without being explicitly programmed.", "confidence": 0.9, "category": "technology"}
    },
    {
        "inputs": {"question": "What are the main causes of climate change?"},
        "outputs": {"answer": "The main causes of climate change are greenhouse gas emissions from burning fossil fuels, deforestation, and industrial processes.", "confidence": 0.95, "category": "science"}
    },
    {
        "inputs": {"question": "How do you calculate compound interest?"},
        "outputs": {"answer": "Compound interest is calculated using the formula A = P(1 + r/n)^(nt), where P is principal, r is annual interest rate, n is compounding frequency, and t is time in years.", "confidence": 1.0, "category": "mathematics"}
    },
    {
        "inputs": {"question": "What is the difference between HTML and CSS?"},
        "outputs": {"answer": "HTML structures web content while CSS styles and formats that content, controlling layout, colors, fonts, and visual presentation.", "confidence": 0.9, "category": "technology"}
    }
]

qa_dataset_id = dataset_builder.create_qa_dataset("qa-evaluation-advanced", qa_examples)
print(f"‚úÖ Q&A Dataset created: {qa_dataset_id}")

# Classification Dataset
classification_examples = [
    {
        "inputs": {"text": "I absolutely love this product! It exceeded all my expectations and works perfectly."},
        "outputs": {"label": "positive", "confidence": 0.95, "reasoning": "Strong positive language with words like 'love', 'exceeded expectations', and 'perfectly'"}
    },
    {
        "inputs": {"text": "This item is completely useless and broke after one day. Waste of money!"},
        "outputs": {"label": "negative", "confidence": 0.98, "reasoning": "Clear negative sentiment with 'useless', 'broke', and 'waste of money'"}
    },
    {
        "inputs": {"text": "The product is okay, nothing special but it does what it's supposed to do."},
        "outputs": {"label": "neutral", "confidence": 0.85, "reasoning": "Balanced sentiment with neutral expressions like 'okay' and 'nothing special'"}
    },
    {
        "inputs": {"text": "Amazing quality and fast delivery! Highly recommend to everyone."},
        "outputs": {"label": "positive", "confidence": 0.92, "reasoning": "Positive words like 'amazing', 'fast', and 'highly recommend'"}
    },
    {
        "inputs": {"text": "Poor customer service and the product arrived damaged. Very disappointed."},
        "outputs": {"label": "negative", "confidence": 0.90, "reasoning": "Negative experience described with 'poor', 'damaged', and 'disappointed'"}
    }
]

classification_dataset_id = dataset_builder.create_classification_dataset("sentiment-classification-advanced", classification_examples)
print(f"‚úÖ Classification Dataset created: {classification_dataset_id}")

# Summarization Dataset
summarization_examples = [
    {
        "inputs": {"text": "Artificial intelligence (AI) is transforming industries worldwide. From healthcare to finance, AI applications are becoming more sophisticated and widespread. Machine learning algorithms can now process vast amounts of data to identify patterns that would be impossible for humans to detect manually. However, this rapid advancement also raises important questions about job displacement, privacy, and ethical considerations that society must address."},
        "outputs": {"summary": "AI is transforming industries globally with sophisticated applications, but raises concerns about jobs, privacy, and ethics.", "key_points": ["Industry transformation", "Advanced applications", "Pattern recognition capabilities", "Societal concerns"], "length_category": "medium"}
    },
    {
        "inputs": {"text": "Climate change represents one of the most pressing challenges of our time. Rising global temperatures, melting ice caps, and extreme weather events are clear indicators of environmental change. Scientists agree that human activities, particularly the burning of fossil fuels, are the primary drivers of these changes. Immediate action is required to reduce greenhouse gas emissions and transition to renewable energy sources."},
        "outputs": {"summary": "Climate change, driven primarily by human fossil fuel use, requires immediate action to reduce emissions and adopt renewable energy.", "key_points": ["Pressing global challenge", "Clear environmental indicators", "Human-caused", "Need for immediate action"], "length_category": "medium"}
    }
]

summarization_dataset_id = dataset_builder.create_summarization_dataset("summarization-evaluation", summarization_examples)
print(f"‚úÖ Summarization Dataset created: {summarization_dataset_id}")

print("\nüéØ All evaluation datasets created successfully!")

## üß™ Custom Evaluator Development

Now let's build sophisticated custom evaluators for different types of applications.

In [None]:
# Define evaluation result models
class EvaluationScore(BaseModel):
    """Structured evaluation result"""
    score: float = Field(description="Numeric score between 0 and 1")
    reasoning: str = Field(description="Explanation of the score")
    confidence: float = Field(description="Confidence in the evaluation")
    category: Optional[str] = Field(description="Category or aspect being evaluated")

class CustomEvaluators:
    """Collection of custom evaluators for different tasks"""
    
    def __init__(self):
        self.llm = ChatOpenAI(temperature=0.1, model="gpt-4")
        self.parser = PydanticOutputParser(pydantic_object=EvaluationScore)
    
    def accuracy_evaluator(self, run: Run, example: Example) -> Dict[str, Any]:
        """Exact match accuracy evaluator"""
        prediction = run.outputs.get("answer", "").strip().lower()
        ground_truth = example.outputs.get("answer", "").strip().lower()
        
        is_correct = prediction == ground_truth
        
        return {
            "key": "accuracy",
            "score": 1.0 if is_correct else 0.0,
            "comment": f"Expected: '{ground_truth}', Got: '{prediction}'"
        }
    
    def semantic_similarity_evaluator(self, run: Run, example: Example) -> Dict[str, Any]:
        """LLM-as-Judge semantic similarity evaluator"""
        prediction = run.outputs.get("answer", "")
        ground_truth = example.outputs.get("answer", "")
        
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are an expert evaluator. Compare the semantic similarity between a predicted answer and the ground truth answer.
            
Rate the similarity on a scale from 0.0 to 1.0 where:
- 1.0: Semantically identical or equivalent meaning
- 0.8-0.9: Very similar meaning with minor differences
- 0.6-0.7: Similar core meaning but some important differences
- 0.4-0.5: Some overlap but significant differences
- 0.0-0.3: Very different or contradictory meanings

Consider factual accuracy, completeness, and semantic meaning.

{format_instructions}"""),
            ("human", """Ground Truth Answer: {ground_truth}

Predicted Answer: {prediction}

Evaluate the semantic similarity:""")
        ])
        
        try:
            formatted_prompt = prompt.format_messages(
                ground_truth=ground_truth,
                prediction=prediction,
                format_instructions=self.parser.get_format_instructions()
            )
            
            response = self.llm.invoke(formatted_prompt)
            result = self.parser.parse(response.content)
            
            return {
                "key": "semantic_similarity",
                "score": result.score,
                "comment": result.reasoning
            }
            
        except Exception as e:
            return {
                "key": "semantic_similarity",
                "score": 0.0,
                "comment": f"Evaluation failed: {str(e)}"
            }
    
    def classification_evaluator(self, run: Run, example: Example) -> Dict[str, Any]:
        """Classification accuracy evaluator with confidence weighting"""
        predicted_label = run.outputs.get("label", "").strip().lower()
        true_label = example.outputs.get("label", "").strip().lower()
        predicted_confidence = run.outputs.get("confidence", 0.5)
        
        is_correct = predicted_label == true_label
        
        # Weight score by confidence (penalize overconfident wrong predictions)
        if is_correct:
            score = min(1.0, predicted_confidence * 1.2)  # Bonus for confident correct predictions
        else:
            score = max(0.0, 1.0 - predicted_confidence)  # Penalty for confident wrong predictions
        
        return {
            "key": "classification_accuracy",
            "score": score,
            "comment": f"Predicted: {predicted_label} (conf: {predicted_confidence:.2f}), True: {true_label}, Correct: {is_correct}"
        }
    
    def helpfulness_evaluator(self, run: Run, example: Example) -> Dict[str, Any]:
        """LLM-as-Judge helpfulness evaluator"""
        question = example.inputs.get("question", "")
        answer = run.outputs.get("answer", "")
        
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are an expert evaluator assessing the helpfulness of answers to questions.
            
Rate the helpfulness on a scale from 0.0 to 1.0 based on:
- Relevance: How well does the answer address the question?
- Completeness: Is the answer comprehensive enough?
- Clarity: Is the answer clear and understandable?
- Accuracy: Is the information provided correct?
- Usefulness: Would this answer help someone who asked the question?

{format_instructions}"""),
            ("human", """Question: {question}

Answer: {answer}

Evaluate the helpfulness:""")
        ])
        
        try:
            formatted_prompt = prompt.format_messages(
                question=question,
                answer=answer,
                format_instructions=self.parser.get_format_instructions()
            )
            
            response = self.llm.invoke(formatted_prompt)
            result = self.parser.parse(response.content)
            
            return {
                "key": "helpfulness",
                "score": result.score,
                "comment": result.reasoning
            }
            
        except Exception as e:
            return {
                "key": "helpfulness",
                "score": 0.5,
                "comment": f"Evaluation failed: {str(e)}"
            }
    
    def conciseness_evaluator(self, run: Run, example: Example) -> Dict[str, Any]:
        """Evaluate response conciseness"""
        answer = run.outputs.get("answer", "")
        question = example.inputs.get("question", "")
        
        word_count = len(answer.split())
        char_count = len(answer)
        
        # Simple heuristic: penalize overly long answers for simple questions
        question_words = len(question.split())
        
        if question_words <= 10:  # Simple question
            ideal_length = 50  # words
        else:  # Complex question
            ideal_length = 100  # words
        
        # Calculate conciseness score
        if word_count <= ideal_length:
            score = 1.0
        else:
            # Gradual penalty for length
            excess_ratio = (word_count - ideal_length) / ideal_length
            score = max(0.0, 1.0 - (excess_ratio * 0.5))
        
        return {
            "key": "conciseness",
            "score": score,
            "comment": f"Answer length: {word_count} words, {char_count} chars. Ideal: ~{ideal_length} words."
        }

# Initialize evaluators
evaluators = CustomEvaluators()
print("‚úÖ Custom evaluators initialized")

## üèóÔ∏è Application Under Test

Let's create some example applications to evaluate.

In [None]:
# Define applications to evaluate

class ApplicationsUnderTest:
    """Collection of applications to evaluate"""
    
    def __init__(self):
        self.llm = ChatOpenAI(temperature=0.3, model="gpt-3.5-turbo")
        self.creative_llm = ChatOpenAI(temperature=0.7, model="gpt-3.5-turbo")
        self.precise_llm = ChatOpenAI(temperature=0.0, model="gpt-4")
    
    @traceable(run_type="chain", tags=["qa-system", "baseline"])
    def baseline_qa_system(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        """Baseline Q&A system using standard LLM"""
        question = inputs["question"]
        
        messages = [
            SystemMessage(content="Answer the question concisely and accurately."),
            HumanMessage(content=question)
        ]
        
        response = self.llm.invoke(messages)
        
        return {
            "answer": response.content,
            "model": "gpt-3.5-turbo",
            "temperature": 0.3
        }
    
    @traceable(run_type="chain", tags=["qa-system", "enhanced"])
    def enhanced_qa_system(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        """Enhanced Q&A system with better prompting"""
        question = inputs["question"]
        
        messages = [
            SystemMessage(content="""You are a knowledgeable assistant. Provide accurate, helpful, and concise answers.
            
Guidelines:
- Be factually accurate
- Keep answers concise but complete
- If you're unsure, acknowledge uncertainty
- Provide context when helpful"""),
            HumanMessage(content=f"Question: {question}")
        ]
        
        response = self.precise_llm.invoke(messages)
        
        return {
            "answer": response.content,
            "model": "gpt-4",
            "temperature": 0.0
        }
    
    @traceable(run_type="chain", tags=["sentiment-classifier", "simple"])
    def simple_sentiment_classifier(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        """Simple sentiment classification system"""
        text = inputs["text"]
        
        messages = [
            SystemMessage(content="Classify the sentiment as positive, negative, or neutral. Respond with just the label."),
            HumanMessage(content=text)
        ]
        
        response = self.llm.invoke(messages)
        label = response.content.strip().lower()
        
        # Simple confidence estimation based on text length and obvious sentiment words
        confidence = 0.7  # Default
        if any(word in text.lower() for word in ['love', 'amazing', 'perfect', 'excellent']):
            confidence = 0.9
        elif any(word in text.lower() for word in ['hate', 'terrible', 'awful', 'useless']):
            confidence = 0.9
        
        return {
            "label": label,
            "confidence": confidence,
            "model": "gpt-3.5-turbo"
        }
    
    @traceable(run_type="chain", tags=["sentiment-classifier", "advanced"])
    def advanced_sentiment_classifier(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        """Advanced sentiment classification with reasoning"""
        text = inputs["text"]
        
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are an expert sentiment analyzer. Classify the sentiment and provide reasoning.
            
Respond in JSON format:
{
    "label": "positive" | "negative" | "neutral",
    "confidence": 0.0-1.0,
    "reasoning": "explanation of classification"
}"""),
            ("human", "Text to analyze: {text}")
        ])
        
        response = self.llm.invoke(prompt.format_messages(text=text))
        
        try:
            result = json.loads(response.content)
            return {
                "label": result.get("label", "neutral"),
                "confidence": result.get("confidence", 0.5),
                "reasoning": result.get("reasoning", "No reasoning provided"),
                "model": "gpt-3.5-turbo"
            }
        except:
            # Fallback if JSON parsing fails
            return {
                "label": "neutral",
                "confidence": 0.5,
                "reasoning": "Failed to parse response",
                "model": "gpt-3.5-turbo"
            }

# Initialize applications
apps = ApplicationsUnderTest()
print("‚úÖ Test applications initialized")

## üß™ Running Comprehensive Evaluations

Now let's run comprehensive evaluations comparing different approaches.

In [None]:
# Run comprehensive Q&A evaluation

print("üß™ Running Q&A System Evaluations...\n")

# Evaluate baseline Q&A system
print("üìä Evaluating Baseline Q&A System")
try:
    baseline_results = evaluate(
        apps.baseline_qa_system,
        data="qa-evaluation-advanced",
        evaluators=[
            evaluators.accuracy_evaluator,
            evaluators.semantic_similarity_evaluator,
            evaluators.helpfulness_evaluator,
            evaluators.conciseness_evaluator
        ],
        experiment_prefix="qa-baseline",
        description="Baseline Q&A system evaluation",
        max_concurrency=2
    )
    print(f"‚úÖ Baseline evaluation completed")
    
except Exception as e:
    print(f"‚ùå Baseline evaluation failed: {e}")

# Evaluate enhanced Q&A system
print("\nüìä Evaluating Enhanced Q&A System")
try:
    enhanced_results = evaluate(
        apps.enhanced_qa_system,
        data="qa-evaluation-advanced",
        evaluators=[
            evaluators.accuracy_evaluator,
            evaluators.semantic_similarity_evaluator,
            evaluators.helpfulness_evaluator,
            evaluators.conciseness_evaluator
        ],
        experiment_prefix="qa-enhanced",
        description="Enhanced Q&A system evaluation",
        max_concurrency=2
    )
    print(f"‚úÖ Enhanced evaluation completed")
    
except Exception as e:
    print(f"‚ùå Enhanced evaluation failed: {e}")

print("\nüéØ Q&A Evaluations completed! Check your LangSmith dashboard for detailed results.")

In [None]:
# Run sentiment classification evaluation

print("üß™ Running Sentiment Classification Evaluations...\n")

# Evaluate simple sentiment classifier
print("üìä Evaluating Simple Sentiment Classifier")
try:
    simple_sentiment_results = evaluate(
        apps.simple_sentiment_classifier,
        data="sentiment-classification-advanced",
        evaluators=[evaluators.classification_evaluator],
        experiment_prefix="sentiment-simple",
        description="Simple sentiment classification evaluation",
        max_concurrency=2
    )
    print(f"‚úÖ Simple classifier evaluation completed")
    
except Exception as e:
    print(f"‚ùå Simple classifier evaluation failed: {e}")

# Evaluate advanced sentiment classifier
print("\nüìä Evaluating Advanced Sentiment Classifier")
try:
    advanced_sentiment_results = evaluate(
        apps.advanced_sentiment_classifier,
        data="sentiment-classification-advanced",
        evaluators=[evaluators.classification_evaluator],
        experiment_prefix="sentiment-advanced",
        description="Advanced sentiment classification evaluation",
        max_concurrency=2
    )
    print(f"‚úÖ Advanced classifier evaluation completed")
    
except Exception as e:
    print(f"‚ùå Advanced classifier evaluation failed: {e}")

print("\nüéØ Sentiment Classification Evaluations completed!")

## üìä Evaluation Results Analysis

Let's create tools to analyze and visualize our evaluation results.

In [None]:
# Evaluation results analysis tools

class EvaluationAnalyzer:
    """Analyze and visualize evaluation results"""
    
    def __init__(self, client: Client):
        self.client = client
    
    def get_experiment_results(self, experiment_prefix: str) -> List[Dict]:
        """Get results from experiments with given prefix"""
        try:
            experiments = list(self.client.list_experiments(
                project_name=os.getenv("LANGSMITH_PROJECT")
            ))
            
            matching_experiments = [
                exp for exp in experiments 
                if exp.name and exp.name.startswith(experiment_prefix)
            ]
            
            if not matching_experiments:
                print(f"No experiments found with prefix: {experiment_prefix}")
                return []
            
            # Get the most recent experiment
            latest_experiment = max(matching_experiments, key=lambda x: x.created_at)
            
            # Get experiment results
            results = []
            for run in self.client.list_runs(experiment_name=latest_experiment.name):
                if hasattr(run, 'feedback_stats') and run.feedback_stats:
                    results.append({
                        'run_id': run.id,
                        'feedback_stats': run.feedback_stats
                    })
            
            return results
            
        except Exception as e:
            print(f"Error retrieving experiment results: {e}")
            return []
    
    def calculate_summary_metrics(self, results: List[Dict]) -> Dict[str, float]:
        """Calculate summary metrics from evaluation results"""
        if not results:
            return {}
        
        metrics = {}
        metric_values = {}
        
        # Collect all metric values
        for result in results:
            feedback_stats = result.get('feedback_stats', {})
            for metric_name, metric_data in feedback_stats.items():
                if metric_name not in metric_values:
                    metric_values[metric_name] = []
                
                # Extract score based on metric data structure
                if isinstance(metric_data, dict):
                    score = metric_data.get('avg', metric_data.get('mean', 0))
                else:
                    score = metric_data
                
                metric_values[metric_name].append(score)
        
        # Calculate averages
        for metric_name, values in metric_values.items():
            if values:
                metrics[f"{metric_name}_avg"] = np.mean(values)
                metrics[f"{metric_name}_std"] = np.std(values)
                metrics[f"{metric_name}_min"] = np.min(values)
                metrics[f"{metric_name}_max"] = np.max(values)
        
        return metrics
    
    def create_comparison_report(self, experiment_prefixes: List[str]) -> pd.DataFrame:
        """Create a comparison report between different experiments"""
        comparison_data = []
        
        for prefix in experiment_prefixes:
            results = self.get_experiment_results(prefix)
            metrics = self.calculate_summary_metrics(results)
            
            if metrics:
                row = {'experiment': prefix, **metrics}
                comparison_data.append(row)
        
        if comparison_data:
            df = pd.DataFrame(comparison_data)
            return df
        else:
            return pd.DataFrame()
    
    def visualize_metrics(self, df: pd.DataFrame, metrics_to_plot: List[str]):
        """Create visualizations of evaluation metrics"""
        if df.empty:
            print("No data to visualize")
            return
        
        # Set up the plot style
        plt.style.use('default')
        fig, axes = plt.subplots(1, len(metrics_to_plot), figsize=(5 * len(metrics_to_plot), 6))
        
        if len(metrics_to_plot) == 1:
            axes = [axes]
        
        for i, metric in enumerate(metrics_to_plot):
            metric_col = f"{metric}_avg"
            error_col = f"{metric}_std"
            
            if metric_col in df.columns:
                x_pos = range(len(df))
                values = df[metric_col]
                errors = df.get(error_col, [0] * len(df))
                
                bars = axes[i].bar(x_pos, values, yerr=errors, capsize=5, alpha=0.8)
                axes[i].set_xlabel('Experiment')
                axes[i].set_ylabel(f'{metric.title()} Score')
                axes[i].set_title(f'{metric.title()} Comparison')
                axes[i].set_xticks(x_pos)
                axes[i].set_xticklabels(df['experiment'], rotation=45, ha='right')
                axes[i].set_ylim(0, 1.1)
                axes[i].grid(True, alpha=0.3)
                
                # Add value labels on bars
                for bar, value in zip(bars, values):
                    axes[i].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
                               f'{value:.3f}', ha='center', va='bottom')
        
        plt.tight_layout()
        plt.show()

# Initialize analyzer
analyzer = EvaluationAnalyzer(client)

# Create comparison report
print("üìä Analyzing Evaluation Results...\n")

# Attempt to create comparison reports
qa_experiments = ["qa-baseline", "qa-enhanced"]
sentiment_experiments = ["sentiment-simple", "sentiment-advanced"]

print("Q&A Systems Comparison:")
qa_comparison = analyzer.create_comparison_report(qa_experiments)
if not qa_comparison.empty:
    print(qa_comparison.to_string(index=False, float_format='%.3f'))
    
    # Visualize key metrics
    qa_metrics_to_plot = []
    for col in qa_comparison.columns:
        if col.endswith('_avg') and any(metric in col for metric in ['accuracy', 'semantic_similarity', 'helpfulness']):
            qa_metrics_to_plot.append(col.replace('_avg', ''))
    
    if qa_metrics_to_plot:
        print(f"\nüìà Visualizing metrics: {qa_metrics_to_plot}")
        analyzer.visualize_metrics(qa_comparison, qa_metrics_to_plot)
else:
    print("No Q&A comparison data available yet. Run the evaluations first.")

print("\n" + "="*50 + "\n")

print("Sentiment Classification Systems Comparison:")
sentiment_comparison = analyzer.create_comparison_report(sentiment_experiments)
if not sentiment_comparison.empty:
    print(sentiment_comparison.to_string(index=False, float_format='%.3f'))
    
    # Visualize key metrics
    sentiment_metrics_to_plot = []
    for col in sentiment_comparison.columns:
        if col.endswith('_avg') and 'classification' in col:
            sentiment_metrics_to_plot.append(col.replace('_avg', ''))
    
    if sentiment_metrics_to_plot:
        print(f"\nüìà Visualizing metrics: {sentiment_metrics_to_plot}")
        analyzer.visualize_metrics(sentiment_comparison, sentiment_metrics_to_plot)
else:
    print("No sentiment comparison data available yet. Run the evaluations first.")

print("\n‚úÖ Analysis complete! Check your LangSmith dashboard for detailed evaluation results and comparisons.")

## üîÑ Regression Testing Setup

Let's set up automated regression testing to ensure new changes don't break existing functionality.

In [None]:
# Regression testing framework

import pytest
from typing import Tuple
import subprocess
import sys

class RegressionTester:
    """Automated regression testing for LLM applications"""
    
    def __init__(self, client: Client):
        self.client = client
        self.baseline_thresholds = {
            "accuracy": 0.7,
            "semantic_similarity": 0.8,
            "helpfulness": 0.75,
            "classification_accuracy": 0.8
        }
    
    def run_regression_test(self, 
                          application: Callable,
                          dataset_name: str,
                          evaluators: List[Callable],
                          test_name: str,
                          baseline_thresholds: Dict[str, float] = None) -> Tuple[bool, Dict]:
        """Run a regression test and check against thresholds"""
        
        thresholds = baseline_thresholds or self.baseline_thresholds
        
        try:
            # Run evaluation
            results = evaluate(
                application,
                data=dataset_name,
                evaluators=evaluators,
                experiment_prefix=f"regression-{test_name}",
                description=f"Regression test for {test_name}",
                max_concurrency=1
            )
            
            # Extract metrics
            experiment_results = analyzer.get_experiment_results(f"regression-{test_name}")
            metrics = analyzer.calculate_summary_metrics(experiment_results)
            
            # Check against thresholds
            passed_checks = {}
            overall_pass = True
            
            for metric_name, threshold in thresholds.items():
                metric_key = f"{metric_name}_avg"
                if metric_key in metrics:
                    value = metrics[metric_key]
                    passed = value >= threshold
                    passed_checks[metric_name] = {
                        "value": value,
                        "threshold": threshold,
                        "passed": passed
                    }
                    if not passed:
                        overall_pass = False
            
            return overall_pass, {
                "metrics": metrics,
                "checks": passed_checks,
                "test_name": test_name
            }
            
        except Exception as e:
            return False, {
                "error": str(e),
                "test_name": test_name
            }
    
    def create_regression_test_suite(self) -> str:
        """Create a pytest test suite for regression testing"""
        
        test_suite = '''
import pytest
import os
from langsmith import Client
from langsmith.evaluation import evaluate

# Import your applications and evaluators here
# from your_module import apps, evaluators

class TestRegressions:
    """Regression tests for LLM applications"""
    
    @pytest.fixture(autouse=True)
    def setup(self):
        self.client = Client()
        # Initialize your apps and evaluators here
    
    def test_qa_baseline_regression(self):
        """Test Q&A baseline system doesn't regress"""
        results = evaluate(
            # apps.baseline_qa_system,
            lambda x: {"answer": "test"},  # Placeholder
            data="qa-evaluation-advanced",
            evaluators=[],  # Your evaluators here
            experiment_prefix="regression-qa-baseline"
        )
        
        # Add assertions based on your thresholds
        # assert results["accuracy"] >= 0.7
        pass
    
    def test_sentiment_classification_regression(self):
        """Test sentiment classifier doesn't regress"""
        results = evaluate(
            # apps.simple_sentiment_classifier,
            lambda x: {"label": "neutral", "confidence": 0.5},  # Placeholder
            data="sentiment-classification-advanced",
            evaluators=[],  # Your evaluators here
            experiment_prefix="regression-sentiment-simple"
        )
        
        # Add assertions based on your thresholds
        # assert results["classification_accuracy"] >= 0.8
        pass
    
    def test_response_time_regression(self):
        """Test that response times don't regress significantly"""
        # Implement latency regression tests
        pass
    
    def test_cost_regression(self):
        """Test that costs don't increase unexpectedly"""
        # Implement cost regression tests
        pass

if __name__ == "__main__":
    pytest.main([__file__])
'''
        
        # Write test suite to file
        test_file_path = "test_regressions.py"
        with open(test_file_path, "w") as f:
            f.write(test_suite)
        
        return test_file_path
    
    def create_github_actions_workflow(self) -> str:
        """Create a GitHub Actions workflow for CI/CD integration"""
        
        workflow = '''
name: LLM Application Regression Tests

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]
  schedule:
    # Run regression tests daily at 2 AM UTC
    - cron: '0 2 * * *'

jobs:
  regression-tests:
    runs-on: ubuntu-latest
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'
    
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install pytest langsmith langchain langchain-openai
        # Add other dependencies as needed
    
    - name: Run regression tests
      env:
        LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
        LANGSMITH_PROJECT: ${{ secrets.LANGSMITH_PROJECT }}
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      run: |
        pytest test_regressions.py -v --tb=short
    
    - name: Upload test results
      uses: actions/upload-artifact@v3
      if: always()
      with:
        name: regression-test-results
        path: test-results/
'''
        
        # Create .github/workflows directory if it doesn't exist
        os.makedirs(".github/workflows", exist_ok=True)
        
        workflow_file_path = ".github/workflows/regression-tests.yml"
        with open(workflow_file_path, "w") as f:
            f.write(workflow)
        
        return workflow_file_path

# Initialize regression tester
regression_tester = RegressionTester(client)

print("üîÑ Setting up Regression Testing Framework...")

# Create test suite
test_file = regression_tester.create_regression_test_suite()
print(f"‚úÖ Created regression test suite: {test_file}")

# Create GitHub Actions workflow
workflow_file = regression_tester.create_github_actions_workflow()
print(f"‚úÖ Created GitHub Actions workflow: {workflow_file}")

# Run a sample regression test
print("\nüß™ Running Sample Regression Test...")

try:
    passed, results = regression_tester.run_regression_test(
        application=apps.baseline_qa_system,
        dataset_name="qa-evaluation-advanced",
        evaluators=[evaluators.accuracy_evaluator, evaluators.semantic_similarity_evaluator],
        test_name="qa-baseline-sample",
        baseline_thresholds={"accuracy": 0.5, "semantic_similarity": 0.6}  # Lower thresholds for demo
    )
    
    print(f"\nüìä Regression Test Results:")
    print(f"Overall Status: {'‚úÖ PASSED' if passed else '‚ùå FAILED'}")
    
    if "checks" in results:
        print("\nDetailed Results:")
        for metric, check in results["checks"].items():
            status = "‚úÖ" if check["passed"] else "‚ùå"
            print(f"  {status} {metric}: {check['value']:.3f} (threshold: {check['threshold']})")
    
except Exception as e:
    print(f"‚ùå Regression test failed: {e}")

print("\nüéØ Regression Testing Framework setup complete!")
print("\nNext steps:")
print("1. Customize the test_regressions.py file with your specific applications and thresholds")
print("2. Set up the required secrets in your GitHub repository")
print("3. Commit the workflow file to enable automatic regression testing")

## üí° Advanced Evaluation Patterns

Let's explore some advanced evaluation patterns and best practices.

In [None]:
# Advanced evaluation patterns

class AdvancedEvaluationPatterns:
    """Advanced patterns for LLM evaluation"""
    
    def __init__(self, client: Client):
        self.client = client
        self.llm = ChatOpenAI(temperature=0.0, model="gpt-4")
    
    def multi_dimensional_evaluator(self, run: Run, example: Example) -> Dict[str, Any]:
        """Multi-dimensional evaluation with weighted scoring"""
        question = example.inputs.get("question", "")
        answer = run.outputs.get("answer", "")
        
        # Define evaluation dimensions with weights
        dimensions = {
            "accuracy": 0.3,
            "completeness": 0.25,
            "clarity": 0.2,
            "relevance": 0.15,
            "conciseness": 0.1
        }
        
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are an expert evaluator. Assess the answer across multiple dimensions.
            
Rate each dimension on a scale from 0.0 to 1.0:
- Accuracy: How factually correct is the answer?
- Completeness: Does the answer fully address the question?
- Clarity: How clear and understandable is the answer?
- Relevance: How relevant is the answer to the question?
- Conciseness: Is the answer appropriately concise?

Respond in JSON format:
{
    "accuracy": 0.0-1.0,
    "completeness": 0.0-1.0,
    "clarity": 0.0-1.0,
    "relevance": 0.0-1.0,
    "conciseness": 0.0-1.0,
    "reasoning": "detailed explanation"
}"""),
            ("human", """Question: {question}

Answer: {answer}

Evaluate across all dimensions:""")
        ])
        
        try:
            response = self.llm.invoke(prompt.format_messages(question=question, answer=answer))
            result = json.loads(response.content)
            
            # Calculate weighted score
            weighted_score = sum(
                result.get(dim, 0) * weight 
                for dim, weight in dimensions.items()
            )
            
            return {
                "key": "multi_dimensional",
                "score": weighted_score,
                "comment": f"Weighted score: {weighted_score:.3f}. Reasoning: {result.get('reasoning', 'No reasoning provided')}"
            }
            
        except Exception as e:
            return {
                "key": "multi_dimensional",
                "score": 0.5,
                "comment": f"Evaluation failed: {str(e)}"
            }
    
    def consistency_evaluator(self, run: Run, example: Example) -> Dict[str, Any]:
        """Evaluate consistency by asking the same question multiple times"""
        question = example.inputs.get("question", "")
        original_answer = run.outputs.get("answer", "")
        
        # This would typically be run with multiple samples
        # For demo purposes, we'll simulate consistency checking
        
        prompt = ChatPromptTemplate.from_messages([
            ("system", """Assess how consistent this answer would likely be if the question were asked multiple times.
            
Consider:
- Is the answer based on factual information?
- Does it contain subjective elements that might vary?
- How deterministic is the reasoning?

Rate consistency from 0.0 to 1.0 where:
- 1.0: Highly consistent, factual answer
- 0.5: Moderately consistent, some variability expected
- 0.0: Highly inconsistent, subjective or random

Respond with just a number between 0.0 and 1.0"""),
            ("human", """Question: {question}
Answer: {answer}""")
        ])
        
        try:
            response = self.llm.invoke(prompt.format_messages(question=question, answer=original_answer))
            score = float(response.content.strip())
            score = max(0.0, min(1.0, score))  # Clamp to valid range
            
            return {
                "key": "consistency",
                "score": score,
                "comment": f"Estimated consistency score: {score:.3f}"
            }
            
        except Exception as e:
            return {
                "key": "consistency",
                "score": 0.5,
                "comment": f"Consistency evaluation failed: {str(e)}"
            }
    
    def robustness_test_generator(self, original_examples: List[Dict]) -> List[Dict]:
        """Generate adversarial examples for robustness testing"""
        adversarial_examples = []
        
        for example in original_examples:
            question = example["inputs"]["question"]
            
            # Generate variations
            variations = [
                # Paraphrasing
                f"Can you tell me: {question.lower()}",
                f"I'd like to know {question.lower()}",
                # Adding context
                f"Context: This is for educational purposes. Question: {question}",
                # Minor typos (simulate real user input)
                question.replace("the", "teh").replace("you", "yu"),
            ]
            
            for variation in variations:
                if variation != question:  # Avoid duplicates
                    adversarial_examples.append({
                        "inputs": {"question": variation},
                        "outputs": example["outputs"],
                        "metadata": {
                            "original_question": question,
                            "variation_type": "adversarial"
                        }
                    })
        
        return adversarial_examples
    
    def create_robustness_dataset(self, base_dataset_name: str, robustness_dataset_name: str):
        """Create a robustness testing dataset"""
        # Get original examples
        original_examples = []
        # This would typically fetch from the original dataset
        # For demo, we'll use our qa_examples
        original_examples = qa_examples
        
        # Generate adversarial examples
        adversarial_examples = self.robustness_test_generator(original_examples)
        
        # Create new dataset
        try:
            dataset = self.client.create_dataset(
                dataset_name=robustness_dataset_name,
                description="Robustness testing dataset with adversarial examples"
            )
            
            self.client.create_examples(
                inputs=[ex["inputs"] for ex in adversarial_examples],
                outputs=[ex["outputs"] for ex in adversarial_examples],
                dataset_id=dataset.id
            )
            
            return dataset.id
            
        except Exception as e:
            if "already exists" in str(e):
                datasets = list(self.client.list_datasets(dataset_name=robustness_dataset_name))
                return datasets[0].id if datasets else None
            raise e

# Initialize advanced evaluation patterns
advanced_eval = AdvancedEvaluationPatterns(client)

print("üöÄ Advanced Evaluation Patterns Demo\n")

# Create robustness dataset
print("üìä Creating Robustness Testing Dataset...")
robustness_dataset_id = advanced_eval.create_robustness_dataset(
    "qa-evaluation-advanced", 
    "qa-robustness-testing"
)
print(f"‚úÖ Robustness dataset created: {robustness_dataset_id}")

# Run advanced evaluation
print("\nüß™ Running Advanced Multi-Dimensional Evaluation...")
try:
    advanced_results = evaluate(
        apps.enhanced_qa_system,
        data="qa-evaluation-advanced",
        evaluators=[
            advanced_eval.multi_dimensional_evaluator,
            advanced_eval.consistency_evaluator
        ],
        experiment_prefix="advanced-eval",
        description="Advanced multi-dimensional evaluation",
        max_concurrency=1
    )
    print("‚úÖ Advanced evaluation completed")
    
except Exception as e:
    print(f"‚ùå Advanced evaluation failed: {e}")

print("\nüéØ Advanced Evaluation Patterns demonstration complete!")
print("\nKey takeaways:")
print("1. Multi-dimensional evaluation provides richer insights")
print("2. Consistency testing helps identify reliability issues")
print("3. Robustness testing with adversarial examples reveals edge cases")
print("4. Weighted scoring allows prioritizing different aspects")

## üí° Key Takeaways and Best Practices

### ‚úÖ What You've Mastered

1. **Advanced Dataset Management**:
   - Multi-purpose dataset creation (Q&A, classification, summarization)
   - Structured data organization with metadata
   - Adversarial example generation for robustness testing

2. **Custom Evaluator Development**:
   - Exact match and semantic similarity evaluators
   - LLM-as-Judge patterns for complex assessments
   - Multi-dimensional weighted scoring systems
   - Confidence-aware evaluation metrics

3. **Comprehensive Evaluation Workflows**:
   - A/B testing and comparison frameworks
   - Regression testing and CI/CD integration
   - Automated threshold-based quality gates
   - Results visualization and analysis

4. **Production-Ready Patterns**:
   - GitHub Actions integration for continuous testing
   - Consistency and robustness evaluation
   - Automated reporting and alerting

### üéØ Best Practices for Production

1. **Dataset Quality**:
   - Start small but representative (20-50 examples)
   - Continuously expand with production examples
   - Include edge cases and failure scenarios
   - Version control your datasets

2. **Evaluator Design**:
   - Use multiple complementary evaluators
   - Combine automated and human evaluation
   - Design domain-specific metrics
   - Validate evaluator quality with ground truth

3. **Testing Strategy**:
   - Run evaluations on every significant change
   - Set up automated regression testing
   - Monitor evaluation trends over time
   - Use staged rollouts with evaluation gates

4. **Continuous Improvement**:
   - Analyze failed cases to improve systems
   - Update datasets based on production feedback
   - Refine evaluators based on business needs
   - Share evaluation insights across teams

### üîß Advanced Tips

- **Sampling**: Use stratified sampling for balanced evaluation
- **Caching**: Cache expensive evaluations to speed up iterations
- **Parallelization**: Run evaluations in parallel for faster results
- **Versioning**: Track evaluator versions alongside model versions
- **Documentation**: Document evaluation metrics and their business meaning

## üöÄ What's Next?

You're now equipped to build robust evaluation pipelines! Continue to:

- **LSM-005: Prompt Engineering** - Master collaborative prompt development and version control
- **LSM-006: Production Monitoring** - Set up enterprise-grade monitoring and alerting
- **LSM-007: Advanced Patterns** - Explore complex use cases and integration patterns

---

**Ready to master collaborative prompt development?** Continue to **LSM-005: Prompt Engineering** to learn advanced prompt management and collaboration techniques! üé®