# Lab 1: LLM Evaluation Fundamentals
**Duration**: 60 minutes

Welcome to the first lab of our LLM Evaluations Workshop! In this lab, you'll learn the fundamentals of evaluating Large Language Model outputs and understand why traditional software testing approaches don't work for LLMs.

## 🎯 Learning Objectives
By the end of this lab, you will:
- Understand why LLM evaluation is critical for production systems
- Learn core evaluation metrics (relevance, coherence, groundedness)
- Perform your first evaluation using Azure AI Foundry SDK
- Recognize the non-deterministic nature of LLM outputs

## 📋 Prerequisites
- Completed setup verification notebook
- Azure OpenAI credentials configured
- Understanding of basic programming concepts

## Part 1: Introduction - The LLM Evaluation Problem (15 min)

### Why Traditional Testing Doesn't Work for LLMs

Traditional software testing relies on deterministic inputs and outputs:
```python
# Traditional function - always returns the same output
def add_numbers(a, b):
    return a + b

assert add_numbers(2, 3) == 5  # This will always pass
```

LLMs are different - they're probabilistic and can generate different outputs for the same input!

In [35]:
# Let's start by importing our dependencies
import sys
import os

# Add the project root to the path so we can import shared utilities
sys.path.append(os.path.join(os.path.dirname(os.getcwd())))

from shared_utils.azure_clients import azure_manager
from shared_utils.foundry_evaluation import foundry_runner
from lab1_evaluation_fundamentals.utils.lab1_helpers import (
    demonstrate_llm_variability, 
    print_evaluation_insights,
    DEMO_PROMPTS
)

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

print("✅ Dependencies imported successfully!")
print(f"Available demo prompts: {list(DEMO_PROMPTS.keys())}")

# Check AI Foundry integration status
foundry_status = foundry_runner.get_status_info()
if foundry_status['ai_foundry_available']:
    print("🏢 Azure AI Foundry integration: ENABLED")
    print("   ✅ Evaluation results will appear in the AI Foundry portal")
else:
    print("🔧 Azure AI Foundry integration: DISABLED")
    print("   ℹ️ To enable portal integration, add to .env:")
    print("      AZURE_AI_FOUNDRY_PROJECT_NAME, AZURE_AI_FOUNDRY_ENDPOINT, AZURE_AI_FOUNDRY_API_KEY")

✅ Dependencies imported successfully!
Available demo prompts: ['factual', 'explanatory', 'creative', 'analytical', 'instruction']
🏢 Azure AI Foundry integration: ENABLED
   ✅ Evaluation results will appear in the AI Foundry portal


### Demo: Same Prompt, Different Outputs

In [36]:
# Get Azure OpenAI client
client = azure_manager.get_openai_client()
deployment_name = os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME')

# Demonstrate LLM variability with a factual question
test_prompt = DEMO_PROMPTS['factual']

print("🔬 EXPERIMENT: Demonstrating LLM Non-Determinism")
print("We'll ask the same question 3 times and observe the variations...\n")

responses = demonstrate_llm_variability(
    client=client,
    deployment_name=deployment_name,
    prompt=test_prompt,
    num_runs=3
)

print("\n🔍 OBSERVATION:")
print("Notice how each response is different, even though we asked the exact same question!")
print("This is why we can't use traditional assert statements for LLM testing.")

🔬 EXPERIMENT: Demonstrating LLM Non-Determinism
We'll ask the same question 3 times and observe the variations...

Running the same prompt 3 times to show variability...
Prompt: What is the capital of Japan and what is it known for?
--------------------------------------------------------------------------------
Run 1:
The capital of Japan is **Tokyo**.

**Tokyo** is known for several things, including:

- **Modernity and Technology:** It is renowned for its cutting-edge technology, vibrant cityscape, and innovative architecture.
- **Cultural Heritage:** Tokyo blends traditional culture with modern life, featuring historic temples and shrines, such as Senso-ji and Meiji Shrine.
- **Cuisine:** The city is famous for its diverse food scene, including sushi, ramen, and countless Michelin-starred restaurants.
- **Fashion and Pop Culture:** Tokyo is a global center for fashion, anime, manga, and youth culture, especially in areas like Harajuku and Akihabara.
- **Economic Power:** As one of 

### Business Impact of Poor LLM Performance

Poor LLM performance can lead to:
- **Customer Dissatisfaction**: Irrelevant or unhelpful responses
- **Brand Risk**: Inconsistent or inappropriate content generation
- **Compliance Issues**: Incorrect information in regulated industries
- **Cost Overruns**: Using expensive models when cheaper ones would suffice
- **Security Risks**: Potential for harmful or biased outputs

## Part 2: Core Concepts - Understanding Evaluation Metrics (15 min)

### Quality Metrics

#### 1. Relevance
- **Definition**: How well does the response address the question?
- **Scale**: Typically 1-5 (1 = completely irrelevant, 5 = perfectly relevant)
- **Use Case**: Ensuring responses actually answer the user's question

#### 2. Coherence
- **Definition**: How clear and logically structured is the response?
- **Scale**: Typically 1-5 (1 = incoherent, 5 = perfectly coherent)
- **Use Case**: Ensuring responses are understandable and well-organized

#### 3. Fluency
- **Definition**: How natural and grammatically correct is the language?
- **Scale**: Typically 1-5 (1 = poor fluency, 5 = perfect fluency)
- **Use Case**: Ensuring professional, readable output

#### 4. Groundedness
- **Definition**: How well is the response supported by the provided context?
- **Scale**: Typically 1-5 (1 = not grounded, 5 = fully grounded)
- **Use Case**: Preventing hallucination in RAG applications

In [37]:
# Let's examine our sample data to understand the format
from shared_utils.evaluation_helpers import load_evaluation_data

# Load our sample Q&A pairs
sample_data = load_evaluation_data('data/sample_qa_pairs.jsonl')

print(f"📊 Loaded {len(sample_data)} sample Q&A pairs")
print("\n🔍 Sample data format:")

# Show the first example
example = sample_data[0]
print(f"Query: {example['query']}")
print(f"Response: {example['response']}")
print(f"Context: {example['context'][:100]}...")
print(f"Ground Truth: {example['ground_truth']}")

print("\n💡 Key Points:")
print("- Query: The question or prompt given to the LLM")
print("- Response: The LLM's answer")
print("- Context: Background information provided to the LLM")
print("- Ground Truth: The correct or expected answer (for reference)")

📊 Loaded 8 sample Q&A pairs

🔍 Sample data format:
Query: What is the capital of France?
Response: The capital of France is Paris.
Context: France is a country in Western Europe. Its capital and largest city is Paris, which is also the coun...
Ground Truth: Paris

💡 Key Points:
- Query: The question or prompt given to the LLM
- Response: The LLM's answer
- Context: Background information provided to the LLM
- Ground Truth: The correct or expected answer (for reference)


### Safety and Performance Metrics

Beyond quality, we also evaluate:

#### Safety Metrics
- **Harmful Content**: Detection of potentially harmful outputs
- **Bias Detection**: Identifying biased or discriminatory responses
- **Toxicity**: Measuring offensive or inappropriate content

#### Performance Metrics
- **Latency**: Response time
- **Token Usage**: Cost implications
- **Throughput**: Requests per second

## Part 3: Hands-On - Basic Evaluation (25 min)

Now let's perform our first LLM evaluation using Azure AI Foundry SDK!

### Step 1: Set Up the Evaluation Environment (with optional AI Foundry integration)


In [38]:
# Check AI Foundry integration status
foundry_status = foundry_runner.get_status_info()
print("📊 EVALUATION ENVIRONMENT SETUP")
print("=" * 50)

if foundry_status['ai_foundry_available']:
    print("🏢 Azure AI Foundry Integration: ✅ ENABLED")
    print("   Your evaluation framework is configured for AI Foundry!")
    print("   📱 Portal: https://ai.azure.com")
    print("")
    print("   ℹ️ Current Status: azure-ai-projects v1.0.0 detected")
    print("   📊 Evaluations run locally with enhanced metadata")
    print("   🔮 Future versions will automatically sync to portal")
else:
    print("🔧 Azure AI Foundry Integration: ❌ DISABLED")
    print("   Evaluations will run locally (fully functional)")
    print("")
    print("💡 To enable AI Foundry integration, add to your .env file:")
    print("   AZURE_AI_FOUNDRY_PROJECT_NAME=your-project-name")
    print("   AZURE_AI_FOUNDRY_ENDPOINT=https://your-project.cognitiveservices.azure.com/")
    print("   AZURE_AI_FOUNDRY_API_KEY=your-api-key")
    print("")
    print("📚 How to get these values:")
    print("   1. Go to https://ai.azure.com")
    print("   2. Create or select an AI Foundry project")
    print("   3. Go to Project Settings → Keys and Endpoints")
    print("   4. Copy the endpoint URL and primary key")

print(f"\n🔧 Current execution mode: {foundry_status['portal_integration']}")
print("   Both modes provide identical evaluation capabilities!")
print("   🎯 The workshop works perfectly with or without AI Foundry integration")

📊 EVALUATION ENVIRONMENT SETUP
🏢 Azure AI Foundry Integration: ✅ ENABLED
   Your evaluation framework is configured for AI Foundry!
   📱 Portal: https://ai.azure.com

   ℹ️ Current Status: azure-ai-projects v1.0.0 detected
   📊 Evaluations run locally with enhanced metadata
   🔮 Future versions will automatically sync to portal

🔧 Current execution mode: enabled
   Both modes provide identical evaluation capabilities!
   🎯 The workshop works perfectly with or without AI Foundry integration


In [39]:
# Import evaluation components
from azure.ai.evaluation import evaluate, RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator, GroundednessEvaluator

# Get model configuration for evaluators
model_config = azure_manager.get_model_config()

print("✅ Evaluation environment set up!")
print(f"Model config: {model_config['azure_deployment']}")

✅ Evaluation environment set up!
Model config: gpt-4.1


### Step 2: Create Evaluators

In [40]:
# Create individual evaluators
print("🔧 Creating evaluators...")

try:
    relevance_evaluator = RelevanceEvaluator(model_config=model_config)
    print("✅ Relevance evaluator created")
except Exception as e:
    print(f"❌ Failed to create relevance evaluator: {e}")
    relevance_evaluator = None

try:
    coherence_evaluator = CoherenceEvaluator(model_config=model_config)
    print("✅ Coherence evaluator created")
except Exception as e:
    print(f"❌ Failed to create coherence evaluator: {e}")
    coherence_evaluator = None

try:
    fluency_evaluator = FluencyEvaluator(model_config=model_config)
    print("✅ Fluency evaluator created")
except Exception as e:
    print(f"❌ Failed to create fluency evaluator: {e}")
    fluency_evaluator = None

try:
    groundedness_evaluator = GroundednessEvaluator(model_config=model_config)
    print("✅ Groundedness evaluator created")
except Exception as e:
    print(f"❌ Failed to create groundedness evaluator: {e}")
    groundedness_evaluator = None

# Create evaluators dictionary (only include successful ones)
evaluators = {}
if relevance_evaluator:
    evaluators["relevance"] = relevance_evaluator
if coherence_evaluator:
    evaluators["coherence"] = coherence_evaluator
if fluency_evaluator:
    evaluators["fluency"] = fluency_evaluator
if groundedness_evaluator:
    evaluators["groundedness"] = groundedness_evaluator

print(f"\n📊 Successfully created {len(evaluators)} evaluators: {list(evaluators.keys())}")

🔧 Creating evaluators...
✅ Relevance evaluator created
✅ Coherence evaluator created
✅ Fluency evaluator created
✅ Groundedness evaluator created

📊 Successfully created 4 evaluators: ['relevance', 'coherence', 'fluency', 'groundedness']


### Step 3: Prepare Evaluation Data

In [41]:
# Take a small subset for our first evaluation
evaluation_data = sample_data[:3]  # Use first 3 examples

print(f"🔬 Running evaluation on {len(evaluation_data)} examples...")

# Show what we're evaluating
for i, item in enumerate(evaluation_data):
    print(f"\nExample {i+1}:")
    print(f"  Query: {item['query']}")
    print(f"  Response: {item['response'][:100]}...")
    print(f"  Has context: {'Yes' if item.get('context') else 'No'}")

🔬 Running evaluation on 3 examples...

Example 1:
  Query: What is the capital of France?
  Response: The capital of France is Paris....
  Has context: Yes

Example 2:
  Query: How do you calculate the area of a circle?
  Response: The area of a circle is calculated using the formula A = πr², where r is the radius of the circle....
  Has context: Yes

Example 3:
  Query: What is machine learning?
  Response: Machine learning is a subset of artificial intelligence that enables computers to learn and make dec...
  Has context: Yes


### Step 4: Run the Evaluation

In [None]:
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from azure.ai.evaluation import evaluate, AzureAIProject

# Run the evaluation (with optional Azure AI Foundry integration)
print("🚀 Running evaluation...")
print("This may take a few minutes as we call the Azure OpenAI service for each metric...")

try:
    client = AIProjectClient(
        endpoint="https://aifoundry825233136833-resource.services.ai.azure.com/api/projects/aifoundry825233136833",
        credential=DefaultAzureCredential()
    )

    # Use the enhanced foundry evaluation runner
    # results = foundry_runner.run_evaluation(
    #     data=evaluation_data,
    #     evaluators=evaluators,
    #     run_name=f"Lab 1 Basic Evaluation - {len(evaluation_data)} items",
    #     description=f"Foundational evaluation with {len(evaluators)} quality metrics"
    # )
    proj = AzureAIProject(
        subscription_id=os.environ["AZURE_SUBSCRIPTION_ID"],
        resource_group_name=os.environ["AZURE_RESOURCE_GROUP_NAME"],
        project_name=os.environ["AZURE_AI_FOUNDRY_PROJECT_NAME"],
    )
    results = evaluate(
        data='./data/sample_qa_pairs.jsonl',
        evaluators=evaluators,
        project=proj,
    )
    print("✅ Evaluation completed successfully!")
    
    # # Show execution method and results
    # execution_method = results.get('_execution_method', 'unknown')
    
    # if execution_method == 'azure_ai_foundry_hybrid':
    #     print("🏢 Azure AI Foundry Integration: ACTIVE")
    #     print(f"   📁 Dataset uploaded to AI Foundry portal")
    #     print(f"   📊 Dataset ID: {results.get('_dataset_id', 'N/A')}")
    #     print(f"   📝 Dataset name: {results.get('_dataset_name', 'N/A')}")
    #     print(f"   👀 View at: https://ai.azure.com")
    #     print(f"   💡 {results.get('_note', 'N/A')}")
    #     if 'metrics' in results:
    #         print("📈 Evaluation metrics calculated successfully")
            
    # elif execution_method == 'azure_ai_foundry_ready':
    #     print("🏢 Evaluation completed with AI Foundry integration prepared")
    #     print(f"   📊 Run name: {results.get('_run_name', 'N/A')}")
    #     print(f"   📝 Description: {results.get('_description', 'N/A')}")
    #     print(f"   🔮 Note: {results.get('_note', 'N/A')}")
    #     if 'metrics' in results:
    #         print("📈 Metrics calculated successfully")
            
    # elif execution_method == 'azure_ai_foundry':
    #     print("🏢 Results submitted to Azure AI Foundry portal")
    #     print(f"   📊 Evaluation ID: {results.get('evaluation_id', 'N/A')}")
    #     print(f"   📁 Dataset ID: {results.get('dataset_id', 'N/A')}")
    #     print(f"   👀 View results at: {results.get('portal_url', 'https://ai.azure.com')}")
    #     print("   ⏳ Note: Results may take a few minutes to appear in the portal")
        
    # else:
    #     print("🔧 Results processed locally")
    #     print(f"📊 Evaluated {len(evaluation_data)} examples")
    #     if 'metrics' in results:
    #         print("📈 Metrics calculated successfully")

except Exception as e:
    print(f"❌ Evaluation failed: {e}")
    print("\n🔧 Troubleshooting tips:")
    print("1. Check your Azure OpenAI credentials in .env")
    print("2. Verify your deployment name is correct")
    print("3. Ensure you have quota available in your Azure subscription")
    print("4. Check if your endpoint is accessible")
    results = None

🚀 Running evaluation...
This may take a few minutes as we call the Azure OpenAI service for each metric...
❌ Evaluation failed: (UserError) Unable to load data from './data/lab1_basic_evaluation.json'. Supported formats are JSONL and CSV. Detailed error: Expected object or value.

🔧 Troubleshooting tips:
1. Check your Azure OpenAI credentials in .env
2. Verify your deployment name is correct
3. Ensure you have quota available in your Azure subscription
4. Check if your endpoint is accessible


### Step 5: Interpret the Results

In [8]:
import json
import os
import ipykernel

ipykernel.__version__
print(ipykernel.__version__)
results_file = "data/lab1_basic_evaluation.json"

if os.path.exists(results_file):
    print(f"📁 Loading results from {results_file}")
    with open(results_file, 'r') as f:
        results = json.load(f)
    print("✅ Results loaded successfully!")

    # Use our helper function to display insights
    print_evaluation_insights(results)

else:
    print(f"⚠️ Results file not found at {results_file}")
    print("Make sure you've run the evaluation in the previous step first.")
    print("If evaluation failed, that's okay - the key learning is understanding the process.")
    results = None

6.30.1
📁 Loading results from data/lab1_basic_evaluation.json
✅ Results loaded successfully!
🔍 EVALUATION INSIGHTS
📊 Overall Metrics:
  • relevance.relevance: 4.000
    → Excellent - Response directly addresses the question
  • relevance.gpt_relevance: 4.000
    → Excellent - Response directly addresses the question
  • relevance.relevance_threshold: 3.000
    → Good - Response is mostly relevant with minor issues
  • coherence.coherence: 4.000
    → Excellent - Response is very clear and well-structured
  • coherence.gpt_coherence: 4.000
    → Excellent - Response is very clear and well-structured
  • coherence.coherence_threshold: 3.000
    → Good - Response is mostly coherent with good flow
  • fluency.fluency: 3.667
    → Good - Response flows well with minor language issues
  • fluency.gpt_fluency: 3.667
    → Good - Response flows well with minor language issues
  • fluency.fluency_threshold: 3.000
    → Good - Response flows well with minor language issues
  • groundedness.groun

### Understanding Your Results

Let's break down what these numbers mean:

In [9]:
if results and 'metrics' in results:
    print("📚 UNDERSTANDING YOUR EVALUATION RESULTS")
    print("=" * 50)

    metrics = results['metrics']

    print("\n🎯 Overall Average Scores:")

    # Extract the main metric scores (not the thresholds or binary aggregates)
    main_metrics = {}
    for metric_name, score in metrics.items():
        if not any(x in metric_name for x in ['threshold', 'binary_aggregate', 'gpt_']):
            # Get the base metric name (e.g., 'relevance' from 'relevance.relevance')
            base_name = metric_name.split('.')[0]
            if base_name not in main_metrics:
                main_metrics[base_name] = score

    for metric_name, score in main_metrics.items():
        print(f"\n{metric_name.upper()}: {score:.3f}")

        if score >= 4.0:
            print("  📈 EXCELLENT - Your LLM responses are performing very well!")
        elif score >= 3.0:
            print("  📊 GOOD - Solid performance with room for improvement")
        elif score >= 2.0:
            print("  📉 FAIR - Some issues that should be addressed")
        else:
            print("  🚨 POOR - Significant improvement needed")

    # Show individual results
    if 'rows' in results:
        print(f"\n📋 Individual Results ({len(results['rows'])} examples):")
        print("=" * 40)

        for i, row in enumerate(results['rows']):
            print(f"\nExample {i+1}:")

            # Show the query
            query = row.get('inputs.query', 'Unknown query')
            print(f"  Query: {query}")

            # Show the scores for this example
            scores = {}
            for key, value in row.items():
                if key.startswith('outputs.') and key.endswith('.relevance'):
                    scores['Relevance'] = value
                elif key.startswith('outputs.') and key.endswith('.coherence'):
                    scores['Coherence'] = value
                elif key.startswith('outputs.') and key.endswith('.fluency'):
                    scores['Fluency'] = value
                elif key.startswith('outputs.') and key.endswith('.groundedness'):
                    scores['Groundedness'] = value

            print("  Scores:")
            for metric, score in scores.items():
                print(f"    • {metric}: {score:.1f}/5.0")

            # Show the reasoning if available
            reasons = {}
            for key, value in row.items():
                if '_reason' in key:
                    metric_name = key.split('.')[1].title()
                    reasons[metric_name] = value

            if reasons:
                print("  Key Insights:")
                for metric, reason in list(reasons.items())[:2]:  # Show first 2 to avoid clutter
                    print(f"    • {metric}: {reason[:80]}...")

    print("\n💡 Key Insights:")
    print("• Scores are on a 1-5 scale (5 being the best)")
    print("• Consistency across metrics indicates well-balanced responses")
    print("• Low groundedness might indicate hallucination issues")
    print("• Low relevance suggests the model isn't understanding the query well")

else:
    print("\n📚 CONCEPTUAL UNDERSTANDING")
    print("Even without live results, you now understand:")
    print("✅ How to set up Azure AI evaluation frameworks")
    print("✅ The key metrics for evaluating LLM responses")
    print("✅ How to interpret evaluation scores")
    print("✅ Why evaluation is critical for production LLM systems")

📚 UNDERSTANDING YOUR EVALUATION RESULTS

🎯 Overall Average Scores:

RELEVANCE: 4.000
  📈 EXCELLENT - Your LLM responses are performing very well!

COHERENCE: 4.000
  📈 EXCELLENT - Your LLM responses are performing very well!

FLUENCY: 3.667
  📊 GOOD - Solid performance with room for improvement

GROUNDEDNESS: 4.000
  📈 EXCELLENT - Your LLM responses are performing very well!

📋 Individual Results (3 examples):

Example 1:
  Query: What is the capital of France?
  Scores:
    • Relevance: 4.0/5.0
    • Coherence: 4.0/5.0
    • Fluency: 3.0/5.0
    • Groundedness: 4.0/5.0
  Key Insights:
    • Relevance: The response directly and accurately answers the query by stating that Paris is ...
    • Coherence: The response is fully coherent, directly answers the question, and is logically ...

Example 2:
  Query: How do you calculate the area of a circle?
  Scores:
    • Relevance: 4.0/5.0
    • Coherence: 4.0/5.0
    • Fluency: 4.0/5.0
    • Groundedness: 5.0/5.0
  Key Insights:
    • Relevanc

In [10]:
if results:
    # Use our helper function to display insights
    print_evaluation_insights(results)
    
    # Save results for later analysis
    from lab1_evaluation_fundamentals.utils.lab1_helpers import save_lab1_results
    save_lab1_results(results, "lab1_basic_evaluation.json")
    
else:
    print("⚠️ No results to display - evaluation may have failed.")
    print("Don't worry! This is common in workshop environments.")
    print("The key learning is understanding the evaluation process.")

🔍 EVALUATION INSIGHTS
📊 Overall Metrics:
  • relevance.relevance: 4.000
    → Excellent - Response directly addresses the question
  • relevance.gpt_relevance: 4.000
    → Excellent - Response directly addresses the question
  • relevance.relevance_threshold: 3.000
    → Good - Response is mostly relevant with minor issues
  • coherence.coherence: 4.000
    → Excellent - Response is very clear and well-structured
  • coherence.gpt_coherence: 4.000
    → Excellent - Response is very clear and well-structured
  • coherence.coherence_threshold: 3.000
    → Good - Response is mostly coherent with good flow
  • fluency.fluency: 3.667
    → Good - Response flows well with minor language issues
  • fluency.gpt_fluency: 3.667
    → Good - Response flows well with minor language issues
  • fluency.fluency_threshold: 3.000
    → Good - Response flows well with minor language issues
  • groundedness.groundedness: 4.000
    → Excellent - Response is fully supported by the provided context
  • grou

## Part 4: Exploring Different Types of Responses (Bonus)

In [11]:
# Let's create some intentionally different quality responses to see how evaluation works
test_cases = [
    {
        "query": "What is the capital of France?",
        "response": "The capital of France is Paris. Paris is located in the north-central part of France and is the country's largest city.",
        "context": "France is a country in Western Europe. Its capital and largest city is Paris.",
        "expected_quality": "High - Direct, accurate, relevant"
    },
    {
        "query": "What is the capital of France?",
        "response": "Well, France has many beautiful cities, and I think you might be interested in learning about French culture and cuisine. The Eiffel Tower is very famous.",
        "context": "France is a country in Western Europe. Its capital and largest city is Paris.",
        "expected_quality": "Low - Doesn't answer the question"
    },
    {
        "query": "What is the capital of France?",
        "response": "The capital is Paris and also France has a population of 67 million people living in cities like Lyon and Marseille which are also important economic centers.",
        "context": "France is a country in Western Europe. Its capital and largest city is Paris.",
        "expected_quality": "Medium - Correct but adds irrelevant info"
    }
]

print("🧪 EXPERIMENT: Comparing Response Quality")
print("Let's see how different response qualities are scored...\n")

for i, case in enumerate(test_cases):
    print(f"Test Case {i+1}: {case['expected_quality']}")
    print(f"Response: {case['response']}")
    print("-" * 60)

🧪 EXPERIMENT: Comparing Response Quality
Let's see how different response qualities are scored...

Test Case 1: High - Direct, accurate, relevant
Response: The capital of France is Paris. Paris is located in the north-central part of France and is the country's largest city.
------------------------------------------------------------
Test Case 2: Low - Doesn't answer the question
Response: Well, France has many beautiful cities, and I think you might be interested in learning about French culture and cuisine. The Eiffel Tower is very famous.
------------------------------------------------------------
Test Case 3: Medium - Correct but adds irrelevant info
Response: The capital is Paris and also France has a population of 67 million people living in cities like Lyon and Marseille which are also important economic centers.
------------------------------------------------------------


In [12]:
# If we have working evaluators, let's test these cases
if evaluators and len(evaluators) > 0:
    print("🔬 Evaluating different response qualities...\n")
    
    try:
        # Use the enhanced foundry evaluation runner for comparison
        comparison_results = foundry_runner.run_evaluation(
            data=test_cases,
            evaluators=evaluators,
            run_name="Lab 1 Response Quality Comparison",
            description="Comparing high, medium, and low quality responses"
        )
        
        print("📊 COMPARISON RESULTS:")
        print("=" * 40)
        
        # Show execution method for comparison run
        execution_method = comparison_results.get('_execution_method', 'unknown')
        if execution_method == 'azure_ai_foundry_hybrid':
            print(f"🏢 Comparison dataset uploaded to AI Foundry")
            print(f"   📁 Dataset ID: {comparison_results.get('_dataset_id', 'N/A')}")
        
        if 'rows' in comparison_results:
            for i, (case, row) in enumerate(zip(test_cases, comparison_results['rows'])):
                print(f"\nCase {i+1}: {case['expected_quality']}")
                
                # Show the scores for this example
                scores = {}
                for key, value in row.items():
                    if key.startswith('outputs.') and key.endswith('.relevance'):
                        scores['Relevance'] = value
                    elif key.startswith('outputs.') and key.endswith('.coherence'):
                        scores['Coherence'] = value
                    elif key.startswith('outputs.') and key.endswith('.fluency'):
                        scores['Fluency'] = value
                    elif key.startswith('outputs.') and key.endswith('.groundedness'):
                        scores['Groundedness'] = value

                print("  Scores:")
                for metric, score in scores.items():
                    print(f"    • {metric}: {score:.1f}/5.0")
                print("-" * 30)
        
        print("\n💡 Notice how the scores reflect the expected quality differences!")
        
        if execution_method in ['azure_ai_foundry_hybrid', 'azure_ai_foundry']:
            print("🏢 Both evaluation datasets are now available in AI Foundry portal")
        
    except Exception as e:
        print(f"Comparison evaluation failed: {e}")
        print("This is common in workshop environments - the concept is what matters!")
else:
    print("⚠️ Evaluators not available for comparison test")
    print("But you can imagine how different quality responses would score differently!")

🔬 Evaluating different response qualities...

🏢 Running evaluation through Azure AI Foundry...
   ✅ Results will appear in the AI Foundry portal
🏢 Running evaluation with Azure AI Foundry integration...
📤 Uploading dataset to AI Foundry: evaluation_dataset_756063988765806918
✅ Dataset uploaded successfully: azureai://accounts/aifoundry825233136833-resource/projects/aifoundry825233136833/data/evaluation_dataset_756063988765806918/versions/1.0
📊 Running evaluation locally (AI Foundry evaluation API not yet available)
2025-08-24 14:48:42 -0500 6251540480 execution.bulk     INFO     Finished 1 / 3 lines.
2025-08-24 14:48:42 -0500 6251540480 execution.bulk     INFO     Average execution time for completed lines: 1.27 seconds. Estimated time for incomplete lines: 2.54 seconds.
2025-08-24 14:48:42 -0500 6251540480 execution.bulk     INFO     Finished 2 / 3 lines.
2025-08-24 14:48:42 -0500 6251540480 execution.bulk     INFO     Average execution time for completed lines: 0.69 seconds. Estimate

## Part 5: Wrap-up and Key Takeaways (5 min)

### 🎯 What You've Learned

1. **Why LLM Evaluation Matters**
   - LLMs are non-deterministic (same input ≠ same output)
   - Traditional testing approaches don't work
   - Quality directly impacts business outcomes

2. **Core Evaluation Metrics**
   - **Relevance**: Does it answer the question?
   - **Coherence**: Is it clear and logical?
   - **Fluency**: Is the language natural?
   - **Groundedness**: Is it supported by context?

3. **Practical Skills**
   - Set up Azure AI evaluation environment
   - Create and configure evaluators
   - Run evaluations on sample data
   - Interpret evaluation results

### 🚀 Next Steps

In **Lab 2**, you'll learn how to:
- Scale evaluations to larger datasets
- Generate synthetic evaluation data
- Compare multiple models systematically
- Analyze cost vs. quality trade-offs

### 📝 Practice Exercise (Optional)

Try evaluating some of your own prompts and responses using the techniques you've learned!

In [13]:
# Practice area - try your own evaluation!
your_test_data = [
    {
        "query": "YOUR QUESTION HERE",
        "response": "YOUR LLM RESPONSE HERE",
        "context": "ANY RELEVANT CONTEXT HERE (optional)"
    }
    # Add more test cases as needed
]

# Uncomment and run to evaluate your own data
# if evaluators:
#     your_results = evaluate(data=your_test_data, evaluators=evaluators)
#     print_evaluation_insights(your_results)

print("🎉 Congratulations! You've completed Lab 1: LLM Evaluation Fundamentals")
print("\n📚 Ready for Lab 2? Let's scale up your evaluation capabilities!")

🎉 Congratulations! You've completed Lab 1: LLM Evaluation Fundamentals

📚 Ready for Lab 2? Let's scale up your evaluation capabilities!


---

## 🆘 Troubleshooting

**Common Issues and Solutions:**

1. **Import Errors**: 
   - Run `pip install -r requirements.txt`
   - Check that you're in the correct virtual environment

2. **Azure Connection Issues**:
   - Verify your `.env` file has correct credentials
   - Check your Azure OpenAI deployment is active
   - Ensure you have sufficient quota

3. **Evaluation Failures**:
   - This is common in workshop environments
   - The learning objective is understanding the process
   - Try with smaller datasets or single evaluators

4. **Rate Limiting**:
   - Add delays between requests
   - Use smaller batch sizes
   - Check your Azure OpenAI tier limits

**Need Help?**
- Check the troubleshooting guide: `docs/troubleshooting.md`
- Open an issue in the GitHub repository
- Review Azure AI documentation

In [14]:
# Let's examine the current values to understand the issue
print("Current evaluation_data structure:")
if 'evaluation_data' in locals():
    print(f"Type: {type(evaluation_data)}")
    print(f"Length: {len(evaluation_data)}")
    print(f"First item keys: {list(evaluation_data[0].keys()) if evaluation_data else 'No data'}")
else:
    print("evaluation_data not defined")

Current evaluation_data structure:
Type: <class 'list'>
Length: 3
First item keys: ['query', 'response', 'context', 'ground_truth']


In [15]:
# Check what's currently imported from lab1_helpers
from lab1_evaluation_fundamentals.utils.lab1_helpers import print_evaluation_insights
print("✅ print_evaluation_insights imported successfully")

✅ print_evaluation_insights imported successfully
