# Groundedness Evaluator

## Objective
This sample demonstrates how to use the Groundedness evaluator to assess whether AI-generated responses are grounded in the provided context. The evaluator supports multiple input formats including:
- Simple response and context evaluation
- Query, response, and context evaluation
- Agent responses with tool calls (file_search)
- Multi-turn conversations

This notebook uses consistent examples with the ToolCallAccuracyEvaluator for better comparison across evaluators.

## Time

You should expect to spend about 20 minutes running this notebook.

## Before you begin
For quality evaluation, you need to deploy a `gpt` model supporting JSON mode. We recommend a model `gpt-4o` or `gpt-4o-mini` for their strong reasoning capabilities.

### Prerequisite
```bash
pip install azure-ai-evaluation
```
Set these environment variables with your own values:
1) **MODEL_DEPLOYMENT_NAME** - The deployment name of the model for this AI-assisted evaluator
2) **AZURE_OPENAI_ENDPOINT** - Azure OpenAI Endpoint to be used for evaluation
3) **AZURE_OPENAI_API_KEY** - Azure OpenAI Key to be used for evaluation
4) **AZURE_OPENAI_API_VERSION** - Azure OpenAI API version to be used for evaluation

## What is Groundedness?

The Groundedness evaluator assesses the correspondence between claims in an AI-generated response and the source context. It ensures that responses are substantiated by the provided context, preventing hallucinations and unsupported claims.

**Key Points:**
- Even factually correct responses are considered ungrounded if they can't be verified against the provided context
- Essential for RAG (Retrieval-Augmented Generation) applications
- Helps ensure AI responses are trustworthy and verifiable

**Scoring:** Groundedness scores range from 1 to 5, with:
- **1**: Completely ungrounded - no claims supported by context
- **2**: Mostly ungrounded - few claims supported
- **3**: Partially grounded - some claims supported
- **4**: Mostly grounded - most claims supported
- **5**: Fully grounded - all claims supported by context

## Groundedness Evaluator Input Requirements

The Groundedness evaluator supports multiple input formats:

1. **Basic Context Evaluation:**
   - `response`: The AI response to evaluate (str)
   - `context`: The source context/documents (str)
   - `query`: Optional query for enhanced evaluation (str)

2. **Agent Tool Evaluation:**
   - `query`: The user query (str)
   - `response`: Agent response with tool calls (List[dict])
   - `tool_definitions`: Available tools, only file_search supported (List[dict])

3. **Conversation Evaluation:**
   - `conversation`: Multi-turn conversation with context (Conversation object)

### Initialize Groundedness Evaluator

In [2]:
import os
from azure.ai.evaluation import GroundednessEvaluator, AzureOpenAIModelConfiguration
from pprint import pprint

# Configure the model
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment=os.environ["MODEL_DEPLOYMENT_NAME"],
)

# Initialize the evaluator
groundedness_evaluator = GroundednessEvaluator(model_config=model_config)

## Sample Evaluations

### 1. Well-Grounded Weather Response
Using weather information consistent with ToolCallAccuracyEvaluator examples.

In [3]:
# Example of a well-grounded response using weather context
context = """Current weather data for Seattle shows rainy conditions with a temperature of 14°C. 
The forecast indicates overcast skies with light precipitation typical for Pacific Northwest weather. 
Humidity is at 85% with winds from the southwest at 12 mph. Visibility is reduced to 8 miles due to rain."""

response = "The current weather in Seattle is rainy with a temperature of 14°C. It's typical Pacific Northwest weather for this time of year with overcast skies and light precipitation."

result = groundedness_evaluator(response=response, context=context)
print("=== Well-Grounded Weather Response ===")
pprint(result)

=== Well-Grounded Weather Response ===
{'gpt_groundedness': 4.0,
 'groundedness': 4.0,
 'groundedness_completion_tokens': 156,
 'groundedness_finish_reason': 'stop',
 'groundedness_model': 'gpt-4.1-2025-04-14',
 'groundedness_prompt_tokens': 1205,
 'groundedness_reason': 'The response is correct and covers most key points '
                        'but leaves out some specific details (humidity, wind, '
                        'visibility) present in the context.',
 'groundedness_result': 'pass',
 'groundedness_sample_input': '[{"role": "user", "content": "{\\"response\\": '
                              '\\"The current weather in Seattle is rainy with '
                              "a temperature of 14\\\\u00b0C. It's typical "
                              'Pacific Northwest weather for this time of year '
                              'with overcast skies and light '
                              'precipitation.\\", \\"context\\": \\"Current '
                              'weather

### 2. Partially Grounded Response (with hallucination)

In [4]:
# Example of partially grounded response with unsupported claims
context = """Current weather data for Seattle shows rainy conditions with a temperature of 14°C. 
The forecast indicates overcast skies with light precipitation typical for Pacific Northwest weather."""

response = "The current weather in Seattle is rainy with a temperature of 14°C. The city is experiencing its wettest month in 50 years, and the mayor has declared a weather emergency due to flooding concerns."

result = groundedness_evaluator(response=response, context=context)
print("=== Partially Grounded Response ===")
pprint(result)

=== Partially Grounded Response ===
{'gpt_groundedness': 2.0,
 'groundedness': 2.0,
 'groundedness_completion_tokens': 139,
 'groundedness_finish_reason': 'stop',
 'groundedness_model': 'gpt-4.1-2025-04-14',
 'groundedness_prompt_tokens': 1185,
 'groundedness_reason': 'The response includes accurate information from the '
                        'context but also introduces incorrect and unsupported '
                        'details, making it unreliable.',
 'groundedness_result': 'fail',
 'groundedness_sample_input': '[{"role": "user", "content": "{\\"response\\": '
                              '\\"The current weather in Seattle is rainy with '
                              'a temperature of 14\\\\u00b0C. The city is '
                              'experiencing its wettest month in 50 years, and '
                              'the mayor has declared a weather emergency due '
                              'to flooding concerns.\\", \\"context\\": '
                              '\\

### 3. Enhanced Evaluation with Weather Query

In [5]:
# Example with query for enhanced evaluation - consistent with ToolCallAccuracyEvaluator
query = "How is the weather in Seattle?"

context = """Weather report for Seattle, Washington: Currently experiencing rainy weather with temperature at 14°C. 
Overcast conditions with light rain are expected to continue. The current conditions are typical for the Pacific Northwest region 
during this season. Wind speed is moderate at 12 mph from southwest direction."""

response = "The weather in Seattle is rainy with a temperature of 14°C. These are typical Pacific Northwest conditions with overcast skies."

result = groundedness_evaluator(response=response, context=context, query=query)
print("=== Enhanced Evaluation with Weather Query ===")
pprint(result)

=== Enhanced Evaluation with Weather Query ===
{'gpt_groundedness': 4.0,
 'groundedness': 4.0,
 'groundedness_completion_tokens': 157,
 'groundedness_finish_reason': 'stop',
 'groundedness_model': 'gpt-4.1-2025-04-14',
 'groundedness_prompt_tokens': 1412,
 'groundedness_reason': 'The response is accurate and relevant, but it is '
                        'missing the wind speed and direction mentioned in the '
                        'context, making it slightly incomplete.',
 'groundedness_result': 'pass',
 'groundedness_sample_input': '[{"role": "user", "content": "{\\"query\\": '
                              '\\"How is the weather in Seattle?\\", '
                              '\\"response\\": \\"The weather in Seattle is '
                              'rainy with a temperature of 14\\\\u00b0C. These '
                              'are typical Pacific Northwest conditions with '
                              'overcast skies.\\", \\"context\\": \\"Weather '
                       

### 4. Agent Response with File Search Tool
This evaluates agent responses that use file_search tools to retrieve weather information - consistent with email/weather theme.

In [6]:
# Example of agent evaluation with file_search tool for weather information
query = "Can you get me the current weather information for Seattle?"

# Simulated agent response with file_search tool call - similar to ToolCallAccuracyEvaluator pattern
agent_response = [
    {
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_filesearch_weather_123",
                "name": "file_search",
                "arguments": {"query": "current weather Seattle temperature conditions"}
            }
        ]
    },
    {
        "role": "tool",
        "tool_call_id": "call_filesearch_weather_123", 
        "content": [
            {
                "type": "tool_result",
                "tool_result": {
                    "content": "Seattle weather report: Currently rainy with temperature of 14°C. Overcast skies with light precipitation. Typical Pacific Northwest weather with 85% humidity and southwest winds at 12 mph."
                }
            }
        ]
    },
    {
        "role": "assistant",
        "content": "Based on the weather data, Seattle is currently experiencing rainy weather with a temperature of 14°C. The conditions include overcast skies and light precipitation, which is typical for the Pacific Northwest."
    }
]

tool_definitions = [
    {
        "name": "file_search",
        "description": "Search through uploaded files to find relevant information",
        "parameters": {
            "type": "object", 
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            }
        }
    }
]

result = groundedness_evaluator(
    query=query, 
    response=agent_response, 
    tool_definitions=tool_definitions
)
print("=== Agent Response with File Search ===")
pprint(result)

=== Agent Response with File Search ===
{'gpt_groundedness': 5.0,
 'groundedness': 5.0,
 'groundedness_completion_tokens': 183,
 'groundedness_finish_reason': 'stop',
 'groundedness_model': 'gpt-4.1-2025-04-14',
 'groundedness_prompt_tokens': 1435,
 'groundedness_reason': 'The response is fully relevant, complete, and '
                        'directly answers the query with specific weather '
                        'information for Seattle, as would be expected from a '
                        'tool call.',
 'groundedness_result': 'pass',
 'groundedness_sample_input': '[{"role": "user", "content": "{\\"query\\": '
                              '\\"Can you get me the current weather '
                              'information for Seattle?\\", \\"response\\": '
                              '[{\\"role\\": \\"assistant\\", \\"content\\": '
                              '[{\\"type\\": \\"tool_call\\", '
                              '\\"tool_call_id\\": '
                              

### 5. Conversation Evaluation
Evaluating groundedness in multi-turn conversations about weather and email requests.

In [7]:
# Example of conversation evaluation - using weather and email theme for consistency
conversation = {
    "context": "Weather data shows Seattle currently has rainy conditions at 14°C with overcast skies. London shows cloudy weather at 8°C with partly cloudy conditions. Both cities are experiencing typical seasonal weather patterns.",
    "messages": [
        {
            "role": "user",
            "content": "Can you check the weather in Seattle for me?"
        },
        {
            "role": "assistant",
            "content": "According to the current weather data, Seattle is experiencing rainy conditions with a temperature of 14°C and overcast skies."
        },
        {
            "role": "user",
            "content": "How does that compare to London?"
        },
        {
            "role": "assistant",
            "content": "London is currently cloudier but drier than Seattle, with a temperature of 8°C and partly cloudy conditions. Seattle is warmer but rainier at 14°C."
        },
        {
            "role": "user", 
            "content": "Can you email me a summary of both cities' weather?"
        },
        {
            "role": "assistant",
            "content": "I can provide you with a weather summary: Seattle has rainy weather at 14°C with overcast skies, while London has partly cloudy conditions at 8°C. However, I would need email access to send this information to you."
        }
    ]
}

result = groundedness_evaluator(conversation=conversation)
print("=== Conversation Evaluation ===")
pprint(result)

=== Conversation Evaluation ===
{'evaluation_per_turn': {'gpt_groundedness': [5.0, 5.0, 5.0],
                         'groundedness': [5.0, 5.0, 5.0],
                         'groundedness_completion_tokens': [107, 161, 130],
                         'groundedness_finish_reason': ['stop', 'stop', 'stop'],
                         'groundedness_model': ['gpt-4.1-2025-04-14',
                                                'gpt-4.1-2025-04-14',
                                                'gpt-4.1-2025-04-14'],
                         'groundedness_prompt_tokens': [1402, 1409, 1427],
                         'groundedness_reason': ['The response is fully '
                                                 'correct and complete, '
                                                 'directly reflecting all the '
                                                 'information provided in the '
                                                 "context about Seattle's "
                     

### 6. Email Context Evaluation
Evaluating groundedness when dealing with email-related responses, consistent with ToolCallAccuracyEvaluator.

In [8]:
# Example with email context - consistent with ToolCallAccuracyEvaluator email scenarios
query = "Can you send me an email with weather information for Seattle?"

context = """Email service is available. Current weather data for Seattle: rainy conditions, 14°C temperature, 
overcast skies with light precipitation. User email preferences show john@example.com as primary contact. 
Email system can send weather reports with current conditions and forecasts."""

response = "I can send you an email with the Seattle weather information. The current weather shows rainy conditions at 14°C with overcast skies. I'll prepare this information for your email."

result = groundedness_evaluator(response=response, context=context, query=query)
print("=== Email Context Evaluation ===")
pprint(result)

=== Email Context Evaluation ===
{'gpt_groundedness': 5.0,
 'groundedness': 5.0,
 'groundedness_completion_tokens': 142,
 'groundedness_finish_reason': 'stop',
 'groundedness_model': 'gpt-4.1-2025-04-14',
 'groundedness_prompt_tokens': 1421,
 'groundedness_reason': 'The response is fully correct and complete, directly '
                        'addressing the query and including all relevant '
                        'details from the context.',
 'groundedness_result': 'pass',
 'groundedness_sample_input': '[{"role": "user", "content": "{\\"query\\": '
                              '\\"Can you send me an email with weather '
                              'information for Seattle?\\", \\"response\\": '
                              '\\"I can send you an email with the Seattle '
                              'weather information. The current weather shows '
                              'rainy conditions at 14\\\\u00b0C with overcast '
                              "skies. I'll prepare t

## Understanding the Results

The Groundedness evaluator returns a dictionary with key metrics:

- **`groundedness`**: Main score (1-5) indicating how well-grounded the response is
- **`groundedness_result`**: Binary result ("pass" or "fail") based on threshold
- **`groundedness_threshold`**: Threshold for pass/fail determination (default: 3)
- **`groundedness_reason`**: Detailed explanation of the grounding assessment
- **Token usage information**: For monitoring costs and performance

### Interpreting Scores:
- **Score 5**: Fully grounded - all claims supported by context
- **Score 4**: Mostly grounded - most claims supported, minor unsupported details
- **Score 3**: Partially grounded - some claims supported, some not
- **Score 2**: Mostly ungrounded - few claims supported by context
- **Score 1**: Completely ungrounded - no claims supported by context

### Best Practices:
1. **Provide Comprehensive Context**: Include all relevant source material
2. **Use Clear Queries**: When provided, queries help focus the evaluation
3. **Monitor Agent Tools**: For agent responses, ensure tool results are properly grounded
4. **Regular Evaluation**: Test groundedness across different domains and use cases

### Use Cases:
- **RAG Applications**: Ensure retrieved information supports responses
- **Customer Support**: Verify responses are based on knowledge base
- **Agent Systems**: Validate tool-based responses against retrieved data
- **Content Generation**: Ensure factual accuracy and source attribution
- **Multi-turn Conversations**: Maintain context grounding across dialogue

### Comparison with Other Evaluators:
- **vs Relevance**: Groundedness checks factual support, Relevance checks topic alignment
- **vs Tool Call Accuracy**: Groundedness validates response content, Tool Call Accuracy validates tool usage
- **Combined Use**: Use all three evaluators for comprehensive agent evaluation