# Khipus.ai
### Assignment 5: Qualitative Metrics for LLM Evaluation Using the Azure AI Evaluation SDK
### Azure OpenAI + Azure AI Evaluation SDK
### Name: (Your Name)
<span>© Copyright Notice 2025, Khipus.ai - All Rights Reserved.</span>

This notebook serves as a case study for evaluating Large Language Models (LLMs) using the Azure AI Evaluation SDK. We will cover the following metrics:
- **Coherence**: Measures the logical flow and clarity of the response.
- **Fluency**: Assesses the grammatical correctness and readability of the text.
- **Groundedness**: Evaluates the accuracy of the information provided in the response.
- **Relevance**: Determines how well the response addresses the query.
- **Retrieval**: Measures the effectiveness of the model in retrieving relevant information.


### Explanation of the Assignment Seccion

The assignment section focuses on evaluating the groundedness of LLM responses using the Azure AI Evaluation SDK. Specifically, it challenges you to:

- Identify factual inaccuracies by modifying an example response.
- Simulate scenarios with subtle or multiple errors in a response.


This exercise emphasizes the importance of precise factual information in model responses and demonstrates how nuanced errors can impact evaluation metrics. 

In [None]:
#%pip install azure-ai-evaluation

### Import the libraries 

In [None]:

import json
import os
from azure.ai.evaluation import (
    CoherenceEvaluator,
    FluencyEvaluator,
    GroundednessEvaluator,
    RelevanceEvaluator,
    RetrievalEvaluator,AzureOpenAIModelConfiguration
)


[INFO] Could not import AIAgentConverter. Please install the dependency with `pip install azure-ai-projects`.


### Set the environment variable for the OpenAI API key

In [None]:

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint="replace with your endpoint",# https://azure-openai-<your-resource-name>.openai.azure.com/
    api_key="replace with your key",# key from the Azure OpenAI resource
    azure_deployment="gpt-4.1-mini",
    api_version="2023-05-15"  
)


## Coherence Evaluation
Coherence refers to the logical flow and clarity of the response. Let's evaluate a sample response for coherence.

In [4]:
coherence_eval = CoherenceEvaluator(model_config)

query_response = dict(
    query='What is the capital of France?',
    response='The capital of France is Paris.'
)

coherence_score = coherence_eval(**query_response)
print('Coherence Score:', coherence_score)

Coherence Score: {'coherence': 4.0, 'gpt_coherence': 4.0, 'coherence_reason': 'The response directly and clearly answers the question with a complete sentence, showing logical and orderly presentation of the idea. It is easy to follow and fully coherent.', 'coherence_result': 'pass', 'coherence_threshold': 3}


## Fluency Evaluation
Fluency assesses the grammatical correctness and readability of the text. Let's evaluate a sample response for fluency.

In [5]:
fluency_eval = FluencyEvaluator(model_config)

query_response = dict(
    query='How do you make a cake?',
    response='To make a cake, you need flour, sugar, and eggs.'
)

fluency_score = fluency_eval(**query_response)
print('Fluency Score:', fluency_score)

Fluency Score: {'fluency': 3.0, 'gpt_fluency': 3.0, 'fluency_reason': 'The response is clear and grammatically correct but simple and lacks complexity or varied vocabulary, fitting the definition of competent fluency.', 'fluency_result': 'pass', 'fluency_threshold': 3}


## Groundedness Evaluation
Groundedness evaluates the accuracy of the information provided in the response. Let's evaluate a sample response for groundedness.

In [6]:
# Create a groundedness evaluator
groundedness_eval = GroundednessEvaluator(model_config)

# Example with response that contains both correct and incorrect information
query = "Tell me about the solar system"
response = "The solar system has 8 planets. Mercury is closest to the sun, followed by Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Pluto."
context = "The solar system has 8 planets: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. Pluto was reclassified as a dwarf planet in 2006."

# Evaluate groundedness with context
result = groundedness_eval(query=query, response=response, context=context)
print("Groundedness Score:", result['groundedness'])
print("Explanation:", result['groundedness_reason'])

Groundedness Score: 3.0
Explanation: The response attempts to answer the query by listing planets but incorrectly includes Pluto as a planet and omits Neptune, which is inaccurate based on the context. Hence, it contains incorrect information and is not fully grounded.


## Relevance Evaluation
Relevance determines how well the response addresses the query. Let's evaluate a sample response for relevance.

In [7]:
relevance_eval = RelevanceEvaluator(model_config)

query_response = dict(
    query='What is the fastest land animal?',
    response='The cheetah is the fastest land animal.'
)

relevance_score = relevance_eval(**query_response)
print('Relevance Score:', relevance_score)

Relevance Score: {'relevance': 4.0, 'gpt_relevance': 4.0, 'relevance_reason': 'The response correctly and completely answers the question by identifying the cheetah as the fastest land animal, fulfilling the requirement for a complete and accurate answer.', 'relevance_result': 'pass', 'relevance_threshold': 3}


## Retrieval Evaluation
Retrieval measures the effectiveness of the model in retrieving relevant information. Let's evaluate a sample response for retrieval.

In [8]:
# Use the existing model_config and RetrievalEvaluator already imported

retrieval_eval = RetrievalEvaluator(model_config)

query = "What is the largest ocean on Earth?"
response = "The Pacific Ocean is the largest ocean on Earth."
context = "The Pacific Ocean covers about 165 million km², making it the largest."

score = retrieval_eval(query=query, response=response, context=context)
print("Retrieval score:", score)


Retrieval score: {'retrieval': 5.0, 'gpt_retrieval': 5.0, 'retrieval_reason': 'The context directly and succinctly answers the query with the most relevant information at the top, making it highly relevant and well ranked without any bias.', 'retrieval_result': 'pass', 'retrieval_threshold': 3}


## Assignment

### Groundedness Evaluation: 
In the groundedness example, the response incorrectly lists Pluto as a planet. Modify the example to test a different type of factual error and explain how the groundedness score might change when the error is more subtle or when multiple errors are present in a longer response

In [None]:
# Your code here
# Create a groundedness evaluator
groundedness_eval = GroundednessEvaluator(model_config)

# Example with a more subtle factual error
query = "When did the first moon landing occur?"
response = "The first moon landing was achieved by Neil Armstrong and Buzz Aldrin on July 21, 1969, when Apollo 11 landed on the lunar surface."
context = "NASA's Apollo 11 mission successfully landed the first humans on the Moon on July 20, 1969. Neil Armstrong and Buzz Aldrin stepped onto the lunar surface while Michael Collins orbited above."

# Evaluate groundedness with context
result = groundedness_eval(query=query, response=response, context=context)
print("Groundedness Score:", result['groundedness'])
print("Explanation:", result['groundedness_reason'])

## Conclusion
In this case study, we have evaluated LLM responses using the Azure AI Evaluation SDK across various metrics: Coherence, Fluency, Groundedness, Relevance, and Retrieval. You can use these metrics to assess and improve the performance of your LLM applications.