# Khipus.ai
### Assignment 5: Qualitative Metrics for LLM Evaluation Using the Azure AI Evaluation SDK
### Azure OpenAI + Azure AI Evaluation SDK
### Name: (Your Name)
<span>© Copyright Notice 2025, Khipus.ai - All Rights Reserved.</span>

This notebook serves as a case study for evaluating Large Language Models (LLMs) using the Azure AI Evaluation SDK. We will cover the following metrics:
- **Coherence**: Measures the logical flow and clarity of the response.
- **Fluency**: Assesses the grammatical correctness and readability of the text.
- **Groundedness**: Evaluates the accuracy of the information provided in the response.
- **Relevance**: Determines how well the response addresses the query.
- **Retrieval**: Measures the effectiveness of the model in retrieving relevant information.


### Explanation of the Assignment Seccion

The assignment section focuses on evaluating the groundedness of LLM responses using the Azure AI Evaluation SDK. Specifically, it challenges you to:

- Identify factual inaccuracies by modifying an example response.
- Simulate scenarios with subtle or multiple errors in a response.


This exercise emphasizes the importance of precise factual information in model responses and demonstrates how nuanced errors can impact evaluation metrics. 

In [6]:
#%pip install azure-ai-evaluation

### Import the libraries 

In [7]:

import json
import os
from azure.ai.evaluation import (
    CoherenceEvaluator,
    FluencyEvaluator,
    GroundednessEvaluator,
    RelevanceEvaluator,
    RetrievalEvaluator,AzureOpenAIModelConfiguration
)


### Set the environment variable for the OpenAI API key

In [None]:

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint="YOUR_AZURE_OPENAI_ENDPOINT", # endpoint from the Azure OpenAI resource
    api_key="YOUR_AZURE_OPENAI_API_KEY",# API key from the Azure OpenAI resource
    azure_deployment="gpt-4o", # deployment name from the Azure OpenAI resource (USE THE DEPLOYMENT NAME THAT YOU CHOOSE FROM THE COURSE)
    api_version="2023-05-15"  
)


## Coherence Evaluation
Coherence refers to the logical flow and clarity of the response. Let's evaluate a sample response for coherence.

In [9]:
coherence_eval = CoherenceEvaluator(model_config)

query_response = dict(
    query='What is the capital of France?',
    response='The capital of France is Paris.'
)

coherence_score = coherence_eval(**query_response)
print('Coherence Score:', coherence_score)

Coherence Score: {'coherence': 5.0, 'gpt_coherence': 5.0, 'coherence_reason': 'The RESPONSE is coherent, logically structured, and directly answers the QUERY without any issues in flow or organization.', 'coherence_result': 'pass', 'coherence_threshold': 3, 'coherence_prompt_tokens': 1264, 'coherence_completion_tokens': 117, 'coherence_total_tokens': 1381, 'coherence_finish_reason': 'stop', 'coherence_model': 'gpt-4o-2024-11-20', 'coherence_sample_input': '[{"role": "user", "content": "{\\"query\\": \\"What is the capital of France?\\", \\"response\\": \\"The capital of France is Paris.\\"}"}]', 'coherence_sample_output': '[{"role": "assistant", "content": "<S0>Let\'s think step by step: The QUERY asks for the capital of France, and the RESPONSE directly answers the question by stating \\"The capital of France is Paris.\\" The information is accurate, concise, and logically presented. There is no ambiguity or disjointedness in the response, and the sentence flows smoothly. The RESPONSE

## Fluency Evaluation
Fluency assesses the grammatical correctness and readability of the text. Let's evaluate a sample response for fluency.

In [10]:
fluency_eval = FluencyEvaluator(model_config)

query_response = dict(
    query='How do you make a cake?',
    response='To make a cake, you need flour, sugar, and eggs.'
)

fluency_score = fluency_eval(**query_response)
print('Fluency Score:', fluency_score)

Fluency Score: {'fluency': 3.0, 'gpt_fluency': 3.0, 'fluency_reason': 'The response is clear and error-free but uses basic vocabulary and simple sentence construction, which aligns with Competent Fluency.', 'fluency_result': 'pass', 'fluency_threshold': 3, 'fluency_prompt_tokens': 926, 'fluency_completion_tokens': 120, 'fluency_total_tokens': 1046, 'fluency_finish_reason': 'stop', 'fluency_model': 'gpt-4o-2024-11-20', 'fluency_sample_input': '[{"role": "user", "content": "{\\"response\\": \\"To make a cake, you need flour, sugar, and eggs.\\"}"}]', 'fluency_sample_output': '[{"role": "assistant", "content": "<S0>Let\'s think step by step: The response is grammatically correct and conveys a clear idea. The vocabulary is basic and limited, and the sentence structure is simple without any complexity or variety. There are no errors, and the message is easily understood. However, the response lacks sophistication, varied vocabulary, and complex sentence structures that would elevate it to h

## Groundedness Evaluation
Groundedness evaluates the accuracy of the information provided in the response. Let's evaluate a sample response for groundedness.

In [11]:
# Create a groundedness evaluator
groundedness_eval = GroundednessEvaluator(model_config)

# Example with response that contains both correct and incorrect information
query = "Tell me about the solar system"
response = "The solar system has 8 planets. Mercury is closest to the sun, followed by Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Pluto."
context = "The solar system has 8 planets: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. Pluto was reclassified as a dwarf planet in 2006."

# Evaluate groundedness with context
result = groundedness_eval(query=query, response=response, context=context)
print("Groundedness Score:", result['groundedness'])
print("Explanation:", result['groundedness_reason'])

Groundedness Score: 2.0
Explanation: The response attempts to answer the query but includes incorrect information by listing Pluto as a planet, which is not supported by the context. This makes the response unreliable.


## Relevance Evaluation
Relevance determines how well the response addresses the query. Let's evaluate a sample response for relevance.

In [12]:
relevance_eval = RelevanceEvaluator(model_config)

query_response = dict(
    query='What is the fastest land animal?',
    response='The cheetah is the fastest land animal.'
)

relevance_score = relevance_eval(**query_response)
print('Relevance Score:', relevance_score)

Relevance Score: {'relevance': 4.0, 'gpt_relevance': 4.0, 'relevance_result': 'pass', 'relevance_threshold': 3, 'relevance_reason': "The response directly and accurately answers the user's query by identifying the cheetah as the fastest land animal. It is fully relevant and sufficient for the question asked.", 'relevance_prompt_tokens': 1589, 'relevance_completion_tokens': 48, 'relevance_total_tokens': 1637, 'relevance_finish_reason': 'stop', 'relevance_model': 'gpt-4o-2024-11-20', 'relevance_sample_input': '[{"role": "user", "content": "{\\"query\\": \\"What is the fastest land animal?\\", \\"response\\": \\"The cheetah is the fastest land animal.\\"}"}]', 'relevance_sample_output': '[{"role": "assistant", "content": "{\\n  \\"explanation\\": \\"The response directly and accurately answers the user\'s query by identifying the cheetah as the fastest land animal. It is fully relevant and sufficient for the question asked.\\",\\n  \\"score\\": 4\\n}"}]'}


## Retrieval Evaluation
Retrieval measures the effectiveness of the model in retrieving relevant information. Let's evaluate a sample response for retrieval.

In [13]:
# Use the existing model_config and RetrievalEvaluator already imported

retrieval_eval = RetrievalEvaluator(model_config)

query = "What is the largest ocean on Earth?"
response = "The Pacific Ocean is the largest ocean on Earth."
context = "The Pacific Ocean covers about 165 million km², making it the largest."

score = retrieval_eval(query=query, response=response, context=context)
print("Retrieval score:", score)


Retrieval score: {'retrieval': 5.0, 'gpt_retrieval': 5.0, 'retrieval_reason': 'The context fully addresses the query with the most relevant information ranked at the top, meeting the criteria for a perfect retrieval score.', 'retrieval_result': 'pass', 'retrieval_threshold': 3, 'retrieval_prompt_tokens': 3479, 'retrieval_completion_tokens': 121, 'retrieval_total_tokens': 3600, 'retrieval_finish_reason': 'stop', 'retrieval_model': 'gpt-4o-2024-11-20', 'retrieval_sample_input': '[{"role": "user", "content": "{\\"query\\": \\"What is the largest ocean on Earth?\\", \\"context\\": \\"The Pacific Ocean covers about 165 million km\\\\u00b2, making it the largest.\\"}"}]', 'retrieval_sample_output': '[{"role": "assistant", "content": "<S0>Let\'s think step by step: The query asks for the largest ocean on Earth. The context directly provides the answer, stating that the Pacific Ocean is the largest and includes its size. The information is highly relevant to the query, and the most pertinent c

## Assignment

### Groundedness Evaluation: 
In the groundedness example, the response incorrectly lists Pluto as a planet. Modify the example to test a different type of factual error and explain how the groundedness score might change when the error is more subtle or when multiple errors are present in a longer response

In [14]:
# Your code here
# Create a groundedness evaluator
groundedness_eval = GroundednessEvaluator(model_config)

# Example with a more subtle factual error
query = "When did the first moon landing occur?"
response = "The first moon landing was achieved by Neil Armstrong and Buzz Aldrin on July 21, 1969, when Apollo 11 landed on the lunar surface."
context = "NASA's Apollo 11 mission successfully landed the first humans on the Moon on July 20, 1969. Neil Armstrong and Buzz Aldrin stepped onto the lunar surface while Michael Collins orbited above."

# Evaluate groundedness with context
result = groundedness_eval(query=query, response=response, context=context)
print("Groundedness Score:", result['groundedness'])
print("Explanation:", result['groundedness_reason'])

Groundedness Score: 2.0
Explanation: The RESPONSE contains incorrect information about the date of the moon landing, which makes it unreliable and not grounded in the CONTEXT.


## Conclusion
In this Assignment, we have evaluated LLM responses using the Azure AI Evaluation SDK across various metrics: Coherence, Fluency, Groundedness, Relevance, and Retrieval. You can use these metrics to assess and improve the performance of your LLM applications.