# Evaluation

## Evaluate a single Answer
* Context Relevance: retrieved context should provide information for the question
* Answer Faithfulness: answer should be consist with retrieved context
* Answer Relevance: answer should address the question

## Evaluate the RAG system
* Noise Robustness: some document is related by lack substantive information
* Negative Rejection: don't to respond out-scope knowledge
* Information Integration: ability to  answer based on multiple document
* Counterfactual Robustness: distinguish possible misinformation in the document

## Key Takeaways
The Evaluation is depends on LLM, but sometimes judgement from LLM is not so reliable. We still have long journey to go.


In [2]:
from azureresource import (
    get_llm,
    get_embed_model,
    get_vector_store
)
from index import get_index

index_dict = {}
llm = get_llm("gpt-35-turbo", "gpt-35-turbo-1106")
embed_model = get_embed_model("text-embedding-ada-002", "text-embedding-ada-002")
vector_store = get_vector_store("chunk-512")
index = get_index(vector_store, llm, embed_model)

In [12]:
import nest_asyncio

nest_asyncio.apply()

In [23]:
from llama_index.core.evaluation import FaithfulnessEvaluator
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler
llama_debug = LlamaDebugHandler(print_trace_on_end=False)
callback_manager = CallbackManager([llama_debug])

# define evaluator
evaluator = FaithfulnessEvaluator(llm=llm)
query_engine = index.as_query_engine()
question = "在什么地点可以勾起对堂吉柯德的联想"
response = query_engine.query(question)
faith_eval_result = evaluator.evaluate_response(query=question, response=response)
from llama_index.core.evaluation import AnswerRelevancyEvaluator
relevance_evaluator = AnswerRelevancyEvaluator(llm=llm)
relevance_eval_result = relevance_evaluator.evaluate_response(query=question, response=response)
from llama_index.core.evaluation import AnswerRelevancyEvaluator
relevance_evaluator = AnswerRelevancyEvaluator(llm=llm)
relevance_eval_result = relevance_evaluator.evaluate_response(query=question, response=response)
from llama_index.core.evaluation import ContextRelevancyEvaluator
context_relevance_evaluator = ContextRelevancyEvaluator(llm=llm)
context_relevance_result = context_relevance_evaluator.evaluate(query=question, response=response, contexts=[node.get_content() for node in response.source_nodes])


In [32]:
from IPython.display import Markdown, display

source_text = "\n".join(["### node_id: " + node.node_id + "\n\nscore: " + str(node.score) + "\n\n#### text:\n" + node.text for node in response.source_nodes])

display(Markdown(f"""\
## Question
{question}

## Response
{response.response}

## Retrieved Context
{source_text}

## Evaluation result
## Context Relevance
{str(context_relevance_result.feedback)}

## Faithfulness
{str(faith_eval_result.feedback)}

## Answer Relevance
{str(relevance_eval_result.feedback)}
"""))

## Question
在哪里可以勾起对堂吉柯德的联想

## Response
在纪念碑前面的场景可以勾起对堂吉柯德的联想。

## Retrieved Context
### node_id: 41162f37-97af-4d62-a686-92477e7c74ab

score: 0.84682554

#### text:
从纪念碑前面走去，你会感觉那栋稳稳当当、三台阶收分的大楼，就是纪念碑设计中的一个背景。它们作为建筑群，活像是一个整体。

![image](./Images/113.jpeg)

塞万提斯纪念碑

蓝天和碑前面的水池，打破了“纪念”的沉闷。纪念碑的主角高高在上，却和整个纪念碑的色调没有区分。塞万提斯在那里，可是他已经和西班牙的巨石融为一体了。那石砌的纪念碑，就如同西班牙那绵绵不尽的群山。而接近地面、无可阻挡地在走出来的，是那几近黑色的两个青铜塑像，那就是骑在瘦马上的堂·吉诃德和骑在驴子上的桑丘。

站在这两个一高瘦一矮胖、万世不坠的西班牙人面前，我终于感到有必要想想，假如堂·吉诃德是一个真正意义上的英雄或者骑士，假如他代表了那么多的精神和思想，他还有什么意思？他们从西班牙的黄金时代走出来，却踩着贫瘠的土地。
### node_id: e9b14226-e3ce-4182-bec2-9c59c40ec502

score: 0.84351003

#### text:
这个骑着毛驴的桑丘，是塞万提斯眼中真正的西班牙芸芸大众。桑丘并非没有英雄幻想，只是短缺堂·吉诃德式的英雄气概，且也不乏一点隐隐的私心，这才忠心耿耿、天涯海角地在瘦马后面紧紧跟随。

塞万提斯向我们指点了我们每个人的英雄情结，我们是桑丘，也是堂·吉诃德。我们有时候是桑丘，有时候是堂·吉诃德。他们形影不离，可以是同一个人，可以是同一个民族，可以就是我们眼前的这个世界。我们的冲动和幻想却可能是错乱的，我们在幻想和错乱之中摸索着理性。我们不了解这个世界，因为我们不了解自己或者根本不愿意了解自己，我们无法控制那支配着我们内心的欲望和冲动。在每一个宣言后面，都肩并肩地站着他们，堂·吉诃德和桑丘。而塞万提斯，怀着点忧郁，目送他们前行。

前面是又一个两百年，十八世纪和十九世纪。偏偏就在这新的两百年即将开始的时候，西班牙的王位被传给了法国路易王朝。

## Evaluation result
## Context Relevance
The retrieved context is not directly relevant to the user's query about where to evoke thoughts of Don Quixote. The context provided describes a monument dedicated to Cervantes and Don Quixote, as well as some philosophical reflections on the characters, but it does not offer specific locations or methods for evoking thoughts of Don Quixote.

The retrieved context cannot be used exclusively to provide a full answer to the user's query, as it does not offer specific locations or methods for evoking thoughts of Don Quixote. It only provides philosophical reflections and a description of a monument dedicated to the character.

[RESULT] 2.0

## Faithfulness
YES

## Answer Relevance
1. The provided response matches the subject matter of the user's query by mentioning a specific location where one can evoke thoughts of Don Quixote.
2. The provided response attempts to address the focus or perspective on the subject matter taken on by the user's query by suggesting a specific location where one can evoke thoughts of Don Quixote.
[RESULT] 2


In [34]:
gpt4 = get_llm("gpt-4", "gpt-4")

# define evaluator
evaluator = FaithfulnessEvaluator(llm=llm)
query_engine = index.as_query_engine()
question = "在什么地点可以勾起对堂吉柯德的联想"
response = query_engine.query(question)
faith_eval_result = evaluator.evaluate_response(query=question, response=response)
from llama_index.core.evaluation import AnswerRelevancyEvaluator
relevance_evaluator = AnswerRelevancyEvaluator(llm=llm)
relevance_eval_result = relevance_evaluator.evaluate_response(query=question, response=response)
from llama_index.core.evaluation import AnswerRelevancyEvaluator
relevance_evaluator = AnswerRelevancyEvaluator(llm=llm)
relevance_eval_result = relevance_evaluator.evaluate_response(query=question, response=response)
from llama_index.core.evaluation import ContextRelevancyEvaluator
context_relevance_evaluator = ContextRelevancyEvaluator(llm=llm)
context_relevance_result = context_relevance_evaluator.evaluate(query=question, response=response, contexts=[node.get_content() for node in response.source_nodes])

In [35]:
from IPython.display import Markdown, display

source_text = "\n".join(["### node_id: " + node.node_id + "\n\nscore: " + str(node.score) + "\n\n#### text:\n" + node.text for node in response.source_nodes])

display(Markdown(f"""\
## Question
{question}

## Response
{response.response}

## Retrieved Context
{source_text}

## Evaluation result
## Context Relevance
{str(context_relevance_result.feedback)}

## Faithfulness
{str(faith_eval_result.feedback)}

## Answer Relevance
{str(relevance_eval_result.feedback)}
"""))

## Question
在什么地点可以勾起对堂吉柯德的联想

## Response
At the location described in the context, one can be reminded of Don Quixote when standing in front of the bronze statues of Don Quixote and Sancho Panza, which are close to the ground and nearly black in color.

## Retrieved Context
### node_id: 41162f37-97af-4d62-a686-92477e7c74ab

score: 0.8584269

#### text:
从纪念碑前面走去，你会感觉那栋稳稳当当、三台阶收分的大楼，就是纪念碑设计中的一个背景。它们作为建筑群，活像是一个整体。

![image](./Images/113.jpeg)

塞万提斯纪念碑

蓝天和碑前面的水池，打破了“纪念”的沉闷。纪念碑的主角高高在上，却和整个纪念碑的色调没有区分。塞万提斯在那里，可是他已经和西班牙的巨石融为一体了。那石砌的纪念碑，就如同西班牙那绵绵不尽的群山。而接近地面、无可阻挡地在走出来的，是那几近黑色的两个青铜塑像，那就是骑在瘦马上的堂·吉诃德和骑在驴子上的桑丘。

站在这两个一高瘦一矮胖、万世不坠的西班牙人面前，我终于感到有必要想想，假如堂·吉诃德是一个真正意义上的英雄或者骑士，假如他代表了那么多的精神和思想，他还有什么意思？他们从西班牙的黄金时代走出来，却踩着贫瘠的土地。
### node_id: 7180f127-0f57-49ce-b846-1bbbaa48ae8e

score: 0.85400695

#### text:
已经不记得我们都点了什么饮料，却能记得坐在阴影里，看着夕阳下的马约尔广场：一张张小桌子边是懒懒散散的游客，远处是画家们的摊位，还有靠歌唱谋生的艺术家，不失时机地弹起吉他唱起来——那是一种微微有点奇特的感觉，即便在今天，我们回想起来，仍然能够体会到广场那一丝由规整而起的内在拘谨。围绕广场四周的建筑，不论是红墙还是白石，都由“时间”调入了一种只属于历史的黄色。红白之间就不仅只有设计师造就的色彩对比，还有岁月引出的色彩调和。感觉的奇特，源自于这里被历史做旧了的建筑形制和色彩氛围。置身其中，犹如不留神一脚踏进了历史。

马德里作为首都，当然也见证过宗教裁判。这里就曾经是一个公审公判和行刑的地方。现在，西班牙还保留了艺术家在1683年所画的这个广场在1680年6月30日审判新教异端的场景。审判的时候，连国王都来了，除了四周楼房的窗口，还在广场两侧搭起一层层的看台。那些不肯悔过的新教徒，会在当晚被处死。

## Evaluation result
## Context Relevance
The retrieved context does not match the subject matter of the user's query. The context talks about the Cervantes Monument in Madrid, Spain, and the surrounding area, but it does not specifically address the location where one can evoke thoughts of Don Quixote. Therefore, the relevance of the retrieved context to the user's query is low.

The retrieved context cannot be used exclusively to provide a full answer to the user's query. It does not provide specific information about the location where one can evoke thoughts of Don Quixote, which is what the user is asking for. Therefore, the context is not sufficient to fully answer the user's query.

[RESULT] 2.0

## Faithfulness
YES

## Answer Relevance
1. The provided response matches the subject matter of the user's query as it mentions a specific location that can evoke thoughts of Don Quixote.
2. The provided response attempts to address the focus or perspective on the subject matter taken on by the user's query by describing a specific location and the visual cues that can evoke thoughts of Don Quixote.

[RESULT] 2


In [1]:
from llama_index.core import (
    VectorStoreIndex,
    load_index_from_storage,
    StorageContext,
)
from llama_index.experimental.param_tuner import ParamTuner
from llama_index.core.param_tuner.base import TunedResult, RunResult
from llama_index.core.evaluation.eval_utils import (
    get_responses,
    aget_responses,
)
from llama_index.core.evaluation import (
    SemanticSimilarityEvaluator,
    BatchEvalRunner,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.schema import IndexNode

import os
import numpy as np
from pathlib import Path

ModuleNotFoundError: No module named 'llama_index.experimental'

In [None]:
def _build_index(chunk_size, docs):
    index_out_path = f"./storage_{chunk_size}"
    if not os.path.exists(index_out_path):
        Path(index_out_path).mkdir(parents=True, exist_ok=True)
        # parse docs
        node_parser = SimpleNodeParser.from_defaults(chunk_size=chunk_size)
        base_nodes = node_parser.get_nodes_from_documents(docs)

        # build index
        index = VectorStoreIndex(base_nodes)
        # save index to disk
        index.storage_context.persist(index_out_path)
    else:
        # rebuild storage context
        storage_context = StorageContext.from_defaults(
            persist_dir=index_out_path
        )
        # load index
        index = load_index_from_storage(
            storage_context,
        )
    return index


def _get_eval_batch_runner():
    evaluator_s = SemanticSimilarityEvaluator(embed_model=OpenAIEmbedding())
    eval_batch_runner = BatchEvalRunner(
        {"semantic_similarity": evaluator_s}, workers=2, show_progress=True
    )

    return eval_batch_runner

In [None]:
def objective_function(params_dict):
    chunk_size = params_dict["chunk_size"]
    docs = params_dict["docs"]
    top_k = params_dict["top_k"]
    eval_qs = params_dict["eval_qs"]
    # ref_response_strs = params_dict["ref_response_strs"]

    # build index
    index = _build_index(chunk_size, docs)

    # query engine
    query_engine = index.as_query_engine(similarity_top_k=top_k)

    # get predicted responses
    pred_response_objs = get_responses(
        eval_qs, query_engine, show_progress=True
    )

    # run evaluator
    # NOTE: can uncomment other evaluators
    eval_batch_runner = _get_eval_batch_runner()
    eval_results = eval_batch_runner.evaluate_responses(
        eval_qs, responses=pred_response_objs
    )

    # get semantic similarity metric
    mean_score = np.array(
        [r.score for r in eval_results["semantic_similarity"]]
    ).mean()

    return RunResult(score=mean_score, params=params_dict)

In [None]:
from llama_index.readers.file.epub import EpubReader
from llama_index.experimental.param_tuner import ParamTuner

document = EpubReader().load_data("data/book.epub")
from testcase import question_list, bcolors

param_dict = {"chunk_size": [256, 512, 1024], "top_k": [2, 5]}
# param_dict = {
#     "chunk_size": [256],
#     "top_k": [1]
# }
fixed_param_dict = {
    "docs": document,
    "eval_qs": question_list,
    # "ref_response_strs": ,
}

param_tuner = ParamTuner(
    param_fn=objective_function,
    param_dict=param_dict,
    fixed_param_dict=fixed_param_dict,
    show_progress=True,
)

In [None]:
results = param_tuner.tune()

In [None]:
best_result = results.best_run_result
best_top_k = results.best_run_result.params["top_k"]
best_chunk_size = results.best_run_result.params["chunk_size"]
print(f"Score: {best_result.score}")
print(f"Top-k: {best_top_k}")
print(f"Chunk size: {best_chunk_size}")