# Individual Scorers Example

In this notebook, we show how to use each of the scorers in `tval.scorers`.

First we set up a RAG system using llama index on our 6 founders Paul Graham essays.

In [1]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./paul_graham_essays").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

Have a simple sample question and reference answer.

In [2]:
question = "What makes Sam Altman a good founder?"
reference_answer = "He has a great force of will."

Get the response and retrieved context from the RAG system.

In [3]:
response = query_engine.query(question)

In [4]:
print(response.response)

Sam Altman is considered a good founder because he possesses qualities that are highly valued in the startup world. He is known for his determination and force of will, which are crucial traits for overcoming obstacles and not getting demoralized easily. Altman is also known for his strategic thinking and ambition, which make him a valuable advisor for startups. Additionally, Altman has a strong sense of imagination, allowing him to come up with surprising and innovative ideas. Finally, Altman's ability to build strong relationships and work well with others, such as his close friendship with Emmett Shear, demonstrates his ability to collaborate effectively and create a positive working environment.


Load the answer and retrieved context into the variables `llm_answer` and `context_list`.

In [5]:
llm_answer = response.response
retrieved_context_list = [x.node.text for x in response.source_nodes]

### Answer Similarity Score

In [7]:
from tval.scorers import AnswerSimilarityScorer

llm_evaluator = "gpt-4"
scorer = AnswerSimilarityScorer(llm_evaluator)

similarity_score = scorer.score(question, reference_answer, llm_answer) # score is a float

In [8]:
similarity_score

5.0

### Retrieval Precision

In [9]:
from tval.scorers import RetrievalPrecisionScorer

llm_evaluator = "gpt-4"
scorer = RetrievalPrecisionScorer(llm_evaluator)

retrieval_precision_score = scorer.score(question, retrieved_context_list) 

score = retrieval_precision_score.score
context_relevant_list = retrieval_precision_score.context_relevant_list

In [10]:
score

1.0

In [11]:
context_relevant_list

[True, True]

### Augmentation Precision

In [12]:
from tval.scorers import AugmentationPrecisionScorer

llm_evaluator = "gpt-4"
scorer = AugmentationPrecisionScorer(llm_evaluator)

# AugmentationPrecisionScore object
augmentation_precision_score = scorer.score(question, llm_answer, retrieved_context_list)

# float between 0 and 1
score = augmentation_precision_score.score

# list of bools of whether each context in retrieved_context_list is relevant
context_relevant_list = augmentation_precision_score.context_relevant_list

# list of bools of whether each context in context_relevant_list appears in llm_answer
answer_contains_context_list = augmentation_precision_score.answer_contains_context_list

In [13]:
score

1.0

In [14]:
context_relevant_list

[True, True]

In [15]:
answer_contains_context_list

[True, True]

### Augmentation Accuracy

In [16]:
from tval.scorers import AugmentationAccuracyScorer

llm_evaluator = "gpt-4"
scorer = AugmentationAccuracyScorer(llm_evaluator)

# AugmentationAccuracyScore object
augmentation_accuracy_score = scorer.score(llm_answer, retrieved_context_list)

# float between 0 and 1
score = augmentation_accuracy_score.score

# list of bools of whether content from each context in retrieved_content_list appears in the answer
answer_contains_context_list = augmentation_accuracy_score.answer_contains_context_list

In [17]:
answer_contains_context_list

[True, True]

In [18]:
score

1.0

### Answer Consistency

In [19]:
from tval.scorers import AnswerConsistencyScorer

llm_evaluator = "gpt-4"
scorer = AnswerConsistencyScorer(llm_evaluator)

# AnswerConsistencyScore object
answer_consistency_score = scorer.score(llm_answer, retrieved_context_list)

# float between 0 and 1
score = answer_consistency_score.score

# list of main points in the answer
main_point_list = answer_consistency_score.main_point_list

# list of bools of whether each main point is derived from the context
main_point_derived_from_context_list = answer_consistency_score.main_point_derived_from_context_list

In [20]:
score

1.0

In [21]:
main_point_list

['Sam Altman is considered a good founder because he possesses qualities that are highly valued in the startup world.',
 'He is known for his determination and force of will, which are crucial traits for overcoming obstacles and not getting demoralized easily.',
 'Altman is also known for his strategic thinking and ambition, which make him a valuable advisor for startups.',
 'Additionally, Altman has a strong sense of imagination, allowing him to come up with surprising and innovative ideas.',
 "Finally, Altman's ability to build strong relationships and work well with others, such as his close friendship with Emmett Shear, demonstrates his ability to collaborate effectively and create a positive working environment."]

In [22]:
main_point_derived_from_context_list

[True, True, True, True, True]

### Answer Consistency Binary

In [23]:
from tval.scorers import AnswerConsistencyBinaryScorer

llm_evaluator = "gpt-4"
scorer = AnswerConsistencyBinaryScorer(llm_evaluator)

# binary integer
score = scorer.score(llm_answer, retrieved_context_list)

In [24]:
score

0

### Retrieval k-Recall

For retrieval k-recall, we set up a new RAG system where we make the retriever explicit, so that we can retrieve the top k context. We do k=5.

In [25]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.indices.query.schema import QueryBundle

# retrieval k-recall with k = 5
k = 5

documents = SimpleDirectoryReader("./paul_graham_essays").load_data()
index = VectorStoreIndex.from_documents(documents)
rag_retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2
)
top_k_retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=k
)

query_engine = RetrieverQueryEngine(
    retriever=rag_retriever
)

Get the answer, retrieved context, and top k context.

In [26]:
response = query_engine.query(question)

answer = response.metadata

retrieved_context_list = [x.node.text for x in response.source_nodes]

query_bundle = QueryBundle(question)
top_k_nodes = top_k_retriever.retrieve(query_bundle)
top_k_retrieved_context_list = [x.node.text for x in top_k_nodes]

In [28]:
from tval.scorers import RetrievalKRecallScorer

llm_evaluator = "gpt-4"
scorer = RetrievalKRecallScorer(llm_evaluator)

score = scorer.score(
    question, retrieved_context_list, top_k_retrieved_context_list
)

In [29]:
score.score

1.0

In [30]:
score.top_k_context_relevant_list

[True, True, False, False, False]