# Individual Scorers Example

In this notebook, we show how to use each of the scorers in `tvalmetrics.scorers`.

First we set up a RAG system using llama index on our 6 founders Paul Graham essays.

In [1]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./paul_graham_essays").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

Have a simple sample question and reference answer.

In [2]:
question = "What makes Sam Altman a good founder?"
reference_answer = "He has a great force of will."

Get the response and retrieved context from the RAG system.

In [3]:
response = query_engine.query(question)

In [4]:
print(response.response)

Sam Altman is considered a good founder because he possesses qualities that are highly valued in the startup world. He is known for his determination and force of will, which are crucial for overcoming obstacles and not getting demoralized easily. Altman also demonstrates flexibility, being able to modify his dreams and adapt to the unpredictable nature of startups. Additionally, he has a strong imagination, which allows him to come up with surprising and innovative ideas. Altman's naughtiness or willingness to break rules that don't matter is seen as a positive trait, as it shows his willingness to think outside the box and challenge the status quo. Lastly, Altman's ability to build strong relationships and work well with others, such as his close friendship with Emmett Shear, is also seen as a valuable quality in a founder.


Load the answer and retrieved context into the variables `llm_answer` and `context_list`.

In [5]:
llm_answer = response.response
retrieved_context_list = [x.node.text for x in response.source_nodes]

### Answer Similarity Score

In [6]:
from tvalmetrics.scorers import AnswerSimilarityScorer

llm_evaluator = "gpt-4"
scorer = AnswerSimilarityScorer(llm_evaluator)

similarity_score = scorer.score(question, reference_answer, llm_answer) # score is a float

In [7]:
similarity_score

4.0

### Retrieval Precision

In [8]:
from tvalmetrics.scorers import RetrievalPrecisionScorer

llm_evaluator = "gpt-4"
scorer = RetrievalPrecisionScorer(llm_evaluator)

retrieval_precision_score = scorer.score(question, retrieved_context_list) 

score = retrieval_precision_score.score
context_relevant_list = retrieval_precision_score.context_relevant_list

In [9]:
score

1.0

In [10]:
context_relevant_list

[True, True]

### Augmentation Precision

In [11]:
from tvalmetrics.scorers import AugmentationPrecisionScorer

llm_evaluator = "gpt-4"
scorer = AugmentationPrecisionScorer(llm_evaluator)

# AugmentationPrecisionScore object
augmentation_precision_score = scorer.score(question, llm_answer, retrieved_context_list)

# float between 0 and 1
score = augmentation_precision_score.score

# list of bools of whether each context in retrieved_context_list is relevant
context_relevant_list = augmentation_precision_score.context_relevant_list

# list of bools of whether each context in context_relevant_list appears in llm_answer
answer_contains_context_list = augmentation_precision_score.answer_contains_context_list

In [12]:
score

0.5

In [13]:
context_relevant_list

[True, True]

In [14]:
answer_contains_context_list

[False, True]

### Augmentation Accuracy

In [15]:
from tvalmetrics.scorers import AugmentationAccuracyScorer

llm_evaluator = "gpt-4"
scorer = AugmentationAccuracyScorer(llm_evaluator)

# AugmentationAccuracyScore object
augmentation_accuracy_score = scorer.score(llm_answer, retrieved_context_list)

# float between 0 and 1
score = augmentation_accuracy_score.score

# list of bools of whether content from each context in retrieved_content_list appears in the answer
answer_contains_context_list = augmentation_accuracy_score.answer_contains_context_list

In [16]:
answer_contains_context_list

[True, True]

In [17]:
score

1.0

### Answer Consistency

In [18]:
from tvalmetrics.scorers import AnswerConsistencyScorer

llm_evaluator = "gpt-4"
scorer = AnswerConsistencyScorer(llm_evaluator)

# AnswerConsistencyScore object
answer_consistency_score = scorer.score(llm_answer, retrieved_context_list)

# float between 0 and 1
score = answer_consistency_score.score

# list of main points in the answer
main_point_list = answer_consistency_score.main_point_list

# list of bools of whether each main point is derived from the context
main_point_derived_from_context_list = answer_consistency_score.main_point_derived_from_context_list

In [19]:
score

1.0

In [20]:
main_point_list

['Sam Altman is considered a good founder because he possesses qualities that are highly valued in the startup world.',
 'He is known for his determination and force of will, which are crucial for overcoming obstacles and not getting demoralized easily.',
 'Altman also demonstrates flexibility, being able to modify his dreams and adapt to the unpredictable nature of startups.',
 'Additionally, he has a strong imagination, which allows him to come up with surprising and innovative ideas.',
 "Altman's naughtiness or willingness to break rules that don't matter is seen as a positive trait, as it shows his willingness to think outside the box and challenge the status quo.",
 "Lastly, Altman's ability to build strong relationships and work well with others, such as his close friendship with Emmett Shear, is also seen as a valuable quality in a founder."]

In [21]:
main_point_derived_from_context_list

[True, True, True, True, True, True]

### Answer Consistency Binary

In [22]:
from tvalmetrics.scorers import AnswerConsistencyBinaryScorer

llm_evaluator = "gpt-4"
scorer = AnswerConsistencyBinaryScorer(llm_evaluator)

# binary integer
score = scorer.score(llm_answer, retrieved_context_list)

In [23]:
score

0

### Retrieval k-Recall

For retrieval k-recall, we set up a new RAG system where we make the retriever explicit, so that we can retrieve the top k context. We do k=5.

In [24]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.indices.query.schema import QueryBundle

# retrieval k-recall with k = 5
k = 5

documents = SimpleDirectoryReader("./paul_graham_essays").load_data()
index = VectorStoreIndex.from_documents(documents)
rag_retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2
)
top_k_retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=k
)

query_engine = RetrieverQueryEngine(
    retriever=rag_retriever
)

Get the answer, retrieved context, and top k context.

In [25]:
response = query_engine.query(question)

answer = response.metadata

retrieved_context_list = [x.node.text for x in response.source_nodes]

query_bundle = QueryBundle(question)
top_k_nodes = top_k_retriever.retrieve(query_bundle)
top_k_retrieved_context_list = [x.node.text for x in top_k_nodes]

In [26]:
from tvalmetrics.scorers import RetrievalKRecallScorer

llm_evaluator = "gpt-4"
scorer = RetrievalKRecallScorer(llm_evaluator)

score = scorer.score(
    question, retrieved_context_list, top_k_retrieved_context_list
)

In [27]:
score.score

1.0

In [28]:
score.top_k_context_relevant_list

[True, True, False, False, False]