# Llama Index Quick Start Example

In the spirit of [Llama Index's starter tutorial](https://gpt-index.readthedocs.io/en/stable/getting_started/starter_example.html) and Andrej Karpathy's [Unreasonable Effectiveness of RNNs blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/), we start with an example of a RAG system where the document set consists of Paul Graham essays. (Footnote: The Paul Graham essay text files used were derived from the dataset of Paul Graham essays found in [paul-graham-gpt](https://github.com/mckaywrigley/paul-graham-gpt) github projectl

In this notebook, we set up a simple RAG system using llama index, and evaluate the RAG system using 10 predetermined questions and answers about Paul Graham essays. The Paul Graham essays that make up the document set are the 6 essays that have the word founder in the title.

In [1]:
import json

# llama index imports
from llama_index import VectorStoreIndex, SimpleDirectoryReader
# tval imports
from tval import RagScoresCalculator

Set up a simple llama index RAG system that uses the default LlamaIndex parameters. The default LlamaIndex parameters use Open AIs ada-002 embedding model as the embedder and gpt-3.5-turbo as the LLM.

In [2]:
documents = SimpleDirectoryReader("./paul_graham_essays").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

Load 10 questions and answers about the Paul Graham essays as a benchmark for how the RAG system should answer questions.

In [3]:
with open("question_and_answer_list.json", "r") as f:
    question_and_answer_list = json.load(f)

Let's inspect an example, question, answer from the RAG system and reference answer.

In [4]:
ex_q_and_a = question_and_answer_list[0]

In [5]:
ex_q_and_a["question"]

'What makes Sam Altman a good founder?'

In [6]:
response = query_engine.query(ex_q_and_a["question"])

In [7]:
print(response.response)

Sam Altman is considered a good founder because he possesses qualities that are highly valued in the startup world. He is known for his determination and force of will, which are crucial for overcoming obstacles and staying motivated in the face of challenges. Altman is also known for his strategic thinking and ambition, making him a valuable resource for startups seeking advice on strategy and ambition. Additionally, Altman's ability to think outside the box and come up with innovative ideas sets him apart as a founder. Overall, Altman's combination of determination, strategic thinking, and creativity make him a highly regarded founder in the startup community.


In [8]:
ex_q_and_a["answer"]

'He has a great force of will.'

Set up the score calculator from Tonic Validate. ScoreCalculator will calculate **answer similarity score**, **retrieval precision**, **augmentation precision**, **augmentation accuracy**, **answer consistency**, and the **overall score**.

Get the score about the example question and answer about Sam Altman.

In [9]:
# can use gpt-4 or gpt-3.5-turbo
llm_evaluator = "gpt-4"
# llm_evaluator = "gpt-3.5-turbo"
score_calculator = RagScoresCalculator(llm_evaluator)

In [10]:
question = ex_q_and_a["question"]
reference_answer = ex_q_and_a["answer"]
llm_answer = response.response
context_list = [source_node.node.text for source_node in response.source_nodes]

In [11]:
scores_object = score_calculator.score(question, reference_answer, llm_answer, context_list)

`scores_object` is a dataclass object with the scores and the inputs to the `score_calculator.score`. The scores that are not calculated are recorded as `None`. To see just the numeric scores use `scores_to_dict()`.

In [12]:
scores_object.scores_to_dict()

{'answer_similarity_score': [5.0],
 'retrieval_precision': [1.0],
 'augmentation_precision': [1.0],
 'augmentation_accuracy': [1.0],
 'answer_consistency_binary': [None],
 'answer_consistency': [1.0],
 'retrieval_k_recall': [None],
 'overall_score': [1.0]}

We can see the inputs to `score_calculator.score` and the output scores by using `to_dataframe()`.

In [13]:
scores_object.to_dataframe()

Unnamed: 0,question,reference_answer,llm_answer,retrieved_context,answer_similarity_score,retrieval_precision,augmentation_precision,augmentation_accuracy,answer_consistency,overall_score
0,What makes Sam Altman a good founder?,He has a great force of will.,Sam Altman is considered a good founder becaus...,[Five Founders\n\nApril 2009\n\nInc recently a...,5.0,1.0,1.0,1.0,1.0,1.0


We can specify which scores we would like to calculate.

In [14]:
score_calculator = RagScoresCalculator(
    model=llm_evaluator,
    retrieval_precision=True,
    augmentation_precision=True,
    augmentation_accuracy=True,
    answer_consistency=True
)

None of retrieval precision, augmentation precision, augmentation accuracy, or answer consistency use the reference answer, so we do not need to give that parameter to `score_calculator.score`.

In [15]:
scores_object = score_calculator.score(
    question=question,
    llm_answer=llm_answer,
    retrieved_context_list=context_list
)

In [16]:
scores_object.scores_to_dict()

{'answer_similarity_score': [None],
 'retrieval_precision': [1.0],
 'augmentation_precision': [1.0],
 'augmentation_accuracy': [1.0],
 'answer_consistency_binary': [None],
 'answer_consistency': [1.0],
 'retrieval_k_recall': [None],
 'overall_score': [1.0]}

In [17]:
scores_object.to_dataframe()

Unnamed: 0,question,llm_answer,retrieved_context,retrieval_precision,augmentation_precision,augmentation_accuracy,answer_consistency,overall_score
0,What makes Sam Altman a good founder?,Sam Altman is considered a good founder becaus...,[Five Founders\n\nApril 2009\n\nInc recently a...,1.0,1.0,1.0,1.0,1.0


Answer all 10 questions and then get the batch score.

In [18]:
score_calculator = RagScoresCalculator(llm_evaluator)

In [19]:
for q_and_a in question_and_answer_list:
    response = query_engine.query(q_and_a["question"])
    q_and_a["llm_answer"] = response.response
    q_and_a["context_list"] = [source_node.node.text for source_node in response.source_nodes]

In [20]:
question_list = [q_and_a["question"] for q_and_a in question_and_answer_list]
reference_answer_list = [q_and_a["answer"] for q_and_a in question_and_answer_list]
llm_answer_list = [q_and_a["llm_answer"] for q_and_a in question_and_answer_list]
context_list_list = [q_and_a["context_list"] for q_and_a in question_and_answer_list]

In [21]:
batch_scores = score_calculator.score_batch(
    question_list,
    reference_answer_list,
    llm_answer_list,
    context_list_list
)

The `mean_scores()` method scores calculates the mean of each score over the batch.

In [22]:
batch_scores.mean_scores()

{'answer_similarity_score': 4.4,
 'retrieval_precision': 0.6,
 'augmentation_precision': 0.8,
 'augmentation_accuracy': 0.55,
 'answer_consistency_binary': None,
 'answer_consistency': 1.0,
 'retrieval_k_recall': None,
 'overall_score': 0.766}

To inspect the batch scores individually, we can put them into a dataframe that includes in addition to the scoires, the question, reference answer, LLM answer, and context.

In [23]:
scores_df = batch_scores.to_dataframe()

In [24]:
scores_df.head()

Unnamed: 0,question,reference_answer,llm_answer,retrieved_context,answer_similarity_score,retrieval_precision,augmentation_precision,augmentation_accuracy,answer_consistency,overall_score
0,What makes Sam Altman a good founder?,He has a great force of will.,Sam Altman is considered a good founder becaus...,[Five Founders\n\nApril 2009\n\nInc recently a...,4.0,1.0,1.0,1.0,1.0,0.96
1,"When was the essay ""Five Founders"" written?",April 2009,"The essay ""Five Founders"" was written in April...",[Five Founders\n\nApril 2009\n\nInc recently a...,5.0,0.5,1.0,0.5,1.0,0.8
2,When does the most dramatic growth happen for ...,When the startup only has three or four people.,The most dramatic growth for a startup typical...,[Learning from Founders\n\nJanuary 2007\n\n(Fo...,5.0,0.5,1.0,0.5,1.0,0.8
3,What is the problem with business culture vers...,"In business culture, energy is expended on out...",The problem with business culture versus start...,[Learning from Founders\n\nJanuary 2007\n\n(Fo...,5.0,1.0,0.5,0.5,1.0,0.8
4,What's the single biggest thing the government...,Establish a new class of visa for startup foun...,The single biggest thing the government could ...,[The Founder Visa\n\nApril 2009\n\nI usually a...,5.0,0.5,1.0,0.5,1.0,0.8
