# Llama Index Quick Start Example

In the spirit of [Llama Index's starter tutorial](https://gpt-index.readthedocs.io/en/stable/getting_started/starter_example.html) and Andrej Karpathy's [Unreasonable Effectiveness of RNNs blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/), we start with an example of a RAG system where the document set consists of Paul Graham essays. (Footnote: The Paul Graham essay text files used were derived from the dataset of Paul Graham essays found in [paul-graham-gpt](https://github.com/mckaywrigley/paul-graham-gpt) github project.)

In this notebook, we set up a simple RAG system using llama index, and evaluate the RAG system using 10 predetermined questions and answers about Paul Graham essays. The Paul Graham essays that make up the document set are the 6 essays that have the word founder in the title.

In [1]:
import json

import pandas as pd
# llama index imports
from llama_index import VectorStoreIndex, SimpleDirectoryReader
# tvalmetrics imports
# scorer
from tvalmetrics.validate_scorer import ValidateScorer 
# metrics
from tvalmetrics.metrics.answer_consistency_metric import AnswerConsistencyMetric
from tvalmetrics.metrics.answer_similarity_metric import AnswerSimilarityMetric
from tvalmetrics.metrics.augmentation_accuracy_metric import AugmentationAccuracyMetric
from tvalmetrics.metrics.augmentation_precision_metric import AugmentationPrecisionMetric
from tvalmetrics.metrics.retrieval_precision_metric import RetrievalPrecisionMetric
# llm utils
from tvalmetrics.classes.llm_response import LLMResponse
from tvalmetrics.classes.benchmark_item import BenchmarkItem

Set up a simple llama index RAG system that uses the default LlamaIndex parameters. The default LlamaIndex parameters use Open AIs ada-002 embedding model as the embedder and gpt-3.5-turbo as the LLM.

In [2]:
documents = SimpleDirectoryReader("./paul_graham_essays").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

Load 10 questions and answers about the Paul Graham essays as a benchmark for how the RAG system should answer questions.

In [3]:
with open("question_and_answer_list.json", "r") as f:
    question_and_answer_list = json.load(f)

Let's inspect an example, question, answer from the RAG system and reference answer.

In [4]:
ex_q_and_a = question_and_answer_list[0]

In [5]:
ex_q_and_a["question"]

'What makes Sam Altman a good founder?'

In [6]:
response = query_engine.query(ex_q_and_a["question"])

In [7]:
print(response.response)

Sam Altman is considered a good founder because he possesses qualities that are highly valued in the startup world. These qualities include determination, flexibility, imagination, naughtiness, and the ability to build strong relationships. Sam Altman's force of will and determination make him someone who is likely to achieve his goals and overcome obstacles. He also demonstrates flexibility by being willing to modify his ideas and adapt to changing circumstances. Altman's imagination allows him to come up with innovative and surprising ideas, which is crucial in the startup world where good ideas may initially seem bad. Additionally, Altman has a piratical gleam in his eye, meaning he is not afraid to break rules that don't matter and think outside the box. Finally, Altman understands the importance of friendship and strong relationships between founders, as evidenced by his close friendship and successful working relationship with others.


In [8]:
ex_q_and_a["answer"]

'He has a great force of will.'

Set up the scorer from Tonic Validate. ScoreCalculator will calculate **answer similarity score**, **retrieval precision**, **augmentation precision**, **augmentation accuracy**, and **answer consistency**.

In [9]:
metrics = [
    AnswerSimilarityMetric(),
    RetrievalPrecisionMetric(),
    AugmentationAccuracyMetric(),
    AugmentationPrecisionMetric(),
    AnswerConsistencyMetric()
]
# can use an OpenAI chat completion model
# llm_evaluator = "gpt-3.5-turbo"
llm_evaluator = "gpt-4-1106-preview"
validate_scorer = ValidateScorer(
    metrics, llm_evaluator
)

To test out the scorer, first we'll score the example question and answer about Sam Altman.

In [10]:
# example BenchmarkItem
question = ex_q_and_a["question"]
reference_answer = ex_q_and_a["answer"]
benchmark_item = BenchmarkItem(
    question=question,
    reference_answer=reference_answer
)

# example LLMResponse
llm_answer = response.response
context_list = [source_node.node.text for source_node in response.source_nodes]
llm_response = LLMResponse(
    llm_answer=llm_answer,
    llm_context_list=context_list,
    benchmark_item=benchmark_item
)

responses = [llm_response]

response_scores = validate_scorer.score_run(responses)

Package the scores into a simple dictionary.

In [11]:
def make_score_dictionary(score_list):
    score_dict = {}
    for score in score_list:
        score_dict[score.metric_name] = score.score
    return score_dict

In [12]:
score_dict = make_score_dictionary(response_scores[0])
score_dict

{'answer_similarity': 4.0,
 'retrieval_precision': 1.0,
 'augmentation_accuracy': 0.5,
 'augmentation_precision': 0.5,
 'answer_consistency': 1.0}

Answer all 10 questions and get scores. 

In [13]:
responses = []

for q_and_a in question_and_answer_list:
    query_response = query_engine.query(q_and_a["question"])

    benchmark_item = BenchmarkItem(
        question=q_and_a["question"],
        reference_answer=q_and_a["answer"]
    )

    llm_response = LLMResponse(
        llm_answer=query_response.response,
        llm_context_list=[source_node.node.text for source_node in response.source_nodes],
        benchmark_item=benchmark_item
    )

    responses.append(llm_response)

In [14]:
response_scores = validate_scorer.score_run(responses)

Put the scores into a dataframe for easy viewing.

In [15]:
def make_scores_df(response_scores):
    scores_df = {
        "question": [],
        "reference_answer": [],
        "llm_answer": [],
        "retrieved_context": []
    }
    for score in response_scores[0]:
        scores_df[score.metric_name] = []
    for score_list in response_scores:
        scores_df["question"].append(score_list[0].llm_response.benchmark_item.question)
        scores_df["reference_answer"].append(score_list[0].llm_response.benchmark_item.reference_answer)
        scores_df["llm_answer"].append(score_list[0].llm_response.llm_answer)
        scores_df["retrieved_context"].append(score_list[0].llm_response.llm_context_list)
        for score in score_list:
            scores_df[score.metric_name].append(score.score)
    return pd.DataFrame(scores_df)
            

In [16]:
scores_df = make_scores_df(response_scores)

In [17]:
scores_df.head()

Unnamed: 0,question,reference_answer,llm_answer,retrieved_context,answer_similarity,retrieval_precision,augmentation_accuracy,augmentation_precision,answer_consistency
0,What makes Sam Altman a good founder?,He has a great force of will.,Sam Altman is considered a good founder becaus...,[Five Founders\n\nApril 2009\n\nInc recently a...,4.0,1.0,1.0,1.0,0.8
1,"When was the essay ""Five Founders"" written?",April 2009,April 2009,[Five Founders\n\nApril 2009\n\nInc recently a...,5.0,0.5,0.5,1.0,1.0
2,When does the most dramatic growth happen for ...,When the startup only has three or four people.,The most dramatic growth for a startup typical...,[Five Founders\n\nApril 2009\n\nInc recently a...,5.0,0.0,0.0,0.0,0.0
3,What is the problem with business culture vers...,"In business culture, energy is expended on out...",The problem with business culture versus start...,[Five Founders\n\nApril 2009\n\nInc recently a...,5.0,0.0,0.0,0.0,0.0
4,What's the single biggest thing the government...,Establish a new class of visa for startup foun...,Establish a new class of visa for startup foun...,[Five Founders\n\nApril 2009\n\nInc recently a...,5.0,0.0,0.0,0.0,0.0
