# Llama Index x Tonic Validate Webinar

In the spirit of [Llama Index's starter tutorial](https://gpt-index.readthedocs.io/en/stable/getting_started/starter_example.html) and Andrej Karpathy's [Unreasonable Effectiveness of RNNs blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/), we start with an example of a RAG system where the document set consists of Paul Graham essays. (Footnote: The Paul Graham essay text files used were derived from the dataset of Paul Graham essays found in [paul-graham-gpt](https://github.com/mckaywrigley/paul-graham-gpt) github project.)

In this notebook, we set up a simple RAG system using llama index, llama2, and BAAI/bge-small-en-v1.5. We then evaluate the RAG system using 10 predetermined questions and answers about Paul Graham essays. The Paul Graham essays that make up the document set are the 6 essays that have the word founder in the title.

Set up a simple llama index RAG system that uses the default LlamaIndex parameters. The default LlamaIndex parameters use Open AIs ada-002 embedding model as the embedder and gpt-3.5-turbo as the LLM.

In [1]:
from llama_index.llms.ollama import Ollama
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Response
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from tonic_validate import CallbackLLMResponse

Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

Settings.llm = Ollama(model="llama2:70b-chat", base_url="http://54.235.13.59:8080", request_timeout=180.0)

documents = SimpleDirectoryReader("../paul_graham_essays").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

# Gets the response from llama index
def get_llama_response(prompt) -> CallbackLLMResponse:
    response = query_engine.query(prompt)
    # Check response is of type Response
    if not isinstance(response, Response):
        raise ValueError(f"Expected Response, got {type(response)}")
    
    context = [x.text for x in response.source_nodes]
    answer = response.response
    if answer is None:
        raise ValueError("No response from Llama")
    
    return {
        "llm_answer": answer,
        "llm_context_list": context
    }

  from .autonotebook import tqdm as notebook_tqdm


Load 10 questions and answers about the Paul Graham essays as a benchmark for how the RAG system should answer questions.

In [2]:
import json
qa_pairs = []
with open("../question_and_answer_list.json", "r") as qa_file:
    qa_pairs = json.load(qa_file)[:10]

Let's inspect an example, question, answer from the RAG system and reference answer.

In [3]:
example_qa = qa_pairs[0]
example_qa

{'question': 'What makes Sam Altman a good founder?',
 'answer': 'He has a great force of will.',
 'reference_article': 'Five Founders',
 'reference_text': '5. Sam Altman\n\nI was told I shouldn\'t mention founders of YC-funded companies in this list. But Sam Altman can\'t be stopped by such flimsy rules. If he wants to be on this list, he\'s going to be.\n\nHonestly, Sam is, along with Steve Jobs, the founder I refer to most when I\'m advising startups. On questions of design, I ask "What would Steve do?" but on questions of strategy or ambition I ask "What would Sama do?"\n\nWhat I learned from meeting Sama is that the doctrine of the elect applies to startups. It applies way less than most people think: startup investing does not consist of trying to pick winners the way you might in a horse race. But there are a few people with such force of will that they\'re going to get whatever they want.'}

In [4]:
get_llama_response(example_qa["question"])

{'llm_answer': "Sam Altman's ability to make deals and his determination are qualities that make him a good founder. According to Paul Graham's essay, Sam Altman has the quality of being able to take care of himself and his company, which investors like to see in startups. Additionally, Graham notes that successful startup founders are often intellectually contrarian, meaning they can spot novel ideas that others may not see as good or valuable. Altman's ability to make deals and bring people together suggests he possesses this quality. Finally, Graham states that the most reliable way to become a billionaire is by starting a company that grows fast and focusing on making what users want, which Altman has achieved with his company, OpenAI.",
 'llm_context_list': ['You can\'t plan when you start a startup how long it will take to become profitable. But if you find yourself in a position where a little more effort expended on sales would carry you over the threshold of ramen profitable, 

Now let's set up the benchmark for Tonic Validate using the QA pairs

In [5]:
from tonic_validate import Benchmark
question_list = [qa_pair['question'] for qa_pair in qa_pairs]
answer_list = [qa_pair['answer'] for qa_pair in qa_pairs]

benchmark = Benchmark(questions=question_list, answers=answer_list)

Set up the scorer from Tonic Validate to score the run.

In [6]:
# Save the responses into an array for scoring
from tonic_validate import LLMResponse


responses = []
for item in benchmark:
    rag_response = get_llama_response(item.question)
    llm_response = LLMResponse(
        llm_answer=rag_response["llm_answer"],
        llm_context_list=rag_response["llm_context_list"],
        benchmark_item=item
    )
    responses.append(llm_response)

In [7]:
# Save the responses to a pickle file
import os
import pickle
# Check if pickle file exists
file_name = "llm_responses.pkl"
if not os.path.exists(file_name):
    with open(file_name, "wb") as f:
        pickle.dump(responses, f)
        

In [4]:
# load the responses from the pickle file
import pickle
with open("llm_responses.pkl", "rb") as f:
    responses = pickle.load(f)

In [5]:
from tonic_validate import ValidateScorer
import os

os.environ["OPENAI_BASE_URL"] = "http://54.235.13.59:8080/v1"
scorer = ValidateScorer(model_evaluator="llama2:70b-chat")
response_scores = scorer.score_responses(responses)

Scoring responses: 100%|██████████| 10/10 [12:05<00:00, 72.54s/it]


Put the scores into a dataframe for easy viewing.

In [10]:
import pandas as pd

def make_scores_df(response_scores):
    scores_df = {
        "question": [],
        "reference_answer": [],
        "llm_answer": [],
        "retrieved_context": []
    }
    for score_name in response_scores.overall_scores:
        scores_df[score_name] = []
    for data in response_scores.run_data:
        scores_df["question"].append(data.reference_question)
        scores_df["reference_answer"].append(data.reference_answer)
        scores_df["llm_answer"].append(data.llm_answer)
        scores_df["retrieved_context"].append(data.llm_context)
        for score_name, score in data.scores.items():
            scores_df[score_name].append(score)
    return pd.DataFrame(scores_df)
            

In [7]:
scores_df = make_scores_df(response_scores)

In [8]:
scores_df.head()

Unnamed: 0,question,reference_answer,llm_answer,retrieved_context,answer_similarity,augmentation_precision,answer_consistency
0,What makes Sam Altman a good founder?,He has a great force of will.,"According to Paul Graham's essay, Sam Altman i...",[You can't plan when you start a startup how l...,4.0,1.0,0.666667
1,"When was the essay ""Five Founders"" written?",April 2009,"The essay ""Five Founders"" was written in April...",[Written by Paul Graham\r\n\r\nFive Founders\r...,5.0,1.0,1.0
2,When does the most dramatic growth happen for ...,When the startup only has three or four people.,"According to the provided text, the most drama...",[Written by Paul Graham\r\n\r\nStartup = Growt...,2.0,1.0,0.5
3,What is the problem with business culture vers...,"In business culture, energy is expended on out...","According to the provided text, the issue with...",[Written by Paul Graham\r\n\r\nLearning from F...,5.0,1.0,0.8
4,What's the single biggest thing the government...,Establish a new class of visa for startup foun...,"According to the provided text, the single big...",[Written by Paul Graham\r\n\r\nThe Founder Vis...,5.0,1.0,1.0


In [None]:
from tonic_validate import ValidateApi

validate_api = ValidateApi()
validate_api.upload_run("project-id", response_scores)