# Llama Index x Tonic Validate Webinar

## Setting Up Llama Index

### Setting up local embedding

`BAAI/bge-small-en-v1.5` is a local embedding model which replaces the default OpenAI embedding model. This model is known for being focused on RAG and has good performance for llama-index. By using a local model, we can avoid the need to send our private data to a remote server.

In [None]:
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Set the default embedding model to BAAI/bge-small-en-v1.5
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

### Setting up Ollama

Ollama is a tool for running local models easily on your computer. To use Ollama with LlamaIndex, we must set the default LLM to use Ollama. We are using Llama2 70b for the model we are running on Ollama. We chose Llama2 70b because of it's ability to follow instructions better than smaller models. However, due to the model's size we are running the model on a separate server with 4 A10G GPUs. We also raised the amount of time it takes for the LLM request to time out due to how long the 70b version of Llama2 takes to run.

In [None]:
from llama_index.llms.ollama import Ollama
import os

Settings.llm = Ollama(model="llama2:70b-chat", base_url=os.getenv("OLLAMA_URL"), request_timeout=180.0)

### Setting up Llama Index

First, we will load our data for RAG into Llama Index. For our data, we will be using a collection of Paul Grahams essays and we will be asking questions about his essays.

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("../paul_graham_essays").load_data()
index = VectorStoreIndex.from_documents(documents)

Now we can set up our query engine and write a simple function to output our results from the query engine

In [1]:
from llama_index.core import Response
from tonic_validate import CallbackLLMResponse

query_engine = index.as_query_engine()

# Gets the response from llama index in a format Tonic Validate can understand
def get_llama_response(prompt) -> CallbackLLMResponse:
    response = query_engine.query(prompt)
    # Check response is of type Response
    if not isinstance(response, Response):
        raise ValueError(f"Expected Response, got {type(response)}")
    
    # Get the response and context from the Llama index
    context = [x.text for x in response.source_nodes]
    answer = response.response
    if answer is None:
        raise ValueError("No response from Llama")
    
    return {
        "llm_answer": answer,
        "llm_context_list": context
    }

  from .autonotebook import tqdm as notebook_tqdm


### Asking questions to Llama Index

Now that we have Llama Index set up, we can load our questions to ask Llama Index about the Paul Graham essays. In the following code, we will just load our 10 questions from a json file with the questions. We also have reference answers for each question which represents the ideal answer to the question. For instance, if you have a question "What is the capital of France" then the reference answer would be "Paris"

In [5]:
import json
qa_pairs = []
with open("../question_and_answer_list.json", "r") as qa_file:
    qa_pairs = json.load(qa_file)[:10]

Let's view the questions and answers in the json file we just loaded

In [7]:
def print_qa_pair(qa_pair):
    print(f"Question: {qa_pair['question']}")
    print(f"Answer: {qa_pair['answer']}")
    print()

In [8]:
for qa_pair in qa_pairs:
    print_qa_pair(qa_pair)

Question: What makes Sam Altman a good founder?
Answer: He has a great force of will.

Question: When was the essay "Five Founders" written?
Answer: April 2009

Question: When does the most dramatic growth happen for a startup?
Answer: When the startup only has three or four people.

Question: What is the problem with business culture versus start up culture with respect to productivity?
Answer: In business culture, energy is expended on outward appearance to the detriment of productivity, while in startup culture there is no value of appearance it's all about productivity.

Question: What's the single biggest thing the government could do to increase the number of startups in this country?
Answer: Establish a new class of visa for startup founders.

Question: How could one create a rigorous government definition of what a startup is to categorize whether a business is a startup?
Answer: One could define a startup as a company that has received investment by recognized startup investor

Let's take one of the questions we loaded and ask it to Llama Index to see the response quality

In [9]:
example_qa = qa_pairs[0]
print_qa_pair(example_qa)

Question: What makes Sam Altman a good founder?
Answer: He has a great force of will.



In [10]:
get_llama_response(example_qa["question"])

NameError: name 'get_llama_response' is not defined

## Using Tonic Validate

Now let's set up Tonic Validate to score the questions. First, we will set up a benchmark in Tonic Validate. A benchmark is just a list of questions and reference answers that we will use to score the response quality. We will use the QA pairs we loaded earlier for this

In [5]:
from tonic_validate import Benchmark
question_list = [qa_pair['question'] for qa_pair in qa_pairs]
answer_list = [qa_pair['answer'] for qa_pair in qa_pairs]

benchmark = Benchmark(questions=question_list, answers=answer_list)

Now we can run through the questions and score the response quality with Tonic Validate

In [6]:
# Save the responses into an array for scoring
from tonic_validate import LLMResponse


responses = []
for item in benchmark:
    rag_response = get_llama_response(item.question)
    llm_response = LLMResponse(
        llm_answer=rag_response["llm_answer"],
        llm_context_list=rag_response["llm_context_list"],
        benchmark_item=item
    )
    responses.append(llm_response)

In [7]:
# Save the responses to a pickle file
import os
import pickle
# Check if pickle file exists
file_name = "llm_responses.pkl"
if not os.path.exists(file_name):
    with open(file_name, "wb") as f:
        pickle.dump(responses, f)
        

In [4]:
# load the responses from the pickle file
import pickle
with open("llm_responses.pkl", "rb") as f:
    responses = pickle.load(f)

In [5]:
from tonic_validate import ValidateScorer
import os

os.environ["OPENAI_BASE_URL"] = "http://54.235.13.59:8080/v1"
scorer = ValidateScorer(model_evaluator="llama2:70b-chat")
response_scores = scorer.score_responses(responses)

Scoring responses: 100%|██████████| 10/10 [12:05<00:00, 72.54s/it]


Let's view the results in a dataframe to see the scores

In [None]:
response_scores.to_df()

In [10]:
import pandas as pd

def make_scores_df(response_scores):
    scores_df = {
        "question": [],
        "reference_answer": [],
        "llm_answer": [],
        "retrieved_context": []
    }
    for score_name in response_scores.overall_scores:
        scores_df[score_name] = []
    for data in response_scores.run_data:
        scores_df["question"].append(data.reference_question)
        scores_df["reference_answer"].append(data.reference_answer)
        scores_df["llm_answer"].append(data.llm_answer)
        scores_df["retrieved_context"].append(data.llm_context)
        for score_name, score in data.scores.items():
            scores_df[score_name].append(score)
    return pd.DataFrame(scores_df)
            

In [7]:
scores_df = make_scores_df(response_scores)

In [8]:
scores_df.head()

Unnamed: 0,question,reference_answer,llm_answer,retrieved_context,answer_similarity,augmentation_precision,answer_consistency
0,What makes Sam Altman a good founder?,He has a great force of will.,"According to Paul Graham's essay, Sam Altman i...",[You can't plan when you start a startup how l...,4.0,1.0,0.666667
1,"When was the essay ""Five Founders"" written?",April 2009,"The essay ""Five Founders"" was written in April...",[Written by Paul Graham\r\n\r\nFive Founders\r...,5.0,1.0,1.0
2,When does the most dramatic growth happen for ...,When the startup only has three or four people.,"According to the provided text, the most drama...",[Written by Paul Graham\r\n\r\nStartup = Growt...,2.0,1.0,0.5
3,What is the problem with business culture vers...,"In business culture, energy is expended on out...","According to the provided text, the issue with...",[Written by Paul Graham\r\n\r\nLearning from F...,5.0,1.0,0.8
4,What's the single biggest thing the government...,Establish a new class of visa for startup foun...,"According to the provided text, the single big...",[Written by Paul Graham\r\n\r\nThe Founder Vis...,5.0,1.0,1.0


In [None]:
from tonic_validate import ValidateApi

validate_api = ValidateApi()
validate_api.upload_run("project-id", response_scores)