# Llama Index x Tonic Validate Webinar

## Setting Up Llama Index

### Setting up local embedding

`BAAI/bge-small-en-v1.5` is a local embedding model which replaces the default OpenAI embedding model. This model is known for being focused on RAG and has good performance for llama-index. By using a local model, we can avoid the need to send our private data to a remote server.

In [1]:
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Set the default embedding model to BAAI/bge-small-en-v1.5
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

### Setting up Ollama

Ollama is a tool for running local models easily on your computer. To use Ollama with LlamaIndex, we must set the default LLM to use Ollama. We are using Llama2 70b for the model we are running on Ollama. We chose Llama2 70b because of it's ability to follow instructions better than smaller models. However, due to the model's size we are running the model on a separate server with 4 A10G GPUs. We also raised the amount of time it takes for the LLM request to time out due to how long the 70b version of Llama2 takes to run.

In [2]:
import os
ollama_url = os.getenv("OLLAMA_URL")

In [3]:
from llama_index.llms.ollama import Ollama

Settings.llm = Ollama(model="llama2:70b-chat", base_url=ollama_url, request_timeout=180.0)

### Setting up Llama Index

First, we will load our data for RAG into Llama Index. For our data, we will be using a collection of Paul Grahams essays and we will be asking questions about his essays.

In [4]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("../paul_graham_essays").load_data()
index = VectorStoreIndex.from_documents(documents)

Now we can set up our query engine and write a simple function to output our results from the query engine

In [5]:
from llama_index.core import Response
from tonic_validate import CallbackLLMResponse

query_engine = index.as_query_engine()

# Gets the response from llama index in a format Tonic Validate can understand
def get_llama_response(prompt) -> CallbackLLMResponse:
    response = query_engine.query(prompt)
    # Check response is of type Response
    if not isinstance(response, Response):
        raise ValueError(f"Expected Response, got {type(response)}")
    
    # Get the response and context from the Llama index
    context = [x.text for x in response.source_nodes]
    answer = response.response
    if answer is None:
        raise ValueError("No response from Llama")
    
    return {
        "llm_answer": answer,
        "llm_context_list": context
    }

### Asking questions to Llama Index

Now that we have Llama Index set up, we can load our questions to ask Llama Index about the Paul Graham essays. In the following code, we will just load our 10 questions from a json file with the questions. We also have reference answers for each question which represents the ideal answer to the question. For instance, if you have a question "What is the capital of France" then the reference answer would be "Paris"

In [6]:
import json
qa_pairs = []
with open("../question_and_answer_list.json", "r") as qa_file:
    qa_pairs = json.load(qa_file)[:10]

Let's view the questions and answers in the json file we just loaded

In [7]:
def print_qa_pair(qa_pair):
    print(f"Question: {qa_pair['question']}")
    print(f"Answer: {qa_pair['answer']}")
    print()

In [8]:
for qa_pair in qa_pairs:
    print_qa_pair(qa_pair)

Question: What makes Sam Altman a good founder?
Answer: He has a great force of will.

Question: When was the essay "Five Founders" written?
Answer: April 2009

Question: When does the most dramatic growth happen for a startup?
Answer: When the startup only has three or four people.

Question: What is the problem with business culture versus start up culture with respect to productivity?
Answer: In business culture, energy is expended on outward appearance to the detriment of productivity, while in startup culture there is no value of appearance it's all about productivity.

Question: What's the single biggest thing the government could do to increase the number of startups in this country?
Answer: Establish a new class of visa for startup founders.

Question: How could one create a rigorous government definition of what a startup is to categorize whether a business is a startup?
Answer: One could define a startup as a company that has received investment by recognized startup investor

Let's take one of the questions we loaded and ask it to Llama Index to see the response quality

In [9]:
example_qa = qa_pairs[0]
print_qa_pair(example_qa)

Question: What makes Sam Altman a good founder?
Answer: He has a great force of will.



In [10]:
get_llama_response(example_qa["question"])

{'llm_answer': 'Sam Altman is considered a good founder because he possesses qualities such as toughness, adaptability, and determination. These traits enable him to take risks, overcome obstacles, and persevere in the face of challenges, increasing his chances of success in business.',
 'llm_context_list': ['You can\'t plan when you start a startup how long it will take to become profitable. But if you find yourself in a position where a little more effort expended on sales would carry you over the threshold of ramen profitable, do it. Investors like it when you\'re ramen profitable. It shows you\'ve thought about making money, instead of just working on amusing technical problems; it shows you have the discipline to keep your expenses low; but above all, it means you don\'t need them. There is nothing investors like more than a startup that seems like it\'s going to succeed even without them. Investors like it when they can help a startup, but they don\'t like startups that would die

## Using Tonic Validate

Now let's set up Tonic Validate to score the questions. First, we will set up a benchmark in Tonic Validate. A benchmark is just a list of questions and reference answers that we will use to score the response quality. We will use the QA pairs we loaded earlier for this

In [11]:
from tonic_validate import Benchmark
question_list = [qa_pair['question'] for qa_pair in qa_pairs]
answer_list = [qa_pair['answer'] for qa_pair in qa_pairs]

benchmark = Benchmark(questions=question_list, answers=answer_list)

Now we can run through the questions and score the response quality with Tonic Validate

In [12]:
from tonic_validate import ValidateScorer
import os

os.environ["OPENAI_BASE_URL"] = f"{ollama_url}/v1"
scorer = ValidateScorer(model_evaluator="llama2:70b-chat", max_parsing_retries=10)
response_scores = scorer.score(benchmark, get_llama_response)

Retrieving responses: 100%|██████████| 10/10 [04:37<00:00, 27.76s/it]
Scoring responses:  20%|██        | 2/10 [01:30<05:48, 43.55s/it]Error calculating answer_consistency: Could not determine true or false from response 
statement: "the distinctive feature of successful startups is that they're not."

can this statement be derived from the context?

true or false. Retrying...
Error calculating answer_consistency: Could not determine true or false from response 
statement: "the distinctive feature of successful startups is that they're not."

can this statement be derived from the context?

true or false. Retrying...
Error calculating answer_consistency: Could not determine true or false from response 
statement: "the distinctive feature of successful startups is that they're not."

can this statement be derived from the context?

true or false. Retrying...
Error calculating answer_consistency: Could not determine true or false from response 
statement: "the distinctive feature of succ

Let's view the results in a dataframe to see the scores

In [13]:
response_scores.to_df()

Unnamed: 0,question,answer_similarity,augmentation_precision,answer_consistency
0,What makes Sam Altman a good founder?,4.0,1.0,1.0
1,"When was the essay ""Five Founders"" written?",5.0,1.0,1.0
2,When does the most dramatic growth happen for ...,2.0,1.0,
3,What is the problem with business culture vers...,4.0,1.0,
4,What's the single biggest thing the government...,5.0,1.0,1.0
5,How could one create a rigorous government def...,4.0,1.0,1.0
6,Why is frienship a good quality of founders?,4.0,1.0,1.0
7,Why is determination the most important qualit...,5.0,1.0,1.0
8,"For startups, what does board control mean in ...",4.0,1.0,1.0
9,What's in the way of founders keeping board co...,4.0,1.0,1.0


In [None]:
from tonic_validate import ValidateApi

validate_api = ValidateApi()
validate_api.upload_run("project-id", response_scores)