# Evaluation Baseline LLM : LlamaIndex Intro. Tutorial
Alejandro Ricciardi (Omegapy)  
created date: 12/26/2023 
GitHub: https://github.com/Omegapy

Projects Description:

Using LlamaIndex to evaluate an LLM query with the following two main evaluations:
- ResponseSourceEvaluator:
    - uses an LLM to decide if the response is similar enough to the sources 
        - a good measure for hallucination detection!
- QueryResponseEvaluator:
    - uses an LLM to decide if a response is similar enough to the original query 
        - a good measure for checking if the query was answered!

Project Map:

- API Key
- Loading Docs
    - load_markdown_docs() Function
    - Load documents from each folder.
- Create the indices
   - Create a vector store index for each folder 
- Create Query Engine Tools
    - Create a Unified Query Engine
- Test the Query Engine
    - Evaluate the Baseline
        - Generate the Dataset
        - Evaluate with the Dataset
        - Investigating Hallucinations
        - Generating response from a known hallucinated_questions
    
Credit: LlamaIndex https://www.youtube.com/watch?v=2c64G-iDJKQ

### API KEY

In [92]:
# Load environment variables API Keys

from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv()) 

True

### Loading Docs

In [93]:
import llama_index as li

In [94]:
from llama_docs_bot.markdown_docs_reader import MarkdownDocsReader # See llama_docs_bot/markdown_docs_reader.py
from llama_index import SimpleDirectoryReader

#### load_markdown_docs() Function
Load markdown docs from a directory, excluding all other file types.

In [95]:
def load_markdown_docs(filepath):
    
    loader = SimpleDirectoryReader(
        input_dir=filepath, 
        required_exts=[".md"],
        file_extractor={".md": MarkdownDocsReader()},
        recursive=True
    )

    documents = loader.load_data()

    # exclude some metadata from the LLM
    for doc in documents:
        doc.excluded_llm_metadata_keys = ["file_name", "content_type", "header_path"]

    return documents

#### Load documents from each folder.
The Docs are separate for now, in order to create separate indexes later on.

In [96]:
getting_started_docs = load_markdown_docs("docs/getting_started")
community_docs = load_markdown_docs("docs/community")
data_docs = load_markdown_docs("docs/core_modules/data_modules")
agent_docs = load_markdown_docs("docs/core_modules/agent_modules")
model_docs = load_markdown_docs("docs/core_modules/model_modules")
query_docs = load_markdown_docs("docs/core_modules/query_modules")
supporting_docs = load_markdown_docs("docs/core_modules/supporting_modules")
tutorials_docs = load_markdown_docs("docs/end_to_end_tutorials")
contributing_docs = load_markdown_docs("docs/development")

### Create the indices
The ServiceContext is a bundle of commonly used resources used during the indexing and querying stage in a LlamaIndex pipeline/application. You can use it to set the global configuration, as well as local configurations at specific parts of the pipeline.

https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/service_context.html#setting-local-configuration

In [97]:
from llama_index import ServiceContext, set_global_service_context
from llama_index.llms import OpenAI

In [98]:
# create a global service context
service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", temperature=0))
set_global_service_context(service_context)

#### Create a vector store index for each folder

In [99]:
from llama_index import VectorStoreIndex, StorageContext, load_index_from_storage

In [100]:
try:
    getting_started_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./getting_started_index"))
    community_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./community_index"))
    data_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./data_index"))
    agent_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./agent_index"))
    model_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./model_index"))
    query_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./query_index"))
    supporting_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./supporting_index"))
    tutorials_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./tutorials_index"))
    contributing_index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./contributing_index"))
except:
    getting_started_index = VectorStoreIndex.from_documents(getting_started_docs)
    getting_started_index.storage_context.persist(persist_dir="./getting_started_index")

    community_index = VectorStoreIndex.from_documents(community_docs)
    community_index.storage_context.persist(persist_dir="./community_index")

    data_index = VectorStoreIndex.from_documents(data_docs)
    data_index.storage_context.persist(persist_dir="./data_index")

    agent_index = VectorStoreIndex.from_documents(agent_docs)
    agent_index.storage_context.persist(persist_dir="./agent_index")

    model_index = VectorStoreIndex.from_documents(model_docs)
    model_index.storage_context.persist(persist_dir="./model_index")

    query_index = VectorStoreIndex.from_documents(query_docs)
    query_index.storage_context.persist(persist_dir="./query_index")    

    supporting_index = VectorStoreIndex.from_documents(supporting_docs)
    supporting_index.storage_context.persist(persist_dir="./supporting_index")

    tutorials_index = VectorStoreIndex.from_documents(tutorials_docs)
    tutorials_index.storage_context.persist(persist_dir="./tutorials_index")

    contributing_index = VectorStoreIndex.from_documents(contributing_docs)
    contributing_index.storage_context.persist(persist_dir="./contributing_index")

## Create Query Engine Tools
Since we have so many indices, we can create a query engine tool for each and then use them in a single query engine!

In [101]:
from llama_index.tools import QueryEngineTool

In [102]:
# create a query engine tool for each folder
getting_started_tool = QueryEngineTool.from_defaults(
    query_engine=getting_started_index.as_query_engine(), 
    name="Getting Started", 
    description="Useful for answering questions about installing and running llama index, as well as basic explanations of how llama index works."
)

community_tool = QueryEngineTool.from_defaults(
    query_engine=community_index.as_query_engine(),
    name="Community",
    description="Useful for answering questions about integrations and other apps built by the community."
)

data_tool = QueryEngineTool.from_defaults(
    query_engine=data_index.as_query_engine(),
    name="Data Modules",
    description="Useful for answering questions about data loaders, documents, nodes, and index structures."
)

agent_tool = QueryEngineTool.from_defaults(
    query_engine=agent_index.as_query_engine(),
    name="Agent Modules",
    description="Useful for answering questions about data agents, agent configurations, and tools."
)

model_tool = QueryEngineTool.from_defaults(
    query_engine=model_index.as_query_engine(),
    name="Model Modules",
    description="Useful for answering questions about using and configuring LLMs, embedding modles, and prompts."
)

query_tool = QueryEngineTool.from_defaults(
    query_engine=query_index.as_query_engine(),
    name="Query Modules",
    description="Useful for answering questions about query engines, query configurations, and using various parts of the query engine pipeline."
)

supporting_tool = QueryEngineTool.from_defaults(
    query_engine=supporting_index.as_query_engine(),
    name="Supporting Modules",
    description="Useful for answering questions about supporting modules, such as callbacks, service context, and avaluation."
)

tutorials_tool = QueryEngineTool.from_defaults(
    query_engine=tutorials_index.as_query_engine(),
    name="Tutorials",
    description="Useful for answering questions about end-to-end tutorials and giving examples of specific use-cases."
)

contributing_tool = QueryEngineTool.from_defaults(
    query_engine=contributing_index.as_query_engine(),
    name="Contributing",
    description="Useful for answering questions about contributing to llama index, including how to contribute to the codebase and how to build documentation."
)

### Create Unified Query Engine

In [103]:
# needed for notebooks
import nest_asyncio
nest_asyncio.apply()

from llama_index.query_engine import SubQuestionQueryEngine
from llama_index.response_synthesizers import get_response_synthesizer

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[
        getting_started_tool,
        community_tool,
        data_tool,
        agent_tool,
        model_tool,
        query_tool,
        supporting_tool,
        tutorials_tool,
        contributing_tool
    ],
    # enable this for streaming
    #response_synthesizer=get_response_synthesizer(streaming=True),
    verbose=False
)

## Test the Query Engine!

In [104]:
response = query_engine.query("How do I install llama index?")
print(str(response))

To install Llama Index, you can use either of the following methods:

1. Installation from Pip:
   Run the following command in your terminal or command prompt:
   ```
   pip install llama-index
   ```

2. Installation from Source:
   Follow these steps:
   - Clone the repository by running the command:
     ```
     git clone https://github.com/jerryjliu/llama_index.git
     ```
   - After cloning, navigate to the cloned directory.
   - If you want to do an editable install (where you can modify source files), run:
     ```
     pip install -e .
     ```
   - If you want to install optional dependencies and dependencies used for development (e.g., unit testing), run:
     ```
     pip install -r requirements.txt
     ```

Choose the method that suits your needs and follow the corresponding steps to install Llama Index.


### Evaluate the Baseline!

Now that we have our baseline query engine created, we can create a basic evaluation pipeline!

Our pipeline will:

- Generate a small dataset of questions
- Save/cache these questions (so we can properly compare performance later!)
- Evaluate both response quality and hallucination

To do this reliably, we need to use an LLM smarter than `gpt-3.5-turbo`, so we will setup `gpt-4` for the evaluation process!

### Generate the Dataset

In order to make the question generation more efficient, we can remove small documents and combine all documents into a giant single document.

I also modify the question generation prompt, to generate a single question for each chunk, along with extra context for what it is reading.

In [105]:
from llama_index import Document

In [106]:
documents = SimpleDirectoryReader("docs", recursive=True, required_exts=[".md"]).load_data()

all_text = ""

for doc in documents:
    all_text += doc.text

giant_document = Document(text=all_text)

In [107]:
import os
import random
random.seed(42)

from llama_index import ServiceContext
from llama_index.prompts import Prompt
from llama_index.llms import OpenAI
from llama_index.evaluation import DatasetGenerator

In [108]:
gpt4_service_context = ServiceContext.from_defaults(llm=OpenAI(llm="gpt-4", temperature=0))

In [109]:
question_dataset = []
if os.path.exists("./data/question_dataset.txt"):
    with open("data/question_dataset.txt", "r") as f:
        for line in f:
            question_dataset.append(line.strip())
else:
    # generate questions
    data_generator = DatasetGenerator.from_documents(
        [giant_document],
        text_question_template=Prompt(
            "A sample from the LlamaIndex documentation is below.\n"
            "---------------------\n"
            "{context_str}\n"
            "---------------------\n"
            "Using the documentation sample, carefully follow the instructions below:\n"
            "{query_str}"
        ),
        question_gen_query=(
            "You are an evaluator for a search pipeline. Your task is to write a single question "
            "using the provided documentation sample above to test the search pipeline. The question should "
            "reference specific names, functions, and terms. Restrict the question to the "
            "context information provided.\n"
            "Question: "
        ),
        # set this to be low, so we can generate more questions
        service_context=gpt4_service_context
    )
    generated_questions = data_generator.generate_questions_from_nodes()

    # randomly pick 40 questions from each dataset
    generated_questions = random.sample(generated_questions, 40)
    question_dataset.extend(generated_questions)

    print(f"Generated {len(question_dataset)} questions.")

    # save the questions!
    with open("question_dataset.txt", "w") as f:
        for question in question_dataset:
            f.write(f"{question.strip()}\n")

In [110]:
print(random.sample(question_dataset, 5))
print(f"Generated {len(question_dataset)} questions.")

['What is the function used to specify the metadata visible to the embedding model and how can it be customized?', 'How can I convert tools to LangChain tools using the provided documentation sample?', 'What is the purpose of the "router query engine" in the LlamaIndex framework?', 'What are the different vector stores supported by LlamaIndex for use as the storage backend for `VectorStoreIndex`?', 'What is the default number of LLM calls required for the ListIndex?']
Generated 40 questions.


### Evaluate with the Dataset

Now that we have our dataset, let's measure performance!

#### Evaluating Response for Hallucination

In [111]:
import time
import asyncio
import nest_asyncio
nest_asyncio.apply()

from llama_index import Response

def evaluate_query_engine(evaluator, query_engine, questions):
    async def run_query(query_engine, q):
        try:
            return await query_engine.aquery(q)
        except:
            return Response(response="Error, query failed.")

    total_correct = 0
    all_results = []
    for batch_size in range(0, len(questions), 5):
        batch_qs = questions[batch_size:batch_size+5]
        
        tasks = [run_query(query_engine, q) for q in batch_qs]
        responses = asyncio.run(asyncio.gather(*tasks))
        print(f"finished batch {(batch_size // 5) + 1} out of {len(questions) // 5}")
        
        for response in responses:
            if evaluator.evaluate_response(response=response).passing: 
                eval_result = 1
            else:
                eval_result = 0
            total_correct += eval_result
            all_results.append(eval_result)
        
        # helps avoid rate limits
        time.sleep(1)

    return total_correct, all_results

In [112]:
from llama_index.evaluation import FaithfulnessEvaluator

# gpt-4 evaluator!
evaluator = FaithfulnessEvaluator(service_context=gpt4_service_context)

total_correct, all_results = evaluate_query_engine(evaluator, query_engine, question_dataset)

print(f"Hallucination? Scored {total_correct} out of {len(question_dataset)} questions correctly.")

finished batch 1 out of 8
finished batch 2 out of 8
finished batch 3 out of 8
finished batch 4 out of 8
finished batch 5 out of 8
finished batch 6 out of 8
finished batch 7 out of 8
finished batch 8 out of 8
Hallucination? Scored 33 out of 40 questions correctly.


#### Investigating Hallucinations

In [113]:
import numpy as np

# Incorrect answers, incorrect answer number = len(question_dataset) - total_correct
hallucinated_questions = np.array(question_dataset)[np.array(all_results) == 0]
print(hallucinated_questions)

['How can I convert tools to LangChain tools using the provided documentation sample?'
 'What is the purpose of the `GuidancePydanticProgram` class in the LlamaIndex documentation?'
 'What are the available options for the storage backend of the index store in LlamaIndex?'
 "What is the purpose of the `CollectionQueryConsumer` class in the Delphic application's WebSocket handling?"
 "What is the function used to retrieve the collections for the logged-in user in the Delphic project's frontend?"
 'What is the purpose of the Algovera tool built on top of LlamaIndex?'
 'What are the three primary sections within the layout of the ChatView component?']


#### Generating response from a known hallucinated_questions

In [114]:
inv_hallucinated_question = query_engine.query(hallucinated_questions[0]) # "How can I convert tools to LangChain tools using the provided documentation sample?"

In [115]:
print(str(inv_hallucinated_question))
print("-----------------")
print(response.get_formatted_sources(length=256))

The context information does not provide any information about converting tools to LangChain tools using the provided documentation sample.
-----------------
> Source (Doc id: e0069bdd-195d-40ed-a0ab-c92d6d1e60b3): Sub question: What are the installation requirements for llama index?
Response: The installation requirements for llama index include the following:
- Git clone the repository: `git clone https://github.com/jerryjliu/llama_index.git`
- Install the packa...

> Source (Doc id: b4c280af-af4f-4b5e-9281-cba8ecc139e7): Sub question: How do I download and install llama index?
Response: To download and install Llama Index, you can use either of the following methods:

1. Installation from Pip:
   Run the following command in your terminal or command prompt:
   ```
   pi...

> Source (Doc id: b9cd686a-5e90-4ffc-ab17-d58d9f5aefac): Sub question: What are the steps to set up llama index?
Response: To set up LlamaIndex, you need to follow these steps:
1. Install LlamaIndex by running th

# Conclusion

In this notebook, we covered several key topics!

- setting up a sub-question query engine
- generating a dataset of evaluation questions
- evaluating responses for hallucination
- evaluating responses for answer quality