<a href="https://colab.research.google.com/github/Guidevit/notebooks/blob/main/Pipeline_RAG_com_Evaluation_Juridico.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip -q install python-dotenv pinecone-client llama-index pymupdf llmsherpa openai llama_hub


#SET ENVIRONMENT

We create a file for our environment variables.

In [None]:
import openai
import pinecone
pinecone.init(api_key="", environment="")
openai.api_key = ""


#SETUP

We build an empty Pinecone Index, and define the necessary LlamaIndex wrappers/abstractions so that we can start loading data into Pinecone.

In [None]:
api_key = "PINECONE_API_KEY"
environment = "PINECONE_ENVIRONMENT"
pinecone.init(api_key=api_key, environment=environment)

#Build an Ingestion Pipeline from Scratch

We show how to build an ingestion pipeline as mentioned in the introduction.

Note that steps (2) and (3) can be handled via our NodeParser abstractions, which handle splitting and node creation.

For the purposes of this tutorial, we show you how to create these objects manually.

#1. Load Data

In [None]:
import fitz

file_path = ""
doc = fitz.open(file_path)

#2. Use a Text Splitter do Split Documents

Here we import our SentenceSplitter to split document texts into smaller chunks, while preserving paragraphs/sentences as much as possible.

In [None]:
from llama_index.text_splitter import SentenceSplitter


In [None]:
text_splitter = SentenceSplitter(
    chunk_size=1024,
    # separator=" ",
)

In [None]:
text_chunks = []
# maintain relationship with source doc index, to help inject doc metadata in (3)
doc_idxs = []
for doc_idx, page in enumerate(doc):
    page_text = page.get_text("text")
    cur_text_chunks = text_splitter.split_text(page_text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

#3. Manually Construct Nodes from Text Chunks

We convert each chunk into a TextNode object, a low-level data abstraction in LlamaIndex that stores content but also allows defining metadata + relationships with other Nodes.

We inject metadata from the document into each node.

This essentially replicates logic in our SimpleNodeParser.

In [None]:
from llama_index.schema import TextNode

In [None]:
nodes = []
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(
        text=text_chunk,
    )
    src_doc_idx = doc_idxs[idx]
    src_page = doc[src_doc_idx]
    nodes.append(node)

In [None]:
print(nodes[0].metadata)

In [None]:
print(nodes[2].get_content(metadata_mode="all"))

#4. Extract Metadata from each Node

We extract metadata from each Node using our Metadata extractors.

This will add more metadata to each Node.

In [None]:
from llama_index.node_parser.extractors import (
    MetadataExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
)
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")

metadata_extractor = MetadataExtractor(
    extractors=[
        TitleExtractor(nodes=5, llm=llm),
        QuestionsAnsweredExtractor(questions=3, llm=llm),
    ],
    in_place=False,
)

In [None]:
nodes = metadata_extractor.process_nodes(nodes)

In [None]:
print(nodes[12].metadata)

#5. Generate Embeddings for each Node

Generate document embeddings for each Node using our OpenAI embedding model (text-embedding-ada-002).

Store these on the embedding property on each Node.

In [None]:
from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding()

In [None]:
for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

In [None]:
import pinecone

pinecone.init(api_key="", environment="us-west1-gcp-free")
pinecone.list_indexes()

In [None]:
from llama_index.vector_stores import PineconeVectorStore

index_name = "..."

pinecone_index = pinecone.Index(index_name)

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

In [None]:
vector_store.add(nodes)

#Retrieve and Query from the Vector Store

Now that our ingestion is complete, we can retrieve/query this vector store.

NOTE: We can use our high-level VectorStoreIndex abstraction here. See the next section to see how to define retrieval at a lower-level!

In [None]:
from llama_index import VectorStoreIndex
from llama_index.storage import StorageContext

In [None]:
index = VectorStoreIndex.from_vector_store(vector_store)

In [None]:
query_engine = index.as_query_engine()

In [None]:
query_str = "escreva sobre os requerimentos do autor sobre os danos morais"


In [None]:
response = query_engine.query(query_str)


In [None]:
print(str(response))

#Building Retrieval from Scratch

We use Pinecone as the vector database. We load in nodes using our high-level ingestion abstractions (to see how to build this from scratch, see our previous tutorial!).

We will show how to do the following:

How to generate a query embedding

How to query the vector database using different search modes (dense, sparse, hybrid)

How to parse results into a set of Nodes

How to put this in a custom retriever

In [None]:
!pip install llama_hub

##Setup Pinecone

In [None]:
import pinecone

pinecone.init(api_key="", environment="us-west1-gcp-free")

In [None]:
index_name = ""

pinecone_index = pinecone.Index(index_name)

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

##Load Document


In [None]:
file_path = "..."

In [None]:
from pathlib import Path
from llama_hub.file.pymu_pdf.base import PyMuPDFReader

In [None]:
loader = PyMuPDFReader()
documents = loader.load(file_path=file_path)

##Load into Vector Store

Load in documents into the PineconeVectorStore.

NOTE: We use high-level ingestion abstractions here, with VectorStoreIndex.from_documents. We’ll refrain from using VectorStoreIndex for the rest of this tutorial.

In [None]:
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.storage import StorageContext

In [None]:
service_context = ServiceContext.from_defaults(chunk_size=1024)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context, storage_context=storage_context
)

#Define Vector Retriever

Now we’re ready to define our retriever against this vector store to retrieve a set of nodes.

We’ll show the processes step by step and then wrap it into a function.

In [None]:
query_str = "Liste os argumentos utilizados pelo autor sobre o dano moral"


##1. Generate a Query Embedding

In [None]:
from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding()

In [None]:
query_embedding = embed_model.get_query_embedding(query_str)


##2. Query the Vector Database

We show how to query the vector database with different modes: default, sparse, and hybrid.

We first construct a VectorStoreQuery and then query the vector db.

In [None]:
# construct vector store query
from llama_index.vector_stores import VectorStoreQuery

query_mode = "default"
# query_mode = "sparse"
# query_mode = "hybrid"

vector_store_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=2, mode=query_mode
)

In [None]:
# returns a VectorStoreQueryResult
query_result = vector_store.query(vector_store_query)
query_result

##3. Parse Result into a set of Nodes

The VectorStoreQueryResult returns the set of nodes and similarities. We construct a NodeWithScore object with this.

In [None]:
from llama_index.schema import NodeWithScore
from typing import Optional

nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
    score: Optional[float] = None
    if query_result.similarities is not None:
        score = query_result.similarities[index]
    nodes_with_scores.append(NodeWithScore(node=node, score=score))

In [None]:
from llama_index.response.notebook_utils import display_source_node

for node in nodes_with_scores:
    display_source_node(node, source_length=1000)

##4. Put this into a Retriever

Let’s put this into a Retriever subclass that can plug into the rest of LlamaIndex workflows!

In [None]:
from llama_index import QueryBundle
from llama_index.retrievers import BaseRetriever
from typing import Any, List


class PineconeRetriever(BaseRetriever):
    """Retriever over a pinecone vector store."""

    def __init__(
        self,
        vector_store: PineconeVectorStore,
        embed_model: Any,
        query_mode: str = "default",
        similarity_top_k: int = 2,
    ) -> None:
        """Init params."""
        self._vector_store = vector_store
        self._embed_model = embed_model
        self._query_mode = query_mode
        self._similarity_top_k = similarity_top_k

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve."""
        query_embedding = embed_model.get_query_embedding(query_str)
        vector_store_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self._similarity_top_k,
            mode=self._query_mode,
        )
        query_result = vector_store.query(vector_store_query)

        nodes_with_scores = []
        for index, node in enumerate(query_result.nodes):
            score: Optional[float] = None
            if query_result.similarities is not None:
                score = query_result.similarities[index]
            nodes_with_scores.append(NodeWithScore(node=node, score=score))

        return nodes_with_scores

In [None]:
retriever = PineconeRetriever(
    vector_store, embed_model, query_mode="default", similarity_top_k=2
)

In [None]:
retrieved_nodes = retriever.retrieve(query_str)
for node in retrieved_nodes:
    display_source_node(node, source_length=1000)

#Plug this into our RetrieverQueryEngine to synthesize a response

NOTE: We’ll cover more on how to build response synthesis from scratch in future tutorials!

In [None]:
from llama_index.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(retriever)

In [None]:
response = query_engine.query(query_str)

In [None]:
print(str(response))

#Building Response Synthesis from Scratch

We’ll walk through some synthesis strategies:

Create and Refine

Tree Summarization

We’re essentially unpacking our “Response Synthesis” module and exposing that for the user.

We use OpenAI as a default LLM but you’re free to plug in any LLM you wish.

##Setup Pinecone and Load data

We build an empty Pinecone Index, and define the necessary LlamaIndex wrappers/abstractions so that we can load/index data and get back a vector retriever.



In [None]:
import pinecone

pinecone.init(api_key="", environment="us-west1-gcp-free")

index_name = ""

pinecone_index = pinecone.Index(index_name)

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

from pathlib import Path
from llama_hub.file.pymu_pdf.base import PyMuPDFReader

file_path = "..."
loader = PyMuPDFReader()
documents = loader.load(file_path=file_path)

##Get vector retriever

In [None]:
from llama_index.vector_stores import PineconeVectorStore
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.storage import StorageContext

In [None]:
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
# NOTE: set chunk size of 1024
service_context = ServiceContext.from_defaults(chunk_size=1024)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context, storage_context=storage_context
)

In [None]:
retriever = index.as_retriever()

##Given an example question, get a retrieved set of nodes.

We use the retriever to get a set of relevant nodes given a user query. These nodes will then be passed to the response synthesis modules below.

In [None]:
query_str = (
    "Poderia me falar quais argumentos são os mais relevantes a serem rebatidos"
    " na peça do autor?"
)

In [None]:
retrieved_nodes = retriever.retrieve(query_str)

#Building Response Synthesis with LLMs

In this section we’ll show how to use LLMs + Prompts to build a response synthesis module.

We’ll start from simple strategies (simply stuffing context into a prompt), to more advanced strategies that can handle context overflows.

##1. Try a Simple Prompt

We first try to synthesize the response using a single input prompt + LLM call.



In [None]:
from llama_index.llms import OpenAI
from llama_index.prompts import PromptTemplate

llm = OpenAI(model="text-davinci-003")

In [None]:
qa_prompt = PromptTemplate(
    """\
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: \
"""
)

Given an example question, retrieve the set of relevant nodes and try to put it all in the prompt, separated by newlines.



In [None]:
query_str = (
    "Poderia me falar quais argumentos são os mais relevantes a serem rebatidos"
    " na peça do autor?"
)

In [None]:
retrieved_nodes = retriever.retrieve(query_str)

In [None]:
def generate_response(retrieved_nodes, query_str, qa_prompt, llm):
    context_str = "\n\n".join([r.get_content() for r in retrieved_nodes])
    fmt_qa_prompt = qa_prompt.format(
        context_str=context_str, query_str=query_str
    )
    response = llm.complete(fmt_qa_prompt)
    return str(response), fmt_qa_prompt

In [None]:
response, fmt_qa_prompt = generate_response(
    retrieved_nodes, query_str, qa_prompt, llm
)

In [None]:
print(f"*****Response******:\n{response}\n\n")

In [None]:
print(f"*****Formatted Prompt*****:\n{fmt_qa_prompt}\n\n")


Problem: What if we set the top-k retriever to a higher value? The context would overflow!

In [None]:
retriever = index.as_retriever(similarity_top_k=6)
retrieved_nodes = retriever.retrieve(query_str)

In [None]:
response, fmt_qa_prompt = generate_response(
    retrieved_nodes, query_str, qa_prompt, llm
)
print(f"Response (k=5): {response}")

##2. Try a “Create and Refine” strategy

To deal with context overflows, we can try a strategy where we synthesize a response sequentially through all nodes. Start with the first node and generate an initial response. Then for subsequent nodes, refine the answer using additional context.

This requires us to define a “refine” prompt as well.

In [None]:
refine_prompt = PromptTemplate(
    """\
The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer \
(only if needed) with some more context below.
------------
{context_str}
------------
Given the new context, refine the original answer to better answer the query. \
If the context isn't useful, return the original answer.
Refined Answer: \
"""
)

In [None]:
from llama_index.response.notebook_utils import display_source_node


def generate_response_cr(
    retrieved_nodes, query_str, qa_prompt, refine_prompt, llm
):
    """Generate a response using create and refine strategy.

    The first node uses the 'QA' prompt.
    All subsequent nodes use the 'refine' prompt.

    """
    cur_response = None
    fmt_prompts = []
    for idx, node in enumerate(retrieved_nodes):
        print(f"[Node {idx}]")
        display_source_node(node, source_length=2000)
        context_str = node.get_content()
        if idx == 0:
            fmt_prompt = qa_prompt.format(
                context_str=context_str, query_str=query_str
            )
        else:
            fmt_prompt = refine_prompt.format(
                context_str=context_str,
                query_str=query_str,
                existing_answer=str(cur_response),
            )

        cur_response = llm.complete(fmt_prompt)
        fmt_prompts.append(fmt_prompt)

    return str(cur_response), fmt_prompts

In [None]:
response, fmt_prompts = generate_response_cr(
    retrieved_nodes, query_str, qa_prompt, refine_prompt, llm
)

In [None]:
print(str(response))


In [None]:
# view a sample qa prompt
print(fmt_prompts[0])

In [None]:
# view a sample refine prompt
print(fmt_prompts[1])

Observation: This is an initial step, but obviously there are inefficiencies. One is the fact that it’s quite slow - we make sequential calls. The second piece is that each LLM call is inefficient - we are only inserting a single node, but not “stuffing” the prompt with as much context as necessary.

##3. Try a Hierarchical Summarization Strategy

Another approach is to try a hierarchical summarization strategy. We generate an answer for each node independently, and then hierarchically combine the answers. This “combine” step could happen once, or for maximum generality can happen recursively until there is one “root” node. That “root” node is then returned as the answer.

We implement this approach below. We have a fixed number of children of 5, so we hierarchically combine 5 children at a time.

NOTE: In LlamaIndex this is referred to as “tree_summarize”, in LangChain this is referred to as map-reduce.

In [None]:
def combine_results(
    texts,
    query_str,
    qa_prompt,
    llm,
    cur_prompt_list,
    num_children=10,
):
    new_texts = []
    for idx in range(0, len(texts), num_children):
        text_batch = texts[idx : idx + num_children]
        context_str = "\n\n".join([t for t in text_batch])
        fmt_qa_prompt = qa_prompt.format(
            context_str=context_str, query_str=query_str
        )
        combined_response = llm.complete(fmt_qa_prompt)
        new_texts.append(str(combined_response))
        cur_prompt_list.append(fmt_qa_prompt)

    if len(new_texts) == 1:
        return new_texts[0]
    else:
        return combine_results(
            new_texts, query_str, qa_prompt, llm, num_children=num_children
        )


def generate_response_hs(
    retrieved_nodes, query_str, qa_prompt, llm, num_children=10
):
    """Generate a response using hierarchical summarization strategy.

    Combine num_children nodes hierarchically until we get one root node.

    """
    fmt_prompts = []
    node_responses = []
    for node in retrieved_nodes:
        context_str = node.get_content()
        fmt_qa_prompt = qa_prompt.format(
            context_str=context_str, query_str=query_str
        )
        node_response = llm.complete(fmt_qa_prompt)
        node_responses.append(node_response)
        fmt_prompts.append(fmt_qa_prompt)

    response_txt = combine_results(
        [str(r) for r in node_responses],
        query_str,
        qa_prompt,
        llm,
        fmt_prompts,
        num_children=num_children,
    )

    return response_txt, fmt_prompts

In [None]:
response, fmt_prompts = generate_response_hs(
    retrieved_nodes, query_str, qa_prompt, llm
)

In [None]:
print(str(response))

Observation: Note that the answer is much more concise than the create-and-refine approach. This is a well-known phemonenon - the reason is because hierarchical summarization tends to compress information at each stage, whereas create and refine encourages adding on more information with each node.

Observation: Similar to the above section, there are inefficiencies. We are still generating an answer for each node independently that we can try to optimize away.

Our ResponseSynthesizer module handles this!

##4. [Optional] Let’s create an async version of hierarchical summarization!

A pro of the hierarchical summarization approach is that the LLM calls can be parallelized, leading to big speedups in response synthesis.

We implement an async version below. We use asyncio.gather to execute coroutines (LLM calls) for each Node concurrently.

In [None]:
import nest_asyncio
import asyncio

nest_asyncio.apply()

In [None]:
async def acombine_results(
    texts,
    query_str,
    qa_prompt,
    llm,
    cur_prompt_list,
    num_children=10,
):
    fmt_prompts = []
    for idx in range(0, len(texts), num_children):
        text_batch = texts[idx : idx + num_children]
        context_str = "\n\n".join([t for t in text_batch])
        fmt_qa_prompt = qa_prompt.format(
            context_str=context_str, query_str=query_str
        )
        fmt_prompts.append(fmt_qa_prompt)
        cur_prompt_list.append(fmt_qa_prompt)

    tasks = [llm.acomplete(p) for p in fmt_prompts]
    combined_responses = await asyncio.gather(*tasks)
    new_texts = [str(r) for r in combined_responses]

    if len(new_texts) == 1:
        return new_texts[0]
    else:
        return await acombine_results(
            new_texts, query_str, qa_prompt, llm, num_children=num_children
        )


async def agenerate_response_hs(
    retrieved_nodes, query_str, qa_prompt, llm, num_children=10
):
    """Generate a response using hierarchical summarization strategy.

    Combine num_children nodes hierarchically until we get one root node.

    """
    fmt_prompts = []
    node_responses = []
    for node in retrieved_nodes:
        context_str = node.get_content()
        fmt_qa_prompt = qa_prompt.format(
            context_str=context_str, query_str=query_str
        )
        fmt_prompts.append(fmt_qa_prompt)

    tasks = [llm.acomplete(p) for p in fmt_prompts]
    node_responses = await asyncio.gather(*tasks)

    response_txt = combine_results(
        [str(r) for r in node_responses],
        query_str,
        qa_prompt,
        llm,
        fmt_prompts,
        num_children=num_children,
    )

    return response_txt, fmt_prompts

In [None]:
response, fmt_prompts = await agenerate_response_hs(
    retrieved_nodes, query_str, qa_prompt, llm
)

In [None]:
print(str(response))

#Let’s put it all together!

Let’s define a simple query engine that can be initialized with a retriever, prompt, llm etc. And have it implement a simple query function. We also implement an async version, can be used if you completed part 4 above!

NOTE: We skip subclassing our own QueryEngine abstractions. This is a big TODO to make it more easily sub-classable!

In [None]:
from llama_index.retrievers import BaseRetriever
from llama_index.llms.base import LLM
from dataclasses import dataclass
from typing import Optional, List


@dataclass
class Response:
    response: str
    source_nodes: Optional[List] = None

    def __str__(self):
        return self.response


class MyQueryEngine:
    """My query engine.

    Uses the tree summarize response synthesis module by default.

    """

    def __init__(
        self,
        retriever: BaseRetriever,
        qa_prompt: PromptTemplate,
        llm: LLM,
        num_children=10,
    ) -> None:
        self._retriever = retriever
        self._qa_prompt = qa_prompt
        self._llm = llm
        self._num_children = num_children

    def query(self, query_str: str):
        retrieved_nodes = self._retriever.retrieve(query_str)
        response_txt, _ = generate_response_hs(
            retrieved_nodes,
            query_str,
            self._qa_prompt,
            self._llm,
            num_children=self._num_children,
        )
        response = Response(response_txt, source_nodes=retrieved_nodes)
        return response

    async def aquery(self, query_str: str):
        retrieved_nodes = await self._retriever.aretrieve(query_str)
        response_txt, _ = await agenerate_response_hs(
            retrieved_nodes,
            query_str,
            self._qa_prompt,
            self._llm,
            num_children=self._num_children,
        )
        response = Response(response_txt, source_nodes=retrieved_nodes)
        return response

In [None]:
query_engine = MyQueryEngine(retriever, qa_prompt, llm, num_children=10)

In [None]:
response = query_engine.query(query_str)

In [None]:
print(str(response))

In [None]:
response = await query_engine.aquery(query_str)

In [None]:
print(str(response))

#Building Evaluation from Scratch

We show how you can build evaluation modules from scratch. This includes both evaluation of the final generated response (where the output is plain text), as well as the evaluation of retrievers (where the output is a ranked list of items).

We have in-house modules in our Evaluation section.

##Setup Pinecone and Load data
We load some data and define a very simple RAG query engine that we’ll evaluate (uses top-k retrieval).

In [None]:
import pinecone
from llama_index.vector_stores import PineconeVectorStore

pinecone.init(api_key="", environment="us-west1-gcp-free")

index_name = ""

pinecone_index = pinecone.Index(index_name)

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

from pathlib import Path
from llama_hub.file.pymu_pdf.base import PyMuPDFReader

file_path = ""
loader = PyMuPDFReader()
documents = loader.load(file_path=file_path)

In [None]:
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.node_parser import SimpleNodeParser
from llama_index.llms import OpenAI

In [None]:
import openai
openai.api_key = ""

llm = OpenAI(model="gpt-4")
node_parser = SimpleNodeParser.from_defaults(chunk_size=1024)
service_context = ServiceContext.from_defaults(llm=llm)

In [None]:
nodes = node_parser.get_nodes_from_documents(documents)

In [None]:
index = VectorStoreIndex(nodes, service_context=service_context)

In [None]:
query_engine = index.as_query_engine()

#Dataset Generation
We first go through an exercise of generating a synthetic evaluation dataset. We do this by synthetically generating a set of questions from existing context. We then run each question with existing context through a powerful LLM (e.g. GPT-4) to generate a “ground-truth” response.

##Define Functions
We define the functions that we will use for dataset generation:

In [None]:
from llama_index.schema import BaseNode
from llama_index.llms import OpenAI
from llama_index.prompts import (
    ChatMessage,
    ChatPromptTemplate,
    MessageRole,
    PromptTemplate,
)
from typing import Tuple, List
import re

llm = OpenAI(model="gpt-4")

We define generate_answers_for_questions to generate answers from questions given context.

In [None]:
QA_PROMPT = PromptTemplate(
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
)


def generate_answers_for_questions(
    questions: List[str], context: str, llm: OpenAI
) -> str:
    """Generate answers for questions given context."""
    answers = []
    for question in questions:
        fmt_qa_prompt = QA_PROMPT.format(
            context_str=context, query_str=question
        )
        response_obj = llm.complete(fmt_qa_prompt)
        answers.append(str(response_obj))
    return answers

We define generate_qa_pairs to generate qa pairs over an entire list of Nodes.



In [None]:
QUESTION_GEN_USER_TMPL = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "generate the relevant questions. "
)

QUESTION_GEN_SYS_TMPL = """\
You are a Teacher/ Professor. Your task is to setup \
{num_questions_per_chunk} questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided.\
"""

question_gen_template = ChatPromptTemplate(
    message_templates=[
        ChatMessage(role=MessageRole.SYSTEM, content=QUESTION_GEN_SYS_TMPL),
        ChatMessage(role=MessageRole.USER, content=QUESTION_GEN_USER_TMPL),
    ]
)


def generate_qa_pairs(
    nodes: List[BaseNode], llm: OpenAI, num_questions_per_chunk: int = 10
) -> List[Tuple[str, str]]:
    """Generate questions."""
    qa_pairs = []
    for idx, node in enumerate(nodes):
        print(f"Node {idx}/{len(nodes)}")
        context_str = node.get_content(metadata_mode="all")
        fmt_messages = question_gen_template.format_messages(
            num_questions_per_chunk=10,
            context_str=context_str,
        )
        chat_response = llm.chat(fmt_messages)
        raw_output = chat_response.message.content
        result_list = str(raw_output).strip().split("\n")
        cleaned_questions = [
            re.sub(r"^\d+[\).\s]", "", question).strip()
            for question in result_list
        ]
        answers = generate_answers_for_questions(
            cleaned_questions, context_str, llm
        )
        cur_qa_pairs = list(zip(cleaned_questions, answers))
        qa_pairs.extend(cur_qa_pairs)
    return qa_pairs

In [None]:
qa_pairs

#Getting Pairs over Dataset

NOTE: This can take a long time. For the sake of speed try inputting a subset of the nodes.

In [None]:
qa_pairs = generate_qa_pairs(
    nodes[:1],
    llm,
    num_questions_per_chunk=10,
)

##Define save/load

In [None]:
import pickle

pickle.dump(qa_pairs, open("eval_dataset.pkl", "wb"))

In [None]:
import pickle

qa_pairs = pickle.load(open("eval_dataset.pkl", "rb"))

#Evaluating Generation
In this section we walk through a few methods for evaluating the generated results. At a high-level we use an “evaluation LLM” to measure the quality of the generated results. We do this in both the with labels setting and without labels setting.

We go through the following evaluation algorithms:

Correctness: Compares the generated answer against the ground-truth answer.

Faithfulness: Evaluates whether a response is faithful to the contexts (label-free).

##Building a Correctness Evaluator
The correctness evaluator compares the generated answer to the reference ground-truth answer, given the query. We output a score between 1 and 5, where 1 is the worst and 5 is the best.

We do this through a system and user prompt with a chat interface.

In [None]:
from llama_index.prompts import (
    ChatMessage,
    ChatPromptTemplate,
    MessageRole,
    PromptTemplate,
)
from typing import Dict

In [None]:
CORRECTNESS_SYS_TMPL = """
You are an expert evaluation system for a question answering chatbot.

You are given the following information:
- a user query,
- a reference answer, and
- a generated answer.

Your job is to judge the relevance and correctness of the generated answer.
Output a single score that represents a holistic evaluation.
You must return your response in a line with only the score.
Do not return answers in any other format.
On a separate line provide your reasoning for the score as well.

Follow these guidelines for scoring:
- Your score has to be between 1 and 5, where 1 is the worst and 5 is the best.
- If the generated answer is not relevant to the user query, \
you should give a score of 1.
- If the generated answer is relevant but contains mistakes, \
you should give a score between 2 and 3.
- If the generated answer is relevant and fully correct, \
you should give a score between 4 and 5.
"""

CORRECTNESS_USER_TMPL = """
## User Query
{query}

## Reference Answer
{reference_answer}

## Generated Answer
{generated_answer}
"""

In [None]:
eval_chat_template = ChatPromptTemplate(
    message_templates=[
        ChatMessage(role=MessageRole.SYSTEM, content=CORRECTNESS_SYS_TMPL),
        ChatMessage(role=MessageRole.USER, content=CORRECTNESS_USER_TMPL),
    ]
)

Now that we’ve defined the prompts template, let’s define an evaluation function that feeds the prompt to the LLM and parses the output into a dict of results.



In [None]:
from llama_index.llms import OpenAI


def run_correctness_eval(
    query_str: str,
    reference_answer: str,
    generated_answer: str,
    llm: OpenAI,
    threshold: float = 4.0,
) -> Dict:
    """Run correctness eval."""
    fmt_messages = eval_chat_template.format_messages(
        llm=llm,
        query=query_str,
        reference_answer=reference_answer,
        generated_answer=generated_answer,
    )
    chat_response = llm.chat(fmt_messages)
    raw_output = chat_response.message.content

    # Extract from response
    score_str, reasoning_str = raw_output.split("\n", 1)
    score = float(score_str)
    reasoning = reasoning_str.lstrip("\n")

    return {"passing": score >= threshold, "score": score, "reason": reasoning}

Now let’s try running this on some sample inputs with a chat model (GPT-4).



In [None]:
llm = OpenAI(model="gpt-4")


In [None]:
query_str = (
    "Qual é a materia da peça processual? "
)
reference_answer = (
    "Pedido de indenização por negativação indevida nos orgaos de proteção ao credito"
)

In [None]:
generated_answer = str(query_engine.query(query_str))


In [None]:
print(str(generated_answer))


In [None]:
eval_results = run_correctness_eval(
    query_str, reference_answer, generated_answer, llm=llm, threshold=4.0
)
display(eval_results)

#Building a Faithfulness Evaluator
The faithfulness evaluator evaluates whether the response is faithful to any of the retrieved contexts.

This is a step up in complexity from the correctness evaluator. Since the set of contexts can be quite long, they might overflow the context window. We would need to figure out how to implement a form of response synthesis strategy to iterate over contexts in sequence.

We have a corresponding tutorial showing you how to build response synthesis from scratch. We also have out-of-the-box response synthesis modules. In this guide we’ll use the out of the box modules.

In [None]:
EVAL_TEMPLATE = PromptTemplate(
    "Please tell if a given piece of information "
    "is supported by the context.\n"
    "You need to answer with either YES or NO.\n"
    "Answer YES if any of the context supports the information, even "
    "if most of the context is unrelated. "
    "Some examples are provided below. \n\n"
    "Information: Apple pie is generally double-crusted.\n"
    "Context: An apple pie is a fruit pie in which the principal filling "
    "ingredient is apples. \n"
    "Apple pie is often served with whipped cream, ice cream "
    "('apple pie à la mode'), custard or cheddar cheese.\n"
    "It is generally double-crusted, with pastry both above "
    "and below the filling; the upper crust may be solid or "
    "latticed (woven of crosswise strips).\n"
    "Answer: YES\n"
    "Information: Apple pies tastes bad.\n"
    "Context: An apple pie is a fruit pie in which the principal filling "
    "ingredient is apples. \n"
    "Apple pie is often served with whipped cream, ice cream "
    "('apple pie à la mode'), custard or cheddar cheese.\n"
    "It is generally double-crusted, with pastry both above "
    "and below the filling; the upper crust may be solid or "
    "latticed (woven of crosswise strips).\n"
    "Answer: NO\n"
    "Information: {query_str}\n"
    "Context: {context_str}\n"
    "Answer: "
)

EVAL_REFINE_TEMPLATE = PromptTemplate(
    "We want to understand if the following information is present "
    "in the context information: {query_str}\n"
    "We have provided an existing YES/NO answer: {existing_answer}\n"
    "We have the opportunity to refine the existing answer "
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{context_msg}\n"
    "------------\n"
    "If the existing answer was already YES, still answer YES. "
    "If the information is present in the new context, answer YES. "
    "Otherwise answer NO.\n"
)

NOTE: In the current response synthesizer setup we don’t separate out a system and user message for chat endpoints, so we just use our standard llm.complete for text completion.

We now define our function below. Since we defined both a standard eval template for a given piece of context but also a refine template for subsequent contexts, we implement our “create-and-refine” response synthesis strategy to obtain the answer.

In [None]:
from llama_index.response_synthesizers import Refine
from llama_index import ServiceContext
from typing import List, Dict


def run_faithfulness_eval(
    generated_answer: str,
    contexts: List[str],
    llm: OpenAI,
) -> Dict:
    """Run faithfulness eval."""

    service_context = ServiceContext.from_defaults(llm=llm)
    refine = Refine(
        text_qa_template=EVAL_TEMPLATE,
        refine_template=EVAL_REFINE_TEMPLATE,
    )

    response_obj = refine.get_response(generated_answer, contexts)
    response_txt = str(response_obj)

    if "yes" in response_txt.lower():
        passing = True
    else:
        passing = False

    return {"passing": passing, "reason": str(response_txt)}

Let’s try it out on some data



In [None]:
# use the same query_str, and reference_answer as above
# query_str = "What is the specific name given to the fine-tuned LLMs optimized for dialogue use cases?"
# reference_answer = "The specific name given to the fine-tuned LLMs optimized for dialogue use cases is Llama 2-Chat."

response = query_engine.query(query_str)
generated_answer = str(response)

In [None]:
context_list = [n.get_content() for n in response.source_nodes]
eval_results = run_faithfulness_eval(
    generated_answer,
    contexts=context_list,
    llm=llm,
)
display(eval_results)

#Running Evaluation over our Eval Dataset
Now let’s tie the two above sections together and run our eval modules over our eval dataset!

NOTE: For the sake of speed/cost we extract a very limited sample.

In [None]:
import random

sample_size = 5
qa_pairs_sample = random.sample(qa_pairs, sample_size)

In [None]:
import pandas as pd


def run_evals(qa_pairs: List[Tuple[str, str]], llm: OpenAI, query_engine):
    results_list = []
    for question, reference_answer in qa_pairs:
        response = query_engine.query(question)
        generated_answer = str(response)
        correctness_results = run_correctness_eval(
            query_str,
            reference_answer,
            generated_answer,
            llm=llm,
            threshold=4.0,
        )
        faithfulness_results = run_faithfulness_eval(
            generated_answer,
            contexts=context_list,
            llm=llm,
        )
        cur_result_dict = {
            "correctness": correctness_results["passing"],
            "faithfulness": faithfulness_results["passing"],
        }
        results_list.append(cur_result_dict)
    return pd.DataFrame(results_list)

In [None]:
evals_df = run_evals(qa_pairs_sample, llm, query_engine)


In [None]:
evals_df["correctness"].mean()


In [None]:
evals_df["faithfulness"].mean()
