<!-- docusaurus_head_meta::start
---
title: Introduction Notebook
---
docusaurus_head_meta::end -->

<img src="http://wandb.me/logo-im-png" width="400" alt="Weights & Biases" />
<!--- @wandbcode{intro-colab} -->

# Evaluating Trustworthiness in RAG Pipelines with Weave Integration

This notebook demonstrates how to add a faithfulness score to evaluate the trustworthiness of answers coming from a Retrieval-Augmented Generation (RAG) response. We will integrate this with Weave for tracking function inputs and outputs, creating objects out of prompts, and running evaluations with different datasets.

## Objectives:

* Implement a RAG pipeline that includes a faithfulness scoring mechanism.
* Integrate Weave to track all function calls, inputs, and outputs.
* Create Weave objects for prompts to facilitate reuse and analysis.
* Register three different evaluation datasets and showcase evaluation steps.

## Stack Used:

* LlamaIndex for RAG workflows.
* OpenAI API for language models and embeddings.
* Weave by Weights & Biases for tracking and evaluation.

Note:Ensure you have the necessary API keys set up in your environment.



## 🪄 Install Dependencies

## 🪄 Install `weave` library and login


Start by installing the library and logging in to your account.

In this example, we're using openai so you should [add an openai API key](https://platform.openai.com/docs/quickstart/step-2-setup-your-api-key).



In [None]:
%%capture
!pip install weave \
openai set-env-colab-kaggle-dotenv \
requests \
python-dotenv==1.0.1 \
PyPDF2 \
unstructured \
pdfminer.six \
llama-index


In [None]:
# Set your OpenAI API key

# Put your OPENAI_API_KEY in the secrets panel to the left 🗝️
_ = set_env("OPENAI_API_KEY")
# os.environ["OPENAI_API_KEY"] = "sk-..." # alternatively, put your key here

PROJECT = "Trustworthiness_Check"
from set_env import set_env


In [None]:
weave.init(PROJECT)      # initialize tracking for a specific W&B project
import weave                    # import the weave library


Please login to Weights & Biases (https://wandb.ai/) to continue:


[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Logged in as Weights & Biases user: mg01.
View Weave data at https://wandb.ai/wandb-smle/trustworthiness_check/weave


<weave.trace.weave_client.WeaveClient at 0x7b2bf2b44b50>


## 📚 Import Necessary Libraries

We'll import all the required libraries for our project, including OpenAI, LlamaIndex, and Weave.



In [None]:

# Load environment variables
#load_dotenv()
from dotenv import load_dotenv
from llama_index.core import VectorStoreIndex,SimpleDirectoryReader
from llama_index.embeddings.openai import OpenAIEmbedding
from openai import OpenAI
from typing import List, Dict, Any
import os
import requests
import weave


## 🔑 Initialize OpenAI Client and Embedding Model

Create an OpenAI client instance for API calls and set up the embedding model.



In [None]:
# Initialize OpenAI client
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

# Set up embedding model
embedding_model = OpenAIEmbedding(model="text-embedding-ada-002")


## 📥 Download and Load Documents

We'll download a PDF document from a URL and create an index using LlamaIndex. Please note taht this can be your own vector database with your data indexed for your RAG Chatbot.




In [None]:
# Download the PDF from a URL
pdf_url = "https://arxiv.org/pdf/2408.13296v1.pdf"  # Replace with your PDF URL
pdf_filename = "document.pdf"

response = requests.get(pdf_url)
with open(pdf_filename, 'wb') as f:
    f.write(response.content)

# Load the documents from the PDF
documents = SimpleDirectoryReader(input_dir='.', required_exts=['.pdf']).load_data()

# Create the index from the documents
index = VectorStoreIndex.from_documents(documents, embed_model=embedding_model)


## 🔎 Create Query Engine

Set up the query engine with a limit on the number of retrieved documents.



In [None]:
# Create the query engine
query_engine = index.as_query_engine(similarity_top_k=3)


## 🛠️ Define Weave-Tracked Functions

We'll define our functions for the pipeline and use `@weave.op()` to decorate them, enabling Weave to track their inputs and outputs.

### 1. Retrieve Context

This function retrieves relevant context for the question using the LlamaIndex query engine.



In [None]:
@weave.op()
def retrieve_context(question: str) -> str:
    '''
    Retrieves relevant context for the question using LlamaIndex query engine.
    '''
    response = query_engine.query(question)
    context = str(response)
    return context


### 2. Generate Answer

This function generates an answer to the question based on the provided context using OpenAI's GPT model.



In [None]:
@weave.op()
def generate_answer(question: str, context: str, model_name: str) -> str:
    '''
    Generates an answer to the question based on the provided context using OpenAI's GPT model.
    '''

    messages = [
        {"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion:\n{question}"}
    ]
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        max_tokens=200,
        temperature=0.7,
        n=1,
    )
    answer = response.choices[0].message.content.strip()
    return answer


### 3. Break Down Answer into Statements

This function breaks down the answer into simpler statements without pronouns.



In [None]:

@weave.op()
def break_down_answer_into_statements(answer: str, model_name: str) -> List[str]:
    '''
    Breaks down the answer into simpler statements without pronouns.
    '''

    messages = [
        {"role": "system", "content": "You simplify answers into fully understandable statements without pronouns."},
        {"role": "user", "content": f"Break down the following answer into a list of simpler statements, ensuring each statement is fully understandable and contains no pronouns.\n\nAnswer:\n{answer}\n\nStatements:"}
    ]
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        max_tokens=300,
        temperature=0.5,
        n=1,
    )
    statements_text = response.choices[0].message.content.strip()
    # Parse statements as a list
    statements = [s.strip().strip('.').strip() for s in statements_text.split('\n') if s.strip()]
    # Remove any numbering or bullets
    statements = [s.lstrip('0123456789.- ') for s in statements]
    return statements


### 4. Check Statement Faithfulness

This function checks if each statement can be directly inferred from the context.

In [None]:


@weave.op()
def check_statement_faithfulness(context: str, statement: str, model_name: str) -> Dict[str, Any]:
    '''
    Checks if the statement can be directly inferred from the context.
    Returns a verdict (1 for Yes, 0 for No) and the explanation.
    '''

    messages = [
        {"role": "system", "content": "You check if statements can be inferred from a given context."},
        {"role": "user", "content": f"Given the following context, determine if the statement below can be directly inferred from the context. Answer with 'Yes' or 'No' and provide a brief reason.\n\nContext:\n{context}\n\nStatement:\n{statement}\n\nCan the statement be inferred from the context?"}
    ]
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        max_tokens=150,
        temperature=0,
        n=1,
    )
    result = response.choices[0].message.content.strip()
    # Parse the result to extract 'Yes' or 'No'
    if result.lower().startswith('yes'):
        verdict = 1
    else:
        verdict = 0
    return {'verdict': verdict, 'explanation': result}


## 📊 Register Evaluation Dataset

We'll create and register a single evaluation dataset in Weave. This dataset will be used to evaluate the faithfulness of the generated answers.



In [None]:
# Define the dataset
dataset = weave.Dataset(
    name="Faithfulness_Evaluation_Dataset",
    rows=[
        {"question": "What are the limitations of the Transformers library and Trainer API?"},
        {"question": "Explain  LORA Technique for fine-tuning"},
        {"question": "Why fine-tuning GPT-4 is more challenging than GPT-3.5"},
        {"question": "Explain why fine-tuning is cheaper compared to few shot learning?"},
        {"question": "what are the key features of Key Features of NVIDIA NeMo"},
    ],
)

# Publish dataset to Weave
weave.publish(dataset)


📦 Published to https://wandb.ai/wandb-smle/trustworthiness_check/weave/objects/Faithfulness_Evaluation_Dataset/versions/4eANf654BpPRx926Hw1asWWdnpXYVatdUcsOcmIZRTg


ObjectRef(entity='wandb-smle', project='trustworthiness_check', name='Faithfulness_Evaluation_Dataset', digest='4eANf654BpPRx926Hw1asWWdnpXYVatdUcsOcmIZRTg', extra=())

## 🧪 Define End-to-End Pipeline as a Weave Model

We'll define an end-to-end pipeline as a Weave Model. This allows us to use it for evaluation later and makes the entire process reproducible and traceable.



In [None]:


class FaithfulnessEvaluator(weave.Model):
    model_name: str = "gpt-3.5-turbo"

    @weave.op()
    def predict(self, question: str) -> Dict[str, Any]:
        '''
        Generates an answer to the question based on retrieved context.
        Returns a dict with 'answer', 'context', and 'model_name'.
        '''
        # Retrieve context
        context = retrieve_context(question)
        # Generate answer
        answer = generate_answer(question, context, self.model_name)
        return {'answer': answer, 'context': context, 'model_name': self.model_name}


## 📝 Define Scorer Function

We'll define a scorer function that computes the faithfulness score of the model's answer. This function will be used by Weave's `Evaluation` class.



In [None]:
@weave.op()
def faithfulness_scorer(model_output: Dict[str, Any]) -> Dict[str, Any]:
    '''
    Scorer function that computes the faithfulness score of the model's answer.
    '''
    answer = model_output['answer']
    context = model_output['context']
    model_name = model_output['model_name']
    # Break down the answer into statements
    statements = break_down_answer_into_statements(answer, model_name)
    # Check each statement for faithfulness
    total_statements = len(statements)
    faithful_statements = 0
    statement_results = []
    for statement in statements:
        result = check_statement_faithfulness(context, statement, model_name)
        faithful_statements += result['verdict']
        statement_results.append({
            'statement': statement,
            'verdict': result['verdict'],
            'explanation': result['explanation']
        })
    # Calculate faithfulness score
    if total_statements > 0:
        faithfulness_score = faithful_statements / total_statements
    else:
        faithfulness_score = 0
    # Return results
    return {
        'faithfulness_score': faithfulness_score,
        'statements': statements,
        'statement_results': statement_results,
    }


## 🚀 Run Evaluation Using Weave's `Evaluation` Class

We'll use Weave's `Evaluation` class to run the evaluation, ensuring that the results are stored in the **'eval'** section of Weave.



In [None]:
# Import Weave's Evaluation class
from weave import Evaluation
import asyncio
import nest_asyncio

# Initialize Weave
weave.init(PROJECT)

# Apply nest_asyncio to allow nested event loops in Colab
nest_asyncio.apply()

# Run the evaluation for both models

# Define the models to evaluate
model_names = ["gpt-3.5-turbo", "gpt-4o"]

for model_name in model_names:
    print(f"Running evaluation with model: {model_name}")
    # Instantiate the evaluator model with the specified model name
    evaluator_model = FaithfulnessEvaluator(model_name=model_name)

    # Define the evaluation
    evaluation = Evaluation(
        dataset=dataset,  # the dataset we have defined earlier
        scorers=[faithfulness_scorer],  # the scorer function
    )

    # Run the evaluation
    summary = asyncio.run(evaluation.evaluate(evaluator_model))

    print(f"Completed evaluation with model: {model_name}\n")



Logged in as Weights & Biases user: mg01.
View Weave data at https://wandb.ai/wandb-smle/trustworthiness_check/weave
Running evaluation with model: gpt-3.5-turbo


🍩 https://wandb.ai/wandb-smle/trustworthiness_check/r/call/01927dff-7926-7161-954e-70b809b88a91
Completed evaluation with model: gpt-3.5-turbo

Running evaluation with model: gpt-4o


🍩 https://wandb.ai/wandb-smle/trustworthiness_check/r/call/01927dff-e680-7a23-932f-47c3f0690c1e
Completed evaluation with model: gpt-4o



## 📌 Conclusion

**Evaluation of Faithfulness**:

In this notebook, we focused on evaluating the **faithfulness** of answers generated by our
Retrieval-Augmented Generation (RAG) system. By breaking down the answers into simpler
statements and checking each one against the retrieved context, we quantified how much we can
**trust** the responses provided by the system.

 **How Weave Helps**:

 Weave played a crucial role in this process by:

 - **Tracking**: Weave's `@weave.op()` decorators allowed us to track the inputs and outputs of our
   functions seamlessly. This provided transparency into each step of our pipeline.
 - **Evaluation**: Using Weave's `Evaluation` class, we conducted structured evaluations and stored
   the results in the **'eval'** section. This made it easy to analyze and compare results.
 - **Reproducibility**: By defining our prompts and models as Weave Objects and Models, we ensured
   that our pipeline is reproducible and easily shareable.

 **Benefits of Weave Integration**:

 - **Enhanced Trust**: By integrating faithfulness evaluation, we added an extra layer of **trust** to
   our system. Users can be more confident in the accuracy of the responses.
 - **Debugging and Improvement**: Weave's tracking capabilities make it easier to identify areas
  where the model may not be performing as expected, facilitating targeted improvements.
- **Comprehensive Insights**: The ability to store and analyze evaluation results within Weave
   provides comprehensive insights into model performance over time.

 ---

 ## 🔚 Final Thoughts

 By integrating **Weave** into our code, we've enhanced the transparency, reliability, and
 **trustworthiness** of our RAG system. We can:

 - Track function inputs and outputs.
 - Reuse prompt templates as Weave Objects.
 - Perform comprehensive evaluations focused on faithfulness.
 - Define an end-to-end pipeline as a Weave Model for easier evaluation.
 - Store evaluation results in the **'eval'** section of Weave for better analysis.

 This approach not only provides valuable insights into the trustworthiness of the generated
 answers but also contributes to building systems that users can rely on with confidence.

