This is an example notebook for setting up and running the chatbot evaluation using RAGAS.

# Setting up Training Dataset & preparing for evaluation

Loading the dependencies for dataset preparation.

In [1]:
from dotenv import load_dotenv
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import nest_asyncio
import os
import pandas as pd
import pickle
from ragas.testset import TestsetGenerator


load_dotenv()
nest_asyncio.apply()

  from .autonotebook import tqdm as notebook_tqdm


Loading the documents for dataset creation.

In [2]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", api_key=os.environ.get("OPENAI_API_KEY"))

# LOAD documents
document_path = "./docs"
parser = LlamaParse(result_type="text", api_key=os.getenv("LLAMA_CLOUD_API_KEY"))
file_extractor = {".docx": parser}
document_reader = SimpleDirectoryReader(input_dir=document_path, file_extractor=file_extractor)
documents = document_reader.load_data()
print(f"Loaded {len(documents)} documents")

Started parsing the file under job_id f5022391-61d8-41ee-b8a8-8f840c7f1c9a
Started parsing the file under job_id 61f12e8a-2c4d-427d-a019-4d2319a0ad1b
Started parsing the file under job_id 96ec6271-7e39-4f6a-9f96-852150f31920
Started parsing the file under job_id d906d150-1278-48d7-9392-fc439bfe3bb6
Started parsing the file under job_id d2b61be3-fc35-4095-a635-089c3ad4aa7e
Started parsing the file under job_id 10a121c9-d444-428b-86df-084c7b22f394
Started parsing the file under job_id 5d708684-3e09-40bb-8690-07583358238e
Started parsing the file under job_id a8abac99-f6af-44e7-a515-892503fad4e8
Started parsing the file under job_id 572addf2-f477-4d59-b3a5-eddf80709b01
Started parsing the file under job_id 7de03fd8-4277-41ee-bad9-7fa244e18f8d
Started parsing the file under job_id a1913699-354b-4aa9-b522-20152a1b6c79
Started parsing the file under job_id 5f53c70a-c49a-42a2-b76b-3e3c79f4606c
Started parsing the file under job_id a71cdf11-49d4-45ab-bf44-a6e94b8d7749
Started parsing the file 

Pre-processing the document texts to ensure that text are informative and useful.

In [3]:
documentList = []
for doc in documents:
    cleaned_text = str(doc.text).replace("\n", " ").replace("\\n", " ").replace("\t", " ").replace("\\t", " ")
    formatted_text = " ".join(part for part in cleaned_text.split(" ") if part != "")
    doc.metadata["filename"] = doc.metadata["file_name"]
    new_doc = Document(
        page_content=formatted_text,
        id=doc.id_,
        metadata=doc.metadata,
    )
    documentList.append(new_doc)
print(f"Cleaned {len(documentList)} documents")

Cleaned 47 documents


Saving the document objects for future use.

In [4]:
with open("documents.pkl", "wb") as f:
    pickle.dump(documentList, f)

Creating the dataset for evaluation using the preprocessed document objects.

In [5]:
generator_llm = ChatOpenAI(model="gpt-4o-mini", api_key=os.getenv("OPENAI_API_KEY"))
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", api_key=os.getenv("OPENAI_API_KEY"))

generator = TestsetGenerator.from_langchain( 
    llm=generator_llm, 
    embedding_model=embeddings
)

testset = generator.generate_with_langchain_docs(
    documentList,
    testset_size=10,
)
print("Testset of length", len(testset), "generated")

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]                                                          
[A
[A
[A
[A
[A
[A
[A
Generating Samples: 100%|██████████| 20/20 [00:06<00:00,  2.91it/s]


Testset of length 20 generated


In [6]:
# Sanity check
csv_set = testset.to_pandas()
csv_set

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What happens if no content to analyze?,"[, , , , , , , , , , , , , , , , , , , , , , ,...","If there is no content to analyze, there would...",AbstractQuerySynthesizer
1,What implications can be drawn from the minima...,"[h, h]",The minimal context provided does not offer su...,AbstractQuerySynthesizer
2,What implicashuns can be drawn from the absens...,"[t, t, t, t]",The absence of meaningful content can imply a ...,AbstractQuerySynthesizer
3,What implications can be drawn from the minima...,"[u, u]",The minimal context provided does not offer su...,AbstractQuerySynthesizer
4,What implications can be drawn from the contex...,"[s, s]",The context provided lacks substantive content...,AbstractQuerySynthesizer
5,"How do the various persuasion techniques, such...",[The video discusses 15 psychological tricks f...,"The various persuasion techniques, such as the...",ComparativeAbstractQuerySynthesizer
6,How do psych principles like bandwagon and cha...,[The video discusses 15 psychological tricks f...,Psych principles like the bandwagon effect and...,ComparativeAbstractQuerySynthesizer
7,How do various persuasive techniques compare i...,[The video discusses 15 psychological tricks f...,The video compares various persuasive techniqu...,ComparativeAbstractQuerySynthesizer
8,How do the various approaches to listening ski...,[Great listening is crucial for improving rela...,"Various approaches to listening skills, such a...",ComparativeAbstractQuerySynthesizer
9,"How do various communication techniques, such ...",[Great listening is crucial for improving rela...,"Various communication techniques, such as grea...",ComparativeAbstractQuerySynthesizer


Seeing the output, we will be dropping the first 5 queries because they do not pertain to questions derived from the document database.

In [8]:
csv_set.iloc[5:, :].to_csv("ragas_testset.csv", index_label=False)
print("Testset generated and saved to ragas_testset.csv")

Testset generated and saved to ragas_testset.csv


# Creating chatbot response and evaluating performance

## Testing with base chatbot.

In [2]:
from datasets import Dataset
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
import os
import pandas as pd
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)

from chatgpt import ChatGPT
csv_set = pd.read_csv("ragas_testset.csv")
chat_model = ChatGPT(max_tokens=2000)

In [3]:
csv_set

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
5,"How do the various persuasion techniques, such...","[""The video discusses 15 psychological tricks ...","The various persuasion techniques, such as the...",ComparativeAbstractQuerySynthesizer
6,How do psych principles like bandwagon and cha...,"[""The video discusses 15 psychological tricks ...",Psych principles like the bandwagon effect and...,ComparativeAbstractQuerySynthesizer
7,How do various persuasive techniques compare i...,"[""The video discusses 15 psychological tricks ...",The video compares various persuasive techniqu...,ComparativeAbstractQuerySynthesizer
8,How do the various approaches to listening ski...,['Great listening is crucial for improving rel...,"Various approaches to listening skills, such a...",ComparativeAbstractQuerySynthesizer
9,"How do various communication techniques, such ...",['Great listening is crucial for improving rel...,"Various communication techniques, such as grea...",ComparativeAbstractQuerySynthesizer
10,How do the various methods of data visualizati...,['Data visualization is essential for understa...,"Various methods of data visualization, such as...",ComparativeAbstractQuerySynthesizer
11,How do data visualization techniques like scat...,['Data visualization is essential for understa...,Data visualization techniques like scatter plo...,ComparativeAbstractQuerySynthesizer
12,How data visualizashun techniques and storytel...,['Data visualization is essential for understa...,"Data visualization techniques, such as various...",ComparativeAbstractQuerySynthesizer
13,How do various data storytelling techniques vi...,['Data visualization is essential for understa...,Various data storytelling techniques and visua...,ComparativeAbstractQuerySynthesizer
14,How do the strategies for audience engagement ...,"[""To effectively analyze an audience for publi...",Strategies for audience engagement and message...,ComparativeAbstractQuerySynthesizer


Generating chatgpt response for each query in test dataset.

In [4]:
chatgpt_results, chatgpt_contexts = [], []
for idx in range(len(csv_set)):
    prompt = csv_set.iloc[idx, :]["user_input"]
    # response = agent_executor.invoke({
    #     "input":prompt
    # })
    response = chat_model.generate("user", prompt)
    tries = 0
    # while not response["intermediate_steps"][-1][1] and tries < 5:
    #     response = agent_executor.invoke({
    #         "input":prompt
    #     })
    #     tries += 1
    print(f"Prompt: {prompt}")
    # print(f"Response: {response['output']}")
    print(f"Response: {response}")
    # print(f"Context: {response['intermediate_steps']}")
    # results.append(response['output'])
    chatgpt_results.append(response)
    # if not response["intermediate_steps"][-1][1]:
    #     contexts.append(["Not available"])
    # else:
    #     contexts.append(response["intermediate_steps"][-1][1])
    chatgpt_contexts.append(["Not available"])

Prompt: How do the various persuasion techniques, such as the Reciprocity Norm, Chameleon Effect, and foot in the door technique, compare in their effectiveness and application as discussed in different reports?
Response: The effectiveness and application of persuasion techniques like the Reciprocity Norm, Chameleon Effect, and Foot-in-the-Door Technique vary significantly based on contexts and specific circumstances. Here’s a comparative overview of these techniques:

### 1. **Reciprocity Norm**
- **Definition**: This principle states that individuals feel compelled to return favors or kindnesses. When someone does something for you, you are likely to want to reciprocate.
- **Effectiveness**: The Reciprocity Norm is highly effective in various settings, particularly in marketing and business. Research has shown that when a person receives something for free (e.g., samples), they are more likely to make a purchase.
- **Application**: This technique is commonly seen in fundraising (e.g.

KeyboardInterrupt: 

In [26]:
# context provided should be a list of lists; run in case you forgot to update this
# chatgpt_contexts = [[context] for context in chatgpt_contexts]
# chatgpt_contexts

[['Not available'],
 ['Not available'],
 ['Not available'],
 ['Not available'],
 ['Not available'],
 ['Not available'],
 ['Not available'],
 ['Not available'],
 ['Not available'],
 ['Not available'],
 ['Not available'],
 ['Not available'],
 ['Not available'],
 ['Not available'],
 ['Not available']]

Evaluating Model Performance

In [29]:
# creating RAGAS dataset
critic_llm = ChatOpenAI(model="gpt-4o-mini", api_key=os.getenv("OPENAI_API_KEY"))
d = {
    "user_input": csv_set["user_input"].to_list(),
    "response": chatgpt_results,
    "retrieved_contexts": chatgpt_contexts,
    "reference": csv_set["reference"].to_list()
}
dataset = Dataset.from_dict(d)
dataset

Dataset({
    features: ['user_input', 'response', 'retrieved_contexts', 'reference'],
    num_rows: 15
})

In [30]:
# performing evaluation
result = evaluate(
    dataset=dataset,
    metrics=[
        answer_relevancy, 
        faithfulness, 
        context_recall, 
        context_precision],
    llm=critic_llm,
)

Evaluating: 100%|██████████| 60/60 [01:16<00:00,  1.27s/it]


In [34]:
result

Unnamed: 0,user_input,retrieved_contexts,response,reference,answer_relevancy,faithfulness,context_recall,context_precision
0,"How do the various persuasion techniques, such...",[Not available],Persuasion techniques have been extensively st...,"The various persuasion techniques, such as the...",0.880493,0.0,0.0,0.0
1,How do psych principles like bandwagon and cha...,[Not available],The concepts of the bandwagon effect and the c...,Psych principles like the bandwagon effect and...,0.903002,0.0,0.0,0.0
2,How do various persuasive techniques compare i...,[Not available],When comparing various persuasive techniques i...,The video compares various persuasive techniqu...,0.978862,0.0,0.0,0.0
3,How do the various approaches to listening ski...,[Not available],The various approaches to listening skills—gre...,"Various approaches to listening skills, such a...",0.934014,0.0,0.0,0.0
4,"How do various communication techniques, such ...",[Not available],In crucial conversations—those high-stakes dis...,"Various communication techniques, such as grea...",0.926174,0.0,0.0,0.0
5,How do the various methods of data visualizati...,[Not available],"When it comes to data visualization, each meth...","Various methods of data visualization, such as...",0.929246,0.0,0.0,0.0
6,How do data visualization techniques like scat...,[Not available],Data visualization techniques serve different ...,Data visualization techniques like scatter plo...,0.0,0.0,0.0,0.0
7,How data visualizashun techniques and storytel...,[Not available],Data visualization techniques and storytelling...,"Data visualization techniques, such as various...",0.950204,0.0,0.0,0.0
8,How do various data storytelling techniques vi...,[Not available],Data storytelling is a powerful method for con...,Various data storytelling techniques and visua...,0.933192,0.0,0.0,0.0
9,How do the strategies for audience engagement ...,[Not available],Engaging audiences effectively and tailoring m...,Strategies for audience engagement and message...,0.900281,0.0,0.0,0.0


In [33]:
mean_results = result[["faithfulness", "answer_relevancy", "context_recall", "context_precision"]].mean(axis=0)
print(f"Results: {mean_results}")

Results: faithfulness         0.000000
answer_relevancy     0.879276
context_recall       0.000000
context_precision    0.000000
dtype: float64


Faithfulness, context_recall & context_precision are zero here because we are relying on GPT-4o-mini's inherent knowledge to answer the question and not RAG framework is used to augment the model's response.

In [35]:
# save results
result.to_csv("ragas_results_openai.csv", index_label=False)
print("Results saved to ragas_results_openai.csv")

Results saved to ragas_results_openai.csv


## Testing with RAG chatbot.

We will be repeating the steps as above but with the RAG agent now.

In [7]:
# from langchain_setup import full_chain_w_history
from chroma_setup import agent_executor

rag_results, rag_contexts = [], []
for idx in range(len(csv_set)):
    prompt = csv_set.iloc[idx, :]["user_input"]
    response = agent_executor.invoke({
        "input":prompt},
    )
    tries = 0
    while not response["intermediate_steps"][-1][1] and tries < 5:
        response = agent_executor.invoke({
            "input":prompt
        })
        tries += 1

    print(f"Prompt: {prompt}")
    print(f"Response: {response['output']}")
    print(f"Context: {response['intermediate_steps']}")
    rag_results.append(response['output'])
    # rag_results.append(response)
    if not response["intermediate_steps"][-1][1]:
        rag_contexts.append(["Not available"])
    else:
        rag_contexts.append([response["intermediate_steps"][-1][1]])



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo answer this question, I should look for any course content that discusses various persuasion techniques, their effectiveness, and their application as mentioned in the IQMA documents.

Action: iqma course content query engine

Action Input: "persuasion techniques Reciprocity Norm Chameleon Effect foot in the door technique effectiveness comparison"
[0m[36;1m[1;3mThe effectiveness comparison of the Reciprocity Norm, Chameleon Effect, and foot-in-the-door technique in persuasion techniques varies based on the specific context and individuals involved. Each technique operates differently and may yield varying results depending on the situation and the people targeted.[0m[32;1m[1;3mI need more specific insights or detailed comparisons regarding the effectiveness and application of these techniques from the IQMA course content.

Action: iqma course content query engine

Action Input: "comparison of effectiveness and appli

In [8]:
# sanity checks
rag_contexts

[['The effectiveness and application of persuasion techniques like the Reciprocity Norm, Chameleon Effect, and Foot-in-the-Door technique can vary based on the specific situation and individuals involved. Each technique operates differently to influence behavior and attitudes. The Reciprocity Norm leverages the principle of giving back when something is received, the Chameleon Effect focuses on mirroring behaviors to establish rapport, and the Foot-in-the-Door technique involves starting with a small request before making a larger one. Understanding when and how to apply these techniques can impact their effectiveness in influencing others.'],
 ['The bandwagon effect and the chameleon effect are both phenomena studied in psychological research. The bandwagon effect refers to the tendency for individuals to align their beliefs and behaviors with those of a larger group, often leading to conformity. On the other hand, the chameleon effect involves individuals unconsciously mimicking the 

In [10]:
# creating RAGAS dataset
critic_llm = ChatOpenAI(model="gpt-4o-mini", api_key=os.getenv("OPENAI_API_KEY"))
rag_d = {
    "user_input": csv_set["user_input"].to_list(),
    "response": rag_results,
    "retrieved_contexts": rag_contexts,
    "reference": csv_set["reference"].to_list()
}
rag_dataset = Dataset.from_dict(rag_d)
rag_dataset

Dataset({
    features: ['user_input', 'response', 'retrieved_contexts', 'reference'],
    num_rows: 15
})

In [18]:
rag_dataset[0]

{'user_input': 'How do the various persuasion techniques, such as the Reciprocity Norm, Chameleon Effect, and foot in the door technique, compare in their effectiveness and application as discussed in different reports?',
 'response': 'The effectiveness and application of persuasion techniques such as the Reciprocity Norm, Chameleon Effect, and Foot-in-the-Door technique depend greatly on the context and the individuals involved. The Reciprocity Norm relies on the principle of returning a favor, which can be highly effective in establishing goodwill. The Chameleon Effect hinges on the psychological principle of mirroring, which helps in building rapport and trust. The Foot-in-the-Door technique requires starting with a small request to gain compliance, prior to making a larger request, which can enhance commitment. Each technique is distinct, with varying levels of effectiveness based on situational factors and target audiences.',
 'retrieved_contexts': ['The effectiveness and applicat

In [17]:
# performing evaluation
result = evaluate(
    dataset=rag_dataset,
    metrics=[
        answer_relevancy, 
        faithfulness, 
        context_recall, 
        context_precision],
    llm=critic_llm,
)
result

Evaluating: 100%|██████████| 60/60 [00:38<00:00,  1.58it/s]


{'answer_relevancy': 0.8168, 'faithfulness': 0.8282, 'context_recall': 0.6289, 'context_precision': 0.8000}

In [19]:
# evaluate returning weird results but we will compile the question-response pair anyways
import pandas as pd
rag_overall_results = pd.DataFrame({ 
    "user_input": csv_set["user_input"].to_list(), 
    "retrieved_context": rag_contexts,
    "response": rag_results,
    "reference": csv_set["reference"].to_list()
    })
rag_overall_results

Unnamed: 0,user_input,retrieved_context,response,reference
0,"How do the various persuasion techniques, such...",[The effectiveness and application of persuasi...,The effectiveness and application of persuasio...,"The various persuasion techniques, such as the..."
1,How do psych principles like bandwagon and cha...,[The bandwagon effect and the chameleon effect...,The bandwagon effect and chameleon effect are ...,Psych principles like the bandwagon effect and...
2,How do various persuasive techniques compare i...,[The cost effectiveness of quantitative analys...,The cost and effectiveness of persuasive techn...,The video compares various persuasive techniqu...
3,How do the various approaches to listening ski...,[Great listening involves truly focusing on th...,"Great listening, active attending, and effecti...","Various approaches to listening skills, such a..."
4,"How do various communication techniques, such ...",[Active listening skills and the use of open-e...,"Active listening, active attending, and the us...","Various communication techniques, such as grea..."
5,How do the various methods of data visualizati...,[Bar charts are effective in data storytelling...,Bar charts are effective for comparing differe...,"Various methods of data visualization, such as..."
6,How do data visualization techniques like scat...,[Scatter plots and bar charts are commonly use...,Scatter plots are effective for showing the re...,Data visualization techniques like scatter plo...
7,How data visualizashun techniques and storytel...,[Data visualization and storytelling are both ...,Data visualization techniques and storytelling...,"Data visualization techniques, such as various..."
8,How do various data storytelling techniques vi...,[Bar charts are often more effective than scat...,Bar charts are generally more effective than s...,Various data storytelling techniques and visua...
9,How do the strategies for audience engagement ...,[Audience engagement strategies may vary based...,Strategies for audience engagement and message...,Strategies for audience engagement and message...


In [20]:
# save results
rag_overall_results.to_csv("ragas_results_rag.csv", index_label=False)
print("Results saved to ragas_results_rag.csv")

Results saved to ragas_results_rag.csv
