# Evaluating a RAG solution with Giskard

In the final part of this notebook, you will find the necessary code to evaluate the suitability of the responses provided by the Agent using Giskard.

#Installing libraries & Loading Dataset

We will download the dataset from the Hugging Face datasets library. It's a dataset with information about diseases.

In [None]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY') 

In [None]:
from datasets import load_dataset

data = load_dataset("keivalya/MedQuad-MedicalQnADataset", split='train')


In [None]:
data = data.to_pandas()
data.head(10)

In [None]:
data = data[0:100]

As you can see, the medical information in the dataset is well-organized, and to someone like me, who is not an expert in the field, it appears to be quite valuable. This information could be a useful addition to any general medicine book to support primary care doctors.

Load the langchain libraries to load the document.

In [None]:
from langchain.document_loaders import DataFrameLoader
from langchain.vectorstores import Chroma

The Document is in the Answer column, and the others columns are Metadata.

In [None]:
df_loader = DataFrameLoader(data, page_content_column="Answer")


In [None]:
df_document = df_loader.load()
display(df_document[:2])

We can chunk the documents. The size to which we want to split the document is a design decision. The larger it is, the larger the prompt will be, and the slower the Model's response process.

We also need to consider the maximum prompt size and ensure that the document does not exceed it.

In [None]:
from langchain.text_splitter import CharacterTextSplitter

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=1250, chunk_overlap=100)
texts = text_splitter.split_documents(df_document)


These warnings we see are because it can't perform the partition of the required size. This is because it waits for a page break to divide the text and does so when possible.

In [None]:
first_doc = texts[1]
print(first_doc.page_content)

### Initialize the Embedding Model and Vector DB

We load the text-embedding-ada-002 model from OpenAI.

In [None]:
from langchain_openai import OpenAIEmbeddings

model_name = 'text-embedding-ada-002'
#model_name = 'text-embedding-3-small'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

The execution of this cell may take 3 to 5 minutes. If you want it to be faster, you can reduce the number of records in the dataset.

In [None]:
directory_cdb = 'chromadb/'
chroma_db = Chroma.from_documents(
    df_document, embed, persist_directory=directory_cdb
)

We are going to create three objects.

* The language model, which can be any of those from OpenAI, the most common being gpt-3.5.
* The memory, responsible for keeping the prompt with all the necessary history.
* The retrieval, used to obtain information stored in ChromaDB.

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain_openai import OpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA

llm=OpenAI(openai_api_key=OPENAI_API_KEY,
           temperature=0.0)

conversational_memory = ConversationBufferWindowMemory(
    memory_key='chat_history',
    k=4, #Number of messages stored in memory
    return_messages=True #Must return the messages in the response.
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=chroma_db.as_retriever()
)

We can try the isolated Retrieval to see if the information it returns is relevant.




In [None]:
qa.invoke("What is the main symptom of LCM?")

Perfect! The information returned is exactly what we desired.

## Creating the Agent.

In [None]:
from langchain.agents import Tool, AgentExecutor

#Defining the list of tool objects to be used by LangChain.
tools = [
    Tool(
        name='Medical KB',
        func=qa.run,
        description=(
            'use this tool when answering medical knowledge queries to get '
            'more information about the topic'
        )
    )
]

In [None]:
from langchain.agents import initialize_agent, create_react_agent
from langchain import hub

prompt = hub.pull("hwchase17/react-chat")
agent = create_react_agent(
    #agent='chat-conversational-react-description',
    tools=tools,
    llm=llm,
    prompt=prompt
    #verbose=True,
    #max_iterations=3,
    #early_stopping_method='generate',
    #memory=conversational_memory
)

In [None]:
# Create an agent executor by passing in the agent and tools
agent_executor = AgentExecutor(agent=agent,
                               tools=tools,
                               verbose=True,
                               memory=conversational_memory,
                               max_iterations=30,
                               max_execution_time=600,
                               #early_stopping_method='generate',
                               handle_parsing_errors=True
                               )

### Using the Conversational Agent

To make queries we simply call the `agent` directly.

First i will try a order not related to the Medical field.

In [None]:
agent_executor.invoke({"input": "Give me the area of square of 2x2"})

Perfect, the model has responded without accessing the configured knowledge database.

Now I will try with a question that is also not related to health.

In [None]:
agent_executor.invoke({"input": "Do you know who is Clark Kent?"})

It has not accessed either, as the model has been able to identify that it is not a question related to the database that LangChain provides.

Now it's time to try with a question related to Medicine. Let's see if the model can understand that it should first look for information in the vector database at its disposal.

In [None]:
 agent_executor.memory.clear()

In [None]:
agent_executor.invoke({"input": """I have a patient that can have Botulism,
how can I confirm the diagnose?"""})

Perfect, the most important thing for us is that it has been able to identify that it should go to the medical database to search for information about the symptoms.

In [None]:
agent_executor.invoke({"input": "Is this an important illness?"})

And the memory works perfectly. We can maintain a conversation, taking into account that the model knows the previous questions and answers.

## Evaluating the solution with Giskard

Install and Load the libraries.

In [None]:
!pip install --default-timeout=100 "giskard[llm]"
from giskard.rag import KnowledgeBase, generate_testset, evaluate

Is necesary to create a Dataframe with just the column containing the information used to create the RAG system.

In [None]:
import pandas as pd
df_giskard = pd.DataFrame([d.page_content for d in df_document], columns=["text"])
df_giskard.head(10)

Using the information from the dataset, we ask Giskard to create a Knowledge Base, which is nothing more than a set of questions along with their respective answers. Both the questions and answers are generated by OpenAI's most advanced model, which is why it requires our OpenAI key to be provided.

In [None]:
kb_giskard = KnowledgeBase(df_giskard)

In [None]:
# The more questions you generate, the more you will be charged.
test_questions = generate_testset(
    kb_giskard,
    num_questions=30,
    agent_description="Medical assistant for diagnosis and treatment support.",
)

In [None]:
df_test_questions = test_questions.to_pandas()

In [None]:
df_test_questions.head()

A function is created that will be called from Giskard's evaluate function. This function receives the question and returns the agent's response.

In [None]:
# df_test_questions['reference_context'][0]

In [None]:
def use_agent(question, history=None):
    return agent_executor.invoke({"input": question})

In [None]:
report = evaluate(use_agent, testset=test_questions, knowledge_base=kb_giskard)

In [None]:
# Summary with the results.
report.correctness_by_question_type()

In [None]:
# Obtaining the incorrect answers
failures = report.get_failures()[:2]
failures

In [None]:
# Giskard explains the reasons why it considers the answers to be incorrect.
failures['correctness_reason'].iloc[1]


# Conclusions.
The experiment has been a small success. The Vectorial database has been configured and filled with information from the dataset. A LangChain agent has been created, and it has been able to retrieve information from the database only when necessary. Don't forget that our ChatBot has memory.

All of this in just a few lines of code!


---