# How to evaluate PoopClimateQA?

This notebook is created to explore possibilities to have tests that automatically evaluate the performance of the RAG agent

In [4]:
import json
import os
from os.path import exists
import sqlite3
# LLM
from langchain_ollama import ChatOllama
from langchain.schema import Document, AIMessage
# to chunk the text
from langchain.text_splitter import RecursiveCharacterTextSplitter
# to make/store embeddings 
from langchain_community.vectorstores import SKLearnVectorStore
#from langchain_community.embeddings.spacy_embeddings import SpacyEmbeddings
from langchain_nomic.embeddings import NomicEmbeddings
from langchain_core.messages import HumanMessage, SystemMessage

In [8]:
import giskard
import pandas as pd
from langchain.chains import RetrievalQA
from langchain import PromptTemplate
from openai import OpenAI
from giskard.llm.client.openai import OpenAIClient
from giskard.rag import KnowledgeBase, generate_testset, QATestset


### set LLM 

In [3]:
local_llm = "llama3.1:70b"

## 1. Create a test set automatically from the documents at our disposal with RAGET Testset Generation

See: [link](https://docs.giskard.ai/en/stable/open_source/testset_generation/testset_generation/index.html)

This is what we're doing below:

1. we load the SQL database

2. We split the content for a reasonable amount of characters. This really depends on your machine capacity when it comes to ingest the tokens to the LLM for doing embeddings or later on for test set generation -- so my suggestion is to start small, and allow 20% overlap with the chunk size. Overlap allows you to give some better context to the LLM as there is higher chance that if it encounters, let's say, an acronym, it may contain also the full spelling of that in some lines before/after

3. We remove weird stuff from the text (maybe this could have been done before but here we are)

4. Transform the cleaned documents into a KnowledgeBase -- this will store our embeddings + topics, see: [link](https://docs.giskard.ai/en/stable/reference/rag-toolset/knowledge_base.html#giskard.rag.knowledge_base.KnowledgeBase)

5. Run the test_set_generation()

In [9]:
database_path = 'literature_relevant.db'

def extract_full_text_content(database_path):
    conn = sqlite3.connect(database_path)
    cursor = conn.cursor()

    # Retrieve all rows/papers from the table
    cursor.execute(f"SELECT fulltext FROM literature_fulltext;")
    rows = cursor.fetchall()

    # Iterate through the rows (which are papers) and extract text content
    text_content = [row[0] for row in rows if isinstance(row[0], str) and row[0] is not None]

    conn.close()

    return text_content

db_path = database_path
docs = extract_full_text_content(db_path)

In [11]:
documents = []

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=200
)

documents = text_splitter.create_documents([text for text in docs])

In [12]:
# make sure that this is a list -- we need this format later on
type(documents)

list

In [41]:
# love the result!
documents[0].page_content[:100]

'# Rotavirus Seasonality and Age Effects in a Birth Cohort Study of Southern India\n\n#\n# Rotavirus Sea'

In [36]:
# ok once we cleaned the docs, we need to put it into a pandas dataframe. That is because knowledge base excepts a pd.Dataframe, and who am I to say no to it?
# for reasons that I don't fully understand this conversions adds STUFF that I cleaned above, I figured that only \n and --- are added, but I really didn't inspect more
# this thing is a bit annoiying because I went all the way to clean my text and LOOK AT THIS DISASTER. I think this needs to be reconsidered.
# Define the regex pattern for the string you want to replace
import re
string_to_replace = r'\{[^}]*\}|\.[a-zA-Z0-9_-]+\s*\{[^}]*\}|<[^>]+>|\n{2,}|\s{2,}|' \
                    r'<script.*?>.*?</script>|<!--.*?-->|---|\n---\n|h[1-6]\s*\{[^}]*\}|' \
                    r'p\s*\{[^}]*\}|n\n\n|#\n#|#|'

# Clean and replace the content in each document and store it in a DataFrame
knowledge_base_docs = pd.DataFrame(
    [re.sub(string_to_replace, '', doc.page_content).replace('---', ' ') for doc in documents],
    columns=["text"]
)

In [44]:
string_to_replace = r'\{[^}]*\}|\.[a-zA-Z0-9_-]+\s*\{[^}]*\}|<[^>]+>|\n{2,}|\s{2,}|' \
                    r'<script.*?>.*?</script>|<!--.*?-->|---|\n---\n|h[1-6]\s*\{[^}]*\}|' \
                    r'p\s*\{[^}]*\}|n\n\n|'

def clean_text(text, pattern):
    # Ensure the text is handled as UTF-8
    if isinstance(text, bytes):  # If it's a byte string
        text = text.decode('utf-8', errors='replace')
    
    cleaned_text = re.sub(pattern, '', text)  # Replace patterns with a single space
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)  # Collapse multiple spaces
    return cleaned_text.strip()  # Remove leading/trailing spaces

# Apply this to your documents and ensure UTF-8 handling
knowledge_base_docs = pd.DataFrame(
    [clean_text(doc.page_content, string_to_replace) for doc in documents],
    columns=["text"]
)

In [45]:
# but look how tidy this is!
knowledge_base_docs['text'][0][:5000]

'# Rotavirus Seasonality and Age Effects in a Birth Cohort Study of Southern India# # Rotavirus Seasonality and Age Effects in a Birth Cohort Study of Southern IndiaRajiv Sarkar1, Gagandeep Kang1, Elena N. Naumova1,2*1Department of Gastrointestinal Sciences, Christian Medical College, Vellore, TN, India2Department of Civil and Environmental Engineering, Tufts University School of Engineering, Boston, Massachusetts, United States of America# Abstract# IntroductioUnderstanding the temporal patterns in disease occurrence is valuable for formulating effective disease preventive programs. Cohort studies present a unique opportunity to explore complex interactions associated with emergence of seasonal patterns of infectious diseases.# MethodsWe used data from 452 children participating in a birth cohort study to assess the seasonal patterns of rotavirus diarrhea by creating a weekly time series of rotavirus incidence and fitting a Poisson harmonic regression with biannual peaks. Age and coho

In [46]:
# this is how many strips we got:
len(knowledge_base_docs)

1037

## Set the client. Giskard does that with OpenAI which allows to use ollama locally -- no idea why

Will this work on snellius? no idea

In [47]:
from openai import OpenAI
from giskard.llm.client.openai import OpenAIClient
from giskard.llm.embeddings.openai import OpenAIEmbedding
from giskard.llm.embeddings import set_default_embedding

_client = OpenAI(base_url="http://localhost:11434/v1/", api_key="ollama")
oc = OpenAIClient(model="llama3.1", client=_client)
emb_client = OpenAIEmbedding(model="nomic-embed-text", client=_client)

giskard.llm.set_default_client(oc)
set_default_embedding(emb_client)

In [50]:
# Store the pandas dataframe with our cleaned docs into a knowledge base
knowledge_base = KnowledgeBase.from_pandas(knowledge_base_docs)

In [51]:
# look at how many topics there are! cute!
knowledge_base.plot_topics()

2024-10-25 15:15:02,742 pid:85760 MainThread giskard.rag  INFO     Finding topics in the knowledge base.
2024-10-25 15:18:11,810 pid:85760 MainThread giskard.rag  INFO     Found 13 topics in the knowledge base.


## Generate test set (drum rolls)

In [53]:
testset = generate_testset(knowledge_base,
                           num_questions=10,
                           language='en',
                           agent_description="A chatbot answering questions about the seasonality, prevalence, and environmental factors influencing diarrheal pathogens based on insights from scholarly literature.")

Generating questions: 100%|█████████████████████| 10/10 [02:14<00:00, 13.43s/it]


In [35]:
testset.save("paper_testset2.jsonl")

In [36]:
testset_loaded = QATestset.load("paper_testset2.jsonl")
df = testset_loaded.to_pandas()
df

Unnamed: 0_level_0,question,reference_answer,reference_context,conversation_history,metadata
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3891d665-e578-4274-b359-36b26e6128bd,What percentage of diarrhoea-related deaths am...,29.3%,Document 575: Corresponding author: Dinesh Bha...,[],"{'question_type': 'simple', 'seed_document_id'..."
32c4a15a-f4ad-4784-b890-b2490abadea3,What percentage of children in The Gambia test...,3.5%,Document 94: # Results Using the aforementione...,[],"{'question_type': 'complex', 'seed_document_id..."
13556c84-d3e7-4987-bace-a422a6e2c5bd,What is the relationship between temperature a...,The temperature dependence of reported Campylo...,"Document 810: 19. Tam CC, Rodrigues LC, OBrien...",[],"{'question_type': 'distracting element', 'seed..."
1ce6ed70-563f-4da8-8da9-cacefaf539aa,"Hi, I'm working on a project to investigate ho...",203,Document 784: HS15 19 1 20 PT16 1 1 CC 403 19 ...,[],"{'question_type': 'situational', 'seed_documen..."
623b1503-0805-427e-9f2b-af026ec45863,What are the two main assumptions of the Topmo...,The two main assumptions which Topmodel uses t...,Document 592: Figure 1. Flowchart to summarize...,[],"{'question_type': 'double', 'original_question..."


In [37]:
for i in df.question:
    print(i)

What percentage of diarrhoea-related deaths among children below 5 years of age was caused by rotavirus infection globally in 2015?
What percentage of children in The Gambia tested positive for Campylobacter, considering only the cases that were moderate-to-severe diarrhea?
What is the relationship between temperature and reported Campylobacter infection in England, considering the context of environmental factors influencing diarrheal pathogens?
Hi, I'm working on a project to investigate how rainfall affects waterborne pathogens in low-income areas. Can you tell me what the value of HS11 is in relation to this topic?
What are the two main assumptions of the Topmodel rainfall-runoff model and what is used to derive stream networks in each subcatchment?


In [38]:
for i in df.reference_answer:
    print(i)

29.3%
3.5%
The temperature dependence of reported Campylobacter infection in England was studied from 1989-1999, with a report published in Epidemiol Infect 2006;134:119e25.
203
The two main assumptions which Topmodel uses to relate downslope flow from a point to discharge at the catchment outlet are that: 1. The dynamics of the saturated zone are approximated by successive steady-state representations; 2. The hydraulic gradient of the saturated zone is approximated by the local surface topographic slope, and stream networks in each subcatchment are derived from a DEM [digital elevation map].


In [28]:
for i in df.reference_context:
    print(i)
    print('\n\n\n')

Document 931: # Identifying the Environmental Drivers of Campylobacter Infection Risk in Southern Ontario, Canada Using a One Health Approach # # Identifying the Environmental Drivers of Campylobacter Infection Risk in Southern Ontario, Canada Using a One Health Approach Melanie Cousins1,2,3 | Jan M. Sargeant1,2 | David N. Fisman4 | Amy L. Greer1,2 1Department of Population Medicine, Ontario Veterinary College, University of Guelph, Guelph, ON, Canada 2Centre for Public Health and Zoonoses, University of Guelph, Guelph, ON, Canada 3School of Public Health and Health Systems, University of Waterloo, Waterloo, ON, Canada 4Department of Epidemiology, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada Correspondence: Amy L. Greer, Population Science, University of Guelph, 50 Stone Road E., Guelph, ON N1G 2W1, Canada. Email: agreer@uoguelph.ca Funding information: Canadian Institute of Health Research Received: 21 November 2019 | Revised: 15 January 2020 | Accept

# Evaluation of the test set 100 questions

In [1]:
from giskard.rag import evaluate, RAGReport
from giskard.rag.metrics.ragas_metrics import ragas_context_recall, ragas_context_precision

In [7]:
from rag import run_agentic_rag

RuntimeError: The 'gpt4all' package is required for local inference. Suggestion: `pip install "nomic[local]"`

In [None]:
# The RAG agent
def answer_fn(question):
    print('get answer')
    answer = run_agentic_rag(question)
    return str(answer)

Prompting -- this maybe later on

In [16]:
# Prepare QA chain
PROMPT_TEMPLATE = """You are a Researcher in Medicine with a specialisation in infectious diseases and an helpful AI assistant.
Your task is to answer specific questions on scholarly text examining the associations between diarrhea-specific pathogens and climate variables.
You will be given a question and relevant papers from PubMed.
Please provide short and clear answers based on the provided context. Be polite and helpful.

Context:
{context}

Question:
{question}

Your answer:
"""

prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["question", "context"])
climate_qa_chain = RetrievalQA.from_llm(llm=llm, retriever=vectorstore.as_retriever(), prompt=prompt)

In [17]:
def model_predict(df: pd.DataFrame):
    """Wraps the LLM call in a simple Python function.

    The function takes a pandas.DataFrame containing the input variables needed
    by your model, and must return a list of the outputs (one for each row).
    """
    return [climate_qa_chain.invoke({"query": question}) for question in df["question"]]



In [18]:
giskard_model = giskard.Model(
    model=model_predict,
    model_type="text_generation",
    name="Climate Change pathogens Question Answering",
    description="This model answers any question about pathogens concentration and climate change based on scholarly text",
    feature_names=["question"]
)

2024-10-17 13:57:39,978 pid:84743 MainThread giskard.models.automodel INFO     Your 'prediction_function' is successfully wrapped by Giskard's 'PredictionFunctionModel' wrapper class.


In [19]:
examples = ["List all diarrhea-specific pathogens present in all scholarly text",
            "Does environmental factors affect pathogens concentration?"]
giskard_dataset = giskard.Dataset(pd.DataFrame({"question": examples}), target=None)


2024-10-17 13:57:41,520 pid:84743 MainThread giskard.datasets.base INFO     Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.


In [22]:
print(giskard_model.predict(giskard_dataset).prediction)

2024-10-17 13:24:02,794 pid:68095 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'question': 'object'}


KeyboardInterrupt: 

In [27]:
report = giskard.scan(giskard_model, giskard_dataset, only=["robustness", "performance"])

🔎 Running scan…
Estimated calls to your model: ~200
Estimated LLM calls for evaluation: 0

2024-10-17 13:30:28,871 pid:68095 MainThread giskard.scanner.logger INFO     Running detectors: ['LLMCharsInjectionDetector']
Running detector LLMCharsInjectionDetector…
2024-10-17 13:30:28,890 pid:68095 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'question': 'object'}
2024-10-17 13:30:28,893 pid:68095 MainThread giskard.utils.logging_utils INFO     Predicted dataset with shape (2, 1) executed in 0:00:00.012920
2024-10-17 13:30:28,897 pid:68095 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'question': 'object'}
2024-10-17 13:30:28,900 pid:68095 MainThread giskard.utils.logging_utils INFO     Predicted dataset with shape (1, 1) executed in 0:00:00.004624
2024-10-17 13:30:28,905 pid:68095 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'qu

In [28]:
display(report)

In [23]:
report = giskard.scan(giskard_model, giskard_dataset, only="hallucination", raise_exceptions=False)

🔎 Running scan…
Estimated calls to your model: ~30
Estimated LLM calls for evaluation: 22

2024-10-17 13:24:14,594 pid:68095 MainThread giskard.scanner.logger INFO     Running detectors: ['LLMImplausibleOutputDetector', 'LLMBasicSycophancyDetector']
Running detector LLMImplausibleOutputDetector…
2024-10-17 13:24:33,010 pid:68095 MainThread httpx        INFO     HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-17 13:24:33,019 pid:68095 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'question': 'object'}
2024-10-17 13:24:34,853 pid:68095 MainThread httpx        INFO     HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"
2024-10-17 13:24:39,937 pid:68095 MainThread httpx        INFO     HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"
2024-10-17 13:24:43,246 pid:68095 MainThread httpx        INFO     HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 

