# How to evaluate PoopClimateQA?

This notebook is created to explore possibilities to have tests that automatically evaluate the performance of the RAG agent

In [None]:
import json
import os
from os.path import exists
import sqlite3
# LLM
from langchain_ollama import ChatOllama
from langchain.schema import Document, AIMessage
# to chunk the text
from langchain.text_splitter import RecursiveCharacterTextSplitter
# to make/store embeddings 
from langchain_community.vectorstores import SKLearnVectorStore
#from langchain_community.embeddings.spacy_embeddings import SpacyEmbeddings
from langchain_nomic.embeddings import NomicEmbeddings
from langchain_core.messages import HumanMessage, SystemMessage
# to build/display/run a langgraph
from langgraph.graph import StateGraph, MessagesState
from IPython.display import Image, display
import operator
from typing_extensions import TypedDict
from typing import List, Annotated
from langgraph.graph import END, START

In [2]:
import giskard
import pandas as pd
from langchain.chains import RetrievalQA
from langchain import PromptTemplate
from openai import OpenAI
from giskard.llm.client.openai import OpenAIClient
from giskard.rag import KnowledgeBase, generate_testset, QATestset


### set LLM 

In [134]:
local_llm = "llama3.1"
llm = ChatOllama(model=local_llm, temperature=0)

## 1. Create a test set automatically from the documents at our disposal with RAGET Testset Generation

See: [link](https://docs.giskard.ai/en/stable/open_source/testset_generation/testset_generation/index.html)

This is what we're doing below:

1. we load the SQL database

2. We split the content for a reasonable amount of characters. This really depends on your machine capacity when it comes to ingest the tokens to the LLM for doing embeddings or later on for test set generation -- so my suggestion is to start small, and allow 20% overlap with the chunk size. Overlap allows you to give some better context to the LLM as there is higher chance that if it encounters, let's say, an acronym, it may contain also the full spelling of that in some lines before/after

3. We remove weird stuff from the text (maybe this could have been done before but here we are)

4. Transform the cleaned documents into a KnowledgeBase -- this will store our embeddings + topics, see: [link](https://docs.giskard.ai/en/stable/reference/rag-toolset/knowledge_base.html#giskard.rag.knowledge_base.KnowledgeBase)

5. Run the test_set_generation()

In [164]:
database_path = 'literature_relevant.db'

def extract_full_text_content(database_path):
    conn = sqlite3.connect(database_path)
    cursor = conn.cursor()

    # Retrieve all rows/papers from the table
    cursor.execute(f"SELECT fulltext FROM literature_fulltext;")
    rows = cursor.fetchall()

    # Iterate through the rows (which are papers) and extract text content
    text_content = [row[0] for row in rows if isinstance(row[0], str) and row[0] is not None]

    conn.close()

    return text_content

db_path = database_path
docs = extract_full_text_content(db_path)

In [136]:
documents = []

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=200
)

documents = text_splitter.create_documents([text for text in docs])

In [137]:
# make sure that this is a list -- we need this format later on
type(documents)

list

In [165]:
# we use DocumentCleaner from haystack library to clean a lot of the text
# I like this function a lot because it is specifically created to preprocess text for LLMs 
# although it requires a bit of tweaking-- maybe we should clean this stuff right after llamaParse? but again, here we are
from haystack.components.preprocessors import DocumentCleaner

cleaner = DocumentCleaner(
    unicode_normalization="NFKC",  
    ascii_only=True,               
    remove_empty_lines=True,       
    remove_extra_whitespaces=True, 
    remove_repeated_substrings=True,  
    remove_substrings=["font-family: Arial, sans-serif;", "line-height: 1.6;"],  
    remove_regex=r'\{[^}]*\}|\.[a-zA-Z0-9_-]+\s*\{[^}]*\}|<[^>]+>|\n{2,}|\s{2,}|<script.*?>.*?</script>|<!--.*?-->|---|\n---\n|h[1-6]\s*\{[^}]*\}|p\s*\{[^}]*\}'
)

In [166]:
from haystack import Document  # Import the Document class from Haystack -- this is necessary to run the cleaner (a bit annoying this bit)

In [140]:
haystack_documents = [
    Document(content=doc.page_content, meta=doc.metadata if hasattr(doc, 'metadata') else {})
    for doc in documents
]

In [141]:
cleaned_documents = cleaner.run(haystack_documents)

In [168]:
# love the result!
cleaned_documents['documents'][1].content[:1000]

'# Introduction World-wide expansions of public health surveillance, long-term maintenance of patient electronic records and digital disease detection have invigorated attention to seasonal fluctuations in infectious diseases. A deep understanding of temporal patterns in disease occurrence and its governing principles is valuable for designing preventive programs for disease control, tracking effectiveness of public health programs, and allocating scarce resources. Many infectious diseases exhibit seasonal patterns, when systematic periodic fluctuations are observed during an annual cycle. Seasonality can be characterized by the magnitude, timing, and duration of a seasonal increase. It may differ by pathogen and its strain virulence, and may change from year to year due to shift/drift in antigenic strain and change in immunity of a naive and exposed population. Seasonal characteristics may also vary by population, geographical area, or climate zone. Introduction of a vaccine and/or su

In [173]:
# ok once we cleaned the docs, we need to put it into a pandas dataframe. That is because knowledge base excepts a pd.Dataframe, and who am I to say no to it?
# for reasons that I don't fully understand this conversions adds STUFF that I cleaned above, I figured that only \n and --- are added, but I really didn't inspect more
# this thing is a bit annoiying because I went all the way to clean my text and LOOK AT THIS DISASTER. I think this needs to be reconsidered.
knowledge_base_docs = pd.DataFrame(
    [doc.content.replace('\n', ' ').replace('---', ' ') for doc in cleaned_documents['documents']],
    columns=["text"]
)

In [174]:
# but look how tidy this is!
knowledge_base_docs['text'][1][:1000]

'# Introduction World-wide expansions of public health surveillance, long-term maintenance of patient electronic records and digital disease detection have invigorated attention to seasonal fluctuations in infectious diseases. A deep understanding of temporal patterns in disease occurrence and its governing principles is valuable for designing preventive programs for disease control, tracking effectiveness of public health programs, and allocating scarce resources. Many infectious diseases exhibit seasonal patterns, when systematic periodic fluctuations are observed during an annual cycle. Seasonality can be characterized by the magnitude, timing, and duration of a seasonal increase. It may differ by pathogen and its strain virulence, and may change from year to year due to shift/drift in antigenic strain and change in immunity of a naive and exposed population. Seasonal characteristics may also vary by population, geographical area, or climate zone. Introduction of a vaccine and/or su

In [145]:
# this is how many strips we got:
len(knowledge_base_docs)

1037

## Set the client. Giskard does that with OpenAI which allows to use ollama locally -- no idea why

Will this work on snellius? no idea

In [148]:
from openai import OpenAI
from giskard.llm.client.openai import OpenAIClient
from giskard.llm.embeddings.openai import OpenAIEmbedding
from giskard.llm.embeddings import set_default_embedding

_client = OpenAI(base_url="http://localhost:11434/v1/", api_key="ollama")
oc = OpenAIClient(model="llama3.1", client=_client)
emb_client = OpenAIEmbedding(model="nomic-embed-text", client=_client)

giskard.llm.set_default_client(oc)
set_default_embedding(emb_client)

In [149]:
# Store the pandas dataframe with our cleaned docs into a knowledge base
knowledge_base = KnowledgeBase(knowledge_base_docs)

In [101]:
# look at how many topics there are! cute!
knowledge_base.plot_topics()

2024-10-18 14:01:56,539 pid:90534 MainThread giskard.rag  INFO     Finding topics in the knowledge base.
2024-10-18 14:05:30,378 pid:90534 MainThread giskard.rag  INFO     Found 32 topics in the knowledge base.


## Generate test set (drum rolls)

In [150]:
testset = generate_testset(knowledge_base,
                           num_questions=3,
                           language='en',
                           agent_description="A chatbot answering questions about the environmental factors influencing diarrheal pathogens")

2024-10-18 14:18:14,181 pid:90534 MainThread giskard.rag  INFO     Finding topics in the knowledge base.
2024-10-18 14:23:04,335 pid:90534 MainThread giskard.rag  INFO     Found 27 topics in the knowledge base.


Generating questions:   0%|          | 0/3 [00:00<?, ?it/s]

In [151]:
testset.save("paper_testset.jsonl")
testset_loaded = QATestset.load("paper_testset.jsonl")
df = testset_loaded.to_pandas()
df

Unnamed: 0_level_0,question,reference_answer,reference_context,conversation_history,metadata
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
26edf22d-a581-48f9-82af-8323163242cf,What factors were used as predictors in the li...,"soil type(s), sheep/km2, cattle/km2",Document 595: After identification of an appro...,[],"{'question_type': 'simple', 'seed_document_id'..."
9e247a78-e171-4f8a-b023-9b69a697b0f7,What was the mean monthly risk of campylobacte...,"0.593 per 100,000",Document 30: # Temporal Patterns of Campylobac...,[],"{'question_type': 'complex', 'seed_document_id..."
12bf7d38-cbcd-4561-9584-ac2b7a7fd006,What is the impact of El Niño on hospital admi...,Effect of El Nino and ambient temperature on h...,"Document 247: 21. Hurst CJ, Gerba CP. Stabilit...",[],"{'question_type': 'distracting element', 'seed..."


In [159]:
for i in df.question:
    print(i)

What factors were used as predictors in the linear models to identify patterns of variation in Campylobacter case rates between subcatchments?
What was the mean monthly risk of campylobacteriosis in Georgia from 1999 to 2008, considering only the months when there was no drought?
What is the impact of El Niño on hospital admissions for diarrheal diseases in Peruvian children, considering the ambient temperature as a potential confounding factor?


In [160]:
for i in df.reference_answer:
    print(i)

soil type(s), sheep/km2, cattle/km2
0.593 per 100,000
Effect of El Nino and ambient temperature on hospital admissions for diarrhoeal diseases in Peruvian children.


In [175]:
for i in df.reference_context:
    print(i)
    print('\n\n\n')

Document 595: After identification of an appropriate AR, MA or ARMA correlation function (and confirmation of improvement in model fit via BIC and the ACF plots), the random effects (i.e. for each subcatchment) from this model were used as the response variable in the subcatchment analyses as described below. # Spatial Model of Soil Type, Sheep and Cattle Stocking Rates on Campylobacter Cases The random effects from the best temporal model quantify differences between the subcatchments in population-adjusted Campylobacter case rates which are not explained by the hydrology, temperature, evapotranspiration or rainfall. These spatial differences between the subcatchments might be due to other environmental factors, in particular soil type and livestock grazing. Soil data from the Soil Survey of England and Wales (SSEW) maps for northern England at 100-m grid resolution were analysed at the level of the soil group in the SSEW classification. Different soil groups show strong collinearity,

Prompting -- this maybe later on

In [16]:
# Prepare QA chain
PROMPT_TEMPLATE = """You are a Researcher in Medicine with a specialisation in infectious diseases and an helpful AI assistant.
Your task is to answer specific questions on scholarly text examining the associations between diarrhea-specific pathogens and climate variables.
You will be given a question and relevant papers from PubMed.
Please provide short and clear answers based on the provided context. Be polite and helpful.

Context:
{context}

Question:
{question}

Your answer:
"""

prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["question", "context"])
climate_qa_chain = RetrievalQA.from_llm(llm=llm, retriever=vectorstore.as_retriever(), prompt=prompt)

In [17]:
def model_predict(df: pd.DataFrame):
    """Wraps the LLM call in a simple Python function.

    The function takes a pandas.DataFrame containing the input variables needed
    by your model, and must return a list of the outputs (one for each row).
    """
    return [climate_qa_chain.invoke({"query": question}) for question in df["question"]]



In [18]:
giskard_model = giskard.Model(
    model=model_predict,
    model_type="text_generation",
    name="Climate Change pathogens Question Answering",
    description="This model answers any question about pathogens concentration and climate change based on scholarly text",
    feature_names=["question"]
)

2024-10-17 13:57:39,978 pid:84743 MainThread giskard.models.automodel INFO     Your 'prediction_function' is successfully wrapped by Giskard's 'PredictionFunctionModel' wrapper class.


In [19]:
examples = ["List all diarrhea-specific pathogens present in all scholarly text",
            "Does environmental factors affect pathogens concentration?"]
giskard_dataset = giskard.Dataset(pd.DataFrame({"question": examples}), target=None)


2024-10-17 13:57:41,520 pid:84743 MainThread giskard.datasets.base INFO     Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.


In [22]:
print(giskard_model.predict(giskard_dataset).prediction)

2024-10-17 13:24:02,794 pid:68095 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'question': 'object'}


KeyboardInterrupt: 

In [27]:
report = giskard.scan(giskard_model, giskard_dataset, only=["robustness", "performance"])

🔎 Running scan…
Estimated calls to your model: ~200
Estimated LLM calls for evaluation: 0

2024-10-17 13:30:28,871 pid:68095 MainThread giskard.scanner.logger INFO     Running detectors: ['LLMCharsInjectionDetector']
Running detector LLMCharsInjectionDetector…
2024-10-17 13:30:28,890 pid:68095 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'question': 'object'}
2024-10-17 13:30:28,893 pid:68095 MainThread giskard.utils.logging_utils INFO     Predicted dataset with shape (2, 1) executed in 0:00:00.012920
2024-10-17 13:30:28,897 pid:68095 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'question': 'object'}
2024-10-17 13:30:28,900 pid:68095 MainThread giskard.utils.logging_utils INFO     Predicted dataset with shape (1, 1) executed in 0:00:00.004624
2024-10-17 13:30:28,905 pid:68095 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'qu

In [28]:
display(report)

In [23]:
report = giskard.scan(giskard_model, giskard_dataset, only="hallucination", raise_exceptions=False)

🔎 Running scan…
Estimated calls to your model: ~30
Estimated LLM calls for evaluation: 22

2024-10-17 13:24:14,594 pid:68095 MainThread giskard.scanner.logger INFO     Running detectors: ['LLMImplausibleOutputDetector', 'LLMBasicSycophancyDetector']
Running detector LLMImplausibleOutputDetector…
2024-10-17 13:24:33,010 pid:68095 MainThread httpx        INFO     HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-17 13:24:33,019 pid:68095 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'question': 'object'}
2024-10-17 13:24:34,853 pid:68095 MainThread httpx        INFO     HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"
2024-10-17 13:24:39,937 pid:68095 MainThread httpx        INFO     HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"
2024-10-17 13:24:43,246 pid:68095 MainThread httpx        INFO     HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 

