# RAG EVALUATION

Nous allons évaluer notre système de RAG de tourisme au Cameroun.

1. Nous allons dans un premier temps construire le système de RAG.
2. Construire un jeu de données synthétique pour l'évaluation.
3. Evaluer avec un llm-as-judge
4. Centraliser nos évaluations sur mlflow.

In [69]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# 1. Librairies utiles et variables globales

In [None]:
#!pip install ragatouille

In [17]:
from src.documentPreparation import prepare_rag_data
from src.agenthandler import (
    instanciate_llm_with_huggingface, 
    initialize_embeddings_model,
    answer_with_rag,
    run_rag_tests,
    evaluate_answers
    )
import pandas as pd
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import SystemMessage
from ragatouille import RAGPretrainedModel
from langchain.chat_models import ChatOpenAI
from tqdm import tqdm

In [18]:
# list of documents in file folder
import os

FILES = [f"files/{f}" for f in os.listdir('files') ]
MODEL_NAME = 'all-MiniLM-L6-v2'  #'sentence-transformers/all-mpnet-base-v2' # 'sentence-transformers/all-MiniLM-L6-v2'
CHUNK_SIZE = 300
CHUNK_OVERLAP = 50
MAX_NEW_TOKENS = 500
DO_SAMPLE = True
TEMPERATURE = 0.8
TOP_P = 0.9
REPETITION_PENALTY = 1.1

In [19]:
RAG_PROMPT_TEMPLATE = """
<|system|>
Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
{context}
---
Now here is the question you need to answer.

Question: {question}
</s>
<|assistant|>
"""

In [20]:
EVALUATION_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response correct, accurate, and factual based on the reference answer?]
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
Score 3: The response is somewhat correct, accurate, and/or factual.
Score 4: The response is mostly correct, accurate, and factual.
Score 5: The response is completely correct, accurate, and factual.

###Feedback:"""





# 2. Jeu de données d'évaluation

In [21]:
questions = [
    "what are cameroonian traditionnal meals ?",
    "How many museum west region of Cameroon has ?",
    "what are the main waterfall in west region of Cameroon ?",
    "Does cameroon has cliff ?",
    "Can you give me some ecotourism site in Cameroon ?",
    "Where is located Cameroon national museum ?",
    
    "What is Cameroon population ?",
    "What is Cameroon land area ?",
    "when is Cameroon national celebration ?",
    
]

answers = [
    "Cameroon traditional meals : ndole, eru, okok, kwem, Achu Soup (Yellow Soup) & Achu, poulet DG.",
    "12",
    "Ekom nkam waterfall",
    "yes",
    "Korup National Park, Ebogo tourist site, Dja Faunal Reserve National Park",
    "Yaoundé",
    "26.545.863 ",
    "183,569 square miles",
    "20th of May"
]

eval_df = pd.DataFrame(
    {
        "question" : questions,
        "answer": answers
    }
)


# 3. Evaluation du système de RAG

La première étape de la construction d'un système de RAG consiste en la préparation des données. La préparation inclut : 

1. La lecture des documents, elle se fait en utilisant la classe PyPDFLoader de langchain

2. Le découpage du texte en petit bout appelés chunks. Nous allons utiliser le RecursiveCharacterTextSplitter de langchain.

3. Le calcul des embeddings avec un modèle pré-entrainé disponible sur huggingface

4. Le stockage de ces embeddings dans une base de données vectorielle.

Le `RecursiveCharacterTextSplitter` dans LangChain divise les documents en appliquant de manière récursive une série de séparateurs pour découper le texte en morceaux plus petits. Voici une explication étape par étape de son fonctionnement :

1. **Définir les séparateurs** :
   Vous fournissez une liste de séparateurs (par exemple, `["\n\n", "\n", " ", ""]`). Le séparateur essaiera de diviser le texte au premier séparateur de la liste. S'il ne peut pas diviser le texte en morceaux de la taille souhaitée en utilisant ce séparateur, il passera au suivant, et ainsi de suite.

2. **Division récursive** :
   Le séparateur commence par le plus grand séparateur (par exemple, `"\n\n"` pour les paragraphes) et tente de diviser le texte en morceaux plus petits que la taille spécifiée (`chunk_size`). Si les morceaux résultants sont encore trop grands, il utilisera le séparateur suivant dans la liste (par exemple, `"\n"` pour les lignes) pour diviser davantage le texte.

3. **Taille des morceaux et chevauchement** :
   Vous spécifiez une `chunk_size` (taille maximale de chaque morceau) et un `chunk_overlap` (nombre de caractères qui doivent chevaucher entre les morceaux consécutifs). Le séparateur s'assure que chaque morceau est dans la taille spécifiée et inclut le chevauchement pour maintenir le contexte entre les morceaux.

4. **Morceaux finaux** :
   Le processus continue jusqu'à ce que le texte soit divisé en morceaux qui sont tous dans la taille souhaitée. Les morceaux finaux sont ensuite retournés sous forme de liste.

Cette manière de faire le chunks assure de garder le maximum de cohérence après le découpage.

In [22]:
embedding_model = initialize_embeddings_model(MODEL_NAME)

In [23]:
model = instanciate_llm_with_huggingface(
        model_name = "mistralai/Mistral-7B-Instruct-v0.3",  
        max_new_tokens = MAX_NEW_TOKENS,
        do_sample = DO_SAMPLE,
        temperature  = TEMPERATURE,
        top_p = TOP_P,
        repetition_penalty = REPETITION_PENALTY
        )

In [24]:
evaluation_prompt_template = ChatPromptTemplate.from_messages(
    [
        SystemMessage(content="You are a fair evaluator language model."),
        HumanMessagePromptTemplate.from_template(EVALUATION_PROMPT),
    ]
)

In [25]:
evaluator_name = "databricks/dolly-v2-12b" #"meta-llama/Llama-2-7b"

In [26]:
OPENAI_API_KEY = ""

eval_chat_model = ChatOpenAI(model="gpt-4-1106-preview", temperature=0, openai_api_key=OPENAI_API_KEY)
evaluator_name = "GPT4"

  eval_chat_model = ChatOpenAI(model="gpt-4-1106-preview", temperature=0, openai_api_key=OPENAI_API_KEY)


In [28]:
if not os.path.exists("./output"):
    os.mkdir("./output")

for chunk_size in [100,200, 300]:  # Add other chunk sizes (in tokens) as needed
    for temperature in [0.4, 0.8, 1.0]:
        for embeddings in [MODEL_NAME]:  # Add other embeddings as needed
            for rerank in [True, False]:
                settings_name = f"chunk-{chunk_size}_temperature-{temperature}_embeddings-{embeddings.replace('/', '~')}_rerank-{rerank}_reader-model-{MODEL_NAME}"
                output_file_name = f"./output/rag_{settings_name}.json"

                print(f"Running evaluation for {settings_name}:")

                print("Loading knowledge base embeddings...")

                knowledge_index = prepare_rag_data(
                    list_file_path = FILES, 
                    chunk_size = chunk_size,
                    chunk_overlap = CHUNK_OVERLAP,
                    model_name = embeddings,
                    embedding_model=embedding_model,
                )

                print("Running RAG...")
                model = instanciate_llm_with_huggingface(
                    model_name = "mistralai/Mistral-7B-Instruct-v0.3",  
                    max_new_tokens = MAX_NEW_TOKENS,
                    do_sample = DO_SAMPLE,
                    temperature  = temperature,
                    top_p = TOP_P,
                    repetition_penalty = REPETITION_PENALTY
                    )
                run_rag_tests(
                    eval_dataset=eval_df,
                    llm=model,
                    knowledge_index=knowledge_index,
                    output_file=output_file_name,
                    reranker=False,
                    verbose=False,
                    test_settings=settings_name,
                    rag_prompt_template=RAG_PROMPT_TEMPLATE
                )

                print("Running evaluation...")
                evaluate_answers(
                    output_file_name,
                    eval_chat_model,
                    evaluator_name,
                    evaluation_prompt_template,
                )

Running evaluation for chunk-100_temperature-0.4_embeddings-all-MiniLM-L6-v2_rerank-True_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Index not found, generating it...
Running RAG...


9it [00:17,  1.96s/it]


Running evaluation...


100%|██████████| 9/9 [00:34<00:00,  3.81s/it]


Running evaluation for chunk-100_temperature-0.4_embeddings-all-MiniLM-L6-v2_rerank-False_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:00,  9.20it/s]


Running evaluation...


100%|██████████| 9/9 [00:30<00:00,  3.35s/it]


Running evaluation for chunk-100_temperature-0.8_embeddings-all-MiniLM-L6-v2_rerank-True_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:17,  1.96s/it]


Running evaluation...


100%|██████████| 9/9 [00:30<00:00,  3.35s/it]


Running evaluation for chunk-100_temperature-0.8_embeddings-all-MiniLM-L6-v2_rerank-False_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:00,  9.03it/s]


Running evaluation...


100%|██████████| 9/9 [00:31<00:00,  3.49s/it]


Running evaluation for chunk-100_temperature-1.0_embeddings-all-MiniLM-L6-v2_rerank-True_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:16,  1.86s/it]


Running evaluation...


100%|██████████| 9/9 [00:40<00:00,  4.47s/it]


Running evaluation for chunk-100_temperature-1.0_embeddings-all-MiniLM-L6-v2_rerank-False_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:00,  9.73it/s]


Running evaluation...


100%|██████████| 9/9 [00:30<00:00,  3.37s/it]


Running evaluation for chunk-200_temperature-0.4_embeddings-all-MiniLM-L6-v2_rerank-True_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:00,  9.47it/s]


Running evaluation...


100%|██████████| 9/9 [00:33<00:00,  3.75s/it]


Running evaluation for chunk-200_temperature-0.4_embeddings-all-MiniLM-L6-v2_rerank-False_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:00,  9.08it/s]


Running evaluation...


100%|██████████| 9/9 [00:35<00:00,  3.96s/it]


Running evaluation for chunk-200_temperature-0.8_embeddings-all-MiniLM-L6-v2_rerank-True_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:00,  9.10it/s]


Running evaluation...


100%|██████████| 9/9 [00:31<00:00,  3.52s/it]


Running evaluation for chunk-200_temperature-0.8_embeddings-all-MiniLM-L6-v2_rerank-False_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:01,  8.91it/s]


Running evaluation...


100%|██████████| 9/9 [00:33<00:00,  3.73s/it]


Running evaluation for chunk-200_temperature-1.0_embeddings-all-MiniLM-L6-v2_rerank-True_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:00,  9.61it/s]


Running evaluation...


100%|██████████| 9/9 [00:38<00:00,  4.25s/it]


Running evaluation for chunk-200_temperature-1.0_embeddings-all-MiniLM-L6-v2_rerank-False_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:00,  9.25it/s]


Running evaluation...


100%|██████████| 9/9 [00:34<00:00,  3.84s/it]


Running evaluation for chunk-300_temperature-0.4_embeddings-all-MiniLM-L6-v2_rerank-True_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:00,  9.44it/s]


Running evaluation...


100%|██████████| 9/9 [00:30<00:00,  3.41s/it]


Running evaluation for chunk-300_temperature-0.4_embeddings-all-MiniLM-L6-v2_rerank-False_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:00,  9.53it/s]


Running evaluation...


100%|██████████| 9/9 [00:28<00:00,  3.21s/it]


Running evaluation for chunk-300_temperature-0.8_embeddings-all-MiniLM-L6-v2_rerank-True_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:01,  8.90it/s]


Running evaluation...


100%|██████████| 9/9 [00:28<00:00,  3.13s/it]


Running evaluation for chunk-300_temperature-0.8_embeddings-all-MiniLM-L6-v2_rerank-False_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:00,  9.44it/s]


Running evaluation...


100%|██████████| 9/9 [00:36<00:00,  4.00s/it]


Running evaluation for chunk-300_temperature-1.0_embeddings-all-MiniLM-L6-v2_rerank-True_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:00,  9.32it/s]


Running evaluation...


100%|██████████| 9/9 [00:37<00:00,  4.21s/it]


Running evaluation for chunk-300_temperature-1.0_embeddings-all-MiniLM-L6-v2_rerank-False_reader-model-all-MiniLM-L6-v2:
Loading knowledge base embeddings...
Reading pdf files...
Chunking the documents...
Initializing embeddings model...
Creating vectorial db...
Running RAG...


9it [00:01,  8.90it/s]


Running evaluation...


100%|██████████| 9/9 [00:31<00:00,  3.49s/it]


# 4. Load results

In [29]:
import glob
import json

outputs = []
for file in glob.glob("./output/*.json"):
    output = pd.DataFrame(json.load(open(file, "r")))
    output["settings"] = file
    outputs.append(output)
result = pd.concat(outputs)

In [30]:
print(result.shape)
result.head(2)

(162, 8)


Unnamed: 0,question,true_answer,generated_answer,retrieved_docs,test_settings,eval_score_GPT4,eval_feedback_GPT4,settings
0,what are cameroonian traditionnal meals ?,"Cameroon traditional meals : ndole, eru, okok,...",Cameroonian traditional meals include dishes s...,[unique culture in \nCameroon through its \nlo...,chunk-100_temperature-0.4_embeddings-all-MiniL...,4,The response correctly identifies several trad...,./output\rag_chunk-100_temperature-0.4_embeddi...
1,How many museum west region of Cameroon has ?,12,The West Region of Cameroon has 15 museums. [...,[The West Region of Cameroon has the highest n...,chunk-100_temperature-0.4_embeddings-all-MiniL...,2,Feedback: The response provided states that th...,./output\rag_chunk-100_temperature-0.4_embeddi...


In [31]:
result["eval_score_GPT4"] = result["eval_score_GPT4"].apply(
    lambda x: int(x) if isinstance(x, str) else 1
)
result["eval_score_GPT4"] = (result["eval_score_GPT4"] - 1) / 4

In [48]:
average_scores = result.groupby("settings")["eval_score_GPT4"].mean()
average_scores.sort_values()

settings
./output\rag_chunk-100_temperature-1.0_embeddings-all-MiniLM-L6-v2_rerank-True_reader-model-all-MiniLM-L6-v2.json     0.194444
./output\rag_chunk-100_temperature-1.0_embeddings-all-MiniLM-L6-v2_rerank-False_reader-model-all-MiniLM-L6-v2.json    0.222222
./output\rag_chunk-100_temperature-0.4_embeddings-all-MiniLM-L6-v2_rerank-False_reader-model-all-MiniLM-L6-v2.json    0.333333
./output\rag_chunk-100_temperature-0.4_embeddings-all-MiniLM-L6-v2_rerank-True_reader-model-all-MiniLM-L6-v2.json     0.361111
./output\rag_chunk-100_temperature-0.8_embeddings-all-MiniLM-L6-v2_rerank-False_reader-model-all-MiniLM-L6-v2.json    0.388889
./output\rag_chunk-100_temperature-0.8_embeddings-all-MiniLM-L6-v2_rerank-True_reader-model-all-MiniLM-L6-v2.json     0.388889
./output\rag_chunk-300_temperature-1.0_embeddings-all-MiniLM-L6-v2_rerank-False_reader-model-all-MiniLM-L6-v2.json    0.500000
./output\rag_chunk-300_temperature-0.4_embeddings-all-MiniLM-L6-v2_rerank-True_reader-model-all-MiniLM

# 5. Format and plots results

In [49]:
def extract_chunk_size(text):
    res = text.split("rag_chunk-")[1].split("_")[0]
    return res

def extract_temperature(text):
    res = text.split('temperature-')[1].split("_")[0]
    return res

def extract_rerank(text):
    res = text.split("rerank-")[1].split("_")[0]
    return res

In [55]:
formated_results = pd.DataFrame(average_scores).reset_index()
formated_results['chunk_size'] = formated_results['settings'].apply(extract_chunk_size)
formated_results['temperature'] = formated_results['settings'].apply(extract_temperature)
formated_results['rerank'] = formated_results['settings'].apply(extract_rerank)
formated_results.head()

Unnamed: 0,settings,eval_score_GPT4,chunk_size,temperature,rerank
0,./output\rag_chunk-100_temperature-0.4_embeddi...,0.333333,100,0.4,False
1,./output\rag_chunk-100_temperature-0.4_embeddi...,0.361111,100,0.4,True
2,./output\rag_chunk-100_temperature-0.8_embeddi...,0.388889,100,0.8,False
3,./output\rag_chunk-100_temperature-0.8_embeddi...,0.388889,100,0.8,True
4,./output\rag_chunk-100_temperature-1.0_embeddi...,0.222222,100,1.0,False


In [56]:
formated_results[['temperature', 'chunk_size']] = formated_results[['temperature', 'chunk_size']].astype(float)

In [57]:
formated_results = formated_results[formated_results["rerank"]=="True"]
formated_results.head()

Unnamed: 0,settings,eval_score_GPT4,chunk_size,temperature,rerank
0,./output\rag_chunk-100_temperature-0.4_embeddi...,0.333333,100.0,0.4,False
2,./output\rag_chunk-100_temperature-0.8_embeddi...,0.388889,100.0,0.8,False
4,./output\rag_chunk-100_temperature-1.0_embeddi...,0.222222,100.0,1.0,False
6,./output\rag_chunk-200_temperature-0.4_embeddi...,0.527778,200.0,0.4,False
8,./output\rag_chunk-200_temperature-0.8_embeddi...,0.555556,200.0,0.8,False


In [53]:
import plotly.express as px

fig = px.scatter(formated_results, x='chunk_size', y='temperature', 
                 color='eval_score_GPT4', size='eval_score_GPT4', 
                 color_continuous_scale='Viridis', 
                 title='eval score par temperature vs chunk_size')

fig.show()