# 3. The current Jupyter Notebook will cover the Third Phase of the project "Test"

## 3.1 Evaluate retrieval quality using Precision@k and Top-k Accuracy.

The metrics used to evaluate the quality of the answers are as follows:

- Precision@k: This metric measures, out of the k retrieved chunks, how many are actually relevant to answering the question.
- Top‑k Accuracy: This metric measures whether at least one relevant chunk appears within the top k results.

Applying Precision@k requires manual evaluation, since a human must determine whether the retrieved chunks truly contain relevant information to answer the question. In contrast, to apply the Top‑k Accuracy metric, we will define a test set with two questions and the expected chapters where the relevant chunks should be found. A function will then be implemented to compute whether at least one of the top k retrieved chunks comes from the expected chapters.

In [1]:
# Define the database where the search is performed

from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

# Define the path of the database

VECTOR_DATABASE_PATH = r"C:\Users\lonel\OneDrive\Escritorio\Re Zero NLP Project\vector_database"

# Initialize the Embedding Model

embedding_model = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2")
print("Embedding Model correctly created!")

# Connect the embedding model to the vector database

vector_database = Chroma(persist_directory=VECTOR_DATABASE_PATH,
                        embedding_function=embedding_model)
print("Embedding Model correctly connected to the database!")

  from .autonotebook import tqdm as notebook_tqdm


Embedding Model correctly created!
Embedding Model correctly connected to the database!


In [2]:
# Define the questions and answer set to evaluate the system quality

test_set = [{'question':'Who killed Rom?', 'expected_source': 'arc-1-chapter-11.txt'}, 
            {'question': 'What was the object Felt stole to Satella/The silver-haired girl?', 'expected_source': 'arc-1-chapter-8.txt'}]

# Define the function to perform the Top-K Accuracy metric.

def topk_accuracy(questions, k=3):
    print("Starting Top-K accuracy test!")

    correct_retrievals = 0
    total_questions = len(questions)

    for item in test_set:
        query = item['question']
        expected_answer = item['expected_source']

        # Perform the search
        search = vector_database.similarity_search(query, k=k)

        # Check if any of the top k chunks is trieved from the expected source
        found = False
        retrieved_sources = []
        retrieved_chunks = []

        for doc in search:
            source = doc.metadata.get('source', 'unknown')
            chunk = doc.page_content
            retrieved_sources.append(source)
            retrieved_chunks.append(chunk)

            if source == expected_answer:
                found = True

        if found == True:
            correct_retrievals += 1
            print(f"The question '{query}' was correctly found in '{expected_answer}'")
            print(f"The retrieved sources were: {retrieved_sources}")
            print(f"And the retrieved chunks were:")
            print(f"Chunk 1: {retrieved_chunks[0]}")
            print(f"Chunk 2: {retrieved_chunks[1]}")
            print(f"Chunk 3: {retrieved_chunks[2]}")
        else:
            print(f"The question '{query}' was not found in '{expected_answer}'.")
            print(f"The retrieved sources were: {retrieved_sources}")
            print(f"And the retrieved chunks were:")
            print(f"Chunk 1: {retrieved_chunks[0]}")
            print(f"Chunk 2: {retrieved_chunks[1]}")
            print(f"Chunk 3: {retrieved_chunks[2]}")

    # Calculate the Top-k Accuracy

    accuracy = (correct_retrievals/total_questions)*100
    print(f"The Top-k accuracy of the current test is: {accuracy}")


# Run the test

topk_accuracy(test_set, k=3)

Starting Top-K accuracy test!
The question 'Who killed Rom?' was not found in 'arc-1-chapter-11.txt'.
The retrieved sources were: ['arc-1-chapter-9.txt', 'arc-1-chapter-21.txt', 'arc-1-chapter-21.txt']
And the retrieved chunks were:
Chunk 1: Rom’s face was stern as he answered Subaru’s tactless question.

He then brought the bottle he had been pouring out of to his mouth, and as he drank,

“Because of this, most of us were wiped out. Even in the capital, I haven’t seen any other giants.”

“Yer strong even without eatin’, sho kewl. … Gunna throw up.”

“I’m saying something sad here and you respond like that?”

He wasn’t about to let someone’s sob story kill his mood.

As Subaru blocked his ears and interrupted the story, Rom gave up on telling it and started eating his beans.

The two of them passed their time silently eating those terrible beans as a side to their alcohol.

Eventually there was a coded knock on the door, by which time the sun had already set for the most part.

Subaru 

### About the Precision@k for the question 'Who killed Rom?'

The search retrieved chunks from chapters 9 and 21 of Arc 1, but none of them contained relevant information confirming that Rom was killed by Elsa. Since no relevant evidence was found, and following the formula "(# relevant chunks in top k / k)*100", the Precision@k for this question is 0% ((0/3)*100).

### About the Precision@k for the question 'What was the object Felt stole to Satella/The silver-haired girl?'

The search retrieved chunks from chapters 8, 14, and 15, and all of them contained relevant information confirming that the object stolen by Felt from Satella (as Emilia was referred to at this point in the novel) was an insignia.

- In chapter 8, it is explicitly stated that Felt stole an insignia from Emilia (called Satella).
- In chapter 14, the same fact is repeated.
- In chapter 15, the text again mentions that the silver‑haired girl had an insignia stolen from her.

Based on this evidence, the Precision@k for this question is 100% ((3/3)*100).

## 3.2 Measure generative quality and coherence with ROUGE and cosine similarity between retrieved and generated text.

ROUGE stands for “Recall‑Oriented Understudy for Gisting Evaluation” and is used to assess whether the final answers provided by the AI agent are correct and align with what a human would say.

To apply this metric, we will use the 'evaluate' library along with the test_set variable. In each dictionary of the test set, we will add a new key called human_expected_answer, which will contain the answer to the question as a human would provide it. Additionally, a new function will be created to compute the ROUGE metric.

In [5]:
# Add the new key to both dictionaries

test_set[0].update({'human_expected_answer': 'Elsa was the one who killed Rom.'})
test_set[1].update({'human_expected_answer': 'The object stolen by Felt from Emilia was an insignia.'})

# Define the function to use ROUGE metric

import evaluate

# Load the ROUGE metric

rouge = evaluate.load("rouge")

# Define the function to use the ROUGE metric

def calculate_rouge(test_set):
    print("Starting ROUGE metric")

    # ROUGE metric needs two list:
    predictions = [] # One with the strings generated by the AI-agent
    references = [] # One with the "human" answers

    # Extract the needed information

    for item in test_set:
        query = item['question']
        human_answer = item['human_expected_answer']

        references.append(human_answer)

        # Ask the AI-Agent the question and add the answer to the predictions list

        from RAG_Module import query_question

        predictions.append(query_question(query))

    # Compute the score

    rouge_score = rouge.compute(predictions=predictions, references=references)
    print(f"The ROUGE score is:")
    print(rouge_score)

calculate_rouge(test_set)

Starting ROUGE metric
Searching in the vector database for the question Who killed Rom?
The LLM has been asked the question and is now generating the answer!


ResponseError: llama runner process has terminated: CUDA error (status code: 500)

### ROUGE interpretation

The current ROUGE results show that our AI agent is not answering in a 'human-like' behavior because they are less than 0.4 and 0.3 (which is considered a good score for prototypes).

However, in essence, the answer provided by the agent is correct. The reason the scores are too low is that the current prompt is not limiting the agent's answer, due to the general instruction provided which means that a redefinition of the prompt could be a possible solution to improve the ROUGE scores.