<a href="https://colab.research.google.com/github/JackGraymer/Advanced-GenAI/blob/main/3_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Generative Artificial Intelligence
**Project - Designing a RAG-Based Q&A System for News Retrieval**

**Authors:** Vsevolod Mironov, Pascal Küng, Alvaro Cervan (Group 5)


# Step 3 Evaluation – Assessing answer quality through both automated and human evaluation

**Contribution:** Vsevolod Mironov, Pascal Küng, Alvaro Cervan

**Goal of this step:** Students will assess the quality of answers produced by their top-performing RAG pipeline. This involves applying the pipeline to benchmark questions, comparing its responses to ground truth answers, and evaluating performance using both automated metrics and human judgment.

### **Objective**
Evaluate the quality of answers generated by the best RAG pipeline using both automated metrics and human judgment.

---

### **Workflow**

1. **Run the Best RAG Pipeline**
	- Apply the developed RAG pipeline to the benchmark question set to generate answers.

2. **Automated Metrics Calculation**
	- **Semantic Exact Match:** Assess if generated answers semantically match ground truth using embeddings or semantic models.
	- **Semantic F1 Score:** Tokenize answers and compute precision, recall, and F1 based on semantic similarity.
	- **BLEU/ROUGE:** Measure N-gram or sequence overlap between generated and ground truth answers.
	- **Record Results:** Store all metric scores in a structured format for comparison.

3. **Human Evaluation**
	- **Criteria:** Evaluate each answer for:
	  - *Relevance* (alignment with the query)
	  - *Correctness* (accuracy vs. ground truth)
	  - *Clarity* (understandability)
	- **Rating Scale:** Use a 1–5 scale for each criterion.
	- **Analysis:** Provide a brief written summary of human evaluation findings.

4. **Report and Presentation**
	- Present both automated and human evaluation results using tables, charts, or other concise visualizations within the notebook.

#### Visual Representation of the Evaluation Pipeline
![Pipeline Overview](Evaluation_Workflow.svg)

---

**Summary Table Example:**

| Question | Automated Metrics (F1, BLEU, ROUGE) | Human Relevance | Human Correctness | Human Clarity | Comments |
|----------|-------------------------------------|-----------------|-------------------|--------------|----------|
| Q1       | ...                                 | ...             | ...               | ...          | ...      |



# 1.0 Loading Data and functions from previous stage

## 1.1 Setup of the environment

In [None]:
import os
import re
import json
import asyncio
import random
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
import matplotlib.pyplot as plt
import seaborn as sns
import pprint
import pickle
import faiss
from typing import Optional, List
from sentence_transformers import SentenceTransformer, util
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSequenceClassification
from FlagEmbedding import FlagReranker
from google.colab import userdata

In [4]:
# Set the seed for consistent results
seed_value = 2138247234
random.seed(seed_value)
np.random.seed(seed_value)
os.environ['PYTHONHASHSEED'] = str(seed_value)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
base_folder = '/content/drive/MyDrive/AdvGenAI'

In [45]:
# Run this cell if working locally
df = pd.read_csv('data/Stage2-02-chunked-dataset.csv')
filename = 'data/Stage3-02-precalc-reranked-chunks.pkl'
with open(filename, 'rb') as f:
	precalc_retrieved_chunks = pickle.load(f)
print(f"Dictionary loaded from {filename}:")

with open('data/Stage2-03-questions-answers.pkl', 'rb') as f_qa:
	question_answers = pickle.load(f_qa)

# load files in local computer and api from data .env
from dotenv import dotenv_values

env_vars = dotenv_values('data/.env')
OPENAI_API_KEY = env_vars.get('OPENAI_API_KEY', None)
print(f"Loaded OPENAI_API_KEY: {'***' if OPENAI_API_KEY else 'Not found'}")

Dictionary loaded from data/Stage3-02-precalc-reranked-chunks.pkl:
Loaded OPENAI_API_KEY: ***


In [74]:
df.head()

Unnamed: 0,unique_chunk_id,chunk_text,chunk_length,total_chunks,folder_path,file_name,year,month,language,type,title,text_id,chunk_id
0,0000_00,"Als 1950 die Meteorologen Jule Charney, Ragnar...",563,8,/content/drive/MyDrive/AdvGenAI/data/de_news_e...,blog-knutti-klimamodelle.html,2019,8,de,news events,Blog knutti klimamodelle,0,0
1,0000_01,## Erstaunliche Entwicklung der Klimamodelle\n...,804,8,/content/drive/MyDrive/AdvGenAI/data/de_news_e...,blog-knutti-klimamodelle.html,2019,8,de,news events,Blog knutti klimamodelle,0,1
2,0000_02,"«Alle Modelle sind falsch, aber einige sind nü...",881,8,/content/drive/MyDrive/AdvGenAI/data/de_news_e...,blog-knutti-klimamodelle.html,2019,8,de,news events,Blog knutti klimamodelle,0,2
3,0000_03,"Doch um die Gitterweite verkleinern zu können,...",536,8,/content/drive/MyDrive/AdvGenAI/data/de_news_e...,blog-knutti-klimamodelle.html,2019,8,de,news events,Blog knutti klimamodelle,0,3
4,0000_04,Bis ein hochaufgelöstes Modell auf einer neuen...,466,8,/content/drive/MyDrive/AdvGenAI/data/de_news_e...,blog-knutti-klimamodelle.html,2019,8,de,news events,Blog knutti klimamodelle,0,4


In [46]:
# print filename to see the content of the .pkl file
print(f"Precalculated retrieved chunks: {len(precalc_retrieved_chunks)} entries")
# Print the first few entries to understand the structure
print("Sample entries from precalculated retrieved chunks:")
for key, value in list(precalc_retrieved_chunks.items())[:2]:
	print(f"Key: {key}, Value: {value[:2]}...")  # Print first two items for brevity

Precalculated retrieved chunks: 25 entries
Sample entries from precalculated retrieved chunks:
Key: Who was president of ETH in 2003?, Value:   unique_chunk_id     score  \
0         3429_06  0.048395   
1         3795_08  0.047403   

                                          chunk_text  \
0  The President of the ETH Board, Fritz Schiesse...   
1  ETH President Ralph Eichler, who is handing ov...   

                               title  
0            New president appointed  
1  Encouraging more critical opinion  ...
Key: Who were the rectors of ETH between 2017 and 2022?, Value:   unique_chunk_id     score  \
0         3871_02  0.048916   
1         2371_01  0.048139   

                                          chunk_text            title  
0  ## Big changes on all study levels\nOver the p...     Eth day 2019  
1  "I am excited about the opportunity to contrib...  New head of let  ...


In [73]:
import pprint
# Print the content of the variable question_answers
#pprint.pprint(question_answers)
print(question_answers)
# convert question_answers to a DataFrame
qa_df = pd.DataFrame(question_answers)
# Display the first few rows of the DataFrame
print("Question-Answers DataFrame:")
qa_df = qa_df.T
qa_df = qa_df.drop(['possible_relevant_chunks', 'ground_truth_relevance'], axis=1)
qa_df.head()

Question-Answers DataFrame:


Unnamed: 0,question,answer,evaluation_comments
1,Who was president of ETH in 2003?,Olaf Kübler,
2,Who were the rectors of ETH between 2017 and 2...,"Sarah Springman, Günther Dissertori",
3,Who at ETH received ERC grants?,European Research Council grants: Tobias Donne...,The criterion here: does it come up with a lis...
4,When did the InSight get to Mars?,26 November 2018,
5,What did Prof. Schubert say about ﬂying?,Flying is too cheap. If we want to reduce ﬂyin...,


# 2.0 Connection to LLM (OpenAI)

In [80]:
#%pip install --quiet --upgrade openai

import openai

# Ensure the column exists
qa_df['chatgpt_no_context'] = ""

client = openai.OpenAI(api_key=OPENAI_API_KEY)

def ask_chatgpt(question, model="gpt-4o"):
	try:
		response = client.chat.completions.create(
			model=model,
			messages=[{"role": "user", "content": question}],
			temperature=0
		)
		return response.choices[0].message.content.strip()
	except Exception as e:
		return f"Error: {e}"

# Query ChatGPT for each question and store the answer
for idx, row in qa_df.iterrows():
	answer = ask_chatgpt(row['question'])
	qa_df.at[idx, 'chatgpt_no_context'] = answer

# Optionally display the updated DataFrame
qa_df.head()

Unnamed: 0,question,answer,evaluation_comments,chatgpt_rag_response,chatgpt_no_context
1,Who was president of ETH in 2003?,Olaf Kübler,,The context provided does not specify who was ...,"In 2003, the president of ETH Zurich (Swiss Fe..."
2,Who were the rectors of ETH between 2017 and 2...,"Sarah Springman, Günther Dissertori",,The rectors of ETH between 2017 and 2022 were ...,"Between 2017 and 2022, the rector of ETH Zuric..."
3,Who at ETH received ERC grants?,European Research Council grants: Tobias Donne...,The criterion here: does it come up with a lis...,More than 80 researchers at ETH Zurich have re...,ETH Zurich has been successful in securing num...
4,When did the InSight get to Mars?,26 November 2018,,The InSight lander successfully landed on Mars...,NASA's InSight lander arrived on Mars on Novem...
5,What did Prof. Schubert say about ﬂying?,Flying is too cheap. If we want to reduce ﬂyin...,,Prof. Schubert expressed that while surcharges...,"I'm sorry, but I need more context to provide ..."


In [92]:
import openai

# Output column
qa_df["chatgpt_rag_response"] = ""

client = openai.OpenAI(api_key=OPENAI_API_KEY)

def ask_chatgpt_with_rag_context(question: str, top_k: int = 5, model: str = "gpt-4.1") -> str:
    """
    RAG-enhanced ChatGPT call using top_k context chunks from precalc_retrieved_chunks.
    Pulls chunk text from df by unique_chunk_id.
    """
    if question not in precalc_retrieved_chunks:
        return f"[Error] No retrieved chunks for: {question}"

    try:
        top_chunks = precalc_retrieved_chunks[question].sort_values("score", ascending=False).head(top_k)
        chunk_ids = top_chunks["unique_chunk_id"].tolist()
        scores = top_chunks["score"].tolist()

        # Fetch actual text for each chunk
        context_blocks = []
        for cid, score in zip(chunk_ids, scores):
            match = df[df["unique_chunk_id"] == cid]
            if not match.empty:
                text = match.iloc[0]["chunk_text"]
                context_blocks.append(f"Score: {round(score, 4)}\n{text}")

        # Build prompt
        context_text = "\n\n---\n\n".join(context_blocks)
        prompt = f"""You are a helpful assistant. Take your time understanding the questions and noting all the information provided on the context.
        Based on the following context and your previous knowledge, answer the user's question accurately and concisely. 
        If the question is too open-ended or requires more details, ask for clarification.

Context:
{context_text}

Question: {question}
Answer:"""

        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3,
            max_tokens=512
        )
        return response.choices[0].message.content.strip()

    except Exception as e:
        return f"[Error] {e}"


In [90]:
for idx, row in qa_df.iterrows():
    q = row["question"]
    print(f"⏳ Answering: {q[:80]}...")
    answer = ask_chatgpt_with_rag_context(q, top_k=5)
    if "Error" in answer:
        print(f"❌ Error for question {q}: {answer}")
    else:
        print(f"✅ Answered: {answer[:80]}...")
    qa_df.at[idx, "chatgpt_rag_response"] = answer


⏳ Answering: Who was president of ETH in 2003?...
✅ Answered: Based on the context provided, the presidents of ETH Zurich mentioned are:

- Ra...
⏳ Answering: Who were the rectors of ETH between 2017 and 2022?...
✅ Answered: Between 2017 and 2022, the Rector of ETH Zurich was Sarah Springman. She served ...
⏳ Answering: Who at ETH received ERC grants?...
✅ Answered: Several researchers at ETH Zurich have received ERC (European Research Council) ...
⏳ Answering: When did the InSight get to Mars?...
✅ Answered: The InSight lander successfully arrived on Mars and landed on the Martian surfac...
⏳ Answering: What did Prof. Schubert say about ﬂying?...
✅ Answered: Prof. Renate Schubert expressed several important views about flying:

- She bel...
⏳ Answering: What is e-Sling?...
✅ Answered: The term "e-Sling" does not appear directly in the provided context information....
⏳ Answering: Who are famous ETH alumni?...
✅ Answered: ETH Zurich has an impressive list of famous alumni who have made

KeyboardInterrupt: 

In [79]:
qa_df.head()

Unnamed: 0,question,answer,evaluation_comments,chatgpt_rag_response
1,Who was president of ETH in 2003?,Olaf Kübler,,The context provided does not specify who was ...
2,Who were the rectors of ETH between 2017 and 2...,"Sarah Springman, Günther Dissertori",,The rectors of ETH between 2017 and 2022 were ...
3,Who at ETH received ERC grants?,European Research Council grants: Tobias Donne...,The criterion here: does it come up with a lis...,More than 80 researchers at ETH Zurich have re...
4,When did the InSight get to Mars?,26 November 2018,,The InSight lander successfully landed on Mars...
5,What did Prof. Schubert say about ﬂying?,Flying is too cheap. If we want to reduce ﬂyin...,,Prof. Schubert expressed that while surcharges...
