<a href="https://colab.research.google.com/github/JackGraymer/Advanced-GenAI/blob/main/3_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Generative Artificial Intelligence
**Project - Designing a RAG-Based Q&A System for News Retrieval**

**Authors:** Vsevolod Mironov, Pascal Küng, Alvaro Cervan (Group 5)


# Step 3 Evaluation – Assessing answer quality through both automated and human evaluation

**Contribution:** Vsevolod Mironov, Pascal Küng, Alvaro Cervan

**Goal of this step:** Students will assess the quality of answers produced by their top-performing RAG pipeline. This involves applying the pipeline to benchmark questions, comparing its responses to ground truth answers, and evaluating performance using both automated metrics and human judgment.

### **Objective**
Evaluate the quality of answers generated by the best RAG pipeline using both automated metrics and human judgment.

---

### **Workflow**

1. **Run the Best RAG Pipeline**
	- Apply the developed RAG pipeline to the benchmark question set to generate answers.

2. **Automated Metrics Calculation**
	- **Semantic Exact Match:** Assess if generated answers semantically match ground truth using embeddings or semantic models.
	- **Semantic F1 Score:** Tokenize answers and compute precision, recall, and F1 based on semantic similarity.
	- **BLEU/ROUGE:** Measure N-gram or sequence overlap between generated and ground truth answers.
	- **Record Results:** Store all metric scores in a structured format for comparison.

3. **Human Evaluation**
	- **Criteria:** Evaluate each answer for:
	  - *Relevance* (alignment with the query)
	  - *Correctness* (accuracy vs. ground truth)
	  - *Clarity* (understandability)
	- **Rating Scale:** Use a 1–5 scale for each criterion.
	- **Analysis:** Provide a brief written summary of human evaluation findings.

4. **Report and Presentation**
	- Present both automated and human evaluation results using tables, charts, or other concise visualizations within the notebook.

#### Visual Representation of the Evaluation Pipeline
![Pipeline Overview](https://github.com/JackGraymer/Advanced-GenAI/blob/main/Evaluation_Workflow.svg?raw=1)

---

**Summary Table Example:**

| Question | Automated Metrics (F1, BLEU, ROUGE) | Human Relevance | Human Correctness | Human Clarity | Comments |
|----------|-------------------------------------|-----------------|-------------------|--------------|----------|
| Q1       | ...                                 | ...             | ...               | ...          | ...      |



# 1.0 Loading Data and functions from previous stage

## 1.1 Setup of the environment

In [None]:
!pip install torch torchvision torchaudio
!pip install --quiet --upgrade openai

In [2]:
import torch
if torch.cuda.is_available():
    print(f"GPU is available: {torch.cuda.get_device_name(0)}")
    # The model will automatically be placed on the GPU
else:
    print("GPU is not available. Running on CPU.")

GPU is available: Tesla T4


In [3]:
import os
import re
import json
import asyncio
import random
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
import matplotlib.pyplot as plt
import seaborn as sns
import pprint
import pickle
from typing import Optional, List
from sentence_transformers import SentenceTransformer, util
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSequenceClassification
from google.colab import userdata

In [4]:
# Set the seed for consistent results
seed_value = 2138247234
random.seed(seed_value)
np.random.seed(seed_value)
os.environ['PYTHONHASHSEED'] = str(seed_value)

## 1.2a Setup for working on Colab

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
base_folder = '/content/drive/MyDrive/AdvGenAI'

In [7]:
# Load dictionary of precalculated retrieved chunks
filename = os.path.join(base_folder, "Stage3/Working-dir/Stage3-02-precalc-reranked-chunks.pkl")
with open(filename, 'rb') as f:
    precalc_reranked_chunks = pickle.load(f)
print(f"Dictionary loaded from {filename}:")

Dictionary loaded from /content/drive/MyDrive/AdvGenAI/Stage3/Working-dir/Stage3-02-precalc-reranked-chunks.pkl:


In [8]:
df = pd.read_csv(os.path.join(base_folder, 'Stage2/Working-dir/Stage2-02-chunked-dataset.csv'))

In [9]:
# Load Q_A_data file
with open(os.path.join(base_folder, 'Stage2/Working-dir/Stage2-08-q-a-file-with-relevancy.pkl'), 'rb') as f:
    question_answers = pickle.load(f)

In [26]:
# Define OpenAI Api Key
import openai
from google.colab import userdata

OPENAI_API_KEY = userdata.get('openai_advAI')

## 1.2b Setup for working locally

In [None]:
# Run this cell if working locally
df = pd.read_csv('data/Stage2-02-chunked-dataset.csv')
filename = 'data/Stage3-02-precalc-reranked-chunks.pkl'
with open(filename, 'rb') as f:
	precalc_reranked_chunks = pickle.load(f)
print(f"Dictionary loaded from {filename}:")

with open('data/Stage2/Working-dir/Stage2-08-q-a-file-with-relevancy.pkl', 'rb') as f_qa:
	question_answers = pickle.load(f_qa)

# load files in local computer and api from data .env
from dotenv import dotenv_values

env_vars = dotenv_values('data/.env')
OPENAI_API_KEY = env_vars.get('OPENAI_API_KEY', None)
print(f"Loaded OPENAI_API_KEY: {'***' if OPENAI_API_KEY else 'Not found'}")

Dictionary loaded from data/Stage3-02-precalc-reranked-chunks.pkl:
Loaded OPENAI_API_KEY: ***


## 1.3 Inspect loaded data

In [11]:
df.head()

Unnamed: 0,unique_chunk_id,chunk_text,chunk_length,total_chunks,folder_path,file_name,year,month,language,type,title,text_id,chunk_id
0,0000_00,"Als 1950 die Meteorologen Jule Charney, Ragnar...",563,8,/content/drive/MyDrive/AdvGenAI/data/de_news_e...,blog-knutti-klimamodelle.html,2019,8,de,news events,Blog knutti klimamodelle,0,0
1,0000_01,## Erstaunliche Entwicklung der Klimamodelle\n...,804,8,/content/drive/MyDrive/AdvGenAI/data/de_news_e...,blog-knutti-klimamodelle.html,2019,8,de,news events,Blog knutti klimamodelle,0,1
2,0000_02,"«Alle Modelle sind falsch, aber einige sind nü...",881,8,/content/drive/MyDrive/AdvGenAI/data/de_news_e...,blog-knutti-klimamodelle.html,2019,8,de,news events,Blog knutti klimamodelle,0,2
3,0000_03,"Doch um die Gitterweite verkleinern zu können,...",536,8,/content/drive/MyDrive/AdvGenAI/data/de_news_e...,blog-knutti-klimamodelle.html,2019,8,de,news events,Blog knutti klimamodelle,0,3
4,0000_04,Bis ein hochaufgelöstes Modell auf einer neuen...,466,8,/content/drive/MyDrive/AdvGenAI/data/de_news_e...,blog-knutti-klimamodelle.html,2019,8,de,news events,Blog knutti klimamodelle,0,4


In [21]:
# print filename to see the content of the .pkl file
print(f"Precalculated retrieved chunks: {len(precalc_reranked_chunks)} entries")
# Print the first few entries to understand the structure
print("Sample entries from precalculated retrieved chunks:")
for key, value in list(precalc_reranked_chunks.items())[:2]:
	print(f"Key: {key}")
	display(value.head(3))
	print("------------------------------------------------------------\n")

Precalculated retrieved chunks: 25 entries
Sample entries from precalculated retrieved chunks:
Key: Who was president of ETH in 2003?


Unnamed: 0,unique_chunk_id,score,chunk_text,title
0,3429_06,0.048395,"The President of the ETH Board, Fritz Schiesse...",New president appointed
1,3795_08,0.047403,"ETH President Ralph Eichler, who is handing ov...",Encouraging more critical opinion
2,3947_09,0.047139,While ETH Zurich has always had an open and in...,A grounded globetrotter


------------------------------------------------------------

Key: Who were the rectors of ETH between 2017 and 2022?


Unnamed: 0,unique_chunk_id,score,chunk_text,title
0,3871_02,0.048916,## Big changes on all study levels\nOver the p...,Eth day 2019
1,2371_01,0.048139,"""I am excited about the opportunity to contrib...",New head of let
2,2814_00,0.047883,"Last Tuesday, the Professors’ Conference of ET...",Guenther dissertori as new eth zurich rector


------------------------------------------------------------



Below we convert the dictionary containing the questions and answers to a dataframe and keep only the columns of `question`, `answer` and `evaluation_comments`.

In [55]:
# convert question_answers to a DataFrame
qa_df = pd.DataFrame(question_answers)
# Display the first few rows of the DataFrame
print("Question-Answers DataFrame:")
qa_df = qa_df.T
qa_df = qa_df.drop(['possible_relevant_chunks', 'ground_truth_relevance'], axis=1)
qa_df.head()

Question-Answers DataFrame:


Unnamed: 0,question,answer,evaluation_comments
1,Who was president of ETH in 2003?,Olaf Kübler,
2,Who were the rectors of ETH between 2017 and 2...,"Sarah Springman, Günther Dissertori",
3,Who at ETH received ERC grants?,European Research Council grants: Tobias Donne...,The criterion here: does it come up with a lis...
4,When did the InSight get to Mars?,26 November 2018,
5,What did Prof. Schubert say about ﬂying?,Flying is too cheap. If we want to reduce ﬂyin...,


# 2.0 Answer the questions

## 2.1 Let an LLM (OpenAI) answer questions without context

In [None]:
import openai

qa_df['chatgpt_no_context'] = ""

client = openai.OpenAI(api_key=OPENAI_API_KEY)

def ask_chatgpt(question, model="gpt-4o"):
	try:
		response = client.chat.completions.create(
			model=model,
			messages=[{"role": "user", "content": question}],
			temperature=0
		)
		return response.choices[0].message.content.strip()
	except Exception as e:
		return f"Error: {e}"

In [27]:
# Query ChatGPT for each question and store the answer
for idx, row in qa_df.iterrows():
	answer = ask_chatgpt(row['question'])
	qa_df.at[idx, 'GPT_4o_no_context'] = answer

In [33]:
qa_df.head()

Unnamed: 0,question,answer,evaluation_comments,chatgpt_no_context
0,Who was president of ETH in 2003?,Olaf Kübler,,"In 2003, the president of ETH Zurich (Swiss Fe..."
1,Who were the rectors of ETH between 2017 and 2...,"Sarah Springman, Günther Dissertori",,"Between 2017 and 2022, the rector of ETH Zuric..."
2,Who at ETH received ERC grants?,European Research Council grants: Tobias Donne...,The criterion here: does it come up with a lis...,ETH Zurich has been successful in securing num...
3,When did the InSight get to Mars?,26 November 2018,,NASA's InSight lander arrived on Mars on Novem...
4,What did Prof. Schubert say about ﬂying?,Flying is too cheap. If we want to reduce ﬂyin...,,"I'm sorry, but I need more context to provide ..."


In [30]:
# save Questions and Answers
qa_df.to_csv(os.path.join(base_folder, 'Stage4/Working-dir/Stage4-01-q-a.csv'), index=False)

## 2.2 Setup for question answering using the retrieved and contexts

In [32]:
# Load Questions and Answers
qa_df = pd.read_csv(os.path.join(base_folder, 'Stage4/Working-dir/Stage4-01-q-a.csv'))

For some question the month and year of publication might be of interest. Therefore we will send that to the LLM together with the context and chunk_id.

In [34]:
# Function to return the month as text from the number
def get_month_name(month_number):
    months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
    return months[month_number - 1]

In [36]:
# As an example
chunk_id = "2101_01"
month = df[df['unique_chunk_id']==chunk_id]['month'].item()
year = df[df['unique_chunk_id']==chunk_id]['year'].item()
print(f"Published in: {get_month_name(month)} {year}")

Published in: September 2022


We provide the information of the relevant chunks in a JSON format and also request the answer to be in a valid JSON-format for easier handling.

In the system prompt we provide the LLM with guidelines for answering the question. We ask it to answer the questions solely based on the information provided in the contexts. If we would relax this restriction the LLM might be able to answer some of the questions where it doesn't provide one solely based on the provided contexts (for example the question "Who was president of ETH in 2003?")

In [44]:
def ask_chatgpt_with_rag_context(question: str, top_k: int = 10, model: str = "gpt-4o") -> dict:
    """
    RAG-enhanced call that provides context as JSON and requests a JSON output.
    The response includes the answer and the specific chunks used.
    """
    if question not in precalc_reranked_chunks:
        return {"error": f"No retrieved chunks for: {question}"}

    try:
        # --- 1. Retrieve and Format Context Chunks ---
        top_chunks_df = precalc_reranked_chunks[question].sort_values("score", ascending=False).head(top_k)

        context_chunks_list = []
        for _, row in top_chunks_df.iterrows():
            cid = row["unique_chunk_id"]
            match = df[df["unique_chunk_id"] == cid]

            if not match.empty:
                match_row = match.iloc[0]
                context_chunks_list.append({
                    "unique_chunk_id": cid,
                    "publication_date": f"{get_month_name(match_row['month'])} {match_row['year']}",
                    "text": match_row["chunk_text"]
                })

        # Convert the list of dictionaries to a JSON string
        context_json_str = json.dumps(context_chunks_list, indent=2)

        # --- 2. Construct the System Prompt ---
        system_prompt = f"""You are a professional AI assistant for the ETH Zurich news website.

**Your Task:**
1.  You will be given a user's question and a list of context chunks from news articles in JSON format.
2.  Formulate a short, precise, and professional answer based **exclusively** on the provided context.
3.  Your tone must be neutral and objective, representing ETH Zurich as a whole. Do not favor any disciplines or departments.
4.  Identify the `unique_chunk_id` of every chunk you use to formulate the answer.
5.  Omit boilerplate intros like “Based on the context provided.”
6.  If the context does not contain the information needed to answer, your answer should state that the information is not available in the provided articles.

**Output Format:**
You MUST respond with a single, valid JSON object and nothing else. The schema is:
{{
  "relevant_chunks": ["id_of_chunk_1", "id_of_chunk_2"],
  "answer": "Your concise and professional answer here."
}}"""

        # --- 3. Construct the User Prompt ---
        user_prompt = f"""Context Chunks:
{context_json_str}

Question:
{question}
"""
        # --- 4. Call the OpenAI API with JSON Mode ---
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0,
            response_format={"type": "json_object"} # Enforce JSON output
        )

        # Parse the JSON response from the model
        response_content = response.choices[0].message.content
        return json.loads(response_content)

    except Exception as e:
        return {"error": f"An exception occurred: {e}"}

Below we ask one of the question to test the function.

In [45]:
question_to_ask = "Who at ETH received ERC grants?"
response_dict = ask_chatgpt_with_rag_context(question_to_ask)

if 'error' not in response_dict:
    print("Answer from Chatbot:")
    print(response_dict.get("answer"))

    print("\nSources Used (Chunk IDs):")
    print(response_dict.get("relevant_chunks"))

else:
    print(f"An error occurred: {response_dict['error']}")

Answer from Chatbot:
Several researchers at ETH Zurich have received ERC grants. More than 80 researchers have been awarded ERC Advanced Grants. Notable recipients include Professor Ruedi Aebersold and Professor Atac Imamoglu, who have each received an ERC Advanced Grant twice. Martin Vechev received an ERC Starting Grant. Christoph Müller was awarded an ERC Consolidator Grant in 2018. Recently, Barbara Treutlein and Nicolas Noiray received ERC Synergy Grants for their projects 'AxoBrain' and 'HYROPE', respectively.

Sources Used (Chunk IDs):
['4007_07', '3509_00', '3087_01', '3163_00', '4139_00', '3888_06']


## 2.3 Process all questions with an LLM using the retrieved contexts

Below we check the provided chunks to check if they are indeed containing the context as a basis for the answer of the LLM.

For this question the retrieved (and reranked) chunks provide a very good basis for answering the question.

In [48]:
for chunk_id in response_dict.get("relevant_chunks"):
    row = df[df["unique_chunk_id"] == chunk_id]
    print(f"Chunk ID: {chunk_id}")
    pprint.pprint(row["chunk_text"].item())

Chunk ID: 4007_07
('## Benchmark for top researchers: ERC Grants\n'
 'ETH researchers have been successfully applying for EU funding – ERC '
 'Research Grants – since 2007. More than 80 researchers at ETH Zurich have '
 'received an ERC Advanced Grant. In addition to the Advanced Grants, the '
 'European Research Council also annually awards Starting Grants to young '
 'researchers at the beginning of their careers and Consolidator Grants to '
 'more established researchers to further develop their own group. '
 'Furthermore, the numerous ERC Proof of Concept Grants (funding for the '
 'preparation of feasibility studies and business plans) awarded to ETH Zurich '
 'show that basic research is often used in market innovations with '
 'corresponding economic benefits.\n'
 '## Downloads\n'
 'DownloadPress release (PDF, 122 KB)vertical\\_align\\_bottom\n'
 '## Contact\n'
 'ETH Zürich\n'
 ' Corporate Communications\n'
 ' Media Relations\n'
 ' Tel.: +41 44 632 41 41')
Chunk ID: 3509_00
('ER

Below we iterate over the list of questions and print the answers and ids of the relevant chunks. Additionally we save the answers in the `qa_df`.

In [50]:
for idx, row in qa_df.iterrows():
    q = row["question"]
    print(f"⏳ Answering: {q[:80]}...")
    answer_dict = ask_chatgpt_with_rag_context(q, top_k=10)
    if "Error" in answer:
        print(f"❌ Error for question {q}")
        pprint.pprint(answer_dict)
        qa_df.at[idx, 'GPT_4o_RAG_response'] = ""

    else:
        print("✓ Success. Response received:")
        pprint.pprint(answer_dict)
        qa_df.at[idx, 'GPT_4o_RAG_response'] = answer_dict.get("answer")

⏳ Answering: Who was president of ETH in 2003?...
✓ Success. Response received:
{'answer': 'The information about who was president of ETH Zurich in 2003 is '
           'not available in the provided articles.',
 'relevant_chunks': []}
⏳ Answering: Who were the rectors of ETH between 2017 and 2022?...
✓ Success. Response received:
{'answer': 'Sarah Springman served as the Rector of ETH Zurich from 2015 until '
           '2022. Günther Dissertori was nominated to succeed her as Rector in '
           '2021.',
 'relevant_chunks': ['2814_00', '4047_00']}
⏳ Answering: Who at ETH received ERC grants?...
✓ Success. Response received:
{'answer': 'Several researchers at ETH Zurich have received ERC grants. More '
           'than 80 researchers have been awarded ERC Advanced Grants. Notable '
           'recipients include Professor Ruedi Aebersold and Professor Atac '
           'Imamoglu, who have each received an ERC Advanced Grant twice. '
           'Martin Vechev received an ERC Starti

In [54]:
qa_df.head()

Unnamed: 0,question,answer,evaluation_comments,chatgpt_no_context,GPT_4o_RAG_response
0,Who was president of ETH in 2003?,Olaf Kübler,,"In 2003, the president of ETH Zurich (Swiss Fe...",The information about who was president of ETH...
1,Who were the rectors of ETH between 2017 and 2...,"Sarah Springman, Günther Dissertori",,"Between 2017 and 2022, the rector of ETH Zuric...",Sarah Springman served as the Rector of ETH Zu...
2,Who at ETH received ERC grants?,European Research Council grants: Tobias Donne...,The criterion here: does it come up with a lis...,ETH Zurich has been successful in securing num...,Several researchers at ETH Zurich have receive...
3,When did the InSight get to Mars?,26 November 2018,,NASA's InSight lander arrived on Mars on Novem...,The InSight lander successfully landed on Mars...
4,What did Prof. Schubert say about ﬂying?,Flying is too cheap. If we want to reduce ﬂyin...,,"I'm sorry, but I need more context to provide ...",Prof. Renate Schubert expressed that while sur...


In [52]:
# save Questions and Answers
qa_df.to_csv(os.path.join(base_folder, 'Stage4/Working-dir/Stage4-02-q-a.csv'), index=False)

## Chapter xyz

In [53]:
# Load Questions and Answers
qa_df = pd.read_csv(os.path.join(base_folder, 'Stage4/Working-dir/Stage4-02-q-a.csv'))