<a href="https://colab.research.google.com/github/JackGraymer/Advanced-GenAI/blob/main/2.2_response_synthesis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Generative Artificial Intelligence
**Project - Designing a RAG-Based Q&A System for News Retrieval**

**Authors:** Vsevolod Mironov, Pascal Küng, Alvaro Cervan (Group 5)


# Step 2.2 Aggregation and Response Synthesis (Post-Retrieval) -<br> Merging retrieved results and applying advanced post-retrieval techniques

**Contribution:** Vsevolod Mironov, Pascal Küng, Alvaro Cervan

**Goal of this step:** This step deals with post-retrieval processing. After an initial set of candidate chunks is retrieved, techniques are applied to refine these results before they are passed to the LLM. The primary focus is on re-ranking, using sophisticated models to reorder the retrieved chunks based on relevance. Other techniques like generating summaries or fusing information from multiple chunks are also explored.

## 1.1 Setup of the environment

### Setting seeds and mounting Google Drive storage folder

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
base_folder = '/content/drive/MyDrive/AdvGenAI'

In [30]:
import pickle
import os
import re
import random
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
import matplotlib.pyplot as plt
import seaborn as sns
import tempfile

In [2]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


In [3]:
!pip list

Package                               Version
------------------------------------- -------------------
absl-py                               1.4.0
accelerate                            1.7.0
aiohappyeyeballs                      2.6.1
aiohttp                               3.11.15
aiosignal                             1.3.2
alabaster                             1.0.0
albucore                              0.0.24
albumentations                        2.0.7
ale-py                                0.11.0
altair                                5.5.0
annotated-types                       0.7.0
antlr4-python3-runtime                4.9.3
anyio                                 4.9.0
argon2-cffi                           23.1.0
argon2-cffi-bindings                  21.2.0
array_record                          0.7.2
arviz                                 0.21.0
astropy                               7.1.0
astropy-iers-data                     0.2025.5.19.0.38.36
astunparse                            1

In [6]:
# Set the seed for consistent results
seed_value = 2138247234
random.seed(seed_value)
np.random.seed(seed_value)
os.environ['PYTHONHASHSEED'] = str(seed_value)

In [9]:
# Load Q_A_data file
with open(os.path.join(base_folder, 'Stage2/Working-dir/Stage2-08-q-a-file-with-relevancy.pkl'), 'rb') as f:
    Q_A_ground_thruth_relevancy_dict = pickle.load(f)

In [14]:
QQ = pd.DataFrame.from_dict(Q_A_ground_thruth_relevancy_dict, orient='index')
QQ.head()

Unnamed: 0,question,answer,possible_relevant_chunks,ground_truth_relevance,evaluation_comments
1,Who was president of ETH in 2003?,Olaf Kübler,"{3016_02, 2929_06, 2993_04, 3034_03, 4166_23, ...","{'3318_03': 0.0, '2550_02': 0.0, '3408_02': 0....",
2,Who were the rectors of ETH between 2017 and 2...,"Sarah Springman, Günther Dissertori","{3293_02, 2308_09, 3138_01, 2806_02, 3253_03, ...","{'3234_12': 0.0, '3318_03': 0.0, '0013_13': 0....",
3,Who at ETH received ERC grants?,European Research Council grants: Tobias Donne...,"{3021_07, 1186_00, 0909_01, 3985_06, 3415_08, ...","{'2478_00': 0.0, '3788_00': 0.0, '3899_03': 0....",The criterion here: does it come up with a lis...
4,When did the InSight get to Mars?,26 November 2018,"{4180_00, 0932_00, 3838_07, 1087_02, 2855_01, ...","{'1098_04': 0.0, '0188_03': 0.0, '3592_05': 0....",
5,What did Prof. Schubert say about ﬂying?,Flying is too cheap. If we want to reduce ﬂyin...,"{2343_00, 4193_04, 3986_11, 3395_06, 0180_02, ...","{'1376_09': 0.0, '2888_02': 0.0, '3999_05': 0....",


In [15]:
print(QQ.columns)

Index(['question', 'answer', 'possible_relevant_chunks',
       'ground_truth_relevance', 'evaluation_comments'],
      dtype='object')


In [16]:
df_chun = pd.read_csv(os.path.join(base_folder, 'Stage2/Working-dir/Stage2-02-chunked-dataset.csv'))

In [17]:
df_chun.head()

Unnamed: 0,unique_chunk_id,chunk_text,chunk_length,total_chunks,folder_path,file_name,year,month,language,type,title,text_id,chunk_id
0,0000_00,"Als 1950 die Meteorologen Jule Charney, Ragnar...",563,8,/content/drive/MyDrive/AdvGenAI/data/de_news_e...,blog-knutti-klimamodelle.html,2019,8,de,news events,Blog knutti klimamodelle,0,0
1,0000_01,## Erstaunliche Entwicklung der Klimamodelle\n...,804,8,/content/drive/MyDrive/AdvGenAI/data/de_news_e...,blog-knutti-klimamodelle.html,2019,8,de,news events,Blog knutti klimamodelle,0,1
2,0000_02,"«Alle Modelle sind falsch, aber einige sind nü...",881,8,/content/drive/MyDrive/AdvGenAI/data/de_news_e...,blog-knutti-klimamodelle.html,2019,8,de,news events,Blog knutti klimamodelle,0,2
3,0000_03,"Doch um die Gitterweite verkleinern zu können,...",536,8,/content/drive/MyDrive/AdvGenAI/data/de_news_e...,blog-knutti-klimamodelle.html,2019,8,de,news events,Blog knutti klimamodelle,0,3
4,0000_04,Bis ein hochaufgelöstes Modell auf einer neuen...,466,8,/content/drive/MyDrive/AdvGenAI/data/de_news_e...,blog-knutti-klimamodelle.html,2019,8,de,news events,Blog knutti klimamodelle,0,4


In [18]:
print(df_chun.columns)

Index(['unique_chunk_id', 'chunk_text', 'chunk_length', 'total_chunks',
       'folder_path', 'file_name', 'year', 'month', 'language', 'type',
       'title', 'text_id', 'chunk_id'],
      dtype='object')


In [23]:
type(QQ['possible_relevant_chunks'].head()[1])

set

In [29]:
list(QQ['possible_relevant_chunks'][2])[0]

'3293_02'