# Exam (morning): Retrieval Augmented Generation

### Personal Details (please complete)
Double Click on Cell to edit.

<table>
  <tr>
    <td>First Name: </td>
    <td>Jaden</td>
  </tr>
  <tr>
    <td>Last Name:</td>
    <td>Donati</td>
  </tr>
  <tr>
    <td>Student ID:</td>
    <td>22582407</td>
  </tr>
  <tr>
    <td>Modul:</td>
    <td>Machine Learning 2</td>
  </tr>
  <tr>
    <td>Exam Date / Raum / Zeit:</td>
    <td>20.05.2025 / Raum: SM O2.01  / 10:15 – 11:30</td>
  </tr>
  <tr>
    <td>Erlaubte Hilfsmittel:</td>
    <td>w.3ML2-WIN (Machine Leaning 2)<br>Open Book, Personal Computer, Internet Access</td>
  </tr>
  <tr>
  <td>Not allowed:</td>
  <td>The use of any form of generative AI (e.g., Copilot, ChatGPT) to assist in solving the exercise is not permitted. <br> However, using such tools as part of the exercise itself (e.g., making API calls to them if required by the task) is allowed. <br> Any form of communication or collaboration with other people is not permitted.</td>
</tr>
</table>

## Evaluation Criteria

### <b style="color: gray;">(maximum achievable points: 48)</b>

<table>
  <thead>
    <tr>
      <th>Category</th>
      <th>Description</th>
      <th>Points Distribution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Code not executable or results not meaningful</td>
      <td>The code contains errors that prevent it from running (e.g., syntax errors) or produces results that do not fit the question.</td>
      <td>0 points</td>
    </tr>
    <tr>
      <td>Code executable, but with serious deficiencies</td>
      <td>The code runs, but the results are incomplete due to major errors (e.g., fundamental errors when reading the data). Only minimal progress is evident.</td>
      <td>25% of the maximum achievable points</td>
    </tr>
    <tr>
      <td>Code executable, but with moderate deficiencies</td>
      <td>The code runs and delivers partially correct results, but there are significant errors (e.g., the data types of the imported data do not meet the requirements of the question). The results are comprehensible but incomplete or inaccurate.</td>
      <td>50% of the maximum achievable points</td>
    </tr>
    <tr>
      <td>Code executable, but with minor deficiencies</td>
      <td>The code runs and delivers a largely correct result, but minor errors (e.g., column name misspelled, timestamp not correctly formatted) affect the completeness of the result.</td>
      <td>75% of the maximum achievable points</td>
    </tr>
    <tr>
      <td>Code executable and correct</td>
      <td>The code runs flawlessly and delivers the correct result without deficiencies.</td>
      <td>100% of the maximum achievable points</td>
    </tr>
  </tbody>
</table>



## Python Libraries und Settings

## <b>Set Up (This part will <u>not</u> be evaluated!)</b>

#### <b>1.) Start a GitHub Codespaces instance based on your fork of this GitHub repository or open the notebook in Colab</b>
#### <b>2.) Add API keys to either .env files for Codespaces or to the secrets for Colab</b>
#### <b>3.) Please execute the two code cells below as soon as the Codespace/Colab has started and install the libraries</b>

In [1]:
!python3 -m pip install --upgrade pip
!pip install PyPDF2
!pip install langchain-community
!pip install faiss-cpu
!pip install groq
!pip install openai
!pip install tqdm
!pip install sentence-transformers
!pip install huggingface_hub[hf_xet]
!pip install faiss-cpu
!pip install google-generativeai

Collecting pip
  Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m53.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 25.0.1
    Uninstalling pip-25.0.1:
      Successfully uninstalled pip-25.0.1
Successfully installed pip-25.1.1
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Collecting langchain-community
  Downloading langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting langchain-core<1.0.0,>=0.3.59 (from langchain-community)
  Downloading langchain_core-0.3.60-py3-none-any.whl.metadata (5.8 kB)
Collecting langchain<1.0.0,>=0.3.25 (from langchain-community)
  Downloading langchain-0.3

In [2]:
from dotenv import load_dotenv
import os
from openai import OpenAI
import openai
import tqdm
import glob
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer
import faiss
import pickle
import google.generativeai as genai
from groq import Groq




  from .autonotebook import tqdm as notebook_tqdm


In [3]:
load_dotenv()
groq_key = os.getenv("GROQ_API_KEY")
openai.api_key = os.getenv("OPENAI_API_KEY")
google_key = os.getenv("GOOGLE_API_KEY")


## <b>Tasks (This part will be evaluated!)</b>
### Notes on the following tasks:

In this part of the exam, you will build a Retrieval-Augmented Generation (RAG) pipeline that efficiently retrieves medical information from the package inserts of common medications. Imagine you are developing a system for pharmacists or medical professionals to quickly and accurately answer questions about medications. The following five package inserts are provided as your data source:

- [data/Amoxicillin.pdf](data/Amoxicillin.pdf)
- [data/bisoprolol.pdf](data/bisoprolol.pdf)
- [data/citalopram.pdf](data/citalopram.pdf)
- [data/metformin.pdf](data/metformin.pdf)
- [data/paracetamol.pdf](data/paracetamol.pdf)

Your task is to implement a RAG pipeline that retrieves relevant information from these package inserts and integrates it into the answer generation process. Use the provided instructions and your knowledge from the exercises.

### Expected Results:

1. Read in the provided package inserts and extract all text.
2. Split the extracted text into manageable chunks using a text splitter (e.g., `RecursiveCharacterTextSplitter`).
3. Create embeddings for the text chunks using a suitable model.
4. Index the embeddings in a vector store (e.g., FAISS).
5. Develop an appropriate prompt template.
6. Build the RAG chain.
7. Automatically generate a list of 10 test questions using a language model.
8. Let your RAG pipeline answer the 10 generated questions.

### Submission documents:

Your submission should include:
- The completed notebook (this file).
- the vector store

<b style="color:blue;">Notes on the following tasks:</b>
<ul style="color:blue;">
  <li>Pay attention to the specific details provided for each task.</li>
  <li>Solve each task using Python code. Integrate your code into the code cells for each task.</li>
  <li>Present your solution(s) as requested in each task.</li>
</ul>

#### <b>Task (1): Read all 5 PDFs from the 'data' folder and store their content for further use</b>
<b>Task details:</b>
- The files are located in the 'data' folder..
- Display the length of the resulting string (number of characters).
- Show the first 100 characters in the notebook output.
<b style="color: gray;">(max. points: 2)</b>

In [4]:
# Definiert ein Dateipfad-Muster, um alle PDF-Dateien im Ordner "data" zu finden.
# 💡 Hinweis für die Prüfung: Du kannst diesen Pfad anpassen, wenn deine PDFs in einem anderen Verzeichnis liegen.
glob_path = "data/*.pdf"

# Initialisiert eine leere Zeichenkette, in der später der extrahierte Text gespeichert wird.
text = ""

# Iteriert über alle PDF-Dateien, die dem Pfad-Muster entsprechen.
# tqdm zeigt einen Fortschrittsbalken an – hilfreich bei vielen Dateien.
for pdf_path in tqdm.tqdm(glob.glob(glob_path)):

    # Öffnet die aktuelle PDF-Datei im Lesemodus ("rb" = read binary).
    with open(pdf_path, "rb") as file:

        # Initialisiert den PDF-Reader für die geöffnete Datei.
        reader = PdfReader(file)

        # Extrahiert den Text aus jeder Seite der PDF, falls Text vorhanden ist,
        # und hängt ihn an die `text`-Variable an (Seiten werden mit Leerzeichen verbunden).
        text += " ".join(page.extract_text() for page in reader.pages if page.extract_text())



100%|██████████| 5/5 [00:02<00:00,  2.07it/s]


In [5]:
# Show the number of characters in the text
print(f"Number of characters in the entire text: ")

# Show the first 100 characters of the text
print(f"The first 100 characters of the text:")

Number of characters in the entire text: 
The first 100 characters of the text:


#### <b>Task (2): Split the text into chunks appropriate for the task. Specify an overlap as well. Give a reason for your choice</b>
<b>Task details:</b>
- Use the data from the previous task.
- Show the total number of chunks in the notebook.
- Show the length of the first chunk in the notebook.
- Explain you reasoning
<b style="color: gray;">(max. points: 4)</b>

In [None]:

from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:


# 🔧 Erstelle einen Text-Splitter mit folgenden Parametern:
# - chunk_size: max. 2000 Zeichen pro Chunk
# - chunk_overlap: jeweils 200 Zeichen Überlappung zwischen zwei Chunks
#   → sorgt dafür, dass der Kontext beim Übergang zwischen Chunks nicht verloren geht
splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,         # Maximale Länge eines Chunks in Zeichen
    chunk_overlap=200        # Überlappung zwischen benachbarten Chunks
)

# ✂️ Teile den extrahierten PDF-Text mithilfe des Splitters in kleinere, überlappende Textabschnitte
chunks = splitter.split_text(text)

# 🧾 Jetzt enthält die Variable 'chunks' eine Liste von Textabschnitten, die jeweils max. 2000 Zeichen

In [9]:
# 📊 Zeige die Gesamtanzahl der erzeugten Text-Chunks an
# Das ist hilfreich zur Kontrolle, wie viele Abschnitte aus dem PDF-Text entstanden sind
print(f"Total chunks: {len(chunks)}")

# 👁️‍🗨️ Zeige eine Vorschau auf den ersten Chunk (die ersten 200 Zeichen)
# So kannst du kontrollieren, ob die Aufteilung sinnvoll funktioniert hat
print("Preview of the first chunk:", chunks[0][:200])

Total chunks: 98
Preview of the first chunk: Inhaltsverzeichnis
Zusammensetzung
Darreichungsform und Wirkstoffmenge pro Einheit
Indikationen/Anwendungsmöglichkeiten
Dosierung/Anwendung
Kontraindikationen
Warnhinweise und Vorsichtsmassnahmen
Inte


In [None]:
print("Preview of the first chunk:", chunks[0].__sizeof__)
#NOTIZ FÜR MICH Code eventuell noch löschen. Und Grund angeben für Chunk überlappung.

Preview of the first chunk: <built-in method __sizeof__ of str object at 0x5f8239593190>


In [None]:
# Show the total number of chunks
print(f"Number of chunks: ")

# Show the length of the first chunk
print(f"Length of the first chunk: ")

#NOTIZ FÜR MICH LÄNGE NOCH ANGEBEN

##### Explanation (double click and add text):

#### <b>Task (3): Initialize an embedding model</b>
<b>Task details:</b>
- Choose a suitable embedding model from Huggingface.
- [Huggingface models](https://huggingface.co/spaces/mteb/leaderboard).
- Consider the size of the model. It should be runnable in your Codespace.
- Choose a model appropriate for the data.

<b style="color: gray;">(max. points: 2)</b>

In [12]:
# Definiere den Namen des Embedding-Modells, das verwendet werden soll.
# Dieses Modell wurde trainiert, um semantisch ähnliche Sätze in ähnliche Vektoren zu übersetzen.
# "multilingual" bedeutet, dass es mit mehreren Sprachen (z. B. Englisch, Deutsch) umgehen kann.
model_name = "paraphrase-multilingual-MiniLM-L12-v2"

# Lade das ausgewählte Modell mit der SentenceTransformer-Bibliothek.
# Es wird intern von HuggingFace geladen und kann sofort zur Vektorisierung verwendet werden.
model = SentenceTransformer(model_name)

# Erzeuge Embeddings (Vektoren) für alle Text-Chunks.
# convert_to_numpy=True sorgt dafür, dass du ein NumPy-Array zurückbekommst (praktisch für spätere Verarbeitung).
chunk_embeddings = model.encode(chunks, convert_to_numpy=True)


In [13]:
# 📐 Ermittle die Anzahl der Dimensionen eines einzelnen Embedding-Vektors
# chunk_embeddings ist ein 2D-Array mit der Form (Anzahl der Chunks, Anzahl der Dimensionen)
# z. B. (120, 384) → 120 Chunks, jeder als Vektor mit 384 Werten

d = chunk_embeddings.shape[1]  # Index [1] gibt die Spaltenanzahl = Vektor-Dimension

# 🖨️ Gib die Dimension des Embeddings aus (wichtig für FAISS oder Ähnlichkeitsvergleiche)
print(d)

384


#### <b>Task (4): Create a vector store</b>
<b>Task details:</b>
- Create a vector store
- store the vector store (this is also helpful in case the codespace or colab needs a restart)
<b style="color: gray;">(max. achievable points: 6)</b>

In [14]:
# 📦 Erstelle einen FAISS-Index zur schnellen Ähnlichkeitssuche
# Wir verwenden hier "IndexFlatL2", der auf der euklidischen Distanz (L2) basiert.
# Der Parameter `d` gibt die Anzahl der Dimensionen pro Vektor an (z. B. 384 bei deinem Modell).
index = faiss.IndexFlatL2(d)

# ➕ Füge alle zuvor erzeugten Embedding-Vektoren in den Index ein
# Dadurch kann FAISS später Anfragen (Queries) mit diesen vergleichen
index.add(chunk_embeddings)

# 🔢 Gib aus, wie viele Vektoren im Index gespeichert sind
# Sollte gleich der Anzahl deiner Chunks sein
print("Number of embeddings in FAISS index:", index.ntotal)

Number of embeddings in FAISS index: 98


In [16]:
# 💾 Speichere den FAISS-Index auf der Festplatte
# → Damit musst du den Index beim nächsten Mal nicht neu berechnen
#    (spart Zeit beim späteren Wiederverwenden)
faiss.write_index(index, "faiss/faiss_index.index")

# 🗂️ Speichere zusätzlich die Text-Chunks als Mapping (Index → Originaltext)
# → So kannst du später zu jedem Treffer die zugehörige Textpassage finden
with open("faiss/chunks_mapping.pkl", "wb") as f:
    pickle.dump(chunks, f)  # Serialisiere und speichere die Liste der Chunks

#### <b>Task (5): Create a retriever function.</b>
<b>Task details:</b>
- Create a retriever function
- Define the number of documents the retriever should return.
- Test the retriever with the following query: `"Welche Dosierung von Amoxicillin Axapharm wird für die Behandlung einer Endokarditis-Prophylaxe bei Erwachsenen empfohlen?"`
- If the retrieved chunks are not relevant, increase the number of chunks to be retrieved and repeat the query. 
- It does not have to be perfect; if nothing improves, continue with the current result.
<b style="color: gray;">(max. achievable points: 6)</b>

In [None]:
def retrieve_texts(query, k, index, chunks, model):

In [None]:
query = "Welche Dosierung +von Amoxicillin Axapharm wird für die Behandlung einer Endokarditis-Prophylaxe bei Erwachsenen empfohlen?"

In [None]:

# Testen des retrievers
retrieved_texts = 

print(retrieved_texts)
print(len(retrieved_texts))

#### <b>Task (6): Implement a reusable RAG function and prompt template</b>
<b>Task details:</b>
- Write a function `get_answer_and_documents` that answers a question using your RAG pipeline.
- The function should:
  - Take as parameters: the question (`question`), the number of documents to retrieve (`k`), the FAISS index (`index`), and the list of text chunks (`chunks`).
  - The prompt template should be tailored to the medical context, address medical professionals, and instruct the model to answer concisely and in German, using only the provided context. This is part of the task.
  - Return both the answer and the retrieved documents.
- Test the function with the question: `Ab welcher Kreatinin-Clearance ist die Einnahme von Metformin kontraindiziert?`

<b style="color: gray;">(max. achievable points: 8)</b>

In [None]:
# set language model and output parser
def answer_query(query, k, index,texts):

    return answer


In [None]:
# Test query
query = "Ab welcher Kreatinin-Clearance ist die Einnahme von Metformin kontraindiziert?"

In [None]:
# print result of test query with your chain (hint: input is a dictionary)
print(answer_query(query, 4, index, chunks))

#### <b>Task (7): Implement a HyDE Query Transformation for RAG</b>
<b>Task details:</b>
- Implement a function that applies the HyDE strategy in your RAG pipeline.
- add your HyDe transformation to your pipeline
- Display the intermediate transformation (print statement within function is enough) and the final answer in the notebook.
<b style="color: gray;">(max. achievable points: 6)</b>

In [None]:
def rewrite_query_hyde(query):
    
    return new_query

In [None]:
def answer_query_with_rewriting(query, k, index, texts):
    
    return answer

In [None]:
query = "Was ist der wichtigste Faktor bei der Diagnostizierung von Asthma?"
answer = answer_query_with_rewriting()
print("LLM Answer:", answer)

#### <b>Task (7): Generate a list of test questions</b>
<b>Task details:</b>
- Create a Python list with 10 questions about the provided medications.
- The questions should be automatically generated using a language model.
- You may use chunks from the package inserts as inspiration, but this is not required.
- At the end, print out your list of questions.

<b style="color: gray;">(max. achievable points: 6)</b>

In [None]:
for i, question in enumerate(questions):
    i +=1
    print("Frage " + str(1) + ": " + question)

#### <b>Task (8): Let your retriever answer the 10 generated questions.</b>
<b>Task details:</b>
- Use the 10 generated questions and have them answered by your RAG chain.
- For each question, output both the retrieved documents and the answer.
- Provide your own assessment of whether your chain works well or not.
- Give an example of what worked well and what did not.

<b style="color: gray;">(max. achievable points: 6)</b>

In [None]:
# Beantwortung der 10 generierten Fragen

for question in questions:  # Questions list from Aufgabe (7)
    
    # Use the RAG chain to get an answer for the question
    answer =  answer_query_with_rewriting(question, 4, index, chunks)
    print(answer)

#### <b> TASK (9) Your assessment of the quality (double-click to edit the cell below):</b>

- Briefly describe what seems to work well in your RAG pipeline based on the answers to the 10 generated questions above.
- Give at least one example of a question/answer pair that worked particularly well.
- Point out at least one aspect or example where the pipeline could be improved or did not work as expected.

<b style="color: gray;">(max. achievable points: 2)</b>

### Jupyter notebook --footer info-- (please always provide this at the end of each notebook)

In [None]:
import os
import platform
import socket
from platform import python_version
from datetime import datetime

print('-----------------------------------')
print(os.name.upper())
print(platform.system(), '|', platform.release())
print('Datetime:', datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
print('Python Version:', python_version())
print('IP Address:', socket.gethostbyname(socket.gethostname()))
print('-----------------------------------')