# Evaluated Exercise - Part 1: Databases

Please upload an html-version of your final version into the drop-zone in Moodle. If you have any issues with it, send the final version to guido.moeser@gmail.com.   
Release Date: 2025-10-07

## Final Submission Instructions

1. Complete all sections in the notebook.
2. Add explanations of all parts. **Explanations are the most important part for the grading.**
3. Comment on which configuration you found best and why. **Comments are the second most important part for the grading.**
4. Export your Jupyter Notebook to HTML and send the HTML-version.


# Topic: Building a simple Retrieval-Augmented Retrieval System with SQLite (in-memory)

## 1 – Import the required packages

**Required packages etc:**
- sqlite3
- pandas
- numpy
- from sklearn.metrics.pairwise we need cosine_similarity
- from the SentenceTransformer package we need the SentenceTransformer class


In [17]:
# Load the minimal set of packages we will need

'''
In order to manage dependencies. in this project it will be used pixi since it is the most versatile tool.
https://pixi.sh/dev/
'''
# Uncomment below lines to installthe libraries required in this project IF using pixi package manager
# %pixi add pandas numpy sentence-transformers scikit-learn
import numpy as np
import pandas as pd
import sqlite3
import torch
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

**Explanation:**
- We will only use the core Python SQLite library and three data-science packages: pandas, numpy, and sentence-transformers.
- No extra dependencies are needed.

## 2 – Provide your own texts

Please insert 10 texts of length > 200 words (news articles, wikipedia article (parts of it), product descriptions, product reviews etc.).

In [18]:
# Each of these text samples are samples of length ~250 and are about Harry Potter's critics.
texts = [
    "The Harry Potter series is one of the most influential works of modern literature, and its impact extends far beyond fantasy fiction. J.K. Rowling’s ability to construct a detailed magical world that coexists with our own is nothing short of remarkable. From the moment readers board the Hogwarts Express, they are drawn into a setting filled with mystery, friendship, and wonder. The series’ appeal lies in its balance between escapism and emotional realism. Beneath the spells and potions, Rowling crafts a story about belonging, courage, and moral choices. Harry’s journey from an unwanted orphan to a hero who chooses compassion over revenge resonates deeply with readers of all ages. The books also celebrate education and curiosity — Hogwarts becomes a symbol of learning and self-discovery. However, what truly cements Harry Potter’s place in literary history is how it rekindled a love for reading among young audiences in a digital age. Millions of children who once disliked books found themselves captivated by this magical universe. For many, it was their first major reading adventure. While some critics question its literary sophistication, its cultural and emotional influence is undeniable. Harry Potter reminds us that imagination is a form of power — one capable of inspiring hope, unity, and resilience in a world that often feels divided.",
    "At its core, Harry Potter is not merely a fantasy series; it’s a profound coming-of-age narrative. The magic, creatures, and enchanted castles are captivating, but the emotional depth behind them is what makes the series timeless. Readers grow alongside Harry — from a lonely, mistreated child living in a cupboard to a young adult facing moral dilemmas and loss. Rowling’s genius lies in her ability to mirror the process of growing up through magic: each spell learned, each friendship tested, and each battle fought symbolizes lessons about identity and courage. The early books are playful, full of wonder and innocence, but as the series progresses, the tone darkens — reflecting both the maturation of the characters and the readers themselves. The deaths of beloved figures like Sirius and Dumbledore are not just plot points; they teach the painful but essential truth that growing up often means facing grief and uncertainty. The beauty of the story lies in its emotional honesty — magic doesn’t erase pain, but it gives people strength to face it. The final book, The Deathly Hallows, brings everything full circle, showing that real heroism comes from love, sacrifice, and self-understanding. For anyone who grew up with these books, Harry Potter is not just fiction; it’s a mirror of adolescence, courage, and transformation.",
    "The Harry Potter franchise is a cultural milestone unlike any other in recent literary history. When The Philosopher’s Stone was released in 1997, few could have predicted that it would spark a global phenomenon spanning books, films, theme parks, and countless fan communities. The series not only redefined the fantasy genre but also reshaped publishing, proving that children’s literature could captivate readers of all ages. Midnight book releases became global events; fans dressed in robes and waved wands, demonstrating the unparalleled reach of Rowling’s world-building. The story’s universal themes — love, friendship, identity, and resistance against tyranny — make it relatable across generations and cultures. Harry Potter has become more than a story; it’s a shared cultural experience. However, with such influence comes controversy. Some critics argue that the franchise’s commercialization diluted its original message, while others challenge Rowling’s personal views and their impact on the fandom. Yet, the community’s resilience — seen in fan art, fanfiction, and activism — shows how readers have made the world their own. The Harry Potter phenomenon is proof that storytelling has the power to unite people, to inspire movements, and to build lasting cultural legacies. Whether you read it as a child or discovered it later in life, the series holds a special kind of magic that transcends pages and screens.",
    "While Harry Potter is celebrated worldwide, it’s not immune to critical examination. From a literary standpoint, Rowling’s prose is often straightforward, prioritizing accessibility over stylistic complexity. Some argue that this simplicity makes the books less sophisticated compared to other fantasy epics like The Lord of the Rings or His Dark Materials. Moreover, critics point to certain limitations in world-building — particularly the underdeveloped treatment of non-European cultures and magical systems outside Britain. The series’ moral framework is also sometimes questioned. The dichotomy between “good” and “evil” can feel overly simplistic, with Slytherin House frequently vilified despite its potential for nuanced moral exploration. Additionally, Rowling’s representation of diversity has been criticized. The lack of significant non-white or LGBTQ+ characters has prompted important discussions about inclusivity in modern fantasy. Still, these criticisms do not erase the series’ achievements. Rowling succeeded in creating an emotionally compelling narrative that inspired millions to read. Her depiction of love, loss, and resilience resonates deeply with audiences. Perhaps Harry Potter’s imperfections are part of its legacy — they invite debate, reinterpretation, and reinvention by newer generations. In that sense, the series is both a triumph and a challenge, encouraging readers to imagine not only magical worlds but also more inclusive and complex ones.",
    "The Harry Potter film adaptations stand as one of the most successful cinematic projects in history. Spanning a decade and eight movies, they managed to capture the essence of Rowling’s world while offering their own artistic interpretations. The casting was near-perfect — Daniel Radcliffe, Emma Watson, and Rupert Grint embodied their roles so naturally that they became inseparable from their characters. The production design of Hogwarts, Diagon Alley, and the Ministry of Magic remains visually iconic. Yet, adapting seven dense novels into eight films was no small feat. Some fans lament missing details and subplots, like the full backstory of the Marauders or the deeper moral struggles in The Order of the Phoenix. However, the films compensated with emotional weight and cinematic beauty. The direction evolved with the story’s tone: from the whimsical innocence of The Philosopher’s Stone under Chris Columbus to the haunting realism of The Deathly Hallows directed by David Yates. The musical score by John Williams added another layer of magic, immortalizing the franchise through its unforgettable motifs. The movies not only expanded the series’ reach but also created a visual language that continues to influence fantasy filmmaking. Despite inevitable compromises, the Harry Potter films transformed literature into an enduring cinematic legend.",
    "Beyond wands and wizardry, Harry Potter is a profound meditation on morality. The series asks timeless ethical questions: What defines good and evil? Can love triumph over hate? How does power corrupt? Rowling’s narrative emphasizes that choices, not abilities, reveal a person’s true character — a philosophy echoed in Dumbledore’s wisdom. Through figures like Snape and Draco, the books explore moral ambiguity, showing that redemption and forgiveness are possible even for the deeply flawed. The theme of prejudice, especially through the treatment of “Muggle-borns” and “house-elves,” mirrors real-world discrimination. Rowling uses fantasy as a lens to discuss empathy and justice, encouraging readers to challenge bias in their own lives. Even Voldemort’s story, though steeped in evil, serves as a cautionary tale about the emptiness of power without love. The final message — that love is the most powerful magic — resonates universally. For many readers, Harry Potter functions almost like modern mythology, using the supernatural to explore what makes us human. Its moral lessons are not presented as lectures but as lived experiences, making them all the more impactful. Long after the final page, readers carry these values forward, proving that the truest magic lies not in spells, but in compassion and integrity.",
    "For millions, Harry Potter represents more than just books — it’s a cherished part of childhood. The anticipation of each new release, the smell of freshly printed pages, and the long nights reading under blankets with a flashlight are memories that shaped an entire generation. The story became a shared experience — kids and adults alike waiting at midnight for the next chapter in Harry’s life. The sense of belonging was powerful: for those who felt out of place in the real world, Hogwarts became a second home. Growing up alongside Harry, Hermione, and Ron was a journey through both fiction and adolescence. When the series ended, many readers felt they were saying goodbye to friends. Revisiting the books now evokes deep nostalgia — a reminder of simpler times when magic felt real. Even as adults, readers still find comfort in Rowling’s universe, discovering new meaning in familiar passages. The series bridges generations; parents who once read it as children now share it with their own kids. That cyclical connection makes Harry Potter timeless. More than anything, the series proves that stories can become part of who we are — shaping our imagination, our empathy, and our sense of wonder long after we leave Hogwarts behind.",
    "While Harry Potter brought joy to millions, its author’s fame has also sparked controversy. In recent years, public debate around J.K. Rowling’s personal statements has created division within the fandom. Some longtime fans feel conflicted — torn between love for the books and disappointment with their creator. This controversy raises complex questions about art and artist: can we separate the two? For many, the answer lies in the community itself. Fans have reclaimed the story, emphasizing the inclusive and accepting values they found within the pages, even when the author’s views seem to contradict them. The tension also highlights how deeply personal Harry Potter has become; it’s not just entertainment but part of people’s identities. The situation invites broader reflection on authorship and interpretation — once a story is released, does it still belong to its writer, or to the readers who give it life? Despite these challenges, the fandom remains resilient, using creativity and dialogue to sustain its magic. It’s a testament to the power of collective imagination: even when controversy clouds the legacy, the story’s message of love, courage, and unity continues to shine.",
    "Few literary fandoms rival the passion of Harry Potter fans. From early internet forums to massive conventions, the community has become a cultural force of its own. Fans don’t just consume the story — they expand it. Through fanfiction, art, podcasts, and activism, they’ve kept Hogwarts alive long after the final book. Websites like MuggleNet and The Leaky Cauldron fostered discussions, while fan-made works like A Very Potter Musical and The Wizarding World theme parks demonstrate the creativity born from this shared love. The fandom also became a space for social activism. Organizations like the Harry Potter Alliance use the story’s themes to promote equality and education, proving that fiction can inspire real-world change. This participatory culture gives Harry Potter a life independent of its author. It belongs to the readers who found empowerment and belonging within its pages. The community’s inclusivity and innovation are reminders that stories thrive when shared. The Harry Potter fandom isn’t just a byproduct of success — it’s a living example of how narrative worlds can evolve through collaboration and imagination, bridging the gap between fiction and reality.",
    "More than two decades since its debut, Harry Potter continues to enchant new generations. Its legacy extends far beyond bookshelves — influencing literature, cinema, and even education. The series taught readers to value empathy, knowledge, and courage. It reminded them that family isn’t always defined by blood and that standing up for what’s right matters, even when it’s difficult. Teachers use the books to spark discussions about ethics, prejudice, and identity. Psychologists cite them as examples of literature that nurtures emotional intelligence. Beyond academia, Harry Potter’s impact is emotional and universal. It provided solace to those who felt invisible and hope to those facing adversity. Despite controversies and criticisms, its central message remains untarnished: love conquers fear. As readers revisit the series, they discover that the true magic lies not in spells, but in the human heart. The story endures because it evolves — each generation finds new meaning in its words. Ultimately, Harry Potter is more than a cultural phenomenon; it’s a global symbol of imagination, resilience, and the timeless belief that even the smallest person can change the world."
    #"Text 10 – replace me with your own text."
]

## 3 – Load the SentenceTransformer model

**Tasks:**
- Load the `all-MiniLM-L6-v2` using `SentenceTransformer()` and apply it to a sentence of your choice to show the embeddings.
- Explain what the embeddings are on a short sentence.

In [None]:
'''
    Load a lightweight embedding model from Huggingface
'''
# Use gpu to speed-up the proccess if available, otherwise use cpu(20 min already).
device = torch.device("mps" if torch.mps.is_available() else 'cpu')
print(device)

import certifi
print(certifi.where())

# The SentenceTransformer(a.k.a SBERT) is a python module that is trained on state-of-the-art embeddings models.

#model = SentenceTransformer("all-MiniLM-L6-v2", device=device,token=False )
model = SentenceTransformer(model_name_or_path="../model", device=device)

#Compute the embbedings by using the previous model
embeddings = model.encode(texts, show_progress_bar=True)
print(embeddings.shape)

mps
/Volumes/hack/projects/data-mining-htw/.pixi/envs/default/lib/python3.13/site-packages/certifi/cacert.pem


**Explanation:** ...


**Fidel's answer:** <br>
Embeddings are numerical column vector representation of words, sentences, paragraphs, etc. In order to achieve vector representation of the words, sentences, or paragraphs; these have to be split into **tokens** and each token is converted to a numerical value and these numeric values are grouped togheter on the size of the **context window**; the context window is choosen based on many approaches(similarity, context, word reelevance, etc.). Ultimately, these numbers are put in a single column vector which is the vector that embeds a representation of the word, sentences, or paragraphs.

## 4 – Create and fill an in-memory SQLite database

- Create an in-memory database with SQLite
- Create a table with the fields
  - id,
  - title,
  - text,
  - embedding,
  - import_time
- Metadata: Just add the time you loaded the data into the database
- Load the data into the database: `INSERT INTO documents (title, text, embedding, import_time) VALUES (?, ?, ?, ?)`
- Use pandas to run a SQL-request against the table to show that everything works fine

**Task:** Add the necessary SQL query. All other parts are already there.

In [15]:
# Create an in-memory database
conn = sqlite3.connect(":memory:")
cursor = conn.cursor()

# Create a simple table (embedding stored as TEXT)
cursor.execute("""
CREATE TABLE IF NOT EXISTS documents (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    title TEXT NOT NULL,
    text TEXT NOT NULL,
    embedding BLOB,
    import_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")

# Insert documents with metadata
import datetime
#model2 = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device=device)
for i, t in enumerate(texts):
    title = f"Doc_{i+1}"
    emb = model.encode([t])[0]
    emb_str = ",".join(map(str, emb))   # convert vector to comma-separated string
    cursor.execute(
        "INSERT INTO documents (title, text, embedding, import_time) VALUES (?, ?, ?, ?)",
        (title, t, emb_str, str(datetime.datetime.now()))
    )

conn.commit()

# Test query using pandas
pd.read_sql("SELECT id, title, import_time FROM documents", conn)

KeyboardInterrupt: 

## 5 – Define and test three similarity metrics

- **Task:** Please build to more functions. The functions should return a similarity score between two vectors.
- Will be used later to compare which metric retrieves the most relevant texts.

Here is one function:

```
# Metric 1: cosine similarity (from scikit-learn)
def cosine_sim(a, b):
    return cosine_similarity(a.reshape(1, -1), b.reshape(1, -1))[0][0]
```



In [None]:
# Metric 1: cosine similarity (from scikit-learn)
def cosine_sim(a, b):
    return cosine_similarity(a.reshape(1, -1), b.reshape(1, -1))[0][0]

# Metric 2:
def


# Metric 3:
def


## 6 – Build a simple Retriever

A function that encodes a query, computes the similarity between the query and each document embedding, and returns the top 3 most similar texts by default.  
  
**Task: Code is complete, please explain what happens here**

In [None]:
def retrieve(query, metric_function, top_k=3):
    # Encode the query
    q_vec = model.encode([query])[0]

    # Load all document embeddings
    cursor.execute("SELECT id, title, text, embedding FROM documents")
    docs = cursor.fetchall()

    results = []
    for doc_id, title, text, emb_str in docs:
        emb = np.fromstring(emb_str, sep=",")     # convert text back to numeric vector
        score = metric_function(q_vec, emb)
        results.append((title, text, score))

    # Sort by similarity and return top_k results
    results = sorted(results, key=lambda x: x[2], reverse=True)[:top_k]
    return pd.DataFrame(results, columns=["Title", "Text", "Score"])

**Explanation**
...

## 7 – Run a query and inspect the results

- Run the retriever with one of the metrics (cosine_sim, dot_product_sim, or inv_euclidean_sim).
- Check how the ranking of results changes across metrics.

In [None]:
query = "Enter your own test question here"
retrieve(query, cosine_sim)

# Fine-Tuning Phase

## 8 – Reload database with different chunk sizes and overlaps

- Reload your in-memory database with various chunk_size and overlap settings (e.g. 30/10, 60/20).
- For each configuration, insert each chunk as a new row in the documents table and repeat the insertion logic from Section 4 (using comma-separated embeddings).

**Task:** Explain what happens here and test the functions.


In [None]:
def chunk_text(text, chunk_size=50, overlap=10):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i+chunk_size])
        if chunk.strip():
            chunks.append(chunk)
    return chunks

**Explanations:** ...

In [None]:
# Test the function

## 9 – Query the new database with different similarity metrics

Compare retrieval quality under the different metrics developed above. A record which combination of metric, chunk size, and overlap yields the most meaningful matches will be printed out.

**Task:** Replace the similarity-functions here with the functions you developed above:

```
for metric in [cosine_sim, <your sim function>, <your sim function>]:
    print(f"Results using {metric.__name__}:")
    display(retrieve(query, metric))
```

In [None]:
for metric in [cosine_sim, dot_product_sim, inv_euclidean_sim]:
    print(f"Results using {metric.__name__}:")
    display(retrieve(query, metric))

## 10 – Use the systematic evaluation module

- This module systematically tests different configurations (chunk size, overlap, metric) and records which text was ranked highest.
- Visualize or summarize the outcomes to decide which configuration works best.

**Task:** Explain what happens here and run the experiment with different settings for
- chunksize
- overlap
- similarity functions

(Modify the experiment if you want, but not necessary or required).

*Please note: Replace the similarity functions with your similarity functions, otherwise it will throw an error.*

In [None]:
def evaluate_configs(query, chunk_sizes, overlaps, metrics):
    results = []

    for cs in chunk_sizes:
        for ov in overlaps:
            # Rebuild an in-memory DB for each configuration
            conn = sqlite3.connect(":memory:")
            cursor = conn.cursor()
            cursor.execute("CREATE TABLE docs (id INTEGER PRIMARY KEY, text TEXT, embedding TEXT)")

            # Insert chunks with textual embeddings
            for t in texts:
                for ch in chunk_text(t, cs, ov):
                    emb = model.encode([ch])[0]
                    emb_str = ",".join(map(str, emb))
                    cursor.execute("INSERT INTO docs (text, embedding) VALUES (?, ?)", (ch, emb_str))
            conn.commit()

            # Encode the query once
            q_vec = model.encode([query])[0]

            # Evaluate all metrics
            for metric in metrics:
                cursor.execute("SELECT id, text, embedding FROM docs")
                docs = cursor.fetchall()
                scores = []

                for _, text, emb_str in docs:
                    emb = np.fromstring(emb_str, sep=",")
                    score = metric(q_vec, emb)
                    scores.append((text, score))

                # Sort by similarity and take the best one
                top_text, top_score = sorted(scores, key=lambda x: x[1], reverse=True)[0]

                results.append((cs, ov, metric.__name__, top_text, top_score))

    # Return results as DataFrame
    return pd.DataFrame(results, columns=["ChunkSize", "Overlap", "Metric", "TopResult", "SimilarityScore"])

**Explanations:** ...

**Example usage**

In [None]:
# Define your test query
query = "Summarize the main idea of my texts"

# Define parameter grid
chunk_sizes = [30, 50, 70]
overlaps = [5, 10, 15]
metrics = [cosine_sim, dot_product_sim, inv_euclidean_sim]

# Run evaluation
results_df = evaluate_configs(query, chunk_sizes, overlaps, metrics)

# Display full table
display(results_df)

# Optional: Find the highest similarity overall
best_config = results_df.sort_values("SimilarityScore", ascending=False).head(1)
display(best_config)


**Interpretation of Results** ...