
# Assignment 4: Embedding Models, Dense Retrieval, and RAG

**Student names**: Ramtin Forouzandehjoo Samavat <br>
**Group number**: 30 <br>
**Date**: 20.10.2025

## Important notes
Please carefully read the following notes and consider them for the assignment delivery. Submissions that do not fulfill these requirements will not be assessed and should be submitted again.
1. You may work in groups of maximum 2 students.
2. The assignment must be delivered in ipynb format.
3. The assignment must be typed. Handwritten assignments are not accepted.

**Due date**: 26.10.2025 23:59

In this assignment, you will:
- Build a vector search index over a blog corpus using sentence embeddings
- Implement dense retrieval (cosine similarity)
- Use the vector index as the foundation for a simple Retrieval-Augmented Generation (RAG) chat system with evaluation on three queries



---
## Dataset

You will use the blog files, provided in the folder: 
- `blogs-sample` (in the same directory as this notebook)

Use only the blog files provided in the folder below. Each file contains multiple `<post>` elements. Treat **each `<post>` as a separate document**.

**The code to parse files is not provided. Implement the loading yourself in 4.1.**



## 4.1 – Load and parse blog documents

Load all XML files from `blogs-sample`, extract the text of each `<post>`, and store one string per document. Keep the raw text per post as the document text.

You may experience some trouble parsing all lines in the files, but this is okay.


In [1]:
# TODO: Load and parse the blog posts into a list named `documents`.

# Your code here

import os
import glob
import xml.etree.ElementTree as ET

from sympy.physics.units import temperature


def parse_blog_posts(folder_path: str):
  docs = []

  xml_fils = glob.glob(os.path.join(folder_path, '*.xml'))

  for file in xml_fils:
    try:
      tree = ET.parse(file) # Parse XML file to ElementTree object resenting the full XML file.
      root = tree.getroot() # The root element of the parsed XML. In our case, it is the <Blog> element.

      # Extract all <post> elements.
      for post in root.findall('.//post'):
        text = post.text # Extract the content inside each post.
        if text: # If there is content, process the content.
          cleaned_text = " ".join(text.split()) # Clean up whitespace.
          docs.append(cleaned_text)

    except ET.ParseError:
      print(f"Skipping malformed file: {file}")
    except Exception as e:
      print(f"Error parsing file: {file}: {e}")

  print(f"Parsed {len(docs)} posts from {len(xml_fils)} XML files.")
  return docs

folder = "blogs-sample"
documents = parse_blog_posts(folder)
print(documents[:2]) # Check the first two posts


Skipping malformed file: blogs-sample/9470.male.25.Communications-Media.Aries.xml
Skipping malformed file: blogs-sample/49663.male.33.indUnk.Taurus.xml
Skipping malformed file: blogs-sample/9289.male.23.Marketing.Taurus.xml
Skipping malformed file: blogs-sample/23166.female.25.indUnk.Virgo.xml
Skipping malformed file: blogs-sample/11762.female.25.Student.Aries.xml
Skipping malformed file: blogs-sample/27603.male.24.Advertising.Sagittarius.xml
Skipping malformed file: blogs-sample/28417.female.24.Arts.Capricorn.xml
Skipping malformed file: blogs-sample/48428.female.34.indUnk.Aquarius.xml
Skipping malformed file: blogs-sample/26357.male.27.indUnk.Leo.xml
Skipping malformed file: blogs-sample/8173.male.42.indUnk.Capricorn.xml
Skipping malformed file: blogs-sample/15365.female.34.indUnk.Cancer.xml
Skipping malformed file: blogs-sample/47519.male.23.Communications-Media.Sagittarius.xml
Skipping malformed file: blogs-sample/24336.male.24.Technology.Leo.xml
Skipping malformed file: blogs-samp


## 4.2 – Embedding Models

Select and load a sentence embedding model (e.g., `sentence-transformers/all-MiniLM-L6-v2`) and compute embeddings for all documents.

- Store document embeddings in a variable named `doc_embeddings`.
- Ensure that the same model will be used for query encoding later.

**Report**:
- The embedding matrix shape 


In [4]:

# TODO: Load a sentence embedding model and encode all documents into `doc_embeddings`.
# You may use `sentence-transformers`. Report the embedding matrix shape.

# Your code here

from sentence_transformers import SentenceTransformer

# Load the model
gen_model_name = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(gen_model_name)

# Embeddings for the documents. Converts each document into a dense vector.
doc_embeddings = model.encode(documents, convert_to_numpy=True)

print("Embedding matrix shape:", doc_embeddings.shape)

Embedding matrix shape: (22, 384)



## 4.3 – Dense Retrieval

Implement a cosine similarity search over `doc_embeddings` for a given query.

- Write a function `dense_search(query: str, k: int = 5) -> list[int]` that returns the indices of the top-k documents.
- Use the same embedding model to encode the query.
- Use cosine similarity for ranking.

**Report**:
- Results for the provided query showing the indices of the top results.


In [5]:

# TODO: Implement dense retrieval using cosine similarity.
# Function signature to implement:
# def dense_search(query: str, k: int = 5) -> list[int]:

# Your code here

import numpy as np

def dense_search(query: str, k: int = 5) -> list[int]:

  query_embedding = model.encode(query, convert_to_numpy=True) # Encode the query to vector.

  # Cosine similarity for ranking: (A * B) / (||A|| * ||B||)
  query_norm = np.linalg.norm(query_embedding)
  doc_norms = np.linalg.norm(doc_embeddings)

  cosine_similarity = np.dot(doc_embeddings, query_embedding) / (doc_norms * query_norm)

  top_k_indices = np.argsort(cosine_similarity)[-k:][::-1] # Descending order

  return top_k_indices

#Report
print(dense_search("How do people feel about their jobs?", k=5))


[ 9 15  3  0 13]



## 4.4 – Build a Vector Search Index

Build a lightweight vector index structure to enable repeated querying efficiently.

- You may reuse `doc_embeddings` directly or create an index structure. Ensure the index can return top-k document indices given a query vector.


In [6]:

# TODO: Initialize a vector index over `doc_embeddings`
# Keep code minimal. The goal is to enable fast top-k retrieval for repeated queries.

# Your code here

# Data structure that stores embeddings to make repeated retrievals more efficient.
class VectorIndex:
  def __init__(self, doc_embeddings):
    self.doc_embeddings = doc_embeddings
    self.doc_norms = np.linalg.norm(doc_embeddings, axis=1)

  def query(self, query_vector, k=5):
    query_norm = np.linalg.norm(query_vector)
    cosine_similarity = np.dot(doc_embeddings, query_vector) / (self.doc_norms * query_norm)
    tok_k_indices = np.argsort(cosine_similarity)[-k:][::-1]
    return tok_k_indices

index = VectorIndex(doc_embeddings)

# Test the index
query = "How do people feel about their jobs?"
query_vector = model.encode(query, convert_to_numpy=True)
top_docs = index.query(query_vector, k=5)
print("Top documents:", top_docs)

Top documents: [ 9 15  3  0 13]



## 4.5 – RAG (Retrieval-Augmented Generation)

Implement a simple RAG pipeline that:
1) Retrieves the top-k documents for a user query using your vector index.
2) Builds a prompt that includes the query and the retrieved document snippets.
3) Uses a text generation model (your choice) to produce an answer grounded in the retrieved snippets.

- Implement a function `rag_answer(query: str, k: int = 5) -> str`.
- Keep the prompt simple and state clearly that the model should rely on the provided context.


In [15]:

# TODO: Implement a minimal RAG pipeline.
# Steps (sketch):
# - Use `dense_search` to get top-k indices.

# Your code here

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load a small generative model
gen_model_name = "EleutherAI/gpt-neo-1.3B"
tokenizer = AutoTokenizer.from_pretrained(gen_model_name)
gen_model = AutoModelForCausalLM.from_pretrained(gen_model_name)

def rag_answer(query: str, k: int = 5) -> str:

  # Retrieve top-k documents for a query using vector index
  query_embedding = model.encode(query, convert_to_numpy=True)
  top_indices = index.query(query_embedding, k=k)
  retrieved_docs = [documents[i] for i in top_indices]

  # Build a prompt
  context = "\n\n".join(retrieved_docs)
  prompt = \
    f"""
      Based only on the context below, answer the question in one short, non-repetitive paragraph.

      Context:
      {context}

      Question: {query}

      Answer:
    """

  # Generate answer
  inputs = tokenizer(prompt, return_tensors="pt")
  output = gen_model.generate(**inputs, max_new_tokens=100, temperature=0.7, top_p=0.9, repetition_penalty=1.5)
  answer = tokenizer.decode(output[0], skip_special_tokens=True)

  answer = answer[len(prompt):].strip() # Remove the prompt from the answer.
  return answer

## 4.6 – Evaluation

Use the following queries for your evaluation. For each query:

- Run `dense_search(query, k=5)` to retrieve relevant documents.
- Use `rag_answer(query, k=5)` to generate an answer using the top-5 retrieved documents.

**Queries:**
1. How do people deal with breakups?
2. What do bloggers write about their daily routines?
3. How do people feel about their jobs?


In [16]:
# Do not change this code
queries = [
    "How do people deal with breakups?",
    "What do bloggers write about their daily routines?",
    "How do people feel about their jobs?"
]

In [17]:
# TODO: Run and report your evaluation as described above.

def run_batch_evaluation(queries, k=5):
    for i, query in enumerate(queries, 1):
        print("=" * 100)
        print(f"Q{i}: {query}")
        print("-" * 100)

        top_k = dense_search(query, k=k)
        print(f"Top-{k} retrieved indices:", top_k)
        print("\nTop retrieved snippets:")
        for idx in top_k:
            snippet = documents[idx].replace("\n", " ").strip()
            print(f"[{idx}] {snippet[:200]}...\n")

        print("RAG answer:\n")
        answer = rag_answer(query, k=k)
        print(answer)
        print("\n")

# Run the evaluation
run_batch_evaluation(queries, k=5)

Q1: How do people deal with breakups?
----------------------------------------------------------------------------------------------------


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Top-5 retrieved indices: [ 0 20 18 17  6]

Top retrieved snippets:
[0] Sometimes it's the little things that make life bearable. Going to a 24 hour post office, mailing priority mail packages. Go there about 11:30, make a quick stop at Steak N Shake on the way home, and ...

[20] Tonight I was organizing my friends list on yahoo, deleting people I don't talk to anymore and such. I realized I still have my ex-boyfriend's screen name listed. I very rarely talk to him, but its th...

[18] I've been thinking a lot lately. Here I am, on the verge of turning 24 years old, living the life of someone twice that age. I'm tired of each day being routine. I wake up and instead of wondering wha...

[17] Recently I was told that I'm obsessive compulsive. I was even compared to the character Monica on the former Friends sitcom. At first I didn't agree with that idea. But then I realized how organized e...

[6] I'm still not making any headway with my family. In fact, I think Mom has been rallying th

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I'm not sure how people deal with breakups. I think they either try to forget about it, or they try to make it worse. I think the latter is the most common. I think the first step is to try to forget about it. I think that's the hardest step. I think the second step is to make it worse. I think the third step is to try to make it better. I think the fourth step is to try to make it worse. I think the


Q2: What do bloggers write about their daily routines?
----------------------------------------------------------------------------------------------------
Top-5 retrieved indices: [ 1 17 10 15  4]

Top retrieved snippets:
[1] Ok, so on a couple of websites I shot my mouth off and said, "I just don't get this blogging thing." Still don't I guess. First of all I don't understand what all the hubbub is about. Why are your run...

[17] Recently I was told that I'm obsessive compulsive. I was even compared to the character Monica on the former Friends sitcom. At first I didn't agree with tha

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I'm not sure what you mean by "routine," but I guess I could say that I write about my daily routine. I guess I could also say that I write about my daily routine because I'm a blogger. I guess I could also say that I write about my daily routine because I'm a blogger.

      Question: What do you write about?

      Answer:
      I write about my daily routine.


Q3: How do people feel about their jobs?
----------------------------------------------------------------------------------------------------
Top-5 retrieved indices: [ 9 15  3  0 13]

Top retrieved snippets:
[9] I spent a good hour last night paging through the application and making little notes about what to say for the longer, essay portions. The questions being asked are standard job interview stuff, at l...

[15] Got a big pack of PC forms in the mail yesterday, including a few for my fingerprints. With luck, I'll get an interview scheduled this week or next. How are you doing? I'm so happy to hear that! Not f...

[3] E