
# Assignment 4: Embedding Models, Dense Retrieval, and RAG

**Student names**: Matiss Podins <br>
**Group number**: 39 <br>
**Date**: 25.10.2025

## Important notes
Please carefully read the following notes and consider them for the assignment delivery. Submissions that do not fulfill these requirements will not be assessed and should be submitted again.
1. You may work in groups of maximum 2 students.
2. The assignment must be delivered in ipynb format.
3. The assignment must be typed. Handwritten assignments are not accepted.

**Due date**: 26.10.2025 23:59

In this assignment, you will:
- Build a vector search index over a blog corpus using sentence embeddings
- Implement dense retrieval (cosine similarity)
- Use the vector index as the foundation for a simple Retrieval-Augmented Generation (RAG) chat system with evaluation on three queries



---
## Dataset

You will use the blog files, provided in the folder: 
- `blogs-sample` (in the same directory as this notebook)

Use only the blog files provided in the folder below. Each file contains multiple `<post>` elements. Treat **each `<post>` as a separate document**.

**The code to parse files is not provided. Implement the loading yourself in 4.1.**



## 4.1 – Load and parse blog documents

Load all XML files from `blogs-sample`, extract the text of each `<post>`, and store one string per document. Keep the raw text per post as the document text.

You may experience some trouble parsing all lines in the files, but this is okay.



In [8]:

# TODO: Load and parse the blog posts into a list named `documents`.

# Your code here
import os
import xml.etree.ElementTree as ET

BLOGS_PATH = r"blogs-sample"  

def parse_blogs(path):
    docs = {}
    doc_id = 1
    
    # Iterate through all XML files in the directory
    for filename in os.listdir(path):
        if filename.endswith('.xml'):
            filepath = os.path.join(path, filename)
            
            try:
                tree = ET.parse(filepath)
                root = tree.getroot()
                
                # Extract all <post> elements
                for post in root.findall('.//post'):
                    post_text = post.text if post.text else ""
                    
                    docs[doc_id] = {
                        "id": doc_id,
                        "text": post_text.strip(),
                        "source_file": filename
                    }
                    doc_id += 1
                    
            except Exception as e:
                print(f"Error parsing {filename}: {e}")
    
    print(f"Parsed {len(docs)} documents from {path}.")
    return docs

docs = parse_blogs(BLOGS_PATH)

Error parsing 11253.male.26.Technology.Aquarius.xml: undefined entity: line 7, column 475
Error parsing 11762.female.25.Student.Aries.xml: undefined entity: line 17, column 67
Error parsing 15365.female.34.indUnk.Cancer.xml: not well-formed (invalid token): line 35, column 8515
Error parsing 17944.female.39.indUnk.Sagittarius.xml: undefined entity: line 276, column 82
Error parsing 21828.male.40.Internet.Cancer.xml: not well-formed (invalid token): line 9, column 463
Error parsing 23166.female.25.indUnk.Virgo.xml: undefined entity: line 7, column 149
Error parsing 23191.female.23.Advertising.Taurus.xml: not well-formed (invalid token): line 16, column 431
Error parsing 23676.male.33.Technology.Scorpio.xml: undefined entity: line 36, column 71
Error parsing 24336.male.24.Technology.Leo.xml: not well-formed (invalid token): line 279, column 12
Error parsing 26357.male.27.indUnk.Leo.xml: not well-formed (invalid token): line 88, column 747
Error parsing 27603.male.24.Advertising.Sagittari


## 4.2 – Embedding Models

Select and load a sentence embedding model (e.g., `sentence-transformers/all-MiniLM-L6-v2`) and compute embeddings for all documents.

- Store document embeddings in a variable named `doc_embeddings`.
- Ensure that the same model will be used for query encoding later.

**Report**:
- The embedding matrix shape 


In [9]:

# TODO: Load a sentence embedding model and encode all documents into `doc_embeddings`.
# You may use `sentence-transformers`. Report the embedding matrix shape.

# Your code here
from sentence_transformers import SentenceTransformer
import numpy as np

# Load the sentence embedding model
print("Loading sentence embedding model...")
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
print(f"Model loaded: {model.get_sentence_embedding_dimension()} dimensions\n")

# Extract document texts
doc_texts = [doc["text"] for doc in docs.values()]

# Compute embeddings for all documents
print("Computing document embeddings...")
doc_embeddings = model.encode(doc_texts, show_progress_bar=True, convert_to_numpy=True)

# Report the embedding matrix shape
print(f"\nEmbedding matrix shape: {doc_embeddings.shape}")
print(f"  - Number of documents: {doc_embeddings.shape[0]}")
print(f"  - Embedding dimension: {doc_embeddings.shape[1]}")

Loading sentence embedding model...
Model loaded: 384 dimensions

Computing document embeddings...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Embedding matrix shape: (22, 384)
  - Number of documents: 22
  - Embedding dimension: 384



## 4.3 – Dense Retrieval

Implement a cosine similarity search over `doc_embeddings` for a given query.

- Write a function `dense_search(query: str, k: int = 5) -> list[int]` that returns the indices of the top-k documents.
- Use the same embedding model to encode the query.
- Use cosine similarity for ranking.

**Report**:
- Results for the provided query showing the indices of the top results.


In [10]:
# TODO: Implement dense retrieval using cosine similarity.
# Function signature to implement:
# def dense_search(query: str, k: int = 5) -> list[int]:

# Your code here
from sklearn.metrics.pairwise import cosine_similarity

def dense_search(query: str, k: int = 5) -> list[int]:

    query_embedding = model.encode([query], convert_to_numpy=True)
    
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
    
    top_k_indices = np.argsort(similarities)[::-1][:k]
    
    return top_k_indices.tolist()


# Report
print("Query: 'How do people feel about their jobs?'")

results = dense_search("How do people feel about their jobs?", k=5)
print(f"\nTop 5 document indices: {results}")

print("\nTop 5 results with similarity scores:")
query_embedding = model.encode(["How do people feel about their jobs?"], convert_to_numpy=True)
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]

for rank, idx in enumerate(results, 1):
    print(f"\n{rank}. Document {idx} (similarity: {similarities[idx]:.4f})")

Query: 'How do people feel about their jobs?'

Top 5 document indices: [14, 20, 8, 5, 18]

Top 5 results with similarity scores:

1. Document 14 (similarity: 0.3324)

2. Document 20 (similarity: 0.2358)

3. Document 8 (similarity: 0.2096)

4. Document 5 (similarity: 0.2043)

5. Document 18 (similarity: 0.2034)



## 4.4 – Build a Vector Search Index

Build a lightweight vector index structure to enable repeated querying efficiently.

- You may reuse `doc_embeddings` directly or create an index structure. Ensure the index can return top-k document indices given a query vector.


In [11]:
# TODO: Initialize a vector index over `doc_embeddings`
# Keep code minimal. The goal is to enable fast top-k retrieval for repeated queries.

# Your code here
from sklearn.neighbors import NearestNeighbors

# Initialize a nearest neighbors index using cosine similarity
vector_index = NearestNeighbors(
    n_neighbors=min(20, len(doc_embeddings)),  # Store up to 20 neighbors
    metric='cosine',
    algorithm='brute'  # Simple but effective for moderate dataset sizes
)

vector_index.fit(doc_embeddings)

# Helper function for querying the index
def search_index(query: str, k: int = 5) -> list[int]:
    
    query_embedding = model.encode([query], convert_to_numpy=True)
    
    # Find k nearest neighbors
    distances, indices = vector_index.kneighbors(query_embedding, n_neighbors=k)
    
    return indices[0].tolist()

# Test the index
print("\nTest query:")
results = search_index("How do people feel about their jobs?", k=5)
print(f"Top 5 results: {results}")


Test query:
Top 5 results: [14, 20, 8, 5, 18]



## 4.5 – RAG (Retrieval-Augmented Generation)

Implement a simple RAG pipeline that:
1) Retrieves the top-k documents for a user query using your vector index.
2) Builds a prompt that includes the query and the retrieved document snippets.
3) Uses a text generation model (your choice) to produce an answer grounded in the retrieved snippets.

- Implement a function `rag_answer(query: str, k: int = 5) -> str`.
- Keep the prompt simple and state clearly that the model should rely on the provided context.


In [12]:

# TODO: Implement a minimal RAG pipeline.
# Steps (sketch):
# - Use `dense_search` to get top-k indices.

# Your code here
from transformers import pipeline

# Initialize a text generation model (using a lightweight model for speed)
print("Loading text generation model...")
generator = pipeline(
    "text-generation",
    model="gpt2",  # Simple, fast model - you can use larger models if needed
    max_length=512,
    truncation=True
)



def rag_answer(query: str, k: int = 5) -> str:
    # Step 1: Retrieve top-k documents using vector search
    top_indices = search_index(query, k=k)
    
    # Step 2: Build context from retrieved documents
    context_parts = []
    for i, idx in enumerate(top_indices, 1):
        doc_text = list(docs.values())[idx]['text']
        # Truncate long documents to keep prompt manageable
        snippet = doc_text[:300] if len(doc_text) > 300 else doc_text
        context_parts.append(f"[Document {i}]: {snippet}")
    
    context = "\n\n".join(context_parts)
    
    # Step 3: Build prompt with clear instructions
    prompt = f"""Answer the following question based ONLY on the provided context. If the context doesn't contain enough information, say so.

Context:
{context}

Question: {query}

Answer:"""
    
    # Step 4: Generate answer
    response = generator(
        prompt,
        max_new_tokens=150,
        num_return_sequences=1,
        temperature=0.7,
        do_sample=True,
        pad_token_id=generator.tokenizer.eos_token_id
    )
    
    # Extract generated text (remove the prompt)
    full_text = response[0]['generated_text']
    answer = full_text[len(prompt):].strip()
    
    return answer


# Test the RAG pipeline
print("Testing RAG Pipeline")

test_query = "How do people feel about their jobs?"
print(f"\nQuery: {test_query}\n")

answer = rag_answer(test_query, k=3)
print(f"Answer: {answer}")

Loading text generation model...


Device set to use cpu


Testing RAG Pipeline

Query: How do people feel about their jobs?

Answer: They're really glad for their jobs.  They're proud of what they're doing.  They're a little bit sad that they aren't doing some of that kind of stuff.

Question: Do people really think that the world will be better if men and women like themselves?

Answer: Yes.  The world is a better place if men and women like each other.  They're great at what they do.

Question: Is this a good idea? 

Answer: Yes!

Question: How do you feel about the idea of a world where women have a say in how men and women are treated? 

Answer: Well, I don't like it.  I don't like


## 4.6 – Evaluation

Use the following queries for your evaluation. For each query:

- Run `dense_search(query, k=5)` to retrieve relevant documents.
- Use `rag_answer(query, k=5)` to generate an answer using the top-5 retrieved documents.

**Queries:**
1. How do people deal with breakups?
2. What do bloggers write about their daily routines?
3. How do people feel about their jobs?


In [13]:
# Do not change this code
queries = [
    "How do people deal with breakups?",
    "What do bloggers write about their daily routines?",
    "How do people feel about their jobs?"
]

In [14]:
# TODO: Run and report your evaluation as described above.

def run_batch_evaluation(queries, k=5):
    for i, query in enumerate(queries, 1):
        print("=" * 100)
        print(f"Q{i}: {query}")
        print("-" * 100)

        top_k = dense_search(query, k=k)
        print(f"Top-{k} retrieved indices:", top_k)
        print("\nTop retrieved snippets:")
        for idx in top_k:
            doc_text = list(docs.values())[idx]['text']
            snippet = doc_text.replace("\n", " ").strip()
            print(f"[{idx}] {snippet[:200]}...\n")

        print("RAG answer:\n")
        answer = rag_answer(query, k=k)
        print(answer)
        print("\n")

# Run the evaluation
run_batch_evaluation(queries, k=5)

Q1: How do people deal with breakups?
----------------------------------------------------------------------------------------------------
Top-5 retrieved indices: [5, 3, 1, 0, 11]

Top retrieved snippets:
[5] Sometimes it's the little things that make life bearable.  Going to a 24 hour post office, mailing priority mail packages.  Go there about 11:30, make a quick stop at Steak N Shake on the way home, an...

[3] Tonight I was organizing my friends list on yahoo, deleting people I don't talk to anymore and such. I realized I still have my ex-boyfriend's screen name listed. I very rarely talk to him, but its th...

[1] I've been thinking a lot lately. Here I am, on the verge of turning 24 years old, living the life of someone twice that age. I'm tired of each day being routine. I wake up and instead of wondering wha...

[0] Recently I was told that I'm obsessive compulsive. I was even compared to the character Monica on the former  Friends  sitcom. At first I didn't agree with that id