# Semestral Home Assignment 
In the semestral home assignment you are tasked with designing and implementing a production ready information retrieval (IR) system with the use of Qdrant. <br>
First will need to implement scalable Qdrant cluster with the principles of NoSQL (sharding, replication quorum). <br>
Then, you will implement the vector search with Qdrant using all the advanced features of the vector database. <br>

Author: Sepideh Sanedoust Karseidani, učo 554733      
Year: 2026

In [1]:
%cd ../

d:\Documents\MUNI\4-Semester\PA195 NoSQL Databases\VectorDB\pa195_semestral_assignment_2025


In [2]:
%load_ext autoreload
%autoreload 2

## Setup

In [3]:
import json
import os
from typing import Any, cast, Callable

from datasets import load_dataset
from datasets.dataset_dict import DatasetDict
from datasets.dataset_dict import Dataset
from qdrant_client import QdrantClient
from qdrant_client.models import models
from qdrant_client.http.models.models import QueryResponse
from fastembed import TextEmbedding, SparseTextEmbedding, LateInteractionTextEmbedding
from fastembed.sparse.sparse_embedding_base import SparseEmbedding
from dotenv import load_dotenv

from notebooks.utils import evaluate_retrieval

Load environment variables. **Do not forget to create a .env file in the root directory based on the .env.example file**.

In [4]:
load_dotenv("./.env")

True

Start up local instance of Qdrant through docker.

In [5]:
#!docker run -p 6333:6333 -p 6334:6334 -d --name qdrant-server qdrant/qdrant:v1.16

!docker-compose up -d

 Network pa195_semestral_assignment_2025_qdrant-cluster  Creating
 Network pa195_semestral_assignment_2025_qdrant-cluster  Created
 Container pa195_semestral_assignment_2025-qdrant_node_3-1  Creating
 Container pa195_semestral_assignment_2025-qdrant_node_1-1  Creating
 Container pa195_semestral_assignment_2025-qdrant_node_2-1  Creating
 Container pa195_semestral_assignment_2025-qdrant_node_3-1  Created
 Container pa195_semestral_assignment_2025-qdrant_node_2-1  Created
 Container pa195_semestral_assignment_2025-qdrant_node_1-1  Created
 Container pa195_semestral_assignment_2025-qdrant_node_1-1  Starting
 Container pa195_semestral_assignment_2025-qdrant_node_3-1  Starting
 Container pa195_semestral_assignment_2025-qdrant_node_2-1  Starting
 Container pa195_semestral_assignment_2025-qdrant_node_2-1  Started
 Container pa195_semestral_assignment_2025-qdrant_node_3-1  Started
 Container pa195_semestral_assignment_2025-qdrant_node_1-1  Started


Initiate the Qdrant client by connecting to the server running as a docker container.

In [6]:
client = QdrantClient(host=os.environ["QDRANT_HOST"], port=int(os.environ["QDRANT_PORT"]))

## Dataset

### Task 1 - Data Loading
Load the data from the Hugging Face dataset [Zovi3/pa195_semestral_assignment](https://huggingface.co/datasets/Zovi3/pa195_semestral_assignment/upload/main), explore it and extract/preprocess it if necessary.

In [7]:
# TODO: Import query dataset from https://huggingface.co/datasets/Zovi3/pa195_semestral_assignment/tree/main
query_dataset: Dataset = load_dataset(
    "json",
    data_files="hf://datasets/Zovi3/pa195_semestral_assignment/query-all-MiniLM-L6-v2-100-filters-embedded-results/train.jsonl",
    split="train",
)

In [8]:
# TODO: Import documents dataset from https://huggingface.co/datasets/Zovi3/pa195_semestral_assignment/tree/main
documents: Dataset = load_dataset(
    "json",
    data_files="hf://datasets/Zovi3/pa195_semestral_assignment/corpus-all-MiniLM-L6-v2-50K-groups-multi-vector/train.jsonl",
    split="train",
)

## Models Setup

### Embedding Model

Within the homework you will work with `sentence-transformers/all-MiniLM-L6-v` from fastembed library. <br>
These embedding are precomputed for you in the assignment dataset, but you will need to used model when running the queries.

In [9]:
# Embeddings are precomputed so you can save some memory by not loading the model
embedding_model = TextEmbedding('sentence-transformers/all-MiniLM-L6-v2')
embedding_model_size = 384

### Sparse Retrieval Model
Some queries require the prioritization of the certain keywords. <br>
Therefor, you will need to use BM25 algorithm to boost the documents with these keywords during retrieval. <br>
Note that BM25 is not taken into account in the dataset, so you will need to apply when uploading and indexing the data.

In [10]:
bm25_model = SparseTextEmbedding("Qdrant/bm25")

### Multi-Vector Model
It is general good practice to include reranking model in the IR system. <br>
Reranking uses stronger model to select the most relevant documents from the initial retrieval. <br>
You will implement reranking with multi-vector late interaction embedding ColBERT.

In [11]:
# Embeddings are precomputed so you can save some memory by not loading the model
multi_vector_model = LateInteractionTextEmbedding("colbert-ir/colbertv2.0")
multi_vector_model_size = 128

## Database Configuration

### Task 2 - Data Modelling
In this task you will create proper data model for your data including vector representations, index configuration, distance functions and more.

#### Task 2.1 - HNSW Index Configuration
Configure the HNSW index for the retrieval. <br>
**Change the ef_construct parameter to 64 to speed the build time at the cost of the recall.** <br>
We do this for practical reasons, to enable you iterate over the notebook faster.

In [12]:
# Change ef_construct parameter to 64 to speed the build time at the cost of the recall
ef_construct = 64
# TODO Configure HNSW index
hnsw_config = models.HnswConfigDiff(ef_construct=ef_construct)

#### Task 2.2 - Collection Creation
Create model for your data. You should create three vector representations for your data. <br>
There should be one representation for each model defined above. <br>
For multi-vector model make sure to disable the vector index since it will be used only for reranking. <br>
Also, do not forget that multi-vector computation of similarity is not done only through the cosine similarity (check the lecture for more info). <br>
Configure proper modifier for the sparse vector.

In [13]:
COLLECTION_NAME = "ms_macro"

In [14]:
try:
    client.delete_collection(COLLECTION_NAME)
    print(f"Deleted existing collection: {COLLECTION_NAME}")
except: 
    print(f"Collection {COLLECTION_NAME} does not exist")


# TODO: Configure collection creation
# Dense and sparse enable hybrid search; multi-vector is for reranking to improve precision.  
collection_created = client.recreate_collection(
    collection_name=COLLECTION_NAME,
    vectors_config={
        "dense": models.VectorParams(
            size=embedding_model_size,  # 384-dim
            distance=models.Distance.COSINE,
            hnsw_config=hnsw_config,
        ),
        "multi_vector": models.VectorParams(
            size=multi_vector_model_size,   # 128-dim
            distance=models.Distance.DOT,  # Placeholder; ColBERT reranking uses MaxSim (max dot-product)
            multivector_config=models.MultiVectorConfig(
                comparator=models.MultiVectorComparator.MAX_SIM # MaxSim (Much better semantics)
            ),
            hnsw_config=None,  # Disable indexing for reranking only
        ),
    },
    sparse_vectors_config={
        "sparse": models.SparseVectorParams(
            modifier=models.Modifier.IDF,  # Applying inverse document frequency weighting for better BM25 scoring
        ),
    },
    on_disk_payload=True,
)

if collection_created:
    print(f"Created collection '{COLLECTION_NAME}'.")
else:
    print("Collection creation failed")


Deleted existing collection: ms_macro


  collection_created = client.recreate_collection(


Created collection 'ms_macro'.


#### Task 2.3 - Create Payload Index & Disable Quantization
Configure keyword payload index for the `groups` field. Make sure that payload index is on-disk.

In [15]:
# TODO: Create payload index
payload_index_created = client.create_payload_index(
    collection_name=COLLECTION_NAME,
    field_name="groups",
    field_schema=models.KeywordIndexParams(
        type="keyword",
        on_disk=True,
    ),
)

# Disable quantization
client.update_collection(
    collection_name=COLLECTION_NAME,
    quantization_config=models.Disabled.DISABLED,
)

if payload_index_created:
    print(f"Payload index created for field 'groups'")

Payload index created for field 'groups'


### Task 3 - Data Upload
Upload vector embeddings and metadata to the created collection, make sure to upload the vectors metadata.

In [16]:
points: list[models.PointStruct] = []

bm25_iter = bm25_model.embed(documents["text"])

print("Generating sparse embeddings...")
doc: dict[str, Any]
for i, doc in enumerate(documents):  # type: ignore
    if i % 10000 == 0:  # Print every 10000 documents
        print(f"Processing document {i}/{len(documents)}")

    # TODO: Implement data upload
    # Generate Sparse Vector (BM25) on the fly
    sparse_result = next(bm25_iter)
    
    # Convert to Qdrant SparseVector
    sparse_vector = models.SparseVector(
        indices=sparse_result.indices.tolist(),
        values=sparse_result.values.tolist()
    )

    point = models.PointStruct(
        id=doc["id"],
        vector={
            "dense": doc["embedding"],  # Precomputed dense embedding
            "sparse": sparse_vector,    # BM25 sparse embedding
            "multi_vector": doc["multi_vector_embedding"],  # Precomputed ColBERT multi-vector
        },
        payload={
            "text": doc["text"],
            "groups": doc["groups"],
        },
    )
    points.append(point)

print("Upserting documents...")
client.upload_points(collection_name=COLLECTION_NAME, points=points, batch_size=128)

print(f"Collection info: {client.get_collection(COLLECTION_NAME).points_count} points in collection")
assert client.get_collection(COLLECTION_NAME).points_count == len(documents), f"Expected {len(documents)} points in collection, got {client.get_collection(COLLECTION_NAME).points_count}"

Generating sparse embeddings...
Processing document 0/50000
Processing document 10000/50000
Processing document 20000/50000
Processing document 30000/50000
Processing document 40000/50000
Upserting documents...
Collection info: 50000 points in collection


## Querying

### Task 4 - Design Complex Query
Your task is to design a complex query that will include hybrid search, filtering, reranking and metadata boosting. <br>
**The result of this task should be one Qdrant query (do not add any postprocessing logic outside of the Qdrant query)!**
 
**Subtasks:**
1. Define query filter with relation to the `groups` field, do not forget there can be filter values in the query.
    - Think about in which prefetch you should apply the filter.
2. Define sparse and dense search prefetche, the limit for the retrieval should be 100 objects.
3. Define fusion of the two rankings with Reciprocal Rank Fusion (RRF).
4. Rerank the results with ColBERT multi-vector model, use 50 documents for reranking.
5. Boost the results with metadata weighting, use `group_1` with weight 0.05 and `group_2` with weight 0.1.


In [None]:
def build_sparse_query_text(query_text: str, filter_values: list[str]) -> str:
    if filter_values:
        return f"{query_text} {' '.join(filter_values)}"
    return query_text

def rag_context_retrieval(query: dict[str, Any]) -> QueryResponse:
    # TODO: Implement correct embeddings usage
    # Generate query embeddings using pre-trained models
    query_dense_embedding = list(embedding_model.embed([query['text']]))[0]
    
    sparse_input_text = build_sparse_query_text(query['text'], query.get('filters', []))
    query_sparse_result = list(bm25_model.embed([sparse_input_text]))[0]
    query_sparse_embedding = models.SparseVector(
        indices=query_sparse_result.indices.tolist(),
        values=query_sparse_result.values.tolist()
    )
    
    query_multi_vector_embedding = list(multi_vector_model.embed([query['text']]))[0]

    # Task 4.1 - Define query filter
    filter_condition = None  # TODO: Implement filters
    if query.get('filters'):
        filter_condition = models.Filter(
            must=[models.FieldCondition(key="groups", match=models.MatchAny(any=query['filters']))]
        )

    # Candidates for fusion/reranking
    sparse_limit = 100
    dense_limit = 100
    # Task 4.2 - Define sparse and dense search. Set their limit to 100.
    prefetch_sparse_and_dense_search: list[models.Prefetch] = [
        # TODO: Implement sparse and dense prefetches
        models.Prefetch(
            query=query_dense_embedding,
            using="dense",
            filter=filter_condition,
            limit=dense_limit
        ),
        models.Prefetch(
            query=query_sparse_embedding,
            using="sparse",
            filter=filter_condition,
            limit=sparse_limit
        )
    ]

    # Task 4.3 - Define fusion of the two rankings (set the k parameter of the query to 60 to mitigate effect of high rankings)
    rff_k = 60

    prefetch_fused_rankings: list[models.Prefetch] = [
        # TODO: Implement rank fusion
        models.Prefetch(
            query=models.RrfQuery(rrf=models.Rrf(k=rff_k)),  # Tune RRF k for Fusion, k=60 is a good default
            prefetch=prefetch_sparse_and_dense_search,
        )
    ]

    # Task 4.4 - Rerank the results with ColBERT multi-vector model taking 50 documents.
    reranking_limit = 50  # Tunnable for better precision (more documents, improving selection)
    prefetch_multi_vector_reranking: list[models.Prefetch] = [
        # TODO: Implement multi-vector reranking
        models.Prefetch(
            query=query_multi_vector_embedding,
            using="multi_vector",
            prefetch=prefetch_fused_rankings,
            limit=reranking_limit
        )
    ]
    
    group_1_boost_weight = 0.05  # Tunnable for better precision
    group_2_boost_weight = 0.1  # Tunnable for better precision
    final_query_limit = 10
    # Task 4.5 - Boost following "groups" in the search: "group_1" with weight 0.05 and "group_2" with weight 0.1
    final_result: QueryResponse = client.query_points(
        collection_name=COLLECTION_NAME,
        # TODO: Implement final query with metadata boosting
        # TODO: This query should be built from all the prefetches
        prefetch=prefetch_multi_vector_reranking,
        query=models.FormulaQuery(
            formula=models.SumExpression(sum=[
                "$score",
                models.MultExpression(mult=[group_1_boost_weight, models.FieldCondition(key="groups", match=models.MatchAny(any=["group_1"]))]),
                models.MultExpression(mult=[group_2_boost_weight, models.FieldCondition(key="groups", match=models.MatchAny(any=["group_2"]))])
            ]
        )),
        using="multi_vector",
        limit=final_query_limit
    )

    return final_result

In [18]:
avg_retrieval_precision = evaluate_retrieval(rag_context_retrieval, query_dataset)

You achieved 0.879 enough to pass ✅!


In [None]:
# from typing import Any, Callable
# from qdrant_client.http.models.models import QueryResponse
# from datasets import Dataset
# from typing import cast

# def calculate_recall(retrieved_docs: list[int], relevant_docs: list[int]) -> float:
#     if not relevant_docs:
#         return 1.0 if not retrieved_docs else 0.0  # Edge case: no relevant docs
    
#     relevant_retrieved = len(set(retrieved_docs) & set(relevant_docs))
#     return relevant_retrieved / len(relevant_docs)

# def evaluate_recall(rag_context_retrieval: Callable[[dict[str, Any]], QueryResponse], query_dataset: Dataset) -> float:
#     total_recall: float = 0.0
    
#     for i in range(len(query_dataset)):
#         query: dict[str, Any] = cast(dict[str, Any], query_dataset[i])
        
#         query_response: QueryResponse = rag_context_retrieval(query)
#         retrieved_docs: list[int] = [cast(int, point.id) for point in query_response.points]
        
#         relevant_docs: list[int] = query["result"]["point_ids"]
        
#         recall: float = calculate_recall(retrieved_docs, relevant_docs)
#         total_recall += recall
    
#     avg_recall = total_recall / len(query_dataset) if len(query_dataset) > 0 else 0.0
#     print(f"Average recall: {avg_recall}")
#     return avg_recall

# # Usage: Replace 'rag_context_retrieval' and 'query_dataset' with your variables
# avg_recall = evaluate_recall(rag_context_retrieval, query_dataset)
# if avg_recall >= 0.8:
#     print("Recall requirement met ✅!")
# else:
#     print("Recall below 80% ❌")

Average recall: 0.877
Recall requirement met ✅!
