# Semestral Home Assignment 
In the semestral home assignment you are tasked with designing and implementing a production ready information retrieval (IR) system with the use of Qdrant. <br>
First will need to implement scalable Qdrant cluster with the principles of NoSQL (sharding, replication quorum). <br>
Then, you will implement the vector search with Qdrant using all the advanced features of the vector database. <br>

In [None]:
%cd ../

In [None]:
%load_ext autoreload
%autoreload 2

## Setup

In [None]:
import json
import os
from typing import Any, cast, Callable

from datasets import load_dataset
from datasets.dataset_dict import DatasetDict
from datasets.dataset_dict import Dataset
from qdrant_client import QdrantClient
from qdrant_client.models import models
from qdrant_client.http.models.models import QueryResponse
from fastembed import TextEmbedding, SparseTextEmbedding, LateInteractionTextEmbedding
from fastembed.sparse.sparse_embedding_base import SparseEmbedding
from dotenv import load_dotenv

from notebooks.utils import evaluate_retrieval

Load environment variables. **Do not forget to create a .env file in the root directory based on the .env.example file**.

In [None]:
load_dotenv("./.env")

Start up local instance of Qdrant through docker.

In [None]:
!docker run -p 6335:6333 -p 6336:6334 -d --name qdrant-server qdrant/qdrant:v1.16

Initiate the Qdrant client by connecting to the server running as a docker container.

In [None]:
client = QdrantClient(host=os.environ["QDRANT_HOST"], port=int(os.environ["QDRANT_PORT"]))

## Dataset

### Task 1 - Data Loading
Load the data from the Hugging Face dataset [Zovi3/pa195_semestral_assignment](https://huggingface.co/datasets/Zovi3/pa195_semestral_assignment/upload/main), explore it and extract/preprocess it if necessary.

In [None]:
# TODO: Import query dataset from https://huggingface.co/datasets/Zovi3/pa195_semestral_assignment/tree/main
query_dataset: Dataset = None

In [None]:
# TODO: Import documents dataset from https://huggingface.co/datasets/Zovi3/pa195_semestral_assignment/tree/main
documents: Dataset = None

## Models Setup

### Embedding Model

Within the homework you will work with `sentence-transformers/all-MiniLM-L6-v` from fastembed library. <br>
These embedding are precomputed for you in the assignment dataset, but you will need to used model when running the queries.

In [None]:
## Embeddings are precomputed so you can save some memory by not loading the model
# embedding_model = TextEmbedding('sentence-transformers/all-MiniLM-L6-v2')
embedding_model_size = 384

### Sparse Retrieval Model
Some queries require the prioritization of the certain keywords. <br>
Therefor, you will need to use BM25 algorithm to boost the documents with these keywords during retrieval. <br>
Note that BM25 is not taken into account in the dataset, so you will need to apply when uploading and indexing the data.

In [None]:
bm25_model = SparseTextEmbedding("Qdrant/bm25")

### Multi-Vector Model
It is general good practice to include reranking model in the IR system. <br>
Reranking uses stronger model to select the most relevant documents from the initial retrieval. <br>
You will implement reranking with multi-vector late interaction embedding ColBERT.

In [None]:
## Embeddings are precomputed so you can save some memory by not loading the model
# multi_vector_model = LateInteractionTextEmbedding("colbert-ir/colbertv2.0")
multi_vector_model_size = 128

## Database Configuration

### Task 2 - Data Modelling
In this task you will create proper data model for your data including vector representations, index configuration, distance functions and more.

#### Task 2.1 - HNSW Index Configuration
Configure the HNSW index for the retrieval. <br>
**Change the ef_construct parameter to 64 to speed the build time at the cost of the recall.** <br>
We do this for practical reasons, to enable you iterate over the notebook faster.

In [None]:
# Change ef_construct parameter to 64 to speed the build time at the cost of the recall
ef_construct = 64
# TODO Configure HNSW index
hnsw_config=None

#### Task 2.2 - Collection Creation
Create model for your data. You should create three vector representations for your data. <br>
There should be one representation for each model defined above. <br>
For multi-vector model make sure to disable the vector index since it will be used only for reranking. <br>
Also, do not forget that multi-vector computation of similarity is not done only through the cosine similarity (check the lecture for more info). <br>
Configure proper modifier for the sparse vector.

In [None]:
COLLECTION_NAME = "ms_macro"

In [None]:
try:
    client.delete_collection(COLLECTION_NAME)
    print(f"Deleted existing collection: {COLLECTION_NAME}")
except: 
    print(f"Collection {COLLECTION_NAME} does not exist")


# TODO: Configure collection creation  
collection_created = False #client.create_collection(
#    collection_name=COLLECTION_NAME,
# )

if collection_created:
    print(f"Created collection '{COLLECTION_NAME}'.")
else:
    print("Collection creation failed")


#### Task 2.3 - Create Payload Index & Disable Quantization
Configure keyword payload index for the `groups` field. Make sure that payload index is on-disk.

In [None]:
# TODO: Create payload index
payload_index_created = False # client.create_payload_index(
#    collection_name=COLLECTION_NAME,
# )

if payload_index_created:
    print(f"Payload index created for field 'groups'")

### Task 3 - Data Upload
Upload vector embeddings and metadata to the created collection, make sure to upload the vectors metadata.

In [None]:
points: list[models.PointStruct] = []

doc: dict[str, Any]
for doc in documents: # type: ignore
    # TODO: Implement data upload
    pass

print("Upserting documents...")
client.upload_points(collection_name=COLLECTION_NAME, points=points, batch_size=128)

print(f"Collection info: {client.get_collection(COLLECTION_NAME).points_count} points in collection")
assert client.get_collection(COLLECTION_NAME).points_count == len(documents), f"Expected {len(documents)} points in collection, got {client.get_collection(COLLECTION_NAME).points_count}"

## Querying

### Task 4 - Design Complex Query
Your task is to design a complex query that will include hybrid search, filtering, reranking and metadata boosting. <br>
**The result of this task should be one Qdrant query (do not add any postprocessing logic outside of the Qdrant query)!**
 
**Subtasks:**
1. Define query filter with relation to the `groups` field, do not forget there can be filter values in the query.
    - Think about in which prefetch you should apply the filter.
2. Define sparse and dense search prefetche, the limit for the retrieval should be 100 objects.
3. Define fusion of the two rankings with Reciprocal Rank Fusion (RRF).
4. Rerank the results with ColBERT multi-vector model, use 50 documents for reranking.
5. Boost the results with metadata weighting, use `group_1` with weight 0.05 and `group_2` with weight 0.1.


In [None]:
def rag_context_retrieval(query: dict[str, Any]) -> QueryResponse:
    # TODO: Implement correct embeddings usage
    query_dense_embedding: list[float] = []
    query_sparse_embedding: SparseEmbedding = None
    query_multi_vector_embedding: list[list[float]] = []

    # Task 4.1 - Define query filter
    filter_condition : models.Filter = None  # TODO: Implement filters

    
    sparse_limit = 100
    dense_limit = 100
    # Task 4.2 - Define sparse and dense search. Set their limit to 100.
    prefetch_sparse_and_dense_search: list[models.Prefetch] = [
        # TODO: Implement sparse and dense prefetches
    ]

    # Task 4.3 - Define fusion of the two rankings (set the k parameter of the query to 60 to mitigate effect of high rankings)
    rff_k = 60
    prefetch_fused_rankings: list[models.Prefetch] = [
        # TODO: Implement rank fusion
    ]

    # Task 4.4 - Rerank the results with ColBERT multi-vector model taking 50 documents.
    reranking_limit = 50
    prefetch_multi_vector_reranking: list[models.Prefetch] = [
        # TODO: Implement multi-vector reranking
    ]
    
    group_1_boost_weight = 0.05
    group_2_boost_weight = 0.1
    final_query_limit = 10
    # Task 4.5 - Boost following "groups" in the search: "group_1" with weight 0.05 and "group_2" with weight 0.1
    final_result: QueryResponse = client.query_points(
        collection_name=COLLECTION_NAME,
        # TODO: Implement final query with metadata boosting
        # TODO: This query should be built from all the prefetches
    )

    return final_result

In [None]:
avg_retrieval_precision = evaluate_retrieval(rag_context_retrieval, query_dataset)