# Hybrid Search Experiment (Vector + BM25)

This notebook demonstrates the **Hybrid Search** capabilities of the `tandon_ai_doc_intel` library.
We will index a few sample documents and compare retrieval results using:
1. Pure Vector Search (Semantic)
2. Hybrid Search (Semantic + Keyword using RRF)

In [None]:
import os
import sys
import time
from pprint import pprint

# Ensure we can import the local library
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

from tandon_ai_doc_intel.embeddings import VectorStore, OpenAIEmbeddings

# Initialize Components
# Make sure OPENAI_API_KEY is set in your environment or provide it here
api_key = os.getenv("OPENAI_API_KEY", "sk-...")
embedder = OpenAIEmbeddings(api_key=api_key)
store = VectorStore(collection_name="demo_hybrid_experiment")

## 1. Index Sample Documents
We will add some synthetic documents. Some have semantic overlap, others have specific keywords.

In [None]:
docs = [
    "The financial report for Q3 shows a profit of $5M due to unexpected sales.",
    "The engineering team migrated the database to PostgreSQL for better performance.",
    "Compliance risks were identified in the new vendor agreement regarding data privacy.",
    "The office party is scheduled for Friday at 5 PM in the main hall.",
    "Q3 revenue was down, but profit margins increased significantly.",
    "PostgreSQL is a powerful, open source object-relational database system."
]
ids = [f"doc_{i}" for i in range(len(docs))]

print("Generating embeddings...")
embeddings = embedder.embed(docs)

print("Indexing documents...")
store.add_documents(ids, docs, embeddings)
print("Done.")

## 2. Compare Search Modes
### Case A: Semantic Query
Query: "financial success"
Expected: Should match documents about profit and revenue.

In [None]:
query = "financial success"
query_vec = embedder.embed([query])[0]

print(f"\n--- Query: '{query}' ---")

print("\n[Vector Search Results]")
v_res = store.query(query_vec, n_results=3)
for i, doc in enumerate(v_res['documents'][0]):
    print(f"{i+1}. {doc} (Dist: {v_res['distances'][0][i]:.4f})")

print("\n[Hybrid Search Results (RRF)]")
h_res = store.hybrid_search(query, query_vec, n_results=3)
for i, item in enumerate(h_res):
    print(f"{i+1}. {item['text']} (Score: {item['score']:.4f})")

### Case B: Specific Keyword
Query: "PostgreSQL features"
Expected: Should strongly favor documents containing the exact term 'PostgreSQL'.

In [None]:
query = "PostgreSQL features"
query_vec = embedder.embed([query])[0]

print(f"\n--- Query: '{query}' ---")

print("\n[Vector Search Results]")
v_res = store.query(query_vec, n_results=3)
for i, doc in enumerate(v_res['documents'][0]):
    print(f"{i+1}. {doc} (Dist: {v_res['distances'][0][i]:.4f})")

print("\n[Hybrid Search Results (RRF)]")
h_res = store.hybrid_search(query, query_vec, n_results=3)
for i, item in enumerate(h_res):
    print(f"{i+1}. {item['text']} (Score: {item['score']:.4f})")

### Case C: Mixed Query
Query: "Friday party time"
Expected: Should match the office party document.

In [None]:
query = "Friday party time"
query_vec = embedder.embed([query])[0]

print(f"\n--- Query: '{query}' ---")

print("\n[Hybrid Search Results]")
h_res = store.hybrid_search(query, query_vec, n_results=3)
for i, item in enumerate(h_res):
    print(f"{i+1}. {item['text']} (Score: {item['score']:.4f})")