# Advanced Hybrid Search with OgbujiPT

Interactive exploration of **hybrid search** combining:
- **Dense vector search** (semantic similarity via embeddings)
- **Sparse BM25 search** (keyword-based retrieval)
- **Reciprocal Rank Fusion** (RRF) to merge results

## Prerequisites & Running

1. PostgreSQL with pgvector running (see README.md for Docker setup)
2. Required packages:
```bash
uv pip install "ogbujipt>=0.10.0"
uv pip install jupyter
```

Run the notebook:

```bash
jupyter notebook hybrid_search.ipynb
```

## Why Hybrid Search?

Different search methods excel at different tasks:

| Method | Strengths | Weaknesses |
|--------|-----------|------------|
| **Dense vectors** | Semantic meaning, synonyms, concepts | Misses exact terminology, names |
| **Sparse BM25** | Exact keywords, names, terminology | Misses semantic similarity |
| **Hybrid (RRF)** | Best of both! | Slightly more complex |

**Example**: Searching for \"ML algorithms\"
- Dense finds: \"machine learning techniques\", \"neural networks\"
- Sparse finds: \"ML\", \"algorithms\", \"random forest algorithm\"
- Hybrid: Ranks results that match both semantic + keyword criteria highest

## Setup and Imports

In [None]:
import asyncio
import os
from typing import AsyncIterator
from sentence_transformers import SentenceTransformer

from ogbujipt.store.postgres import DataDB
from ogbujipt.retrieval import BM25Search, HybridSearch, SimpleDenseSearch
from ogbujipt.memory.base import SearchResult

# Database connection parameters (adjust if needed)
PG_DB_NAME = os.environ.get('PG_DB_NAME', 'hybrid_demo')
PG_DB_HOST = os.environ.get('PG_DB_HOST', 'localhost')
PG_DB_PORT = int(os.environ.get('PG_DB_PORT', '5432'))
PG_DB_USER = os.environ.get('PG_DB_USER', 'demo_user')
PG_DB_PASSWORD = os.environ.get('PG_DB_PASSWORD', 'demo_pass_2025')

# Load embedding model (this may take a minute on first run)
print('Loading embedding model...')
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print('‚úì Model loaded!')

## Sample Knowledge Base

Let's create a small knowledge base about machine learning and programming topics.

In [None]:
# Sample documents covering various ML/programming topics
knowledge_base = [
    {'content': 'Machine learning (ML) is a subset of artificial intelligence that enables systems to learn from data without explicit programming.', 'metadata': {'topic': 'ML basics', 'difficulty': 'beginner'}},
    {'content': 'Neural networks are computing systems inspired by biological neural networks. They consist of layers of interconnected nodes.', 'metadata': {'topic': 'neural networks', 'difficulty': 'intermediate'}},
    {'content': 'Random forest is an ensemble learning algorithm that constructs multiple decision trees during training.', 'metadata': {'topic': 'algorithms', 'difficulty': 'intermediate'}},
    {'content': 'Python is a high-level programming language widely used for ML development due to libraries like scikit-learn, TensorFlow, and PyTorch.', 'metadata': {'topic': 'programming', 'difficulty': 'beginner'}},
    {'content': 'Gradient descent is an optimization algorithm used to minimize loss functions in ML by iteratively moving toward the minimum.', 'metadata': {'topic': 'optimization', 'difficulty': 'intermediate'}},
    {'content': 'Supervised learning uses labeled training data to learn mappings from inputs to outputs. Examples include classification and regression.', 'metadata': {'topic': 'ML basics', 'difficulty': 'beginner'}},
    {'content': 'Convolutional Neural Networks (CNNs) are specialized for processing grid-like data such as images. They use convolutional layers.', 'metadata': {'topic': 'deep learning', 'difficulty': 'advanced'}},
    {'content': 'K-means clustering is an unsupervised learning algorithm that partitions data into K clusters based on feature similarity.', 'metadata': {'topic': 'clustering', 'difficulty': 'beginner'}},
    {'content': 'Transfer learning leverages pre-trained models on new tasks, reducing training time and data requirements significantly.', 'metadata': {'topic': 'deep learning', 'difficulty': 'advanced'}},
    {'content': 'The backpropagation algorithm computes gradients of the loss function with respect to network weights using the chain rule.', 'metadata': {'topic': 'neural networks', 'difficulty': 'advanced'}},
    {'content': 'Support Vector Machines (SVMs) find optimal hyperplanes that maximize the margin between different classes in the feature space.', 'metadata': {'topic': 'algorithms', 'difficulty': 'intermediate'}},
    {'content': 'Overfitting occurs when a model learns training data too well, including noise, resulting in poor generalization to new data.', 'metadata': {'topic': 'ML basics', 'difficulty': 'beginner'}},
]

print(f'Knowledge base contains {len(knowledge_base)} documents')

## Initialize Database Connection

In [None]:
# Connect to PostgreSQL and create table
kb_db = await DataDB.from_conn_params(
    embedding_model=embedding_model,
    table_name='ml_knowledge',
    db_name=PG_DB_NAME,
    host=PG_DB_HOST,
    port=PG_DB_PORT,
    user=PG_DB_USER,
    password=PG_DB_PASSWORD,
    itypes=['vector'],  # Create HNSW index for fast vector search
    ifuncs=['cosine']
)

print('‚úì Connected to PostgreSQL')

## Create Table and Insert Documents

In [None]:
# Drop existing table if present (for clean demo)
if await kb_db.table_exists():
    await kb_db.drop_table()
    print('‚úì Dropped existing table')

# Create fresh table
await kb_db.create_table()
print('‚úì Created table: ml_knowledge')

# Insert all documents
await kb_db.insert_many([
    (doc['content'], doc['metadata']) 
    for doc in knowledge_base
])

doc_count = await kb_db.count_items()
print(f'‚úì Inserted {doc_count} documents')

## Test 1: Dense Vector Search Only

First, let's try traditional dense vector search (semantic similarity).

In [None]:
query = 'What are ML algorithms?'

print(f'Query: "{query}"\n')
print('=' * 60)
print('DENSE VECTOR SEARCH (semantic similarity)')
print('=' * 60)

dense_results = []
async for result in kb_db.search(query=query, limit=5):
    dense_results.append(result)

for i, result in enumerate(dense_results, 1):
    print(f'\n{i}. Score: {result.score:.3f}')
    print(f'   {result.content}')
    print(f'   [Topic: {result.metadata.get("topic", "unknown")}]')

## Test 2: Sparse BM25 Search Only

Now let's try BM25 sparse retrieval (keyword-based).

In [None]:
print(f'Query: "{query}"\n')
print('=' * 60)
print('SPARSE BM25 SEARCH (keyword-based)')
print('=' * 60)

# Initialize BM25 search
bm25 = BM25Search(
    k1=1.5,      # Term frequency saturation
    b=0.75,      # Document length normalization
    epsilon=0.25 # IDF floor
)

# Execute search
sparse_results = []
async for result in bm25.execute(query=query, backends=[kb_db], limit=5):
    sparse_results.append(result)

for i, result in enumerate(sparse_results, 1):
    print(f'\n{i}. Score: {result.score:.3f}')
    print(f'   {result.content}')
    print(f'   [Topic: {result.metadata.get("topic", "unknown")}]')

## Test 3: Hybrid Search (Dense + Sparse with RRF)

Now let's combine both approaches using Reciprocal Rank Fusion!

In [None]:
print(f'Query: "{query}"\n')
print('=' * 60)
print('HYBRID SEARCH (Dense + Sparse with RRF)')
print('=' * 60)

# Initialize hybrid search with both strategies
hybrid = HybridSearch(
    strategies=[
        SimpleDenseSearch(),  # Dense vector search
        BM25Search()          # Sparse BM25 search
    ],
    k=60  # RRF constant
)

# Execute hybrid search
hybrid_results = []
async for result in hybrid.execute(query=query, backends=[kb_db], limit=5):
    hybrid_results.append(result)

for i, result in enumerate(hybrid_results, 1):
    print(f'\n{i}. Score: {result.score:.3f}')
    print(f'   {result.content}')
    print(f'   [Topic: {result.metadata.get("topic", "unknown")}]')
    print(f'   [Sources: {result.source}]')

## Comparison: Query with Exact Terminology

Let's try a query where exact keywords matter (\"CNN\" for Convolutional Neural Network).

In [None]:
query_terminology = 'Tell me about CNNs and image processing'

print(f'Query: "{query_terminology}"\n')
print('This query tests how well each method handles abbreviations (CNN)\n')

# Dense search
print('\n' + '='*60)
print('DENSE (might miss "CNN" abbreviation)')
print('='*60)
dense_results = []
async for r in kb_db.search(query=query_terminology, limit=3):
    dense_results.append(r)
for i, r in enumerate(dense_results, 1):
    print(f'{i}. [{r.score:.3f}] {r.content[:80]}...')

# Sparse search
print('\n' + '='*60)
print('SPARSE (catches "CNN" keyword)')
print('='*60)
sparse_results = []
async for r in bm25.execute(query=query_terminology, backends=[kb_db], limit=3):
    sparse_results.append(r)
for i, r in enumerate(sparse_results, 1):
    print(f'{i}. [{r.score:.3f}] {r.content[:80]}...')

# Hybrid search
print('\n' + '='*60)
print('HYBRID (best of both!)')
print('='*60)
hybrid_results = []
async for r in hybrid.execute(query=query_terminology, backends=[kb_db], limit=3):
    hybrid_results.append(r)
for i, r in enumerate(hybrid_results, 1):
    print(f'{i}. [{r.score:.3f}] {r.content[:80]}...')

## Experiment: Tuning BM25 Parameters

BM25 has parameters you can tune. Let's experiment with different values.

In [None]:
query_test = 'optimization algorithm gradient'

print(f'Query: "{query_test}"\n')
print('Testing different k1 values (term frequency saturation):\n')

# Test different k1 values
for k1_val in [1.2, 1.5, 2.0]:
    print(f'\n{"="*60}')
    print(f'BM25 with k1={k1_val}, b=0.75')
    print('='*60)
    
    bm25_tuned = BM25Search(k1=k1_val, b=0.75)
    
    results = []
    async for r in bm25_tuned.execute(query=query_test, backends=[kb_db], limit=3):
        results.append(r)
    
    for i, r in enumerate(results, 1):
        print(f'{i}. [{r.score:.3f}] {r.content[:70]}...')

print('\nüí° Tip: Higher k1 = more weight on term frequency')

## Understanding RRF Scores

Hybrid search includes metadata showing how results were ranked by each strategy.

In [None]:
query_rrf = 'neural network training'

print(f'Query: "{query_rrf}"\n')
print('='*60)
print('RRF Ranking Details')
print('='*60)

hybrid_detailed = []
async for r in hybrid.execute(query=query_rrf, backends=[kb_db], limit=3):
    hybrid_detailed.append(r)

for i, r in enumerate(hybrid_detailed, 1):
    print(f'\n{i}. Final RRF Score: {r.score:.3f}')
    print(f'   {r.content[:100]}...')
    
    # Show individual strategy ranks
    if 'rrf_ranks' in r.metadata:
        print('   Individual strategy rankings:')
        for strategy, rank, score in r.metadata['rrf_ranks']:
            print(f'     ‚Ä¢ {strategy}: rank #{rank}, score {score:.3f}')

print('\nüí° RRF formula: score = sum(1 / (k + rank)) for each strategy')

## Try Your Own Query!

Modify the query below and see how different methods perform.

In [None]:
# ‚úèÔ∏è Edit this query to test different searches!
my_query = 'supervised classification'

print(f'Your query: "{my_query}"\n')

# Run all three methods
print('DENSE:')
count = 0
async for r in kb_db.search(query=my_query, limit=2):
    count += 1
    print(f'  {count}. [{r.score:.3f}] {r.content[:60]}...')

print('\nSPARSE:')
count = 0
async for r in bm25.execute(query=my_query, backends=[kb_db], limit=2):
    count += 1
    print(f'  {count}. [{r.score:.3f}] {r.content[:60]}...')

print('\nHYBRID:')
count = 0
async for r in hybrid.execute(query=my_query, backends=[kb_db], limit=2):
    count += 1
    print(f'  {count}. [{r.score:.3f}] {r.content[:60]}...')

## Cleanup

In [None]:
# Drop the demo table
await kb_db.drop_table()
print('Dropped demo table')
print('\nDemo complete!')

## Key Takeaways

1. **Dense search** excels at semantic similarity but can miss exact terminology
2. **Sparse BM25** excels at keyword matching but misses semantic relationships
3. **Hybrid RRF** combines both, typically outperforming either alone
4. **Tune BM25** parameters (k1, b) based on your corpus and query patterns
5. **Inspect RRF metadata** to understand how results are ranked

## Next Steps

- Try the `chat_with_hybrid_kb.py` demo for a full conversational AI application
- Experiment with your own documents and queries
- Explore `SparseDB` for storing sparse vectors directly
- Combine with reranking models for even better results

## Resources

- [BM25 Algorithm](https://en.wikipedia.org/wiki/Okapi_BM25)
- [Reciprocal Rank Fusion Paper](http://www.cs.uwaterloo.ca/~jimmylin/publications/Cormack_etal_SIGIR2009.pdf)
- [pgvector Documentation](https://github.com/pgvector/pgvector)
