# Hybrid Search with Elasticsearch

## Overview
This notebook demonstrates three different search approaches using Elasticsearch:
1. **Keyword Search (BM25)** - Traditional full-text search based on term frequency and document length
2. **Semantic Search (Vector)** - Searches based on meaning using embeddings
3. **Hybrid Search** - Combines both approaches for optimal results

We'll use the same travel company FAQ dataset and show how each approach performs differently on various queries.

In [None]:
import pandas as pd
import openai
import os
from dotenv import load_dotenv
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
import numpy as np
from typing import List, Dict, Any
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

In [9]:
# Navigate to project root
os.chdir("..")
load_dotenv()

# Setup OpenAI
openai.api_key = OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
client = openai.Client()

## 1. Connect to Elasticsearch

Make sure you have Elasticsearch running:
```bash
docker ps | grep elasticsearch
```

If not running, start it:
```bash
docker run -d --name elasticsearch -p 9200:9200 -p 9300:9300 \
  -e "discovery.type=single-node" -e "xpack.security.enabled=false" \
  -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
  docker.elastic.co/elasticsearch/elasticsearch:8.11.0
```

In [10]:
# Connect to Elasticsearch
# For ES 8.x with security disabled, we need to explicitly configure the client
es = Elasticsearch(
    ['http://localhost:9200'],
    verify_certs=False,
    ssl_show_warn=False,
    request_timeout=30
)

# Verify connection
try:
    if es.ping():
        print("‚úì Connected to Elasticsearch")
        info = es.info()
        print(f"Cluster: {info['cluster_name']}")
        print(f"Version: {info['version']['number']}")
    else:
        print("‚úó Failed to connect to Elasticsearch")
except Exception as e:
    print(f"‚úó Connection error: {e}")
    print("\nTroubleshooting:")
    print("1. Check if Elasticsearch is running: docker ps | grep elasticsearch")
    print("2. Test with curl: curl http://localhost:9200")
    print("3. Make sure you have elasticsearch python package installed: pip install elasticsearch")

‚úó Failed to connect to Elasticsearch


## 2. Load Travel FAQ Data

In [None]:
# Load the same dataset used in the vector search notebook
df = pd.read_json("data/travel_company_faq.json")

print(f"Loaded {len(df)} FAQ items")
print(f"Categories: {df['category'].unique().tolist()}")
print(f"\nSample question: {df.iloc[0]['question']}")

## 3. Create Elasticsearch Index with Hybrid Search Support

We'll create an index that supports both:
- **Text fields** for keyword search (BM25)
- **Dense vector fields** for semantic search

The mapping defines how each field should be indexed and searched.

In [None]:
INDEX_NAME = "travel_faq_hybrid"

# Define the index mapping
mapping = {
    "mappings": {
        "properties": {
            "question": {
                "type": "text",
                "analyzer": "english"
            },
            "answer": {
                "type": "text",
                "analyzer": "english"
            },
            "combined_text": {
                "type": "text",
                "analyzer": "english"
            },
            "category": {
                "type": "keyword"
            },
            "embedding": {
                "type": "dense_vector",
                "dims": 1536,  # OpenAI ada-002 dimension
                "index": True,
                "similarity": "cosine"
            }
        }
    },
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    }
}

# Delete index if it exists
if es.indices.exists(index=INDEX_NAME):
    es.indices.delete(index=INDEX_NAME)
    print(f"Deleted existing index: {INDEX_NAME}")

# Create the index
es.indices.create(index=INDEX_NAME, body=mapping)
print(f"‚úì Created index: {INDEX_NAME}")

## 4. Generate Embeddings and Ingest Data

We'll use OpenAI's `text-embedding-ada-002` model to generate embeddings for each FAQ item.

In [None]:
def get_embedding(text: str, model: str = "text-embedding-ada-002") -> List[float]:
    """Get embedding from OpenAI API for a single text"""
    response = client.embeddings.create(input=[text], model=model)
    return response.data[0].embedding

def get_embeddings_batch(texts: List[str], model: str = "text-embedding-ada-002", batch_size: int = 100) -> List[List[float]]:
    """
    Get embeddings for multiple texts in batches.
    OpenAI allows up to 2048 texts per request, but we use smaller batches for stability.
    """
    all_embeddings = []
    
    # Process in batches with progress bar
    for i in tqdm(range(0, len(texts), batch_size), desc="Getting embeddings", unit="batch"):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(input=batch, model=model)
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
    
    return all_embeddings

def prepare_documents_for_indexing(df: pd.DataFrame) -> List[Dict[str, Any]]:
    """
    Prepare documents for bulk indexing.
    Each document includes text fields and embeddings.
    Uses batch processing for faster embedding generation.
    """
    print(f"Preparing {len(df)} documents for indexing...")
    
    # Combine all texts first
    combined_texts = []
    for idx, row in df.iterrows():
        combined_text = f"Question: {row['question']}\nAnswer: {row['answer']}"
        combined_texts.append(combined_text)
    
    # Get all embeddings in batches (much faster!)
    embeddings = get_embeddings_batch(combined_texts)
    
    # Build documents with progress bar
    documents = []
    for idx, (row, combined_text, embedding) in enumerate(tqdm(
        zip(df.iterrows(), combined_texts, embeddings),
        total=len(df),
        desc="Building documents",
        unit="doc"
    )):
        _, row = row  # iterrows returns (index, row)
        
        doc = {
            "_index": INDEX_NAME,
            "_id": f"faq_{idx}",
            "_source": {
                "question": row["question"],
                "answer": row["answer"],
                "combined_text": combined_text,
                "category": row["category"],
                "embedding": embedding
            }
        }
        documents.append(doc)
    
    print(f"‚úì Prepared {len(documents)} documents")
    return documents

In [None]:
# Prepare and index documents
print("Generating embeddings and preparing documents...")
documents = prepare_documents_for_indexing(df)

# Bulk index with progress
print("\nIndexing documents to Elasticsearch...")
success, failed = bulk(es, documents, raise_on_error=False)
print(f"‚úì Successfully indexed {success} documents")
if failed:
    print(f"‚úó Failed to index {failed} documents")

# Refresh index to make documents searchable
es.indices.refresh(index=INDEX_NAME)
print("‚úì Index refreshed and ready for search")

In [None]:
# Verify indexing
count = es.count(index=INDEX_NAME)['count']
print(f"Total documents in index: {count}")

## 5. Implement Three Search Approaches

### 5.1 Keyword Search (BM25)
Traditional full-text search using BM25 algorithm. Good for:
- Exact keyword matches
- Technical terms
- Specific phrases

In [None]:
def keyword_search(query: str, top_k: int = 5) -> List[Dict[str, Any]]:
    """
    Keyword search using BM25.
    Searches across question, answer, and combined_text fields.
    """
    search_query = {
        "query": {
            "multi_match": {
                "query": query,
                "fields": ["question^2", "answer", "combined_text"],  # Boost question field
                "type": "best_fields"
            }
        },
        "size": top_k
    }
    
    response = es.search(index=INDEX_NAME, body=search_query)
    
    results = []
    for hit in response['hits']['hits']:
        results.append({
            "question": hit["_source"]["question"],
            "answer": hit["_source"]["answer"],
            "category": hit["_source"]["category"],
            "score": hit["_score"],
            "search_type": "keyword"
        })
    
    return results

### 5.2 Semantic Search (Vector Similarity)
Searches based on meaning using embeddings. Good for:
- Conceptual similarity
- Paraphrased queries
- Finding related content even with different words

In [None]:
def semantic_search(query: str, top_k: int = 5) -> List[Dict[str, Any]]:
    """
    Semantic search using vector similarity.
    Converts query to embedding and finds most similar documents.
    """
    # Generate query embedding
    query_embedding = get_embedding(query)
    
    search_query = {
        "knn": {
            "field": "embedding",
            "query_vector": query_embedding,
            "k": top_k,
            "num_candidates": 100
        },
        "_source": ["question", "answer", "category"],
        "size": top_k
    }
    
    response = es.search(index=INDEX_NAME, **search_query)
    
    results = []
    for hit in response['hits']['hits']:
        results.append({
            "question": hit["_source"]["question"],
            "answer": hit["_source"]["answer"],
            "category": hit["_source"]["category"],
            "score": hit["_score"],
            "search_type": "semantic"
        })
    
    return results

### 5.3 Hybrid Search (Combined)
Combines both keyword and semantic search with adjustable weights. Best of both worlds:
- Balances exact matches with conceptual similarity
- More robust across different query types
- Can be tuned for specific use cases

In [None]:
def hybrid_search(query: str, top_k: int = 5, keyword_weight: float = 0.5, semantic_weight: float = 0.5) -> List[Dict[str, Any]]:
    """
    Hybrid search combining keyword (BM25) and semantic (vector) search.
    
    Args:
        query: Search query
        top_k: Number of results to return
        keyword_weight: Weight for keyword search (0-1)
        semantic_weight: Weight for semantic search (0-1)
    """
    # Generate query embedding for semantic search
    query_embedding = get_embedding(query)
    
    # Elasticsearch hybrid query using bool should with knn
    search_query = {
        "query": {
            "bool": {
                "should": [
                    # Keyword search component
                    {
                        "multi_match": {
                            "query": query,
                            "fields": ["question^2", "answer", "combined_text"],
                            "type": "best_fields",
                            "boost": keyword_weight
                        }
                    }
                ]
            }
        },
        "knn": {
            "field": "embedding",
            "query_vector": query_embedding,
            "k": top_k,
            "num_candidates": 100,
            "boost": semantic_weight
        },
        "size": top_k,
        "_source": ["question", "answer", "category"]
    }
    
    response = es.search(index=INDEX_NAME, body=search_query)
    
    results = []
    for hit in response['hits']['hits']:
        results.append({
            "question": hit["_source"]["question"],
            "answer": hit["_source"]["answer"],
            "category": hit["_source"]["category"],
            "score": hit["_score"],
            "search_type": "hybrid"
        })
    
    return results

## 6. Comparison Framework

Let's create functions to compare all three search approaches side-by-side.

In [None]:
def compare_search_approaches(query: str, top_k: int = 3):
    """
    Compare all three search approaches for a given query.
    """
    print("=" * 100)
    print(f"Query: '{query}'")
    print("=" * 100)
    
    # Run all three searches
    try:
        keyword_results = keyword_search(query, top_k)
        semantic_results = semantic_search(query, top_k)
        hybrid_results = hybrid_search(query, top_k)
    except Exception as e:
        print(f"\n‚úó Search error: {e}")
        print(f"Error type: {type(e).__name__}")
        return
    
    # Check if we got results
    if not keyword_results or not semantic_results or not hybrid_results:
        print("\n‚úó One or more search methods returned no results")
        print(f"Keyword: {len(keyword_results)} results")
        print(f"Semantic: {len(semantic_results)} results")
        print(f"Hybrid: {len(hybrid_results)} results")
        return
    
    # Get the minimum number of results across all methods
    actual_k = min(len(keyword_results), len(semantic_results), len(hybrid_results))
    
    if actual_k < top_k:
        print(f"\n‚ö†Ô∏è  Requested {top_k} results, but only {actual_k} available across all methods")
    
    # Display results side by side
    for i in range(actual_k):
        print(f"\n{'‚îÄ' * 100}")
        print(f"RANK #{i+1}")
        print(f"{'‚îÄ' * 100}")
        
        # Keyword results
        print(f"\nüîç KEYWORD SEARCH (BM25) - Score: {keyword_results[i]['score']:.4f}")
        print(f"   Category: {keyword_results[i]['category']}")
        print(f"   Q: {keyword_results[i]['question']}")
        print(f"   A: {keyword_results[i]['answer'][:150]}...")
        
        # Semantic results
        print(f"\nüß† SEMANTIC SEARCH (Vector) - Score: {semantic_results[i]['score']:.4f}")
        print(f"   Category: {semantic_results[i]['category']}")
        print(f"   Q: {semantic_results[i]['question']}")
        print(f"   A: {semantic_results[i]['answer'][:150]}...")
        
        # Hybrid results
        print(f"\n‚ö° HYBRID SEARCH (Combined) - Score: {hybrid_results[i]['score']:.4f}")
        print(f"   Category: {hybrid_results[i]['category']}")
        print(f"   Q: {hybrid_results[i]['question']}")
        print(f"   A: {hybrid_results[i]['answer'][:150]}...")
    
    print(f"\n{'=' * 100}\n")

In [None]:
def create_comparison_table(query: str, top_k: int = 5) -> pd.DataFrame:
    """
    Create a comparison table showing which documents each search method retrieves.
    """
    try:
        keyword_results = keyword_search(query, top_k)
        semantic_results = semantic_search(query, top_k)
        hybrid_results = hybrid_search(query, top_k)
    except Exception as e:
        print(f"Search error: {e}")
        return pd.DataFrame()
    
    # Get the minimum number of results
    actual_k = min(len(keyword_results), len(semantic_results), len(hybrid_results), top_k)
    
    if actual_k == 0:
        print("No results returned from searches")
        return pd.DataFrame()
    
    # Create comparison dataframe
    comparison_data = []
    
    for i in range(actual_k):
        comparison_data.append({
            "Rank": i + 1,
            "Keyword_Question": keyword_results[i]['question'][:60] + "...",
            "Keyword_Score": f"{keyword_results[i]['score']:.3f}",
            "Semantic_Question": semantic_results[i]['question'][:60] + "...",
            "Semantic_Score": f"{semantic_results[i]['score']:.3f}",
            "Hybrid_Question": hybrid_results[i]['question'][:60] + "...",
            "Hybrid_Score": f"{hybrid_results[i]['score']:.3f}"
        })
    
    df_comparison = pd.DataFrame(comparison_data)
    return df_comparison

## 7. Test Queries - Showcasing Differences

Let's test with queries that highlight the strengths of each approach.

### Test 1: Exact Keyword Match
Query with specific keywords that appear in documents.
**Expected:** Keyword search should perform best.

In [None]:
compare_search_approaches("baggage allowance", top_k=3)

In [None]:
# Test each search method individually to diagnose issues
test_query = "lost luggage compensation"

print("Testing individual search methods:\n")

try:
    print("1. Keyword search...")
    kw_results = keyword_search(test_query, top_k=3)
    print(f"   ‚úì Returned {len(kw_results)} results")
except Exception as e:
    print(f"   ‚úó Error: {e}")

try:
    print("\n2. Semantic search...")
    sem_results = semantic_search(test_query, top_k=3)
    print(f"   ‚úì Returned {len(sem_results)} results")
except Exception as e:
    print(f"   ‚úó Error: {e}")

try:
    print("\n3. Hybrid search...")
    hyb_results = hybrid_search(test_query, top_k=3)
    print(f"   ‚úì Returned {len(hyb_results)} results")
except Exception as e:
    print(f"   ‚úó Error: {e}")

### Test 2: Semantic Query (Different Words, Same Meaning)
Query using different terminology than what's in the documents.
**Expected:** Semantic search should perform best.

In [None]:
compare_search_approaches("What happens if I need medical help while traveling?", top_k=3)

### Test 3: Mixed Query
Query that benefits from both keyword matching and semantic understanding.
**Expected:** Hybrid search should perform best.

In [None]:
compare_search_approaches("lost luggage compensation", top_k=3)

### Test 4: Specific Term
Technical or specific term that should match exactly.
**Expected:** Keyword search advantage.

In [None]:
compare_search_approaches("vegetarian food options", top_k=3)

### Test 5: Conceptual Query
Asking about a concept without using exact terminology.
**Expected:** Semantic search advantage.

In [None]:
compare_search_approaches("getting money back for cancelled trip", top_k=3)

### Test 6: Natural Language Question
Full natural language question.
**Expected:** Hybrid search should balance both approaches.

In [None]:
compare_search_approaches("Can I change my booking dates after I've already confirmed?", top_k=3)

## 8. Comparison Tables

Let's create comparison tables to see which documents each method retrieves.

In [None]:
# Example 1: Keyword-friendly query
print("Query: 'baggage allowance'\n")
df_comp1 = create_comparison_table("baggage allowance", top_k=5)
display(df_comp1)

In [None]:
# Example 2: Semantic-friendly query
print("Query: 'What happens if I need medical help while traveling?'\n")
df_comp2 = create_comparison_table("What happens if I need medical help while traveling?", top_k=5)
display(df_comp2)

## 9. Tuning Hybrid Search Weights

The hybrid search can be tuned by adjusting the weights between keyword and semantic components.

In [None]:
def compare_hybrid_weights(query: str, top_k: int = 3):
    """
    Compare hybrid search with different weight configurations.
    """
    print(f"Query: '{query}'\n")
    
    weight_configs = [
        (0.8, 0.2, "Keyword-heavy"),
        (0.5, 0.5, "Balanced"),
        (0.2, 0.8, "Semantic-heavy")
    ]
    
    for kw_weight, sem_weight, label in weight_configs:
        print(f"\n{'‚îÄ' * 80}")
        print(f"{label} (keyword={kw_weight}, semantic={sem_weight})")
        print(f"{'‚îÄ' * 80}")
        
        results = hybrid_search(query, top_k, keyword_weight=kw_weight, semantic_weight=sem_weight)
        
        for i, result in enumerate(results):
            print(f"\n{i+1}. [Score: {result['score']:.4f}] {result['question']}")
            print(f"   Category: {result['category']}")
    
    print(f"\n{'=' * 80}\n")

In [None]:
compare_hybrid_weights("lost luggage compensation", top_k=3)

## 10. Key Takeaways

### When to use each approach:

#### üîç Keyword Search (BM25)
**Best for:**
- Exact keyword matches
- Technical terms and jargon
- Known-item searches
- Fast performance needs

**Limitations:**
- Doesn't understand synonyms or paraphrases
- Misses semantically similar content with different words
- Sensitive to spelling and exact phrasing

#### üß† Semantic Search (Vector)
**Best for:**
- Conceptual similarity
- Paraphrased queries
- Questions in natural language
- Finding related content across different terminology

**Limitations:**
- Can miss exact matches if semantically distant
- More computationally expensive
- May return conceptually similar but contextually wrong results

#### ‚ö° Hybrid Search (Combined)
**Best for:**
- Production systems with diverse queries
- When you want robustness across query types
- Balancing precision and recall
- Most real-world use cases

**Considerations:**
- Requires tuning weights for your specific use case
- More complex to implement and debug
- Higher computational cost than keyword-only

### General Recommendations:
1. Start with hybrid search (50/50 weights) as a baseline
2. Analyze your query patterns and tune weights accordingly
3. Use keyword-heavy weights (0.7/0.3) for technical domains
4. Use semantic-heavy weights (0.3/0.7) for conversational queries
5. Always evaluate on your specific dataset and use case

## 11. Cleanup

Optional: Clean up the index after experiments.

In [None]:
# Uncomment to delete the index
# es.indices.delete(index=INDEX_NAME)
# print(f"Deleted index: {INDEX_NAME}")

## Exercises

1. **Custom Queries**: Try your own queries and observe which search method works best
2. **Weight Tuning**: Experiment with different hybrid search weights for specific query types
3. **Field Boosting**: Modify the keyword search to boost different fields (question vs answer)
4. **Custom Scoring**: Implement a custom scoring function that considers category matches
5. **Evaluation Metrics**: Create relevance judgments and calculate precision/recall for each method