Advanced search service combining BM25 text search, HNSW vector search, synonym expansion, and ML-based reranking.
- Synonym Expansion: Query enhancement using n-gram matching (1-4 words)
- Dual Search: BM25 text search + HNSW vector search
- RRF Merging: Reciprocal Rank Fusion for combining search results
- ML Reranking: Two-stage filtering and ranking using LogisticRegression and CatBoost
- Snippet Extraction: BM25-based relevant snippet extraction
- Comprehensive Error Handling: HTTP status codes with detailed error messages
- Health Monitoring: Service health check endpoint
- Async Architecture: Fully asynchronous for high performance
- Tokenization: Query is tokenized using NLTK word_tokenize
- Synonym Expansion: Tokens are expanded using n-gram synonym matching
- Embedding Generation: Original query is embedded using vLLM service (BAAI/bge-m3)
- Dual Search: For each index:
- BM25 search with expanded query
- HNSW vector search with query embedding
- RRF Merging: Results from BM25 and HNSW are merged per index
- Deduplication: Documents with same text_hash are deduplicated
- Two-Stage Reranking:
- Stage 1: LogisticRegression filters out irrelevant documents
- Stage 2: CatBoost ranks remaining documents
- Threshold Filtering: Documents below CatBoost threshold are removed
- Snippet Extraction: Best matching snippet extracted for each document
- Response: Top K documents returned with snippets and scores
- OpenSearch: Document storage and search (BM25 + KNN)
- vLLM Embedding Service: Text embedding generation (BAAI/bge-m3 model)
Services must be on the same Docker network (test_network) for communication.
-
Ensure external services are running:
# OpenSearch should be running on test_network:9200 # Embedding service should be running on test_network:8000
-
Build the Docker image:
docker build -t search-service . -
Run the service:
docker run -d \ --name search_service \ --network test_network \ -p 8008:8008 \ search-service
-
Install dependencies:
pip install -r requirements.txt
-
Update .env file with local service URLs:
EMBEDDING_SERVICE_URL=http://localhost:8004 OPENSEARCH_URL=https://localhost:9200
-
Run the service:
python -m uvicorn app.main:app --host 0.0.0.0 --port 8008
Configuration is managed via environment variables in .env file:
# Service Configuration
SERVICE_PORT=8008
SERVICE_HOST=0.0.0.0
# External Services (Docker network names)
EMBEDDING_SERVICE_URL=http://embedding_server:8000
OPENSEARCH_URL=https://opensearch:9200
OPENSEARCH_USER=admin
OPENSEARCH_PASSWORD=our_password
# Search Parameters
RRF_K=60 # RRF constant
DOCS_PER_INDEX=100 # Documents to select per index after RRF
TOP_K_FINAL=15 # Final number of documents to return
CATBOOST_THRESHOLD=0.5 # Minimum CatBoost score threshold
SNIPPET_WINDOW_SIZE=30 # Words per snippet window
# File Paths
SYNONYMS_PATH=data/synonyms.txt
LOGREG_MODEL_PATH=models/logreg.pkl
CATBOOST_MODEL_PATH=models/catboost.cbm
# Timeouts (seconds)
OPENSEARCH_TIMEOUT=30
EMBEDDING_TIMEOUT=60Perform advanced search with synonym expansion and ML reranking.
Request Body:
{
"query": "machine learning applications",
"user_id": "12345",
"indices": ["user123_topic1", "system_topic1", "system_topic2"],
"filters": {
"date_range": {
"from": "2024-01-01",
"to": "2024-12-31"
}
}
}Response:
{
"results": [
{
"doc_id": "doc123",
"snippet": "Machine learning is a subset of artificial intelligence...",
"score": 0.85
}
],
"query_expanded": "machine learning ml applications",
"processing_time_ms": 245.6,
"request_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
}Check service health status.
Response:
{
"status": "healthy",
"services": {
"opensearch": "up",
"embedding": "up",
"models": "loaded"
}
}Service information.
The service returns appropriate HTTP status codes with detailed error messages:
- 400 Bad Request: Invalid query format, invalid date range
- 404 Not Found: Index not found, required files missing
- 422 Unprocessable Entity: Validation errors (missing fields, invalid formats)
- 500 Internal Server Error: Unexpected processing errors
- 503 Service Unavailable: External service unavailable (OpenSearch, embedding service)
- 504 Gateway Timeout: External service timeout
Error Response Format:
{
"error": {
"code": 404,
"message": "Index not found: user123_topic1",
"detail": "Specified index does not exist",
"request_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
}
}SearchServiceBaseLine/
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI application
│ ├── config.py # Configuration management
│ ├── models.py # Pydantic models
│ ├── services/
│ │ ├── __init__.py
│ │ ├── synonym_service.py # Synonym expansion
│ │ ├── embedding_service.py # Embedding client
│ │ ├── opensearch_service.py # OpenSearch client
│ │ ├── reranking_service.py # ML reranking
│ │ └── snippet_service.py # Snippet extraction
│ └── utils/
│ ├── __init__.py
│ ├── text_processing.py # Text utilities
│ └── rrf.py # RRF algorithm
├── data/
│ └── synonyms.txt # Synonym dictionary
├── models/
│ ├── logreg.pkl # LogReg model
│ └── catboost.cbm # CatBoost model
├── .env # Environment configuration
├── requirements.txt # Python dependencies
├── Dockerfile # Docker configuration
└── README.md # Documentation
Each line contains comma-separated synonyms:
machine learning, ml
deep learning, dl
artificial intelligence, ai
neural network, nn
The service creates a bidirectional mapping where each term maps to all other terms in the same line.
Pickle file containing:
{
'model': sklearn.linear_model.LogisticRegression,
'scaler': sklearn.preprocessing.StandardScaler,
'threshold': float # Classification threshold
}Features: [bm25_score, hnsw_score]
Standard CatBoost model file trained on features: [bm25_score, hnsw_score]
Expected index structure:
{
"mappings": {
"properties": {
"text": {"type": "text"},
"embedding": {
"type": "knn_vector",
"dimension": 1024,
"method": {
"name": "hnsw",
"space_type": "cosinesimil"
}
},
"doc_id": {"type": "keyword"},
"text_hash": {"type": "keyword"},
"user_upload_time": {"type": "date"}
}
}
}The service uses Python's standard logging with different levels:
- INFO: Request received, processing stages completed, results count
- DEBUG: Token expansion, synonym matches, search results, scores, filtering decisions
- ERROR: Service failures, connection errors
- WARNING: Missing data, unexpected conditions
Each log entry includes a request_id for request tracing.
- All I/O operations are asynchronous
- Concurrent dual searches (BM25 + HNSW) per index
- Efficient RRF merging with O(n log n) complexity
- Batch snippet extraction
- Model inference optimized with NumPy vectorization
# Install dev dependencies
pip install pytest pytest-asyncio httpx
# Run tests
pytest tests/- Edit
data/synonyms.txt - Add new synonym lines
- Restart the service
- Replace model files in
models/directory - Ensure feature dimensions match (2 features: BM25 and HNSW scores)
- Restart the service
- Check that model files exist in
models/directory - Verify synonyms.txt exists in
data/directory - Check logs for initialization errors
- Verify OpenSearch is running and accessible
- Check credentials in .env file
- Ensure services are on same Docker network
- Check vLLM service is running
- Increase
EMBEDDING_TIMEOUTin .env - Verify network connectivity
- Check that indices exist in OpenSearch
- Verify documents match date range filters
- Review LogReg/CatBoost thresholds (may be too strict)
This project is proprietary software.
For issues and questions, contact the development team.