Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR implements a two-tier semantic caching system that reduces AWS Bedrock costs by 30-50% through intelligent response caching using AWS Bedrock Titan Embeddings. The system attempts exact hash-based matching first (O(1) lookup), then falls back to embedding-based semantic similarity matching (cosine similarity with 0.95 threshold) to identify and serve cached responses for similar queries.
Key Changes:
- Added embedding service integration with AWS Bedrock Titan Embeddings for generating 1536-dimensional vectors
- Implemented vector similarity utilities (cosine similarity, Euclidean distance, batch operations) using NumPy
- Enhanced cache service to support both exact and semantic matching with separate TTLs and comprehensive statistics tracking
Reviewed Changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| app/utils/vector.py | New vector similarity utilities for cosine similarity, normalization, Euclidean distance, and batch operations |
| app/utils/init.py | New utils package initialization |
| app/services/embeddings.py | New embedding service using AWS Bedrock Titan Embeddings with retry logic and cost estimation |
| app/services/cache.py | Enhanced cache service with semantic matching, dual-tier lookup strategy, and detailed statistics |
| app/models/schemas.py | Updated cache statistics response schema with semantic-specific fields |
| app/core/config.py | Added semantic cache and embedding configuration settings |
| app/api/v1/endpoints/cache.py | Updated cache stats endpoint to return semantic cache metrics |
| SEMANTIC_CACHE.md | Comprehensive documentation on semantic cache implementation, architecture, and usage |
| README.md | Updated with semantic cache features and documentation links |
| .env.example | Added semantic cache and embedding configuration variables |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if self.semantic_enabled: | ||
| # Generate unique cache ID | ||
| cache_id = hashlib.sha256( | ||
| f"{model_id}:{prompt}:{response.get('timestamp', '')}".encode() |
There was a problem hiding this comment.
Using response.get('timestamp', '') for cache_id generation is unreliable because the response may not have a timestamp field, resulting in duplicate cache IDs for the same model+prompt combination. This could cause cache collisions. Consider using a more reliable unique identifier such as uuid4() or the current timestamp from time.time().
| # $0.0001 per 1K tokens, average prompt ~200 tokens | ||
| embedding_cost = (semantic_hits * 200 / 1000) * 0.0001 |
There was a problem hiding this comment.
The magic number 200 for average prompt tokens is hardcoded. Consider extracting this as a configuration constant or calculating it from actual prompt lengths to improve accuracy of cost estimates.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 10 out of 12 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
No description provided.