-
Notifications
You must be signed in to change notification settings - Fork 0
Add DummySearchProvider Fallback and Settings Improvements #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Added web search configuration variables to .env.example - Updated .gitignore to include a new line for clarity - Refactored import statement in indexing_router.py for consistency - Included web answering router in main.py for improved routing structure - Updated requirements.txt to include trafilatura and ensure httpx uses HTTP/2
…ngs for search configuration
Caution Review failedThe pull request is closed. WalkthroughAdds a provider-agnostic asynchronous web search service (Tavily and Dummy providers) with optional in-memory caching, exposes web-answering endpoints under /internal, extends settings and Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant API as FastAPI (/internal/websearch)
participant Svc as WebSearchService
participant Prov as SearchProvider (Tavily / Dummy)
participant Ext as External Search API
Client->>API: GET /internal/search?query=...
API->>Svc: search(query, k, language, region, timeout, use_cache)
alt Cache hit
Svc-->>API: cached List<SearchResult>
else Cache miss
Svc->>Prov: search(query, k, region, language, timeout)
alt Tavily selected
Prov->>Ext: HTTP request (with API key)
Ext-->>Prov: JSON response
Prov-->>Svc: parsed List<SearchResult>
else Dummy selected / fallback
Prov-->>Svc: synthesized List<SearchResult>
end
Svc-->>API: results (caches response)
end
API-->>Client: JSON response
note over Prov: Tavily provider applies retries with exponential backoff on transient errors (e.g., 429)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (12)
.env.example (1)
9-13
: Default to dummy provider in example and add trailing newlineTo match settings’ default/fallback behavior and avoid noisy fallback logs when no key is set, prefer
WEB_SEARCH_ENGINE=dummy
in the example and keep the API‑specific line commented until a key is available. Also add a trailing newline.Apply this diff:
-# Web search configuration -WEB_SEARCH_ENGINE=tavily -TAVILY_API_KEY=your_tavily_api_key_here -MAX_FETCH_CONCURRENCY=4 -DEFAULT_TOP_K_RESULTS=8 +# Web search configuration +# Set to "dummy" for local/dev without keys; switch to "tavily" when you add TAVILY_API_KEY. +WEB_SEARCH_ENGINE=dummy +# TAVILY_API_KEY=your_tavily_api_key_here +MAX_FETCH_CONCURRENCY=4 +DEFAULT_TOP_K_RESULTS=8 +Also applies to: 7-7
api/indexing_router.py (2)
37-39
: Size check uses characters, not bytes
len(src["text"])
counts chars. Enforce MB against bytes to matchmax_upload_mb
.Apply this diff:
- if "text" in src and len(src["text"]) > settings.max_upload_mb * 1024 * 1024: + if "text" in src and len(src["text"].encode("utf-8")) > settings.max_upload_mb * 1024 * 1024:
50-53
: 415 here may misleadA chunking failure for plain text isn’t a media-type problem. Consider 422 (unprocessable) or 500 with a more specific detail.
- except Exception: - raise HTTPException(status_code=415, detail="UNSUPPORTED_FILE_TYPE") + except Exception: + raise HTTPException(status_code=422, detail="CHUNKING_FAILED")config/settings.py (2)
8-9
: Remove unused imports
Field
isn’t used.-from pydantic import SecretStr, ValidationError, Field +from pydantic import SecretStr, ValidationError
30-34
: Let BaseSettings load env; avoid direct os.environ in defaultsDeclare defaults and let Pydantic map env (e.g.,
WEB_SEARCH_ENGINE
). The currentos.environ.get(...)
duplicates config loading and complicates precedence.- web_search_engine: str = os.environ.get("WEB_SEARCH_ENGINE", "dummy") # Default to dummy provider if not specified + web_search_engine: str = "dummy" # overridden by WEB_SEARCH_ENGINE if setOptional: validate concurrency bounds:
# e.g., max_fetch_concurrency: int = Field(default=4, ge=1, le=64)
services/web_search_service.py (7)
125-127
: Align k clamp with Tavily’s documented rangeDocs show
max_results
range0..20
(default 5). Clamp to1..20
(or0..20
if you want to allow zero).- # Clamp k between 3 and 15 - k = max(3, min(15, k)) + # Clamp k to Tavily's documented range + k = max(1, min(20, k))Reference: Tavily
max_results
parameter. (docs.tavily.com)
170-176
: Use logger.exception for failuresKeep stack traces on failures; helps ops and debugging.
- logger.error(f"Tavily search failed after {max_retries} attempts: rate limited") + logger.exception("Tavily search failed after %s attempts: rate limited", max_retries) @@ - logger.error(f"Tavily search failed with status {e.response.status_code}: {e.response.text}") + logger.exception("Tavily search failed with status %s: %s", e.response.status_code, e.response.text) @@ - logger.error(f"Tavily request failed: {str(e)}") + logger.exception("Tavily request failed: %s", e) @@ - logger.error(f"Failed to initialize {self.provider_name} search provider: {str(e)}") + logger.exception("Failed to initialize %s search provider: %s", self.provider_name, e) @@ - logger.error(f"Search failed: {str(e)}") + logger.exception("Search failed: %s", e)Also applies to: 164-168, 237-240, 297-297
52-65
: Mark unused parameters in Dummy provider to satisfy lintersPrefix with underscores or use them to mirror real provider behavior.
- async def search(self, query: str, k: int, region: str, language: str, timeout_seconds: int) -> List[SearchResult]: + async def search(self, query: str, k: int, region: str, language: str, timeout_seconds: int) -> List[SearchResult]: @@ - for i in range(min(k, 5)): # Cap at 5 results + for i in range(min(k, 20)): # Cap to a sane upper boundOr:
- async def search(self, query: str, k: int, region: str, language: str, timeout_seconds: int) -> List[SearchResult]: + async def search(self, query: str, k: int, _region: str, _language: str, _timeout_seconds: int) -> List[SearchResult]:
214-233
: Fallback to dummy for unknown provider strings (match PR intent)Currently an unknown provider raises
ValueError
; better to warn and fall back to dummy, same as init‑failure path.- if self.provider_name not in self._providers: - raise ValueError(f"Unsupported search provider: {self.provider_name}") + if self.provider_name not in self._providers: + logger.warning("Unsupported search provider '%s'. Falling back to DummySearchProvider.", self.provider_name) + self.provider_name = "dummy" + self.provider = DummySearchProvider() + return
242-245
: Type Optional for k and keep signature tidySatisfy Ruff RUF013 and clarify optional arg.
- async def search(self, query: str, k: int = None, region: str = "auto", + async def search(self, query: str, k: Optional[int] = None, region: str = "auto", @@ - k = k or settings.default_top_k_results + k = k or settings.default_top_k_results(Import Optional already present.)
Also applies to: 263-264
14-16
: Prune unused imports
Protocol
isn’t used.-from typing import Dict, List, Optional, Protocol, Type, ClassVar, Mapping +from typing import Dict, List, Optional, Type, ClassVar, Mapping
209-213
: Cache is process‑local and not concurrency‑safe across workersFine for dev; in prod behind multiple workers/nodes it won’t share or lock. Document or consider a TTL cache with locks or an external cache if this becomes hot‑path.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
.env.example
(1 hunks).gitignore
(1 hunks)api/indexing_router.py
(1 hunks)api/main.py
(1 hunks)config/settings.py
(2 hunks)requirements.txt
(2 hunks)services/web_search_service.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
api/indexing_router.py (1)
services/vector_db_service.py (1)
VectorDBClient
(10-56)
🪛 dotenv-linter (3.3.0)
.env.example
[warning] 7-7: [UnorderedKey] The CORS_ALLOW_ORIGINS key should go before the POSTGRES_URI key
(UnorderedKey)
[warning] 11-11: [UnorderedKey] The TAVILY_API_KEY key should go before the WEB_SEARCH_ENGINE key
(UnorderedKey)
[warning] 12-12: [UnorderedKey] The MAX_FETCH_CONCURRENCY key should go before the TAVILY_API_KEY key
(UnorderedKey)
[warning] 13-13: [EndingBlankLine] No blank line at the end of the file
(EndingBlankLine)
[warning] 13-13: [UnorderedKey] The DEFAULT_TOP_K_RESULTS key should go before the MAX_FETCH_CONCURRENCY key
(UnorderedKey)
🪛 Ruff (0.13.1)
services/web_search_service.py
55-55: Unused method argument: region
(ARG002)
55-55: Unused method argument: language
(ARG002)
55-55: Unused method argument: timeout_seconds
(ARG002)
103-103: Avoid specifying long messages outside the exception class
(TRY003)
108-108: Unused method argument: region
(ARG002)
123-123: Avoid specifying long messages outside the exception class
(TRY003)
160-160: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
164-164: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
167-167: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
170-170: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
170-170: Use explicit conversion flag
Replace with conversion flag
(RUF010)
172-172: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
231-231: Avoid specifying long messages outside the exception class
(TRY003)
236-236: Do not catch blind exception: Exception
(BLE001)
237-237: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
237-237: Use explicit conversion flag
Replace with conversion flag
(RUF010)
242-242: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
295-295: Consider moving this statement to an else
block
(TRY300)
297-297: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
297-297: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🔇 Additional comments (4)
.gitignore (1)
4-4
: Whitespace tweak is fineThe blank line after
.env
improves readability between sections.requirements.txt (2)
7-7
: Good switch to httpx HTTP/2 extraEnabling
httpx[http2]
is appropriate given external API calls.
16-18
: Remove duplicate pydantic and verify heavy deps (numpy/trafilatura)Verification script returned no matches; confirm whether requirements.txt (lines 16–18) contains a duplicate pydantic entry and whether numpy/trafilatura are required on the request path (cold-start/size impact). If duplicate pydantic is present, apply this diff to dedupe:
- pydantic>=2.6.0,<3.0.0 trafilatura>=1.6.0,<2.0.0 numpy>=1.26.0,<2.0.0
api/main.py (1)
49-49
: Resolve missing web_answering router and duplicate "/internal" mounts
- api/endpoints/web_answering.py is missing; main.py imports it (line ~49) and mounts it with prefix="/internal" (line ~52). Add/restore that module or remove/fix the import and include_router call.
- api/indexing_router.py and api/endpoints/internal.py both declare APIRouter(prefix="/internal") and are both included in main.py — this results in two routers mounted at /internal; consolidate or change prefixes as appropriate.
…ion and metadata consistency
…e response handling and data extraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paytoniga fixed the reviews
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
services/vector_db_service.py (4)
51-54
: Type the public API (and add batch_size kw-only).Tighten the contract and make batching configurable.
Apply this diff within the shown range:
- def upsert(self, namespace, ids, vectors, metadata=None): + def upsert( + self, + namespace: str, + ids: list[str], + vectors: list[list[float]], + metadata: "list[dict] | None" = None, + *, + batch_size: int = 100, + ):Additionally (supporting change outside this range), update typing imports:
-from typing import List +from typing import List +# Python 3.10+: using built-in generics in signature above; no extra imports needed.
70-70
: Batch upserts to avoid payload limits and improve resilience.Large single calls can exceed request limits; batching is safer.
- self.index.upsert(vectors=items, namespace=namespace) + for start in range(0, len(items), batch_size): + self.index.upsert(vectors=items[start:start + batch_size], namespace=namespace)
37-49
: Unify upsert paths; avoid tuple-style vectors with the v3 Pinecone client.Use the new
upsert
to keep one codepath and consistent payload shape.def upsert_vectors(self, vectors: List[List[float]], ids: List[str]): - # Input validation - if vectors is None or ids is None: - raise ValueError("vectors and ids must not be None.") - if not isinstance(vectors, list) or not isinstance(ids, list): - raise TypeError("vectors and ids must be lists.") - if len(vectors) != len(ids): - raise ValueError("vectors and ids must have the same length.") - if self.dimension is not None and any(len(v) != self.dimension for v in vectors): - raise ValueError("vector dimensionality mismatch with index.") - if not self.index: - raise RuntimeError("Index is not initialized. Call create_index first.") - self.index.upsert(vectors=[(id, vec) for id, vec in zip(ids, vectors)]) + # Delegate to unified implementation; default namespace ("") is Pinecone's default. + self.upsert(namespace="", ids=ids, vectors=vectors)
51-70
: Optional: return the Pinecone response for observability/testing.Returning the last
UpsertResponse
(or a list when batched) helps with metrics and tests.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
services/vector_db_service.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
services/vector_db_service.py (2)
tests/test_vector_db_service.py (1)
upsert
(10-12)api/indexing_router.py (1)
index
(27-83)
🪛 Ruff (0.13.1)
services/vector_db_service.py
58-58: Avoid specifying long messages outside the exception class
(TRY003)
60-60: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (1)
services/vector_db_service.py (1)
51-61
: Verify module path and upsert call‑site compatibility.
- Import path OK — repo contains config/settings.py (/home/jailuser/git/config/settings.py) and multiple files import from config.settings (services/vector_db_service.py, services/web_search_service.py, services/embedding_service.py, api/indexing_router.py, tests/test_embedding_batch.py). No references to core.config.settings found.
- Call-site check incomplete — the ripgrep scan returned "No files were searched", so I cannot confirm callers use the new upsert(namespace, ids, vectors, metadata=...) signature. Re-run a full repo search (example: rg -nP '\bupsert\s*(' -S || true) or allow an AST-based scan to enumerate upsert call sites and flag mismatches.
if self.index is None: | ||
self.create_index() | ||
if not (len(ids) == len(vectors)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against missing embedding dimension before auto-creating the index.
Without a dimension, create_index
will fail at runtime.
- if self.index is None:
- self.create_index()
+ if self.index is None:
+ if self.dimension is None:
+ raise RuntimeError(
+ "embedding dimension is not configured; set settings.embedding_dimension or pass dimension."
+ )
+ self.create_index()
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if self.index is None: | |
self.create_index() | |
if not (len(ids) == len(vectors)): | |
if self.index is None: | |
if self.dimension is None: | |
raise RuntimeError( | |
"embedding dimension is not configured; set settings.embedding_dimension or pass dimension." | |
) | |
self.create_index() | |
if not (len(ids) == len(vectors)): |
🤖 Prompt for AI Agents
In services/vector_db_service.py around lines 55-57, the code auto-calls
create_index() without ensuring an embedding dimension exists which will cause a
runtime failure; before calling create_index(), validate that self.dimension is
set or infer it from the provided vectors (e.g., if vectors is non-empty set
self.dimension = len(vectors[0]) after validating all vectors share that
length), and if neither self.dimension nor inferable from vectors, raise a clear
ValueError asking the caller to provide an embedding dimension; adjust
create_index to accept/use the dimension if it currently does not.
if not (len(ids) == len(vectors)): | ||
raise ValueError("ids and vectors must have the same length") | ||
if metadata is not None and len(metadata) != len(ids): | ||
raise ValueError("metadata length must match ids/vectors length") | ||
items = [] | ||
for i, (id_, vector) in enumerate(zip(ids, vectors)): | ||
item = { | ||
"id": id_, | ||
"values": vector | ||
} | ||
if metadata is not None: | ||
item["metadata"] = metadata[i] | ||
items.append(item) | ||
self.index.upsert(vectors=items, namespace=namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add dimension checks, vector normalization, and metadata type validation.
Parity with upsert_vectors
and fewer Pinecone-side errors.
- if not (len(ids) == len(vectors)):
+ if len(ids) != len(vectors):
raise ValueError("ids and vectors must have the same length")
- if metadata is not None and len(metadata) != len(ids):
- raise ValueError("metadata length must match ids/vectors length")
- items = []
- for i, (id_, vector) in enumerate(zip(ids, vectors)):
- item = {
- "id": id_,
- "values": vector
- }
- if metadata is not None:
- item["metadata"] = metadata[i]
- items.append(item)
- self.index.upsert(vectors=items, namespace=namespace)
+ if metadata is not None and len(metadata) != len(ids):
+ raise ValueError("metadata length must match ids/vectors length")
+ items: list[dict] = []
+ for i, (id_, vector) in enumerate(zip(ids, vectors)):
+ # Dimension validation + normalization (supports numpy arrays)
+ if self.dimension is not None and len(vector) != self.dimension:
+ raise ValueError(
+ f"vector[{i}] dimensionality {len(vector)} != expected {self.dimension}"
+ )
+ values = vector.tolist() if hasattr(vector, "tolist") else list(vector)
+ if not all(isinstance(x, (int, float)) for x in values):
+ raise TypeError(f"vector[{i}] must be a sequence of numbers")
+ item = {"id": id_, "values": values}
+ if metadata is not None:
+ md = metadata[i]
+ if not isinstance(md, dict):
+ raise TypeError("each metadata entry must be a dict")
+ item["metadata"] = md
+ items.append(item)
+ self.index.upsert(vectors=items, namespace=namespace)
🧰 Tools
🪛 Ruff (0.13.1)
58-58: Avoid specifying long messages outside the exception class
(TRY003)
60-60: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In services/vector_db_service.py around lines 57 to 70, add parity with
upsert_vectors by validating each vector's dimension, normalizing vectors, and
validating metadata types before calling upsert: check that each vector is a
sequence of numeric values and its length equals the configured embedding
dimension (e.g., self.dimension) and raise ValueError if not; if normalization
is required, convert each vector to a unit vector (divide by its L2 norm,
guarding against zero norm); ensure metadata (when provided) is the expected
type (e.g., dict or list) and raise TypeError for invalid entries; perform these
checks/normalization inside the loop that builds items so you send only
validated, normalized vectors to self.index.upsert(vectors=items,
namespace=namespace).
This PR introduces a safe, no-key fallback for web search and loosens config requirements to improve local/dev workflows.
What Changed
core/config/settings.py
dummy
when unspecified.core/services/web_search_service.py
DummySearchProvider
returning mock results when no real provider is configured/available.WebSearchService
now auto-falls back to the dummy provider to keep the system responsive.Why It Matters
Notes / Usage
Verification
Summary by CodeRabbit
New Features
API
Configuration
Performance/Reliability
Dependencies