-
Notifications
You must be signed in to change notification settings - Fork 0
Refactor: Improve Indexing, Environment, and Dependency Management #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Added web search configuration variables to .env.example - Updated .gitignore to include a new line for clarity - Refactored import statement in indexing_router.py for consistency - Included web answering router in main.py for improved routing structure - Updated requirements.txt to include trafilatura and ensure httpx uses HTTP/2
WalkthroughIntroduces internal web search integration by adding a new router to FastAPI, updates vector DB import to use VectorDBClient via alias, adds environment variables for web search configuration, adjusts .gitignore spacing, and updates requirements to include httpx with HTTP/2, trafilatura, and numpy. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant API as FastAPI App
participant Router as web_answering_router
participant Search as WebSearch Engine (httpx[http2])
participant Parser as trafilatura
Note over API,Router: New internal endpoint ("/internal/…")
Client->>API: HTTP request (web search answer)
API->>Router: Route to handler
Router->>Search: Query (engine=WEB_SEARCH_ENGINE, top_k=DEFAULT_TOP_K_RESULTS)
Search-->>Router: Results (URLs, snippets)
Router->>Parser: Fetch & extract content (max concurrency=MAX_FETCH_CONCURRENCY)
Parser-->>Router: Cleaned text
Router-->>Client: Aggregated answer (JSON)
Note over Router,Search: Uses TAVILY_API_KEY when engine="tavily"
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
api/indexing_router.py (2)
37-38
: Size check counts characters, not bytes.Use UTF‑8 byte length to enforce
max_upload_mb
.Apply within this file (shown in the larger diff above):
-len(src["text"]) > settings.max_upload_mb * 1024 * 1024 +len(src["text"].encode("utf-8")) > settings.max_upload_mb * 1024 * 1024
26-83
: Stale VectorDB upsert callsites found — update callers or restore compatibilityFound live callers that will break if upsert() was removed — update callers to the new API or reintroduce a compatible upsert wrapper.
- services/vector_db_service.py:49 — self.index.upsert(...).
- api/indexing_router.py:28,71 — vdb = VectorDBService(); vdb.upsert(namespace=..., ids=..., vectors=..., metadata=...).
- api/endpoints/internal.py:109-110 — commented references to vector_db.upsert_vectors (reconcile comments with actual API).
- tests/test_vector_db_service.py:18-29 — VectorDBClient/create_index usage; update tests if interface changed.
🧹 Nitpick comments (7)
api/indexing_router.py (4)
51-52
: 415 status code likely incorrect for chunking errors.When input is already plain text, a chunking failure is a 400/422, not “Unsupported Media Type”.
Consider:
- except Exception: - raise HTTPException(status_code=415, detail="UNSUPPORTED_FILE_TYPE") + except Exception: + raise HTTPException(status_code=422, detail="CHUNKING_FAILED")
54-61
: Metadata offsets look wrong/non-actionable.
char_span
always starts at 0; if intended to be [start,end) within the original text, compute real spans. If not used yet, omit to avoid misleading consumers.- "char_span": [0, len(chunks[i])] + # Option A: compute real offsets if chunk_text can return them + # "char_span": [start_positions[i], end_positions[i]] + # Option B: omit until supported downstream + # (remove char_span key)
10-12
: Remove unused imports.
extract_text_from_pdf
,extract_text_from_docx
,logging
,os
are unused.-from services.parsing_service import extract_text_from_pdf, extract_text_from_docx -import logging -import os +# remove unused imports; keep lean
26-83
: Optional: plumb metadata/namespace into the client.If you need metadata filters later, extend the client now.
Outside this file (services/vector_db_service.py):
@@ class VectorDBClient: - def upsert_vectors(self, vectors: List[List[float]], ids: List[str]): + def upsert_vectors( + self, + vectors: List[List[float]], + ids: List[str], + metadata: list[dict] | None = None, + namespace: str | None = None, + ): @@ - self.index.upsert(vectors=[(id, vec) for id, vec in zip(ids, vectors)]) + if metadata and len(metadata) != len(ids): + raise ValueError("metadata length must match ids/vectors.") + if metadata: + payload = [ + {"id": _id, "values": vec, "metadata": md} + for _id, vec, md in zip(ids, vectors, metadata) + ] + else: + payload = [{"id": _id, "values": vec} for _id, vec in zip(ids, vectors)] + self.index.upsert(vectors=payload, namespace=namespace)Then in this router:
- vdb.upsert_vectors(vectors=vectors, ids=ids) + vdb.upsert_vectors(vectors=vectors, ids=ids, metadata=metadata, namespace=request.project_id)requirements.txt (1)
16-16
: Duplicatepydantic
entry.Line 16 repeats line 2; keep one to avoid drift.
- pydantic>=2.6.0,<3.0.0
.env.example (2)
7-7
: Provide a safe default and hint for local CORS.Consider a local-friendly default and comment to reduce setup friction.
-CORS_ALLOW_ORIGINS= +# e.g. http://localhost:3000,https://yourdomain.com +CORS_ALLOW_ORIGINS=http://localhost:3000
9-13
: Minor dotenv hygiene: ordering + EOF newline.Reorder keys alphabetically (if you run dotenv‑linter) and add a trailing newline.
-# Web search configuration -WEB_SEARCH_ENGINE=tavily -TAVILY_API_KEY=your_tavily_api_key_here -MAX_FETCH_CONCURRENCY=4 -DEFAULT_TOP_K_RESULTS=8 +# Web search configuration +DEFAULT_TOP_K_RESULTS=8 +MAX_FETCH_CONCURRENCY=4 +TAVILY_API_KEY=your_tavily_api_key_here +WEB_SEARCH_ENGINE=tavily +
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.env.example
(1 hunks).gitignore
(1 hunks)api/indexing_router.py
(1 hunks)api/main.py
(1 hunks)requirements.txt
(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
api/indexing_router.py (1)
services/vector_db_service.py (1)
VectorDBClient
(10-56)
🪛 dotenv-linter (3.3.0)
.env.example
[warning] 7-7: [UnorderedKey] The CORS_ALLOW_ORIGINS key should go before the POSTGRES_URI key
(UnorderedKey)
[warning] 11-11: [UnorderedKey] The TAVILY_API_KEY key should go before the WEB_SEARCH_ENGINE key
(UnorderedKey)
[warning] 12-12: [UnorderedKey] The MAX_FETCH_CONCURRENCY key should go before the TAVILY_API_KEY key
(UnorderedKey)
[warning] 13-13: [EndingBlankLine] No blank line at the end of the file
(EndingBlankLine)
[warning] 13-13: [UnorderedKey] The DEFAULT_TOP_K_RESULTS key should go before the MAX_FETCH_CONCURRENCY key
(UnorderedKey)
🔇 Additional comments (5)
.gitignore (1)
4-4
: LGTM on spacing.The added blank line after
.env
is harmless and improves readability.requirements.txt (2)
7-7
: Good: enable HTTP/2 for httpx.
httpx[http2]
is a sensible upgrade for concurrency.
17-18
: Make trafilatura & numpy optional — move to extras or a separate requirements fileRepository search produced no matches for trafilatura or numpy; confirm whether these are required at runtime. If only used for web-answering, move them to a web extras file (e.g., requirements-web.txt or extras) and gate imports with try/except or a feature flag.
File: requirements.txt Lines: 17-18
trafilatura>=1.6.0,<2.0.0 numpy>=1.26.0,<2.0.0
api/main.py (2)
52-52
: Route order/namespacing looks fine; tag is clear.No blocking issues with including the websearch router under
/internal
.
49-49
: Confirm router exists and is internal-only.rg reported "api/endpoints/web_answering.py: No such file or directory". Confirm the correct import path or whether the module was renamed/removed; verify the module exports a
router
object and that its endpoints require authentication (not publicly reachable). Suggested local checks: rg -n 'web_answering' -S && rg -nP '(?m)^router\s*=' -S && rg -nP '@(get|post|router.(get|post))(' -S.
we need to search in github for an os solution. lets discuss more in tmrws chat |
This refactor streamlines how we build indexes, wire up APIs, and manage configuration/dependencies. It aligns the indexing router with the current
VectorDBService
API, tightens CORS and router registration, clarifies environment setup, and prunes our Python requirements for reliability across dev and prod.Key Changes
Indexing Router
Migrates to the latest
VectorDBService
contracts:create_index()
upsert_vectors()
Normalizes metadata preparation to support future filtering/search facets.
Adds clearer error handling and logs around index creation and batch upserts.
Main API Entrypoint
Environment & Settings
.env.example
with required/optional vars for search and vector DB providers (incl. API keys).settings.py
now treats external API keys as optional with safe fallbacks and explicit validation messages.Requirements
VectorDBService
.Breaking Changes
create_index()
/upsert_vectors()
; any ad-hoc calls to previous methods should be updated.Migration / Setup
Dependencies
Environment
.env.example
→.env
Index Bootstrapping
Testing & Verification
Unit tests updated for router + settings paths.
Manual checks:
Notes
Summary by CodeRabbit