Refactor: Improve Indexing, Environment, and Dependency Management #26

enfayz · 2025-09-20T21:20:14Z

This refactor streamlines how we build indexes, wire up APIs, and manage configuration/dependencies. It aligns the indexing router with the current VectorDBService API, tightens CORS and router registration, clarifies environment setup, and prunes our Python requirements for reliability across dev and prod.

Key Changes

Indexing Router

Migrates to the latest VectorDBService contracts:
- Initializes indexes via create_index()
- Upserts embeddings via upsert_vectors()
Normalizes metadata preparation to support future filtering/search facets.
Adds clearer error handling and logs around index creation and batch upserts.

Main API Entrypoint

Ensures all routers are registered in a consistent order.
Hardened CORS defaults (configurable allowlist, methods, and headers).
Small cleanup of startup hooks and dependency injection.

Environment & Settings

Updates .env.example with required/optional vars for search and vector DB providers (incl. API keys).
settings.py now treats external API keys as optional with safe fallbacks and explicit validation messages.
Better separation of env-specific config (local/dev/prod) and sensible defaults.

Requirements

Deduplicates and pins key packages; removes stale/unused entries.
Groups dev-only tooling to avoid bloating production images.
Aligns versions with current SDKs used by VectorDBService.

Breaking Changes

Indexing endpoints now rely on create_index()/upsert_vectors(); any ad-hoc calls to previous methods should be updated.
Environment validation is stricter: missing critical vars will surface clear errors at startup.

Migration / Setup

Dependencies
```
pip install -r requirements.txt
```
Environment
- Copy and update .env.example → .env
- Provide credentials for vector DB and (optionally) web search.
Index Bootstrapping
- If needed, re-run initial indexing to create/refresh the target index.

Testing & Verification

Unit tests updated for router + settings paths.
Manual checks:
- Start API and confirm routers are reachable.
- CORS preflight succeeds from local frontend.
- Index create + batch upsert completes without errors and metadata is stored.

Notes

Logging around indexing is intentionally more verbose to aid observability; can be tuned via env.
Lays groundwork for metadata-aware querying and filterable search in a follow-up PR.

Summary by CodeRabbit

New Features
- Added internal web search endpoints under /internal for answering from web content.
Configuration
- Introduced environment variables for selecting search engine, API key, fetch concurrency, and default result count.
Chores
- Updated dependencies, enabling HTTP/2 for HTTP client and adding libraries for content extraction and numeric processing.
Style
- Minor .gitignore formatting cleanup.

- Added web search configuration variables to .env.example - Updated .gitignore to include a new line for clarity - Refactored import statement in indexing_router.py for consistency - Included web answering router in main.py for improved routing structure - Updated requirements.txt to include trafilatura and ensure httpx uses HTTP/2

coderabbitai · 2025-09-20T21:20:19Z

Walkthrough

Introduces internal web search integration by adding a new router to FastAPI, updates vector DB import to use VectorDBClient via alias, adds environment variables for web search configuration, adjusts .gitignore spacing, and updates requirements to include httpx with HTTP/2, trafilatura, and numpy.

Changes

Cohort / File(s)	Summary of changes
Environment configuration `\.env.example`	Added variables: `WEB_SEARCH_ENGINE`, `TAVILY_API_KEY`, `MAX_FETCH_CONCURRENCY`, `DEFAULT_TOP_K_RESULTS`. Preserved existing entries.
Git ignore housekeeping `.gitignore`	Inserted a blank line after `.env`. No rule changes.
API routing (web search) `api/main.py`	Imported and mounted `web_answering_router` at prefix `/internal` with tag `["websearch"]`.
Vector DB client aliasing `api/indexing_router.py`	Replaced import: use `VectorDBClient` aliased as `VectorDBService`; runtime calls remain `VectorDBService(...)`.
Dependencies `requirements.txt`	Updated `httpx` to `httpx[http2]>=0.27.0,<1.0.0`; added `trafilatura>=1.6.0,<2.0.0` and `numpy>=1.26.0,<2.0.0`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant API as FastAPI App
  participant Router as web_answering_router
  participant Search as WebSearch Engine (httpx[http2])
  participant Parser as trafilatura
  Note over API,Router: New internal endpoint ("/internal/…")
  Client->>API: HTTP request (web search answer)
  API->>Router: Route to handler
  Router->>Search: Query (engine=WEB_SEARCH_ENGINE, top_k=DEFAULT_TOP_K_RESULTS)
  Search-->>Router: Results (URLs, snippets)
  Router->>Parser: Fetch & extract content (max concurrency=MAX_FETCH_CONCURRENCY)
  Parser-->>Router: Cleaned text
  Router-->>Client: Aggregated answer (JSON)
  Note over Router,Search: Uses TAVILY_API_KEY when engine="tavily"

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

VectorDB Service Refactor & Test Modernization with pinecone API integration #23 — Refactors VectorDBService to VectorDBClient; matches the aliasing change in indexing_router.
Indexing Router Integration with VectorDBClient and Pinecone API integration #24 — Further updates client usage for indexing; aligns with replacing legacy VectorDBService references.

Suggested reviewers

fehranbit

Poem

A whisk of whiskers, a hop in the night,
New routes burrow in, out of plain sight.
Tavily winds with HTTP/2 breeze,
Trafilatura nibbles pages with ease.
With keys in .env and clients aligned,
Results pop up—carrots well-refined! 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "Refactor: Improve Indexing, Environment, and Dependency Management" succinctly captures the primary areas changed in the PR—it references indexing refactors, environment (.env) updates, and dependency adjustments—which align with the file diffs and PR objectives. The phrasing is concise, specific, and relevant for a teammate scanning history, avoiding vague terms or extraneous details. It accurately reflects the main intent without attempting to enumerate every small change.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/vectordb-import

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

api/indexing_router.py (2)
37-38: Size check counts characters, not bytes.

Use UTF‑8 byte length to enforce max_upload_mb.

Apply within this file (shown in the larger diff above):
-len(src["text"]) > settings.max_upload_mb * 1024 * 1024
+len(src["text"].encode("utf-8")) > settings.max_upload_mb * 1024 * 1024
26-83: Stale VectorDB upsert callsites found — update callers or restore compatibility

Found live callers that will break if upsert() was removed — update callers to the new API or reintroduce a compatible upsert wrapper.

services/vector_db_service.py:49 — self.index.upsert(...).

api/indexing_router.py:28,71 — vdb = VectorDBService(); vdb.upsert(namespace=..., ids=..., vectors=..., metadata=...).

api/endpoints/internal.py:109-110 — commented references to vector_db.upsert_vectors (reconcile comments with actual API).

tests/test_vector_db_service.py:18-29 — VectorDBClient/create_index usage; update tests if interface changed.

🧹 Nitpick comments (7)

api/indexing_router.py (4)

51-52: 415 status code likely incorrect for chunking errors.

When input is already plain text, a chunking failure is a 400/422, not “Unsupported Media Type”.

Consider:

-        except Exception:
-            raise HTTPException(status_code=415, detail="UNSUPPORTED_FILE_TYPE")
+        except Exception:
+            raise HTTPException(status_code=422, detail="CHUNKING_FAILED")

54-61: Metadata offsets look wrong/non-actionable.

char_span always starts at 0; if intended to be [start,end) within the original text, compute real spans. If not used yet, omit to avoid misleading consumers.

-            "char_span": [0, len(chunks[i])]
+            # Option A: compute real offsets if chunk_text can return them
+            # "char_span": [start_positions[i], end_positions[i]]
+            # Option B: omit until supported downstream
+            # (remove char_span key)

10-12: Remove unused imports.

extract_text_from_pdf, extract_text_from_docx, logging, os are unused.

-from services.parsing_service import extract_text_from_pdf, extract_text_from_docx
-import logging
-import os
+# remove unused imports; keep lean

26-83: Optional: plumb metadata/namespace into the client.

If you need metadata filters later, extend the client now.

Outside this file (services/vector_db_service.py):

@@ class VectorDBClient:
-    def upsert_vectors(self, vectors: List[List[float]], ids: List[str]):
+    def upsert_vectors(
+        self,
+        vectors: List[List[float]],
+        ids: List[str],
+        metadata: list[dict] | None = None,
+        namespace: str | None = None,
+    ):
@@
-        self.index.upsert(vectors=[(id, vec) for id, vec in zip(ids, vectors)])
+        if metadata and len(metadata) != len(ids):
+            raise ValueError("metadata length must match ids/vectors.")
+        if metadata:
+            payload = [
+                {"id": _id, "values": vec, "metadata": md}
+                for _id, vec, md in zip(ids, vectors, metadata)
+            ]
+        else:
+            payload = [{"id": _id, "values": vec} for _id, vec in zip(ids, vectors)]
+        self.index.upsert(vectors=payload, namespace=namespace)

Then in this router:

- vdb.upsert_vectors(vectors=vectors, ids=ids)
+ vdb.upsert_vectors(vectors=vectors, ids=ids, metadata=metadata, namespace=request.project_id)

requirements.txt (1)

16-16: Duplicate pydantic entry.

Line 16 repeats line 2; keep one to avoid drift.
- pydantic>=2.6.0,<3.0.0

.env.example (2)

7-7: Provide a safe default and hint for local CORS.

Consider a local-friendly default and comment to reduce setup friction.
-CORS_ALLOW_ORIGINS=
+# e.g. http://localhost:3000,https://yourdomain.com
+CORS_ALLOW_ORIGINS=http://localhost:3000
9-13: Minor dotenv hygiene: ordering + EOF newline.

Reorder keys alphabetically (if you run dotenv‑linter) and add a trailing newline.
-# Web search configuration
-WEB_SEARCH_ENGINE=tavily
-TAVILY_API_KEY=your_tavily_api_key_here
-MAX_FETCH_CONCURRENCY=4
-DEFAULT_TOP_K_RESULTS=8
+# Web search configuration
+DEFAULT_TOP_K_RESULTS=8
+MAX_FETCH_CONCURRENCY=4
+TAVILY_API_KEY=your_tavily_api_key_here
+WEB_SEARCH_ENGINE=tavily
+

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af0338c and 56c90a2.

📒 Files selected for processing (5)

.env.example (1 hunks)
.gitignore (1 hunks)
api/indexing_router.py (1 hunks)
api/main.py (1 hunks)
requirements.txt (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

api/indexing_router.py (1)

services/vector_db_service.py (1)

VectorDBClient (10-56)

🪛 dotenv-linter (3.3.0)

.env.example

[warning] 7-7: [UnorderedKey] The CORS_ALLOW_ORIGINS key should go before the POSTGRES_URI key

(UnorderedKey)

[warning] 11-11: [UnorderedKey] The TAVILY_API_KEY key should go before the WEB_SEARCH_ENGINE key

(UnorderedKey)

[warning] 12-12: [UnorderedKey] The MAX_FETCH_CONCURRENCY key should go before the TAVILY_API_KEY key

(UnorderedKey)

[warning] 13-13: [EndingBlankLine] No blank line at the end of the file

(EndingBlankLine)

[warning] 13-13: [UnorderedKey] The DEFAULT_TOP_K_RESULTS key should go before the MAX_FETCH_CONCURRENCY key

(UnorderedKey)

🔇 Additional comments (5)

.gitignore (1)

4-4: LGTM on spacing.

The added blank line after .env is harmless and improves readability.
requirements.txt (2)
7-7: Good: enable HTTP/2 for httpx.

httpx[http2] is a sensible upgrade for concurrency.

17-18: Make trafilatura & numpy optional — move to extras or a separate requirements file

Repository search produced no matches for trafilatura or numpy; confirm whether these are required at runtime. If only used for web-answering, move them to a web extras file (e.g., requirements-web.txt or extras) and gate imports with try/except or a feature flag.

File: requirements.txt Lines: 17-18
trafilatura>=1.6.0,<2.0.0
numpy>=1.26.0,<2.0.0
api/main.py (2)

52-52: Route order/namespacing looks fine; tag is clear.

No blocking issues with including the websearch router under /internal.

49-49: Confirm router exists and is internal-only.

rg reported "api/endpoints/web_answering.py: No such file or directory". Confirm the correct import path or whether the module was renamed/removed; verify the module exports a router object and that its endpoints require authentication (not publicly reachable). Suggested local checks: rg -n 'web_answering' -S && rg -nP '(?m)^router\s*=' -S && rg -nP '@(get|post|router.(get|post))(' -S.

api/indexing_router.py

fehranbit · 2025-09-21T11:41:00Z

we need to search in github for an os solution. lets discuss more in tmrws chat

coderabbitai bot reviewed Sep 20, 2025

View reviewed changes

api/indexing_router.py Show resolved Hide resolved

fehranbit approved these changes Sep 21, 2025

View reviewed changes

fehranbit merged commit 50f4748 into main Sep 21, 2025
1 check passed

coderabbitai bot mentioned this pull request Sep 21, 2025

Add DummySearchProvider Fallback and Settings Improvements #27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor: Improve Indexing, Environment, and Dependency Management #26

Refactor: Improve Indexing, Environment, and Dependency Management #26

Uh oh!

enfayz commented Sep 20, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 20, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

fehranbit commented Sep 21, 2025

Uh oh!

Uh oh!

Uh oh!

Refactor: Improve Indexing, Environment, and Dependency Management #26

Refactor: Improve Indexing, Environment, and Dependency Management #26

Uh oh!

Conversation

enfayz commented Sep 20, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Indexing Router

Main API Entrypoint

Environment & Settings

Requirements

Breaking Changes

Migration / Setup

Testing & Verification

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fehranbit commented Sep 21, 2025

Uh oh!

Uh oh!

Uh oh!

enfayz commented Sep 20, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 20, 2025 •

edited

Loading