feat: add websift library implementation by Nihallllll · Pull Request #1 · AOSSIE-Org/WebSift

Nihallllll · 2026-03-06T08:49:56Z

Websift

Fetch the internet, serve it to any AI.

Web Intelligence is a Python library that crawls websites (or searches the web for you), extracts the useful text, and turns it into clean, formatted context that any LLM can read — OpenAI, Ollama, Groq, Anthropic, LangChain, or your own code. No API key needed for the core library.

What does it do?

You give it a URL → it crawls the page, extracts clean text, splits it into chunks, embeds them, and stores them in a vector database.
You ask a question → it finds the most relevant chunks and gives you LLM-ready context.
Or you give it just a question (no URL) → it searches the web via DuckDuckGo, crawls the top results, indexes them, and gives you the answer context. One line of code.

Summary by CodeRabbit

Release Notes

New Features
- CLI tool for indexing URLs, searching indexed content, and retrieving LLM-ready context
- REST API server with endpoints for indexing, searching, web search, and document management
- Web crawling and content extraction with automatic text chunking
- Multi-backend embedding support (SentenceTransformer, FastEmbed, OpenAI, Ollama)
- Vector storage backends (ChromaDB, NumPy-based)
- Web search integration with context retrieval
- Configuration system with environment variable support
- Comprehensive caching system for URLs, content, and embeddings
Documentation
- Detailed README with quick start guide and API documentation
- Executable quickstart example demonstrating end-to-end workflow
Tests
- Comprehensive test suite covering all core components

coderabbitai · 2026-03-06T08:50:19Z

Walkthrough

Introduces a comprehensive Web Intelligence library with web crawling, content extraction, text chunking, embeddings, vector storage, and semantic search capabilities. Includes CLI and REST API interfaces, multiple embedder backends, vector store implementations, caching layers, and extensive test coverage. Establishes project configuration, dependencies, and documentation.

Changes

Cohort / File(s)	Summary
Project Configuration `.python-version`, `MANIFEST.in`, `pyproject.toml`, `requirements.txt`, `pytest.ini`	Setup Python version (3.12), packaging manifest, project metadata, optional dependencies groups (fastembed, gpu, chromadb, search, server), build configuration, console script entry point, and pytest network marker for integration tests.
Documentation `README.md`	Comprehensive rewrite from placeholder to production-ready guide: features, installation instructions, quick start, REST API endpoints, CLI commands, configuration examples, project structure, and runnable code samples.
Examples and CLI `examples/quickstart.py`, `main.py`	Example script demonstrating web search, indexing, and LLM integration; CLI tool supporting serve, index, search, retrieve, documents, delete, stats, and clear commands with JSON and human-readable output formatting.
Core Library Initialization & Logging `web_intelligence/__init__.py`, `web_intelligence/_logging.py`	Package initialization with public API exports (FastPipeline, Config, exceptions, context formatters), version declaration (0.4.0), and logging setup utility.
Crawling and Content Extraction `web_intelligence/async_crawler.py`, `web_intelligence/crawler.py`, `web_intelligence/extractor.py`	Synchronous and asynchronous HTTP crawling with retry logic, robots.txt respect, rate limiting, and batch support; HTML content extraction via trafilatura with metadata and word count calculation.
Text Processing `web_intelligence/chunker.py`	Sentence-based text chunking with configurable chunk size (400 words) and overlap (50 words); includes sentence splitting, word counting, and metadata preservation.
Configuration Management `web_intelligence/config.py`	Modular configuration system via dataclasses for crawler, chunker, embedding, vector store, search, and server settings; environment variable support with type casting and deep copy utilities.
Caching Layer `web_intelligence/cache.py`	Three cache implementations: URLCache for URL metadata with TTL support, ContentCache for duplicate detection via hashing, and EmbeddingCache for in-memory/disk-backed embedding storage with LRU-like behavior.
Embedders (Base & Auto-detection) `web_intelligence/embedders/__init__.py`, `web_intelligence/embedders/base.py`	BaseEmbedder protocol defining standard embed/embed_batch interface; auto_detect_embedder factory function with fallback chain (SentenceTransformer → fastembed).
Embedder Implementations `web_intelligence/embedders/sentence_transformer.py`, `web_intelligence/embedders/fastembed_embedder.py`, `web_intelligence/embedders/openai_embedder.py`, `web_intelligence/embedders/ollama_embedder.py`	Concrete embedder backends with optional caching, batch processing, device selection (CUDA/CPU), and cache statistics tracking; supports Sentence Transformers, FastEmbed, OpenAI, and Ollama APIs.
Deprecated Fast Embedder `web_intelligence/fast_embedder.py`	Legacy embedder class with deprecation warning pointing to new embedders module; includes benchmarking utility for performance comparison.
Context Formatting `web_intelligence/context_formatter.py`	RetrievedContext dataclass with serialization and LLM message conversion; three formatters (plain, numbered, structured) supporting configurable max words and source attribution.
Search Providers (Base & Implementation) `web_intelligence/search_providers/__init__.py`, `web_intelligence/search_providers/base.py`, `web_intelligence/search_providers/duckduckgo_provider.py`	SearchResult dataclass and BaseSearchProvider protocol; DuckDuckGo implementation with configurable region/safesearch, dual backend support (ddgs/duckduckgo_search), and error handling.
Exception Hierarchy `web_intelligence/exceptions.py`	Custom exception classes for domain-specific errors: CrawlError, ExtractionError, EmbeddingError, NoEmbedderError, VectorStoreError, SearchProviderError, ConfigError with informative messages.
Vector Stores (Base & Implementations) `web_intelligence/vector_stores/__init__.py`, `web_intelligence/vector_stores/base.py`, `web_intelligence/vector_stores/chroma_store.py`, `web_intelligence/vector_stores/numpy_store.py`, `web_intelligence/vector_store.py`	BaseVectorStore protocol; ChromaVectorStore backend for persistent vector storage, NumpyVectorStore for in-memory/pickle-backed storage, and deprecated VectorStore wrapper; all support add, search, document management, and filtering.
Core Pipeline `web_intelligence/optimized_pipeline.py`	FastPipeline orchestrating end-to-end indexing, searching, and retrieval: URL crawling/extraction/chunking/embedding/storage, batch processing, web search integration, document management, and caching with auto-detection of components.
REST API Server `web_intelligence/server.py`	FastAPI application with endpoints for indexing (single/batch), semantic search, context retrieval, web search, document listing/deletion/statistics, and health checks; includes Pydantic request models and CORS support.
Test Suite `tests/test_library.py`	Comprehensive test coverage for config, chunker, extractor, embedder, vector store, context formatter, exceptions, and pipeline integration with network marker for web-dependent tests; includes test fixtures and teardown.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Pipeline as FastPipeline
    participant Crawler as Crawler
    participant Extractor as Extractor
    participant Chunker as Chunker
    participant Embedder as Embedder
    participant VectorStore as VectorStore
    
    Client->>Pipeline: index_url(url)
    Pipeline->>Crawler: crawl_url_async(url)
    Crawler-->>Pipeline: CrawlObject {html, status}
    Pipeline->>Extractor: extract_content(html, url)
    Extractor-->>Pipeline: ExtractedContent {text, title, word_count}
    Pipeline->>Chunker: chunk_text(text, doc_id, url)
    Chunker-->>Pipeline: List[Chunk]
    Pipeline->>Embedder: embed_batch(chunk texts)
    Embedder-->>Pipeline: List[embeddings]
    Pipeline->>VectorStore: add(vectors, metadatas, ids)
    VectorStore-->>Pipeline: success
    Pipeline-->>Client: indexed result

sequenceDiagram
    participant Client
    participant Pipeline as FastPipeline
    participant SearchProvider as SearchProvider
    participant Crawler as Crawler
    participant VectorStore as VectorStore
    participant Formatter as ContextFormatter
    
    Client->>Pipeline: search_web(query, max_results=5)
    Pipeline->>SearchProvider: search(query, max_results)
    SearchProvider-->>Pipeline: List[SearchResult {url, title, snippet}]
    Pipeline->>Crawler: crawl_urls_batch(urls)
    Crawler-->>Pipeline: List[CrawlObject]
    Note over Pipeline: Extract, chunk, embed (same as index flow)
    Pipeline->>VectorStore: add(vectors, metadatas)
    VectorStore-->>Pipeline: success
    Pipeline->>VectorStore: search(query_embedding, limit)
    VectorStore-->>Pipeline: List[matching chunks]
    Pipeline->>Formatter: format_context_numbered(chunks, query)
    Formatter-->>Pipeline: RetrievedContext
    Pipeline-->>Client: formatted context with sources

sequenceDiagram
    participant Client
    participant Pipeline as FastPipeline
    participant VectorStore as VectorStore
    participant Embedder as Embedder
    participant Formatter as ContextFormatter
    
    Client->>Pipeline: retrieve(query, output_format="numbered")
    Pipeline->>Embedder: embed(query)
    Embedder-->>Pipeline: query_embedding
    Pipeline->>VectorStore: search(query_embedding, limit=5)
    VectorStore-->>Pipeline: List[chunks with metadata, scores]
    Pipeline->>Formatter: format_context_numbered(chunks, query)
    Formatter-->>Pipeline: RetrievedContext {context_text, sources, word_count}
    Pipeline-->>Client: RetrievedContext with formatted output

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

Python Lang, Documentation

Poem

🐰 Hops through code with glee:
Crawlers hop, embedders sing,
Vectors store what seekers bring,
Search the web, retrieve with care,
Intelligence blooms everywhere! ✨

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add websift library implementation' directly describes the main objective of the PR: introducing a complete library implementation for web intelligence/crawling and indexing.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 51

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/quickstart.py`:
- Around line 6-53: The Groq LLM usage is unguarded and will cause ImportError
or require GROQ_API_KEY; wrap the import and LLM-related section so the core
FastPipeline (FastPipeline, pipeline.search_web, pipeline.index_url,
pipeline.retrieve) runs standalone: attempt to import ChatGroq in a try/except
and only execute the LLM block (creating llm via ChatGroq, calling
ctx.as_messages and llm.invoke, and the second invoke on ctx2.as_messages) if
the import succeeded and os.environ.get("GROQ_API_KEY") is present; if either
check fails, print a short notice that Groq is unavailable and skip the
LLM/example invocation while keeping the rest of the quickstart intact.

In `@main.py`:
- Around line 101-111: The delete branch for command == "delete" (handling
target, pipeline.delete_url and pipeline.delete_document) performs irreversible
deletes without confirmation; modify it to prompt the user to confirm the
destructive action (e.g., "Are you sure you want to delete <target>? Type 'yes'
to confirm:") and only call pipeline.delete_url or pipeline.delete_document if
the user types the explicit confirmation, or allow an explicit --force flag to
skip the prompt; ensure the confirmation text includes the target so a mistype
is obvious and abort (return) if the user does not confirm.
- Around line 67-70: The loop that parses args uses token value filtering which
removes query tokens equal to "--limit"/"-n" or the numeric value; instead, when
you detect the flag in the for i, a in enumerate(args) loop, record the exact
indices to remove (i and i+1), set limit = int(args[i+1]), and then rebuild
query from args[1:] by skipping tokens at those recorded indices (or build a new
list via enumerate and skip matching original indices) so you exclude the flag
and its value by position rather than by value; update the variables args, limit
and query accordingly (referencing args, limit, query, and the for i, a in
enumerate(args) loop).

In `@MANIFEST.in`:
- Line 4: Update the MANIFEST.in line that currently reads "recursive-include
src *.py" to point at the actual package directory used in pyproject.toml
(packages = ["web_intelligence"]); replace "src" with "web_intelligence" so
Python sources are included in sdist (i.e., change the directive to
"recursive-include web_intelligence *.py" or equivalent to include the package's
source files).

In `@pyproject.toml`:
- Around line 44-48: The [project.urls] entries (Homepage, Documentation,
Repository, Issues) currently use placeholder "yourusername/web-intelligence"
links; replace each value with the actual project URLs (real GitHub org/user and
repo name or hosted docs site) so Homepage, Documentation, Repository and Issues
point to resolvable addresses and update Documentation to the real docs anchor
or URL.
- Around line 8-9: Replace the placeholder author metadata in pyproject.toml by
updating the authors table (the authors = [...] entry) to include the real
maintainer details: set the actual name and email for the package
owner/maintainer in the {name = "...", email = "..."} record (or add multiple
author records if needed) so the package metadata no longer contains "Your Name"
or "your.email@example.com".
- Around line 60-61: The comment claiming "Web search (DuckDuckGo, no API key
needed)" is misleading for the dependency line search = ["ddgs>=6.0.0"]; either
change the dependency to the DuckDuckGo-specific package (e.g., replace ddgs
with duckduckgo-search) if you require DuckDuckGo-only behavior, or update the
comment to accurately describe that ddgs is a metasearch aggregator (supports
multiple search sources) so the comment and dependency match; refer to the
dependency key search and the package name ddgs in the pyproject.toml when
making the change.
- Around line 92-93: The console script entry point "main:main" points to a
root-level main.py that won't be packaged (packages = ["web_intelligence"]);
move the root main.py into the web_intelligence package (e.g.,
web_intelligence/main.py or web_intelligence/cli.py) and update the
pyproject.toml entry under [project.scripts] to reference the package module
(for example change "main:main" to "web_intelligence.main:main" or
"web_intelligence.cli:main") so the entry point resolves to a module included in
the distributed package.

In `@README.md`:
- Around line 108-130: Add a blank line after each Markdown heading to satisfy
MD lint rules: insert an empty line after the headings "When you give it a URL
(`index_url`):", "When you ask a question (`retrieve`):", and "When you search
the web (`search_web`):" so the paragraph lists that follow render correctly;
update the README.md section titled "How It Works (Step by Step)" accordingly by
ensuring each heading is immediately followed by one blank line before its
numbered list/paragraph.
- Around line 213-237: The fenced code block showing the project tree in
README.md lacks a language identifier; update that block to use a plain-text
language tag (e.g., ```text or ```plaintext) so the directory tree (starting
with "web_intelligence/" and files like optimized_pipeline.py, async_crawler.py,
embedders/, vector_stores/, etc.) is rendered correctly and satisfies markdown
linting.

In `@requirements.txt`:
- Around line 1-5: Requirements entries are missing version constraints and
should match the minimum versions declared in pyproject.toml; update the
dependency lines for httpx[http2], beautifulsoup4, trafilatura and numpy to
include the same >= version pins used in pyproject.toml (e.g.,
httpx[http2]>=0.25.0 and beautifulsoup4>=4.12.0) so pip install -r
requirements.txt installs the same minima as pip install web-intelligence.

In `@tests/test_library.py`:
- Around line 389-440: The network tests in class TestPipelineIntegration are
running by default because `@pytest.mark.network` has no default exclusion; update
the test class to only run when an explicit flag/env var is set by replacing or
augmenting the marker with a skip-if guard (e.g., add
`@pytest.mark.skipif`(os.getenv("RUN_NETWORK_TESTS") is None, reason="network
tests disabled unless RUN_NETWORK_TESTS is set") and import os and pytest at
top) so normal pytest runs skip TestPipelineIntegration; alternatively, wrap the
class with pytest.skipif using the same condition or check inside
_make_pipeline/test methods to call pytest.skip when the env var is not present.
- Around line 166-209: The tests call auto_detect_embedder() unconditionally and
fail when no backend is installed (raising NoEmbedderError); update each test in
TestEmbedder (e.g., test_auto_detect, test_embed_single, test_embed_batch,
test_embed_batch_empty, test_embeddings_are_normalized, test_cache_stats) to
handle this by catching NoEmbedderError from web_intelligence.embedders and
calling pytest.skip(...) with a clear message (or use pytest.importorskip /
pytest.mark.skipif around the test) so the test is skipped in core-only
environments rather than failing.

In `@web_intelligence/_logging.py`:
- Around line 7-17: The setup_logging function should declare a return type (->
None) and reuse the module-level logger instead of calling logging.getLogger
again; update the signature to include the return type and replace the local
"root = logging.getLogger('web_intelligence')" with the existing module-level
"logger" variable, then run the same handler-existence check and setLevel on
that logger (e.g., check logger.handlers for an existing StreamHandler before
adding handler and call logger.setLevel(level)).

In `@web_intelligence/async_crawler.py`:
- Around line 70-77: The current rate_limiter semaphore (used around client.get
via rate_limiter.acquire()/release()) only limits concurrent in-flight requests,
not the actual requests-per-second and also forces min rate 1; replace the
semaphore approach with a time-based token-bucket or interval limiter that
supports fractional RPS (requests_per_second as float). Concretely, implement an
async TokenBucket or RateLimiter class that exposes an awaitable method like
wait_for_token() (instead of acquire()/release()) which blocks until a token is
available and refills tokens on a timed schedule (or enforces a per-request
interval = 1/requests_per_second); update all call sites that currently do
rate_limiter.acquire()/release() to await rate_limiter.wait_for_token() before
calling client.get so bursts are prevented and fractional RPS are honored.
Ensure constructor/creator of the limiter accepts float RPS and unit tests cover
low fractional rates (e.g., 0.5 RPS) and concurrent callers.
- Around line 83-94: The retry block currently logs "retrying" and awaits
asyncio.sleep even on the final attempt, causing an unnecessary delay and
misleading logs; update the block in async_crawler.py (inside the request/retry
loop handling response.status_code 429 or >=500) to check if attempt ==
max_retries and, in that case, set last_error and break/return immediately
without calling logger.warning or awaiting asyncio.sleep; otherwise compute wait
(using retry_after and 2 ** attempt), log via logger.warning and await
asyncio.sleep as before. Ensure you reference response.status_code, retry_after,
wait, logger.warning, last_error and asyncio.sleep when making the conditional
change.

In `@web_intelligence/cache.py`:
- Around line 68-73: The stats() method can raise FileNotFoundError due to a
race on self.index_file.stat(); update stats() to safely obtain index_size_bytes
by wrapping the stat/getsize call in a try/except (catch
OSError/FileNotFoundError) and default to 0 if it fails (instead of directly
calling self.index_file.stat() even when exists() returned True); keep the
returned keys ('total_cached', 'cache_dir') unchanged and reference self.index
(for total_cached) and self.cache_dir, self.index_file in your change so the fix
is localized to stats().
- Around line 126-153: The disk filenames are using a 16-character prefix of the
SHA256 hash while the in-memory keys use the full hash, which can cause
collisions; update the get and set methods to use the full hash returned by
_text_hash for cache_file names (or at minimum increase to a longer prefix like
32 chars) so disk filenames match the in-memory keys (refer to _text_hash, get,
set, cache_dir, cache, max_memory_items).
- Around line 35-41: The TTL check currently compares datetime.now() (naive) to
cached_at/fromisoformat (may be aware) causing TypeError; update the logic
around ttl_hours, cached_at and expiry to use timezone-aware UTC datetimes
consistently: parse cached_at with datetime.fromisoformat and if
cached_at.tzinfo is None set/assume UTC (or convert with
cached_at.replace(tzinfo=timezone.utc)), then compute expiry and compare against
datetime.now(timezone.utc); ensure any stored timestamps are treated as UTC so
comparisons between expiry and now are always timezone-aware.

In `@web_intelligence/chunker.py`:
- Around line 14-17: The current split_into_sentences uses a naive regex
(sentence_endings) that splits on any period+space and will break abbreviations
and decimals; replace this with a robust sentence tokenizer such as NLTK's
sent_tokenize (ensure punkt is downloaded) or spaCy's nlp(...).sents inside
split_into_sentences, or if you must keep the lightweight approach, update the
docstring of split_into_sentences to clearly state this limitation and add a
TODO to migrate to a proper tokenizer; refer to the split_into_sentences
function and the sentence_endings variable when making the change.
- Around line 77-78: Remove the dead helper function tokenizer from chunker.py:
delete the def tokenizer(text: str) -> List[str]: return text.split()
declaration and any direct references to it; also ensure no imports or __all__
exports refer to tokenizer (check __init__.py and top-of-file imports) so the
codebase no longer contains or exposes this unused function.
- Line 1: Replace the deprecated "from typing import List" import with the
built-in "list" type and update all type annotations that reference the imported
List to use the builtin list instead; search for occurrences of "List[" in this
file (e.g., in function signatures, return types, and variable annotations) and
change them to "list[...]" (and remove the import line "from typing import
List").

In `@web_intelligence/config.py`:
- Around line 27-28: The comparison currently uses "cast == bool" which should
use identity testing; update the type check in the branch that handles booleans
to use "cast is bool" instead of "==" so the condition reliably detects the bool
type for the conversion of val (the code path referencing cast and val).
- Around line 74-78: The ServerConfig defaults currently bind to all interfaces;
change the default host in the ServerConfig dataclass (the host field using
_env("WI_SERVER_HOST", "0.0.0.0")) to "127.0.0.1" so the server binds to
localhost by default while still allowing overrides via WI_SERVER_HOST; update
the host default value referenced in the ServerConfig class to
_env("WI_SERVER_HOST", "127.0.0.1").

In `@web_intelligence/context_formatter.py`:
- Around line 41-43: The list-comprehension building source_list can raise
KeyError because it accesses s['title'] and s['url'] directly; update the
comprehension in the source_list construction (referencing self.sources and
source_list) to defensively access values (e.g., s.get('title', '<untitled>')
and s.get('url', '<no-url>')) or filter out/skip malformed entries (e.g.,
include only if 'title' and 'url' in s) so missing keys don't raise exceptions
and the formatter yields safe fallback text or omits bad sources.
- Around line 155-162: The XML-like block built in the variable block uses
unescaped values (url, title, text) which can break the markup; update the code
that constructs block in context_formatter.py to XML-escape/HTML-escape those
interpolated fields (at least title and text, and url if it can contain special
chars) before formatting (e.g., use html.escape or xml.sax.saxutils.escape) and
add the appropriate import at the top; ensure the formatted relevance/score
remains numeric and not escaped.

In `@web_intelligence/crawler.py`:
- Around line 6-14: Fix the CrawlObject dataclass annotations: change the type
of crawled_at from datetime to Optional[datetime] and keep its default = None,
and normalize spacing around all type annotations (e.g., use url: str, html:
str, status_code: int, success: bool, error: Optional[str] = None, crawled_at:
Optional[datetime] = None) so annotations follow the standard "name: type"
style; update imports if needed to ensure Optional and datetime are available
and used consistently in the CrawlObject definition.
- Line 31: The success check currently treats only HTTP 200 as success; update
the assignment that sets success (where response.status_code is used) to accept
the full 2xx range (e.g. 200 <= response.status_code < 300) or use response.ok
so other successful codes like 201/204 are allowed; locate the assignment to
success in crawler.py and replace the equality check against 200 with a
2xx-range check or response.ok.
- Around line 16-43: The crawl_url function currently catches a bare Exception
which will swallow critical signals; change the signature to include a return
type (e.g., def crawl_url(url: str, timeout: int = 10) -> CrawlObject:) and
replace the broad except with targeted httpx exceptions — for example catch
httpx.RequestError and httpx.TimeoutException (or httpx.ReadTimeout/WriteTimeout
if used) around the httpx.get call and return the error-wrapped CrawlObject
there; do not catch BaseException/Exception broadly (allow
KeyboardInterrupt/SystemExit to propagate) and ensure httpx exceptions are
imported so the specific except block references httpx.RequestError,
httpx.TimeoutException and handles building the same CrawlObject(error=str(e),
success=False, status_code=0, ...).

In `@web_intelligence/embedders/__init__.py`:
- Around line 9-20: The auto_detect_embedder function currently forwards
**kwargs directly into SentenceTransformerEmbedder and FastEmbedEmbedder which
can raise TypeError on unexpected params and short-circuit fallbacks; update
auto_detect_embedder to either (a) try/except TypeError around each constructor
call and continue to the next backend on TypeError, or (b) whitelist/filter
kwargs to the specific initializer signature of SentenceTransformerEmbedder
(referencing SentenceTransformerEmbedder.__init__) before calling it, and
similarly for FastEmbedEmbedder, so unexpected keys like base_url or api_key
won't break detection and will allow the FastEmbedEmbedder fallback to run.

In `@web_intelligence/embedders/fastembed_embedder.py`:
- Around line 95-103: In the loop inside fastembed_embedder.py where you iterate
over texts_to_embed and vectors (for idx, (text, vector) in
enumerate(zip(texts_to_embed, vectors))): change the zip call to use strict=True
so mismatched lengths raise immediately (i.e., enumerate(zip(texts_to_embed,
vectors, strict=True))). Update that single line in the method that performs
embedding and maintains the cache (the block that sets
results[indices_to_embed[idx]] and calls self.cache.set) to use zip(...,
strict=True); ensure the project runs on Python 3.10+ if not already.
- Around line 17-23: The __init__ of FastembedEmbedder declares **kwargs but
never uses it; either remove it or forward it to the underlying TextEmbedding
constructor—prefer forwarding for compatibility. Modify
FastembedEmbedder.__init__ so any **kwargs passed are forwarded into the
TextEmbedding(...) call (i.e., include **kwargs when instantiating
TextEmbedding), ensuring the **kwargs parameter is actually consumed;
alternatively, if you want to drop it, remove **kwargs from the __init__
signature and any related unused references and update callers accordingly.
- Around line 65-103: The show_progress parameter on embed_batch is unused;
remove it from the method signature and any internal references (function
embed_batch in fastembed_embedder.py) so callers don't pass a misleading option,
and update any call sites to stop supplying show_progress; alternatively, if you
prefer to keep the API for compatibility, add a clear inline comment above
embed_batch and a docstring note stating that fastembed's _model.embed does not
support progress display and show_progress is a no-op, and ensure no attempt is
made to pass it into _model.embed.

In `@web_intelligence/embedders/openai_embedder.py`:
- Around line 45-46: The cache is currently shared across models and backends;
change the EmbeddingCache instantiation in OpenAIEmbedder (the self.cache
assignment) to namespace entries by backend and model so embeddings are not
mixed across different model/backends — e.g. derive a per-model/backend cache
subdir or pass a namespace/model identifier into EmbeddingCache (or update
EmbeddingCache to accept a backend/model key) using the embedder's backend
identifier and model name (from the same class where self.use_cache is set) so
cached keys include backend/model scope and prevent wrong-dimensionality hits.

In `@web_intelligence/fast_embedder.py`:
- Around line 39-40: The call torch.set_num_threads(torch.get_num_threads()) is
a no-op; either remove it or set a meaningful thread count. In fast_embedder.py,
update the device CPU branch (where device == "cpu") to either delete the
torch.set_num_threads line or set a concrete value (e.g., derive from
os.cpu_count() or an injected config/ENV value) using
torch.set_num_threads(<desired_count>) so CPU threading is actually configured;
ensure you import any module you use (e.g., os) and document/validate the chosen
thread count.
- Around line 136-174: The benchmark_embedder function creates on-disk cached
embeddings in ./data/cache/embeddings during the cache test and never removes
them; update benchmark_embedder to clean up that cache after the test by
removing the cache directory (e.g., using shutil.rmtree) or, better, by creating
and using a temporary cache directory for the FastEmbedder instance used in the
cache test and deleting it afterwards; reference the benchmark_embedder function
and the FastEmbedder instance named embedder_cached to locate where to
create/remove the cache and wrap the removal in a try/except to avoid crashing
if deletion fails.

In `@web_intelligence/optimized_pipeline.py`:
- Around line 533-546: The retrieve_async method currently just calls the
synchronous retrieve method, so it's not actually asynchronous; either document
this behavior or make it truly async: update retrieve_async (method
retrieve_async) to either add a clear docstring stating it delegates to
synchronous retrieve and does not perform I/O concurrently, or change its
implementation to run the blocking retrieve on a threadpool (e.g., use
asyncio.get_running_loop().run_in_executor(None, lambda: self.retrieve(...)) and
await the result) or refactor the underlying retrieval into an async helper
(e.g., async _retrieve_impl) and await that; ensure signatures and return type
(RetrievedContext) remain correct and preserve parameters (query, limit,
output_format, max_context_words, where_filter, min_score).
- Around line 489-496: In the except ImportError block that raises
SearchProviderError, chain the original ImportError when re-raising so debugging
preserves the original traceback: capture the caught exception (e.g., except
ImportError as err) and raise SearchProviderError(...) from err (or use from
None if you intentionally want to suppress the original), updating the raise in
the except block that references SearchProviderError.
- Line 279: The "indexed_at" timestamp is currently created with
datetime.now().isoformat(), which is naive; update the assignment that sets
"indexed_at" to produce a timezone-aware ISO timestamp (e.g.,
datetime.now(timezone.utc).isoformat() or
datetime.utcnow().replace(tzinfo=timezone.utc).isoformat()) and ensure timezone
is imported from datetime so the stored value includes UTC offset.

In `@web_intelligence/search_providers/base.py`:
- Line 4: The import list in this module includes an unused Optional symbol;
update the import statement that currently imports List, Optional, Protocol,
runtime_checkable to remove Optional so only used names remain (e.g., keep List,
Protocol, runtime_checkable) to eliminate the unused import warning in
web_intelligence.search_providers.base.

In `@web_intelligence/search_providers/duckduckgo_provider.py`:
- Around line 48-55: The loop building SearchResult objects currently uses
item.get("href", "") which yields empty URLs; update the iteration in
duckduckgo_provider.py (the block that iterates over raw and appends to results)
to validate each item before creating a SearchResult: check that href =
item.get("href") is truthy (non-empty) and optionally that title or body exist,
and skip the item if href is missing or empty so you do not append
SearchResult(url="") entries; keep using SearchResult(...) when validation
passes.

In `@web_intelligence/server.py`:
- Around line 158-159: The except block that currently does "except Exception as
e: raise HTTPException(status_code=500, detail=str(e))" loses the original
traceback; update that raise to use exception chaining by raising the
HTTPException from the caught exception (i.e., "raise HTTPException(... ) from
e") in the same except block so the original exception context is preserved for
debugging; locate the except in server.py where the HTTPException is raised and
apply this change to the handler/function containing that except.
- Around line 187-191: The current delete_by_url endpoint uses a DELETE method
with a request body (DeleteURLRequest), which may break clients; modify the
endpoint to avoid a DELETE body by either (A) changing it to accept the URL as a
query parameter (keep route "/documents/by-url", change signature to accept url:
str and call pipeline.delete_url(url)) or (B) switch the route to POST (e.g.,
"/documents/delete-by-url") and keep accepting DeleteURLRequest in the body;
update the `@app` decorator and the function signature (delete_by_url and any
client calls) accordingly and ensure the returned JSON stays {"deleted_chunks":
count, "url": url}.
- Around line 207-215: Update the start_server signature to use explicit
Optional type hints: import Optional from typing and change the parameters in
start_server (host, port, reload) to Optional[str], Optional[int], and
Optional[bool] respectively, keeping the same None defaults and existing logic
that falls back to default_config().server values; reference the start_server
function and its parameters when making this change.

In `@web_intelligence/vector_store.py`:
- Around line 29-37: The search method currently defines the parameter named
filter which causes TypeError when callers pass where_filter; rename the
parameter and its internal uses from filter to where_filter in the search method
signature and body (function search and its kwargs population: replace "filter"
parameter and kwargs["where"] assignment to use where_filter) so callers like
optimized_pipeline that call search(..., where_filter=...) work correctly;
ensure any internal references to the old name are updated to where_filter to
preserve behavior.
- Around line 22-27: The VectorStore.add wrapper currently only accepts the old
three-arg signature and must be made compatible with the new contract used by
optimized_pipeline (which calls vector_store.add(..., documents=chunk_texts)).
Update the add method (VectorStore.add) to accept a documents:
Optional[List[str]] parameter (or accept **kwargs and pop 'documents') while
preserving support for the existing vectors, metadatas, ids parameters, and
forward documents through to self.collection.add as documents=documents; ensure
type hints reflect Optional[List[str]] and that passing documents via keyword
works without raising TypeError so optimized_pipeline.py can pass chunk_texts
successfully.

In `@web_intelligence/vector_stores/base.py`:
- Line 1: The import and type annotations use deprecated typing.List and
typing.Dict; update the import line to remove List and Dict and change all
annotations in this module that reference List[...] and Dict[...] to use the
built-in generics list[...] and dict[...] (keep Optional, Protocol,
runtime_checkable from typing as-is); specifically update the top-level import
from "from typing import List, Dict, Optional, Protocol, runtime_checkable" and
any usages of List and Dict in functions/classes in this file to use list and
dict generics instead.

In `@web_intelligence/vector_stores/chroma_store.py`:
- Around line 106-136: get_document currently indexes results["documents"][i]
directly which can raise or return wrong values when documents is None or has a
different shape; update get_document (the collection.get handling logic) to
mirror the safe access used in search: extract docs = results.get("documents")
and if docs is truthy determine whether it's nested (docs[0] is list) or flat,
pick the correct element for each id and also guard against index errors (fall
back to results["metadatas"][i].get("text", "") when docs is None or
out-of-range); ensure you still build chunks from id, text, chunk_index and
word_count from results["metadatas"] and sort by chunk_index before returning
the assembled document dict.
- Around line 61-82: The current score calculation in the results formatting
loop (where you compute score = 1 - distance and compare to min_score) is wrong
for non-cosine metrics; update the code that builds formatted (and uses results
and min_score) to first read the collection's configured distance metric from
the Chroma collection object, then branch the distance→similarity mapping: for
"cosine" keep score = 1 - distance, for "ip" use score = -distance (negate), and
for "l2" use a bounded transform such as score = 1 / (1 + distance) (or a
sigmoid) before applying the min_score filter and appending to formatted; locate
this logic in the function that iterates over results["ids"][0] and replace the
single formula with the metric-aware mapping.

In `@web_intelligence/vector_stores/numpy_store.py`:
- Around line 54-71: Validate batch sizes and vector dimensions at the start of
add() before modifying state: ensure len(vectors) == len(ids) == len(metadatas),
and if documents is provided ensure len(documents) == len(ids); convert vectors
to new_vecs (np.array) and if self._vectors is not None check new_vecs.ndim == 2
and new_vecs.shape[1] == self._vectors.shape[1] (raise ValueError with a clear
message on mismatch) so you never call vstack() or extend lists when the batch
is malformed; only after all checks pass perform vstack/assignment to
self._vectors, extend self._ids, self._metadatas, self._documents, and call
_save().
- Around line 46-48: The _load method currently uses pickle.load on persist_path
which is unsafe; replace pickle-based deserialization with a safe format (e.g.,
JSON/MsgPack with explicit schema validation) or, if keeping pickle, add
cryptographic integrity and authenticity checks: compute/verify an HMAC or
digital signature for the file before calling pickle.load, fail loudly on
mismatch, and document the key storage. Update any companion save logic (e.g.,
where the store is written) to emit the safe format or to sign the payload so
_load can verify before deserializing.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e5ecf859-3050-475c-9ed0-02c87654db50

📥 Commits

Reviewing files that changed from the base of the PR and between 31ceccc and 8876398.

⛔ Files ignored due to path filters (36)

tests/__pycache__/__init__.cpython-312.pyc is excluded by !**/*.pyc
tests/__pycache__/test_library.cpython-312-pytest-9.0.2.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/__init__.cpython-310.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/__init__.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/_logging.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/async_crawler.cpython-310.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/async_crawler.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/cache.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/chunker.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/config.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/context_formatter.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/crawler.cpython-310.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/crawler.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/embedder.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/exceptions.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/extractor.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/fast_embedder.cpython-310.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/fast_embedder.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/optimized_pipeline.cpython-310.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/optimized_pipeline.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/pipeline.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/server.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/__pycache__/vector_store.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/embedders/__pycache__/__init__.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/embedders/__pycache__/base.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/embedders/__pycache__/fastembed_embedder.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/embedders/__pycache__/ollama_embedder.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/embedders/__pycache__/openai_embedder.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/embedders/__pycache__/sentence_transformer.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/search_providers/__pycache__/__init__.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/search_providers/__pycache__/base.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/search_providers/__pycache__/duckduckgo_provider.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/vector_stores/__pycache__/__init__.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/vector_stores/__pycache__/base.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/vector_stores/__pycache__/chroma_store.cpython-312.pyc is excluded by !**/*.pyc
web_intelligence/vector_stores/__pycache__/numpy_store.cpython-312.pyc is excluded by !**/*.pyc

📒 Files selected for processing (37)

.python-version
MANIFEST.in
README.md
examples/quickstart.py
main.py
pyproject.toml
pytest.ini
requirements.txt
tests/__init__.py
tests/test_library.py
web_intelligence/__init__.py
web_intelligence/_logging.py
web_intelligence/async_crawler.py
web_intelligence/cache.py
web_intelligence/chunker.py
web_intelligence/config.py
web_intelligence/context_formatter.py
web_intelligence/crawler.py
web_intelligence/embedders/__init__.py
web_intelligence/embedders/base.py
web_intelligence/embedders/fastembed_embedder.py
web_intelligence/embedders/ollama_embedder.py
web_intelligence/embedders/openai_embedder.py
web_intelligence/embedders/sentence_transformer.py
web_intelligence/exceptions.py
web_intelligence/extractor.py
web_intelligence/fast_embedder.py
web_intelligence/optimized_pipeline.py
web_intelligence/search_providers/__init__.py
web_intelligence/search_providers/base.py
web_intelligence/search_providers/duckduckgo_provider.py
web_intelligence/server.py
web_intelligence/vector_store.py
web_intelligence/vector_stores/__init__.py
web_intelligence/vector_stores/base.py
web_intelligence/vector_stores/chroma_store.py
web_intelligence/vector_stores/numpy_store.py

coderabbitai · 2026-03-06T09:06:50Z

examples/quickstart.py

+from langchain_groq import ChatGroq
+
+
+def main():
+    print("Setting up Web Intelligence pipeline...")
+    pipeline = FastPipeline()
+
+    question = "what is lcm of 9 and 17"
+
+    print(f"\nSearching the web for: '{question}'")
+    print("(This searches DuckDuckGo → crawls pages → indexes → retrieves)\n")
+
+    ctx = pipeline.search_web(question, max_results=3, limit=5)
+
+    print("=" * 60)
+    print("WEB INTELLIGENCE RESULTS")
+    print("=" * 60)
+    print(f"Sources found: {len(ctx.sources)}")
+    for s in ctx.sources:
+        print(f"  - {s['title']}: {s['url']}")
+    print(f"Context chunks: {ctx.total_chunks}")
+    print(f"Context words:  {ctx.total_words}")
+    print(f"\nContext preview (first 500 chars):")
+    print(ctx.context_text[:500])
+    print("...\n")
+
+    print("=" * 60)
+    print("SENDING TO GROQ LLM (llama-3.3-70b-versatile)")
+    print("=" * 60)
+
+    llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0)
+
+    messages = ctx.as_messages()
+    response = llm.invoke(messages)
+
+    print(f"\nLLM ANSWER:\n")
+    print(response.content)
+
+    print("\n" + "=" * 60)
+    print("BONUS: Index a specific page and ask a question")
+    print("=" * 60)
+
+    pipeline.index_url("https://en.wikipedia.org/wiki/Python_(programming_language)")
+    ctx2 = pipeline.retrieve("Who created Python and when?")
+
+    response2 = llm.invoke(ctx2.as_messages())
+    print(f"\nQuestion: Who created Python and when?")
+    print(f"Answer: {response2.content}")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's explore the repository structure to understand the context git ls-files | head -30

Repository: AOSSIE-Org/WebSift

Length of output: 808

🏁 Script executed:

# Check if the quickstart file exists and its current state fd quickstart.py . | head -5

Repository: AOSSIE-Org/WebSift

Length of output: 86

🏁 Script executed:

# List key configuration files fd -e toml -e txt -e md . | grep -E "(pyproject|setup|requirements|README)" | head -20

Repository: AOSSIE-Org/WebSift

Length of output: 103

🏁 Script executed:

# Read the quickstart example to verify current state cat -n examples/quickstart.py

Repository: AOSSIE-Org/WebSift

Length of output: 2061

🏁 Script executed:

# Read pyproject.toml to check dependency declarations cat pyproject.toml

Repository: AOSSIE-Org/WebSift

Length of output: 2997

🏁 Script executed:

# Read README to check quickstart prerequisites and documentation cat README.md

Repository: AOSSIE-Org/WebSift

Length of output: 6964

Gate the Groq LLM section behind an optional check.

The quickstart imports langchain_groq unconditionally at line 6, but langchain_groq is not a declared dependency in pyproject.toml. This will cause an ImportError on a fresh install. Additionally, the README states "No API key needed for the core library," yet the example requires GROQ_API_KEY to run, preventing users from trying the core search/index workflow. Conditionally import ChatGroq and check for both the package availability and GROQ_API_KEY before running the LLM sections. The core pipeline (search, index, retrieve) should work standalone.

Suggested fix

+import os from dotenv import load_dotenv load_dotenv() from web_intelligence import FastPipeline -from langchain_groq import ChatGroq +try: + from langchain_groq import ChatGroq +except ImportError: + ChatGroq = None @@ - print("=" * 60) - print("SENDING TO GROQ LLM (llama-3.3-70b-versatile)") - print("=" * 60) - - llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0) - - messages = ctx.as_messages() - response = llm.invoke(messages) - - print(f"\nLLM ANSWER:\n") - print(response.content) + llm = None + if ChatGroq is not None and os.getenv("GROQ_API_KEY"): + print("=" * 60) + print("SENDING TO GROQ LLM (llama-3.3-70b-versatile)") + print("=" * 60) + llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0) + response = llm.invoke(ctx.as_messages()) + print("\nLLM ANSWER:\n") + print(response.content) + else: + print("Skipping Groq demo. Install `langchain-groq` and set `GROQ_API_KEY` to enable it.") @@ - response2 = llm.invoke(ctx2.as_messages()) - print(f"\nQuestion: Who created Python and when?") - print(f"Answer: {response2.content}") + if llm is not None: + response2 = llm.invoke(ctx2.as_messages()) + print("\nQuestion: Who created Python and when?") + print(f"Answer: {response2.content}")

🧰 Tools

🪛 Ruff (0.15.4)

[error] 28-28: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 41-41: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 52-52: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/quickstart.py` around lines 6 - 53, The Groq LLM usage is unguarded and will cause ImportError or require GROQ_API_KEY; wrap the import and LLM-related section so the core FastPipeline (FastPipeline, pipeline.search_web, pipeline.index_url, pipeline.retrieve) runs standalone: attempt to import ChatGroq in a try/except and only execute the LLM block (creating llm via ChatGroq, calling ctx.as_messages and llm.invoke, and the second invoke on ctx2.as_messages) if the import succeeded and os.environ.get("GROQ_API_KEY") is present; if either check fails, print a short notice that Groq is unavailable and skip the LLM/example invocation while keeping the rest of the quickstart intact.

coderabbitai · 2026-03-06T09:06:50Z

main.py

+        for i, a in enumerate(args):
+            if a in ("--limit", "-n") and i + 1 < len(args):
+                limit = int(args[i + 1])
+                query = " ".join(a for a in args[1:] if a not in ("--limit", "-n", args[i + 1]))


⚠️ Potential issue | 🟠 Major

The --limit parser corrupts queries containing the same token value.

Line 70 rebuilds query by filtering tokens by value, not by position. python main.py search version 2 --limit 2 becomes version, and any literal -n/--limit token inside the query is also stripped.

Suggested fix

limit = 5 - for i, a in enumerate(args): - if a in ("--limit", "-n") and i + 1 < len(args): - limit = int(args[i + 1]) - query = " ".join(a for a in args[1:] if a not in ("--limit", "-n", args[i + 1])) + query_parts = [] + i = 1 + while i < len(args): + if args[i] in ("--limit", "-n"): + if i + 1 >= len(args): + print("Usage: python main.py search <query> [--limit N]") + return + limit = int(args[i + 1]) + i += 2 + continue + query_parts.append(args[i]) + i += 1 + query = " ".join(query_parts) results = pipeline.search(query, limit=limit)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@main.py` around lines 67 - 70, The loop that parses args uses token value filtering which removes query tokens equal to "--limit"/"-n" or the numeric value; instead, when you detect the flag in the for i, a in enumerate(args) loop, record the exact indices to remove (i and i+1), set limit = int(args[i+1]), and then rebuild query from args[1:] by skipping tokens at those recorded indices (or build a new list via enumerate and skip matching original indices) so you exclude the flag and its value by position rather than by value; update the variables args, limit and query accordingly (referencing args, limit, query, and the for i, a in enumerate(args) loop).

coderabbitai · 2026-03-06T09:06:50Z

main.py

+    elif command == "delete":
+        if len(args) < 2:
+            print("Usage: python main.py delete <doc_id or url>")
+            return
+        target = args[1]
+        if target.startswith("http"):
+            count = pipeline.delete_url(target)
+            print(f"Deleted {count} chunks from {target}")
+        else:
+            ok = pipeline.delete_document(target)
+            print(f"Deleted: {ok}")


⚠️ Potential issue | 🟠 Major

Add confirmation before destructive deletes.

Unlike clear, this command deletes a document or every chunk for a URL immediately. One mistyped identifier becomes irreversible data loss from the CLI.

Suggested fix

elif command == "delete": if len(args) < 2: print("Usage: python main.py delete <doc_id or url>") return target = args[1] + confirm = input(f"Delete indexed data for '{target}'? Type 'yes' to confirm: ") + if confirm.strip().lower() != "yes": + print("Aborted.") + return if target.startswith("http"): count = pipeline.delete_url(target) print(f"Deleted {count} chunks from {target}") else: ok = pipeline.delete_document(target)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

elif command == "delete":

if len(args) < 2:

print("Usage: python main.py delete <doc_id or url>")

return

target = args[1]

if target.startswith("http"):

count = pipeline.delete_url(target)

print(f"Deleted {count} chunks from {target}")

else:

ok = pipeline.delete_document(target)

print(f"Deleted: {ok}")

elif command == "delete":

if len(args) < 2:

print("Usage: python main.py delete <doc_id or url>")

return

target = args[1]

confirm = input(f"Delete indexed data for '{target}'? Type 'yes' to confirm: ")

if confirm.strip().lower() != "yes":

print("Aborted.")

return

if target.startswith("http"):

count = pipeline.delete_url(target)

print(f"Deleted {count} chunks from {target}")

else:

ok = pipeline.delete_document(target)

print(f"Deleted: {ok}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@main.py` around lines 101 - 111, The delete branch for command == "delete" (handling target, pipeline.delete_url and pipeline.delete_document) performs irreversible deletes without confirmation; modify it to prompt the user to confirm the destructive action (e.g., "Are you sure you want to delete <target>? Type 'yes' to confirm:") and only call pipeline.delete_url or pipeline.delete_document if the user types the explicit confirmation, or allow an explicit --force flag to skip the prompt; ensure the confirmation text includes the target so a mistype is obvious and abort (return) if the user does not confirm.

coderabbitai · 2026-03-06T09:06:50Z

MANIFEST.in

+include README.md
+include LICENSE
+include pyproject.toml
+recursive-include src *.py


⚠️ Potential issue | 🟠 Major

Incorrect path will exclude all Python source files from sdist.

The recursive-include src *.py directive references a src/ directory, but per pyproject.toml (line 100: packages = ["web_intelligence"]), the package is located at web_intelligence/, not src/. This will cause source distributions to be missing all Python code.

🐛 Proposed fix

include README.md include LICENSE include pyproject.toml -recursive-include src *.py +recursive-include web_intelligence *.py

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

recursive-include src *.py

include README.md

include LICENSE

include pyproject.toml

recursive-include web_intelligence *.py

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@MANIFEST.in` at line 4, Update the MANIFEST.in line that currently reads "recursive-include src *.py" to point at the actual package directory used in pyproject.toml (packages = ["web_intelligence"]); replace "src" with "web_intelligence" so Python sources are included in sdist (i.e., change the directive to "recursive-include web_intelligence *.py" or equivalent to include the package's source files).

coderabbitai · 2026-03-06T09:06:50Z

pyproject.toml

+authors = [
+    {name = "Your Name", email = "your.email@example.com"}


⚠️ Potential issue | 🟡 Minor

Update placeholder author information before release.

The author metadata contains placeholder values that should be replaced with actual information.

🔧 Update with actual values

authors = [ - {name = "Your Name", email = "your.email@example.com"} + {name = "Actual Name", email = "actual.email@example.com"} ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

authors = [

{name = "Your Name", email = "your.email@example.com"}

authors = [

{name = "Actual Name", email = "actual.email@example.com"}

]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 8 - 9, Replace the placeholder author metadata in pyproject.toml by updating the authors table (the authors = [...] entry) to include the real maintainer details: set the actual name and email for the package owner/maintainer in the {name = "...", email = "..."} record (or add multiple author records if needed) so the package metadata no longer contains "Your Name" or "your.email@example.com".

coderabbitai · 2026-03-06T09:06:52Z

web_intelligence/vector_store.py

+    def search(self, query_vector: List[float], limit: int = 5,
+               filter: Optional[Dict] = None,
+               min_score: float = 0.0) -> List[Dict]:
+        kwargs = {
+            "query_embeddings": [query_vector],
+            "n_results": limit,
+        }
+        if filter:
+            kwargs["where"] = filter


⚠️ Potential issue | 🟠 Major

Rename filter to where_filter to match current callers.

web_intelligence/optimized_pipeline.py:498-509 now invokes self.vector_store.search(..., where_filter=where_filter). Keeping the deprecated wrapper on filter= makes it incompatible with the rest of the package and raises TypeError: unexpected keyword argument 'where_filter' as soon as a filtered search is used.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@web_intelligence/vector_store.py` around lines 29 - 37, The search method currently defines the parameter named filter which causes TypeError when callers pass where_filter; rename the parameter and its internal uses from filter to where_filter in the search method signature and body (function search and its kwargs population: replace "filter" parameter and kwargs["where"] assignment to use where_filter) so callers like optimized_pipeline that call search(..., where_filter=...) work correctly; ensure any internal references to the old name are updated to where_filter to preserve behavior.

coderabbitai · 2026-03-06T09:06:52Z

web_intelligence/vector_stores/base.py

@@ -0,0 +1,32 @@
+from typing import List, Dict, Optional, Protocol, runtime_checkable


🧹 Nitpick | 🔵 Trivial

Use built-in generics instead of deprecated typing.List and typing.Dict.

Since the project requires Python ≥3.10 (per pyproject.toml), use the built-in list and dict directly as generic types instead of the deprecated typing.List and typing.Dict.

♻️ Proposed fix

-from typing import List, Dict, Optional, Protocol, runtime_checkable +from typing import Optional, Protocol, runtime_checkable `@runtime_checkable` class BaseVectorStore(Protocol): def add( self, - vectors: List[List[float]], - metadatas: List[Dict], - ids: List[str], - documents: Optional[List[str]] = None, + vectors: list[list[float]], + metadatas: list[dict], + ids: list[str], + documents: Optional[list[str]] = None, ) -> None: ... def search( self, - query_vector: List[float], + query_vector: list[float], limit: int = 5, - where_filter: Optional[Dict] = None, + where_filter: Optional[dict] = None, min_score: float = 0.0, - ) -> List[Dict]: ... + ) -> list[dict]: ... - def list_documents(self) -> List[Dict]: ... + def list_documents(self) -> list[dict]: ... - def get_document(self, doc_id: str) -> Optional[Dict]: ... + def get_document(self, doc_id: str) -> Optional[dict]: ...

🧰 Tools

🪛 Ruff (0.15.4)

[warning] 1-1: typing.List is deprecated, use list instead

(UP035)

[warning] 1-1: typing.Dict is deprecated, use dict instead

(UP035)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@web_intelligence/vector_stores/base.py` at line 1, The import and type annotations use deprecated typing.List and typing.Dict; update the import line to remove List and Dict and change all annotations in this module that reference List[...] and Dict[...] to use the built-in generics list[...] and dict[...] (keep Optional, Protocol, runtime_checkable from typing as-is); specifically update the top-level import from "from typing import List, Dict, Optional, Protocol, runtime_checkable" and any usages of List and Dict in functions/classes in this file to use list and dict generics instead.

coderabbitai · 2026-03-06T09:06:52Z

web_intelligence/vector_stores/chroma_store.py

+        formatted = []
+        for i in range(len(results["ids"][0])):
+            score = 1 - results["distances"][0][i]
+            if score < min_score:
+                continue
+
+            meta = results["metadatas"][0][i]
+            text = ""
+            if results.get("documents") and results["documents"][0]:
+                text = results["documents"][0][i] or ""
+            if not text:
+                text = meta.get("text", "")
+
+            formatted.append({
+                "id": results["ids"][0][i],
+                "text": text,
+                "source": meta.get("url", ""),
+                "score": score,
+                "metadata": meta,
+            })
+
+        return formatted


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the context of this file cd web_intelligence/vector_stores cat -n chroma_store.py | head -100

Repository: AOSSIE-Org/WebSift

Length of output: 3861

🏁 Script executed:

# Search for ChromaDB initialization and collection configuration rg "collection\|Collection" web_intelligence/vector_stores/chroma_store.py -B 2 -A 2

Repository: AOSSIE-Org/WebSift

Length of output: 44

🏁 Script executed:

# Look for any distance metric configuration in the codebase rg "metric\|distance" web_intelligence/ -i --type py | grep -i "chroma\|metric"

Repository: AOSSIE-Org/WebSift

Length of output: 44

🏁 Script executed:

# Check how the ChromaDB client/collection is created rg "chroma\|query" web_intelligence/vector_stores/chroma_store.py -B 3 -A 3

Repository: AOSSIE-Org/WebSift

Length of output: 44

🌐 Web query:

ChromaDB distance metrics L2 cosine IP how query returns distances

💡 Result:

In ChromaDB, collection.query(..., include=["distances"]) returns a distance value per returned neighbor (per query): results["distances"][query_idx][rank_idx]. Smaller = closer / more similar for all metrics (Chroma ranks by ascending “distance”). (tessl.io)

Which distance is returned?

The distance function is fixed per collection via the index space (hnsw:space / configuration={"hnsw":{"space": ...}}), not per-query. Possible values: l2, cosine, ip. (cookbook.chromadb.dev)

l2 (default): squared Euclidean distance

Returned value is L2 distance squared (so it can look “large”; it is not square-rooted). (cookbook.chromadb.dev)

If you want Euclidean distance: euclidean = sqrt(distance).

cosine: cosine distance

Returned value is: distance = 1 - cosine_similarity. (tessl.io)

Convert back: cosine_similarity = 1 - distance.

ip: inner-product “distance”

Returned value is based on inner product, with Chroma treating it as a distance by using the negative sign (so that “smaller is better” still works): distance = - (q · x). (tessl.io)

Convert back: dot_product = -distance.

Practical notes

Distances are not normalized to 0–1 (except cosine distance is typically in [0, 2] depending on vector direction). (tessl.io)

If you need “similarity scores” for UI thresholds, you usually convert as above (cosine: 1-d, ip: -d; l2: often use a custom transform). (tessl.io)

Citations:

1: https://tessl.io/registry/tessl/pypi-chromadb/1.0.0/files/docs/queries.md

2: https://cookbook.chromadb.dev/core/collections/?utm_source=openai

3: https://cookbook.chromadb.dev/core/collections/?utm_source=openai

4: https://tessl.io/registry/tessl/pypi-chromadb/1.0.0/files/docs/queries.md

5: https://tessl.io/registry/tessl/pypi-chromadb/1.0.0/files/docs/queries.md

6: https://tessl.io/registry/tessl/pypi-chromadb/1.0.0/files/docs/queries.md

7: https://tessl.io/registry/tessl/pypi-chromadb/1.0.0/files/docs/queries.md

Score calculation is incorrect for ChromaDB's default L2 distance metric.

The formula score = 1 - distance on line 63 produces incorrect similarity scores because:

The collection is created without specifying a distance metric (line 26), defaulting to L2 (squared Euclidean distance)

For L2, distance values can be arbitrarily large; 1 - distance produces nonsensical negative or near-zero scores unrelated to similarity

The formula only works correctly for cosine distance (distance = 1 - cosine_similarity); it fails for IP (inner product) and L2

The min_score threshold comparison on line 64 is also unreliable as a result.

Retrieve the collection's configured distance function and apply the appropriate transformation:

Cosine: score = 1 - distance (current formula works)

IP: score = -distance (negate to get similarity)

L2: Use a different scoring approach (e.g., score = 1 / (1 + distance) or sigmoid transformation)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@web_intelligence/vector_stores/chroma_store.py` around lines 61 - 82, The current score calculation in the results formatting loop (where you compute score = 1 - distance and compare to min_score) is wrong for non-cosine metrics; update the code that builds formatted (and uses results and min_score) to first read the collection's configured distance metric from the Chroma collection object, then branch the distance→similarity mapping: for "cosine" keep score = 1 - distance, for "ip" use score = -distance (negate), and for "l2" use a bounded transform such as score = 1 / (1 + distance) (or a sigmoid) before applying the min_score filter and appending to formatted; locate this logic in the function that iterates over results["ids"][0] and replace the single formula with the metric-aware mapping.

coderabbitai · 2026-03-06T09:06:53Z

web_intelligence/vector_stores/chroma_store.py

+    def get_document(self, doc_id: str) -> Optional[Dict]:
+        results = self.collection.get(where={"doc_id": doc_id})
+        if not results["ids"]:
+            return None
+
+        chunks = []
+        for i, id_ in enumerate(results["ids"]):
+            text = ""
+            if results.get("documents") and results["documents"][i]:
+                text = results["documents"][i]
+            if not text:
+                text = results["metadatas"][i].get("text", "")
+
+            chunks.append({
+                "id": id_,
+                "text": text,
+                "chunk_index": results["metadatas"][i].get("chunk_index", 0),
+                "word_count": results["metadatas"][i].get("word_count", 0),
+            })
+        chunks.sort(key=lambda c: c["chunk_index"])
+
+        meta = results["metadatas"][0]
+        return {
+            "doc_id": doc_id,
+            "url": meta.get("url", ""),
+            "title": meta.get("title", "Untitled"),
+            "indexed_at": meta.get("indexed_at", ""),
+            "chunk_count": len(chunks),
+            "full_text": "\n\n".join(c["text"] for c in chunks),
+            "chunks": chunks,
+        }


🧹 Nitpick | 🔵 Trivial

Incorrect indexing for documents in get_document.

On line 114, results["documents"][i] treats documents as a flat list indexed by i. However, ChromaDB's collection.get() returns documents as a flat list (unlike query() which returns nested lists). But the check results.get("documents") and results["documents"][i] may fail if documents is None or if i is out of bounds when documents weren't stored.

The safer pattern used in search (checking for nested structure) is inconsistent here. Consider unifying the access pattern:

♻️ Proposed fix for consistent document access

for i, id_ in enumerate(results["ids"]): text = "" - if results.get("documents") and results["documents"][i]: - text = results["documents"][i] + docs = results.get("documents") + if docs and i < len(docs) and docs[i]: + text = docs[i] if not text: text = results["metadatas"][i].get("text", "")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@web_intelligence/vector_stores/chroma_store.py` around lines 106 - 136, get_document currently indexes results["documents"][i] directly which can raise or return wrong values when documents is None or has a different shape; update get_document (the collection.get handling logic) to mirror the safe access used in search: extract docs = results.get("documents") and if docs is truthy determine whether it's nested (docs[0] is list) or flat, pick the correct element for each id and also guard against index errors (fall back to results["metadatas"][i].get("text", "") when docs is None or out-of-range); ensure you still build chunks from id, text, chunk_index and word_count from results["metadatas"] and sort by chunk_index before returning the assembled document dict.

coderabbitai · 2026-03-06T09:06:53Z

web_intelligence/vector_stores/numpy_store.py

+    def add(
+        self,
+        vectors: List[List[float]],
+        metadatas: List[Dict],
+        ids: List[str],
+        documents: Optional[List[str]] = None,
+    ) -> None:
+        new_vecs = np.array(vectors, dtype=np.float32)
+
+        if self._vectors is None:
+            self._vectors = new_vecs
+        else:
+            self._vectors = np.vstack([self._vectors, new_vecs])
+
+        self._ids.extend(ids)
+        self._metadatas.extend(metadatas)
+        self._documents.extend(documents or [""] * len(ids))
+        self._save()


⚠️ Potential issue | 🟠 Major

Validate batch sizes and vector dimensions before mutating the store.

add() never checks that vectors, metadatas, ids, and documents have matching lengths, or that the new vectors match the existing embedding dimension. A shorter documents list will break later indexed reads, and a dimension mismatch will fail in vstack(). Reject malformed batches up front so the store cannot enter an inconsistent state.

Suggested fix

def add( self, vectors: List[List[float]], metadatas: List[Dict], ids: List[str], documents: Optional[List[str]] = None, ) -> None: + if not (len(vectors) == len(metadatas) == len(ids)): + raise ValueError("vectors, metadatas, and ids must have the same length") + if documents is not None and len(documents) != len(ids): + raise ValueError("documents and ids must have the same length") + new_vecs = np.array(vectors, dtype=np.float32) + if new_vecs.ndim != 2: + raise ValueError("vectors must be a 2D array of embeddings") + if self._vectors is not None and new_vecs.shape[1] != self._vectors.shape[1]: + raise ValueError("embedding dimension mismatch") if self._vectors is None: self._vectors = new_vecs else: self._vectors = np.vstack([self._vectors, new_vecs])

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@web_intelligence/vector_stores/numpy_store.py` around lines 54 - 71, Validate batch sizes and vector dimensions at the start of add() before modifying state: ensure len(vectors) == len(ids) == len(metadatas), and if documents is provided ensure len(documents) == len(ids); convert vectors to new_vecs (np.array) and if self._vectors is not None check new_vecs.ndim == 2 and new_vecs.shape[1] == self._vectors.shape[1] (raise ValueError with a clear message on mismatch) so you never call vstack() or extend lists when the batch is malformed; only after all checks pass perform vstack/assignment to self._vectors, extend self._ids, self._metadatas, self._documents, and call _save().

coderabbitai

Review continued from previous batch...

coderabbitai · 2026-03-06T09:08:09Z

web_intelligence/config.py

+    if cast == bool:
+        return str(val).lower() in ("1", "true", "yes", "on")


🧹 Nitpick | 🔵 Trivial

Use is for type comparison instead of ==.

Per PEP8 and Python idioms, type comparisons should use is rather than ==.

♻️ Proposed fix

- if cast == bool: + if cast is bool: return str(val).lower() in ("1", "true", "yes", "on")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if cast == bool:

return str(val).lower() in ("1", "true", "yes", "on")

if cast is bool:

return str(val).lower() in ("1", "true", "yes", "on")

🧰 Tools

🪛 Ruff (0.15.4)

[error] 27-27: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@web_intelligence/config.py` around lines 27 - 28, The comparison currently uses "cast == bool" which should use identity testing; update the type check in the branch that handles booleans to use "cast is bool" instead of "==" so the condition reliably detects the bool type for the conversion of val (the code path referencing cast and val).

coderabbitai · 2026-03-06T09:08:09Z

web_intelligence/config.py

+@dataclass
+class ServerConfig:
+    host: str = _env("WI_SERVER_HOST", "0.0.0.0")
+    port: int = _env("WI_SERVER_PORT", 8000, int)
+    reload: bool = _env("WI_SERVER_RELOAD", False, bool)


⚠️ Potential issue | 🟡 Minor

Consider defaulting to 127.0.0.1 for local-only deployments.

The default 0.0.0.0 binds to all network interfaces, making the server accessible from external hosts. Given the PR description emphasizes "runs 100% locally," a default of 127.0.0.1 (localhost only) would be more secure. Users who need external access can override via WI_SERVER_HOST.

🛡️ Proposed safer default

`@dataclass` class ServerConfig: - host: str = _env("WI_SERVER_HOST", "0.0.0.0") + host: str = _env("WI_SERVER_HOST", "127.0.0.1") port: int = _env("WI_SERVER_PORT", 8000, int) reload: bool = _env("WI_SERVER_RELOAD", False, bool)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@dataclass

class ServerConfig:

host: str = _env("WI_SERVER_HOST", "0.0.0.0")

port: int = _env("WI_SERVER_PORT", 8000, int)

reload: bool = _env("WI_SERVER_RELOAD", False, bool)

`@dataclass`

class ServerConfig:

host: str = _env("WI_SERVER_HOST", "127.0.0.1")

port: int = _env("WI_SERVER_PORT", 8000, int)

reload: bool = _env("WI_SERVER_RELOAD", False, bool)

🧰 Tools

🪛 Ruff (0.15.4)

[error] 76-76: Possible binding to all interfaces

(S104)

[error] 78-78: Boolean positional value in function call

(FBT003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@web_intelligence/config.py` around lines 74 - 78, The ServerConfig defaults currently bind to all interfaces; change the default host in the ServerConfig dataclass (the host field using _env("WI_SERVER_HOST", "0.0.0.0")) to "127.0.0.1" so the server binds to localhost by default while still allowing overrides via WI_SERVER_HOST; update the host default value referenced in the ServerConfig class to _env("WI_SERVER_HOST", "127.0.0.1").

coderabbitai · 2026-03-06T09:08:09Z

web_intelligence/context_formatter.py

+        source_list = "\n".join(
+            f"  - [{s['title']}]({s['url']})" for s in self.sources
+        )


⚠️ Potential issue | 🟡 Minor

Potential KeyError if source dict is missing keys.

Line 42 accesses s['title'] and s['url'] directly without .get(). If a source dict is malformed or missing these keys, a KeyError will be raised.

🛡️ Proposed defensive fix

source_list = "\n".join( - f" - [{s['title']}]({s['url']})" for s in self.sources + f" - [{s.get('title', 'Untitled')}]({s.get('url', '')})" for s in self.sources )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

source_list = "\n".join(

f" - [{s['title']}]({s['url']})" for s in self.sources

)

source_list = "\n".join(

f" - [{s.get('title', 'Untitled')}]({s.get('url', '')})" for s in self.sources

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@web_intelligence/context_formatter.py` around lines 41 - 43, The list-comprehension building source_list can raise KeyError because it accesses s['title'] and s['url'] directly; update the comprehension in the source_list construction (referencing self.sources and source_list) to defensively access values (e.g., s.get('title', '<untitled>') and s.get('url', '<no-url>')) or filter out/skip malformed entries (e.g., include only if 'title' and 'url' in s) so missing keys don't raise exceptions and the formatter yields safe fallback text or omits bad sources.

coderabbitai · 2026-03-06T09:08:09Z

web_intelligence/context_formatter.py

+        block = (
+            f"<document index=\"{i}\">\n"
+            f"  <source>{url}</source>\n"
+            f"  <title>{title}</title>\n"
+            f"  <relevance>{score:.2f}</relevance>\n"
+            f"  <content>{text}</content>\n"
+            f"</document>"
+        )


⚠️ Potential issue | 🟡 Minor

Content is not XML-escaped, which may produce malformed output.

If chunk text contains characters like <, >, or &, the generated XML-like structure will be malformed. While this is primarily for LLM consumption rather than strict XML parsing, it could cause issues with some LLM prompting strategies that expect well-formed markup.

🛡️ Optional fix using html.escape

+from html import escape + ... block = ( f"<document index=\"{i}\">\n" - f" <source>{url}</source>\n" - f" <title>{title}</title>\n" + f" <source>{escape(url)}</source>\n" + f" <title>{escape(title)}</title>\n" f" <relevance>{score:.2f}</relevance>\n" - f" <content>{text}</content>\n" + f" <content>{escape(text)}</content>\n" f"</document>" )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

block = (

f"<document index=\"{i}\">\n"

f" <source>{url}</source>\n"

f" <title>{title}</title>\n"

f" <relevance>{score:.2f}</relevance>\n"

f" <content>{text}</content>\n"

f"</document>"

)

block = (

f"<document index=\"{i}\">\n"

f" <source>{escape(url)}</source>\n"

f" <title>{escape(title)}</title>\n"

f" <relevance>{score:.2f}</relevance>\n"

f" <content>{escape(text)}</content>\n"

f"</document>"

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@web_intelligence/context_formatter.py` around lines 155 - 162, The XML-like block built in the variable block uses unescaped values (url, title, text) which can break the markup; update the code that constructs block in context_formatter.py to XML-escape/HTML-escape those interpolated fields (at least title and text, and url if it can contain special chars) before formatting (e.g., use html.escape or xml.sax.saxutils.escape) and add the appropriate import at the top; ensure the formatted relevance/score remains numeric and not escaped.

coderabbitai · 2026-03-06T09:08:09Z

web_intelligence/optimized_pipeline.py

+                    "url": c.url,
+                    "doc_id": doc_id,
+                    "word_count": c.word_count,
+                    "indexed_at": datetime.now().isoformat(),


⚠️ Potential issue | 🟡 Minor

Use timezone-aware datetime for indexed_at timestamp.

datetime.now() returns a naive datetime without timezone information. This can cause issues when comparing timestamps across different systems or when the data is used in timezone-aware contexts.

♻️ Proposed fix

+from datetime import datetime, timezone + ... - "indexed_at": datetime.now().isoformat(), + "indexed_at": datetime.now(timezone.utc).isoformat(),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"indexed_at": datetime.now().isoformat(),

from datetime import datetime, timezone

...

"indexed_at": datetime.now(timezone.utc).isoformat(),

🧰 Tools

🪛 Ruff (0.15.4)

[warning] 279-279: datetime.datetime.now() called without a tz argument

(DTZ005)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@web_intelligence/optimized_pipeline.py` at line 279, The "indexed_at" timestamp is currently created with datetime.now().isoformat(), which is naive; update the assignment that sets "indexed_at" to produce a timezone-aware ISO timestamp (e.g., datetime.now(timezone.utc).isoformat() or datetime.utcnow().replace(tzinfo=timezone.utc).isoformat()) and ensure timezone is imported from datetime so the stored value includes UTC offset.

coderabbitai · 2026-03-06T09:08:10Z

web_intelligence/optimized_pipeline.py

+    async def retrieve_async(
+        self,
+        query: str,
+        limit: Optional[int] = None,
+        output_format: Literal["plain", "numbered", "structured"] = "numbered",
+        max_context_words: Optional[int] = None,
+        where_filter: Optional[Dict] = None,
+        min_score: float = 0.0,
+    ) -> RetrievedContext:
+        return self.retrieve(
+            query, limit=limit, output_format=output_format,
+            max_context_words=max_context_words,
+            where_filter=where_filter, min_score=min_score,
+        )


🧹 Nitpick | 🔵 Trivial

retrieve_async is not truly asynchronous.

The retrieve_async() method simply calls the synchronous retrieve() method. This provides API consistency but doesn't yield async benefits. Consider adding a docstring noting this, or making it truly async if the underlying operations support it.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@web_intelligence/optimized_pipeline.py` around lines 533 - 546, The retrieve_async method currently just calls the synchronous retrieve method, so it's not actually asynchronous; either document this behavior or make it truly async: update retrieve_async (method retrieve_async) to either add a clear docstring stating it delegates to synchronous retrieve and does not perform I/O concurrently, or change its implementation to run the blocking retrieve on a threadpool (e.g., use asyncio.get_running_loop().run_in_executor(None, lambda: self.retrieve(...)) and await the result) or refactor the underlying retrieval into an async helper (e.g., async _retrieve_impl) and await that; ensure signatures and return type (RetrievedContext) remain correct and preserve parameters (query, limit, output_format, max_context_words, where_filter, min_score).

coderabbitai · 2026-03-06T09:08:10Z

web_intelligence/server.py

+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))


⚠️ Potential issue | 🟡 Minor

Improve exception handling with proper chaining.

The broad Exception catch loses the exception chain. Use raise ... from e to preserve the traceback for debugging.

♻️ Proposed fix

except Exception as e: - raise HTTPException(status_code=500, detail=str(e)) + raise HTTPException(status_code=500, detail=str(e)) from e

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception as e:

raise HTTPException(status_code=500, detail=str(e))

except Exception as e:

raise HTTPException(status_code=500, detail=str(e)) from e

🧰 Tools

🪛 Ruff (0.15.4)

[warning] 158-158: Do not catch blind exception: Exception

(BLE001)

[warning] 159-159: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@web_intelligence/server.py` around lines 158 - 159, The except block that currently does "except Exception as e: raise HTTPException(status_code=500, detail=str(e))" loses the original traceback; update that raise to use exception chaining by raising the HTTPException from the caught exception (i.e., "raise HTTPException(... ) from e") in the same except block so the original exception context is preserved for debugging; locate the except in server.py where the HTTPException is raised and apply this change to the handler/function containing that except.

coderabbitai · 2026-03-06T09:08:10Z

web_intelligence/server.py

+@app.delete("/documents/by-url")
+def delete_by_url(req: DeleteURLRequest):
+    pipeline = get_pipeline()
+    count = pipeline.delete_url(req.url)
+    return {"deleted_chunks": count, "url": req.url}


🧹 Nitpick | 🔵 Trivial

DELETE with request body may cause client compatibility issues.

While HTTP allows request bodies in DELETE, some HTTP clients and proxies don't handle them well. Consider using a query parameter or POST method for better compatibility.

♻️ Alternative using query parameter

-@app.delete("/documents/by-url") -def delete_by_url(req: DeleteURLRequest): +@app.delete("/documents/by-url") +def delete_by_url(url: str): pipeline = get_pipeline() - count = pipeline.delete_url(req.url) - return {"deleted_chunks": count, "url": req.url} + count = pipeline.delete_url(url) + return {"deleted_chunks": count, "url": url}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@web_intelligence/server.py` around lines 187 - 191, The current delete_by_url endpoint uses a DELETE method with a request body (DeleteURLRequest), which may break clients; modify the endpoint to avoid a DELETE body by either (A) changing it to accept the URL as a query parameter (keep route "/documents/by-url", change signature to accept url: str and call pipeline.delete_url(url)) or (B) switch the route to POST (e.g., "/documents/delete-by-url") and keep accepting DeleteURLRequest in the body; update the `@app` decorator and the function signature (delete_by_url and any client calls) accordingly and ensure the returned JSON stays {"deleted_chunks": count, "url": url}.

coderabbitai · 2026-03-06T09:08:10Z

web_intelligence/server.py

+def start_server(host: str = None, port: int = None, reload: bool = None):
+    import uvicorn
+    config = default_config().server
+    uvicorn.run(
+        "web_intelligence.server:app",
+        host=host or config.host,
+        port=port or config.port,
+        reload=reload if reload is not None else config.reload,
+    )


🧹 Nitpick | 🔵 Trivial

Add explicit type hints for optional parameters.

The parameters use None defaults without explicit Optional type hints, which violates PEP 484.

♻️ Proposed fix

-def start_server(host: str = None, port: int = None, reload: bool = None): +def start_server( + host: Optional[str] = None, + port: Optional[int] = None, + reload: Optional[bool] = None, +) -> None: import uvicorn

🧰 Tools

🪛 Ruff (0.15.4)

[warning] 207-207: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

[warning] 207-207: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

[error] 207-207: Boolean-typed positional argument in function definition

(FBT001)

[warning] 207-207: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@web_intelligence/server.py` around lines 207 - 215, Update the start_server signature to use explicit Optional type hints: import Optional from typing and change the parameters in start_server (host, port, reload) to Optional[str], Optional[int], and Optional[bool] respectively, keeping the same None defaults and existing logic that falls back to default_config().server values; reference the start_server function and its parameters when making this change.

coderabbitai · 2026-03-06T09:08:10Z

web_intelligence/vector_stores/numpy_store.py

+    def _load(self) -> None:
+        with open(self.persist_path, "rb") as f:
+            data = pickle.load(f)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Python pickle.load() security risk untrusted user-controlled files arbitrary code execution

💡 Result:

pickle.load() / pickle.loads() is unsafe for untrusted or user-controlled input because a crafted pickle can cause arbitrary code execution during unpickling (this is a property of the format/design, not just a “bug”). The official Python docs explicitly warn: malicious pickle data “will execute arbitrary code during unpickling” and you should “never unpickle data received from an untrusted or unauthenticated source.” [1]

This is a classic insecure deserialization issue (CWE-502). If an attacker can control or tamper with the pickle bytes (e.g., uploaded files, data read from a DB/cache/message queue, downloaded “model” artifacts), they may be able to run code as the process user. [2][3]

Mitigations (typical guidance):

Do not use pickle with untrusted input; use safer formats like JSON (and validate schemas) when possible. [1]

If you must use pickle, only load data that is trusted and integrity-protected (e.g., authenticated/signature/HMAC so it can’t be modified). [1][2]

Treat any “wrapper” APIs that use pickle under the hood (e.g., some ML/model loaders) as the same risk class. [1][3]

Sources

Python docs: pickle — security warning about arbitrary code execution and “never unpickle … untrusted” data. [1]

MITRE CWE-502: Deserialization of Untrusted Data (includes Python pickle as an example pattern). [2]

CodeQL guidance: unsafe deserialization (pickle is “inherently unsafe” with untrusted data). [3]

🏁 Script executed:

# Examine the numpy_store.py file to understand context cat -n web_intelligence/vector_stores/numpy_store.py | head -100

Repository: AOSSIE-Org/WebSift

Length of output: 3868

🏁 Script executed:

# Check for persist_path usage and configuration rg "persist_path" --context=3

Repository: AOSSIE-Org/WebSift

Length of output: 4792

🏁 Script executed:

# Look for any validation or sanitization rg "_load|_save|pickle" web_intelligence/vector_stores/numpy_store.py -A 5 -B 2

Repository: AOSSIE-Org/WebSift

Length of output: 1677

Replace pickle.load() with a secure serialization format or add integrity verification.

pickle.load() on persist_path is unsafe because a malicious or tampered pickle file will execute arbitrary code during unpickling. Since persist_path is user-configurable, this turns the store file into a code-execution sink. Switch to JSON (or another safe format) with schema validation, or if pickle is required, cryptographically verify the file with HMAC/signatures before loading.

🧰 Tools

🪛 Ruff (0.15.4)

[error] 48-48: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@web_intelligence/vector_stores/numpy_store.py` around lines 46 - 48, The _load method currently uses pickle.load on persist_path which is unsafe; replace pickle-based deserialization with a safe format (e.g., JSON/MsgPack with explicit schema validation) or, if keeping pickle, add cryptographic integrity and authenticity checks: compute/verify an HMAC or digital signature for the file before calling pickle.load, fail loudly on mismatch, and document the key storage. Update any companion save logic (e.g., where the store is written) to emit the safe format or to sign the payload so _load can verify before deserializing.

feat: add websift library implementation

8876398

coderabbitai bot requested changes Mar 6, 2026

View reviewed changes

		authors = [
		{name = "Your Name", email = "your.email@example.com"}

		@@ -0,0 +1,32 @@
		from typing import List, Dict, Optional, Protocol, runtime_checkable

		if cast == bool:
		return str(val).lower() in ("1", "true", "yes", "on")

-                    "indexed_at": datetime.now().isoformat(),
+from datetime import datetime, timezone
+...
+                    "indexed_at": datetime.now(timezone.utc).isoformat(),

		except Exception as e:
		raise HTTPException(status_code=500, detail=str(e))

Conversation

Nihallllll commented Mar 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Websift

What does it do?

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Which distance is returned?

Practical notes

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Nihallllll commented Mar 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 6, 2026 •

edited

Loading