feat: add websift library implementation#1
feat: add websift library implementation#1Nihallllll wants to merge 1 commit intoAOSSIE-Org:mainfrom
Conversation
WalkthroughIntroduces a comprehensive Web Intelligence library with web crawling, content extraction, text chunking, embeddings, vector storage, and semantic search capabilities. Includes CLI and REST API interfaces, multiple embedder backends, vector store implementations, caching layers, and extensive test coverage. Establishes project configuration, dependencies, and documentation. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Pipeline as FastPipeline
participant Crawler as Crawler
participant Extractor as Extractor
participant Chunker as Chunker
participant Embedder as Embedder
participant VectorStore as VectorStore
Client->>Pipeline: index_url(url)
Pipeline->>Crawler: crawl_url_async(url)
Crawler-->>Pipeline: CrawlObject {html, status}
Pipeline->>Extractor: extract_content(html, url)
Extractor-->>Pipeline: ExtractedContent {text, title, word_count}
Pipeline->>Chunker: chunk_text(text, doc_id, url)
Chunker-->>Pipeline: List[Chunk]
Pipeline->>Embedder: embed_batch(chunk texts)
Embedder-->>Pipeline: List[embeddings]
Pipeline->>VectorStore: add(vectors, metadatas, ids)
VectorStore-->>Pipeline: success
Pipeline-->>Client: indexed result
sequenceDiagram
participant Client
participant Pipeline as FastPipeline
participant SearchProvider as SearchProvider
participant Crawler as Crawler
participant VectorStore as VectorStore
participant Formatter as ContextFormatter
Client->>Pipeline: search_web(query, max_results=5)
Pipeline->>SearchProvider: search(query, max_results)
SearchProvider-->>Pipeline: List[SearchResult {url, title, snippet}]
Pipeline->>Crawler: crawl_urls_batch(urls)
Crawler-->>Pipeline: List[CrawlObject]
Note over Pipeline: Extract, chunk, embed (same as index flow)
Pipeline->>VectorStore: add(vectors, metadatas)
VectorStore-->>Pipeline: success
Pipeline->>VectorStore: search(query_embedding, limit)
VectorStore-->>Pipeline: List[matching chunks]
Pipeline->>Formatter: format_context_numbered(chunks, query)
Formatter-->>Pipeline: RetrievedContext
Pipeline-->>Client: formatted context with sources
sequenceDiagram
participant Client
participant Pipeline as FastPipeline
participant VectorStore as VectorStore
participant Embedder as Embedder
participant Formatter as ContextFormatter
Client->>Pipeline: retrieve(query, output_format="numbered")
Pipeline->>Embedder: embed(query)
Embedder-->>Pipeline: query_embedding
Pipeline->>VectorStore: search(query_embedding, limit=5)
VectorStore-->>Pipeline: List[chunks with metadata, scores]
Pipeline->>Formatter: format_context_numbered(chunks, query)
Formatter-->>Pipeline: RetrievedContext {context_text, sources, word_count}
Pipeline-->>Client: RetrievedContext with formatted output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 51
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/quickstart.py`:
- Around line 6-53: The Groq LLM usage is unguarded and will cause ImportError
or require GROQ_API_KEY; wrap the import and LLM-related section so the core
FastPipeline (FastPipeline, pipeline.search_web, pipeline.index_url,
pipeline.retrieve) runs standalone: attempt to import ChatGroq in a try/except
and only execute the LLM block (creating llm via ChatGroq, calling
ctx.as_messages and llm.invoke, and the second invoke on ctx2.as_messages) if
the import succeeded and os.environ.get("GROQ_API_KEY") is present; if either
check fails, print a short notice that Groq is unavailable and skip the
LLM/example invocation while keeping the rest of the quickstart intact.
In `@main.py`:
- Around line 101-111: The delete branch for command == "delete" (handling
target, pipeline.delete_url and pipeline.delete_document) performs irreversible
deletes without confirmation; modify it to prompt the user to confirm the
destructive action (e.g., "Are you sure you want to delete <target>? Type 'yes'
to confirm:") and only call pipeline.delete_url or pipeline.delete_document if
the user types the explicit confirmation, or allow an explicit --force flag to
skip the prompt; ensure the confirmation text includes the target so a mistype
is obvious and abort (return) if the user does not confirm.
- Around line 67-70: The loop that parses args uses token value filtering which
removes query tokens equal to "--limit"/"-n" or the numeric value; instead, when
you detect the flag in the for i, a in enumerate(args) loop, record the exact
indices to remove (i and i+1), set limit = int(args[i+1]), and then rebuild
query from args[1:] by skipping tokens at those recorded indices (or build a new
list via enumerate and skip matching original indices) so you exclude the flag
and its value by position rather than by value; update the variables args, limit
and query accordingly (referencing args, limit, query, and the for i, a in
enumerate(args) loop).
In `@MANIFEST.in`:
- Line 4: Update the MANIFEST.in line that currently reads "recursive-include
src *.py" to point at the actual package directory used in pyproject.toml
(packages = ["web_intelligence"]); replace "src" with "web_intelligence" so
Python sources are included in sdist (i.e., change the directive to
"recursive-include web_intelligence *.py" or equivalent to include the package's
source files).
In `@pyproject.toml`:
- Around line 44-48: The [project.urls] entries (Homepage, Documentation,
Repository, Issues) currently use placeholder "yourusername/web-intelligence"
links; replace each value with the actual project URLs (real GitHub org/user and
repo name or hosted docs site) so Homepage, Documentation, Repository and Issues
point to resolvable addresses and update Documentation to the real docs anchor
or URL.
- Around line 8-9: Replace the placeholder author metadata in pyproject.toml by
updating the authors table (the authors = [...] entry) to include the real
maintainer details: set the actual name and email for the package
owner/maintainer in the {name = "...", email = "..."} record (or add multiple
author records if needed) so the package metadata no longer contains "Your Name"
or "your.email@example.com".
- Around line 60-61: The comment claiming "Web search (DuckDuckGo, no API key
needed)" is misleading for the dependency line search = ["ddgs>=6.0.0"]; either
change the dependency to the DuckDuckGo-specific package (e.g., replace ddgs
with duckduckgo-search) if you require DuckDuckGo-only behavior, or update the
comment to accurately describe that ddgs is a metasearch aggregator (supports
multiple search sources) so the comment and dependency match; refer to the
dependency key search and the package name ddgs in the pyproject.toml when
making the change.
- Around line 92-93: The console script entry point "main:main" points to a
root-level main.py that won't be packaged (packages = ["web_intelligence"]);
move the root main.py into the web_intelligence package (e.g.,
web_intelligence/main.py or web_intelligence/cli.py) and update the
pyproject.toml entry under [project.scripts] to reference the package module
(for example change "main:main" to "web_intelligence.main:main" or
"web_intelligence.cli:main") so the entry point resolves to a module included in
the distributed package.
In `@README.md`:
- Around line 108-130: Add a blank line after each Markdown heading to satisfy
MD lint rules: insert an empty line after the headings "When you give it a URL
(`index_url`):", "When you ask a question (`retrieve`):", and "When you search
the web (`search_web`):" so the paragraph lists that follow render correctly;
update the README.md section titled "How It Works (Step by Step)" accordingly by
ensuring each heading is immediately followed by one blank line before its
numbered list/paragraph.
- Around line 213-237: The fenced code block showing the project tree in
README.md lacks a language identifier; update that block to use a plain-text
language tag (e.g., ```text or ```plaintext) so the directory tree (starting
with "web_intelligence/" and files like optimized_pipeline.py, async_crawler.py,
embedders/, vector_stores/, etc.) is rendered correctly and satisfies markdown
linting.
In `@requirements.txt`:
- Around line 1-5: Requirements entries are missing version constraints and
should match the minimum versions declared in pyproject.toml; update the
dependency lines for httpx[http2], beautifulsoup4, trafilatura and numpy to
include the same >= version pins used in pyproject.toml (e.g.,
httpx[http2]>=0.25.0 and beautifulsoup4>=4.12.0) so pip install -r
requirements.txt installs the same minima as pip install web-intelligence.
In `@tests/test_library.py`:
- Around line 389-440: The network tests in class TestPipelineIntegration are
running by default because `@pytest.mark.network` has no default exclusion; update
the test class to only run when an explicit flag/env var is set by replacing or
augmenting the marker with a skip-if guard (e.g., add
`@pytest.mark.skipif`(os.getenv("RUN_NETWORK_TESTS") is None, reason="network
tests disabled unless RUN_NETWORK_TESTS is set") and import os and pytest at
top) so normal pytest runs skip TestPipelineIntegration; alternatively, wrap the
class with pytest.skipif using the same condition or check inside
_make_pipeline/test methods to call pytest.skip when the env var is not present.
- Around line 166-209: The tests call auto_detect_embedder() unconditionally and
fail when no backend is installed (raising NoEmbedderError); update each test in
TestEmbedder (e.g., test_auto_detect, test_embed_single, test_embed_batch,
test_embed_batch_empty, test_embeddings_are_normalized, test_cache_stats) to
handle this by catching NoEmbedderError from web_intelligence.embedders and
calling pytest.skip(...) with a clear message (or use pytest.importorskip /
pytest.mark.skipif around the test) so the test is skipped in core-only
environments rather than failing.
In `@web_intelligence/_logging.py`:
- Around line 7-17: The setup_logging function should declare a return type (->
None) and reuse the module-level logger instead of calling logging.getLogger
again; update the signature to include the return type and replace the local
"root = logging.getLogger('web_intelligence')" with the existing module-level
"logger" variable, then run the same handler-existence check and setLevel on
that logger (e.g., check logger.handlers for an existing StreamHandler before
adding handler and call logger.setLevel(level)).
In `@web_intelligence/async_crawler.py`:
- Around line 70-77: The current rate_limiter semaphore (used around client.get
via rate_limiter.acquire()/release()) only limits concurrent in-flight requests,
not the actual requests-per-second and also forces min rate 1; replace the
semaphore approach with a time-based token-bucket or interval limiter that
supports fractional RPS (requests_per_second as float). Concretely, implement an
async TokenBucket or RateLimiter class that exposes an awaitable method like
wait_for_token() (instead of acquire()/release()) which blocks until a token is
available and refills tokens on a timed schedule (or enforces a per-request
interval = 1/requests_per_second); update all call sites that currently do
rate_limiter.acquire()/release() to await rate_limiter.wait_for_token() before
calling client.get so bursts are prevented and fractional RPS are honored.
Ensure constructor/creator of the limiter accepts float RPS and unit tests cover
low fractional rates (e.g., 0.5 RPS) and concurrent callers.
- Around line 83-94: The retry block currently logs "retrying" and awaits
asyncio.sleep even on the final attempt, causing an unnecessary delay and
misleading logs; update the block in async_crawler.py (inside the request/retry
loop handling response.status_code 429 or >=500) to check if attempt ==
max_retries and, in that case, set last_error and break/return immediately
without calling logger.warning or awaiting asyncio.sleep; otherwise compute wait
(using retry_after and 2 ** attempt), log via logger.warning and await
asyncio.sleep as before. Ensure you reference response.status_code, retry_after,
wait, logger.warning, last_error and asyncio.sleep when making the conditional
change.
In `@web_intelligence/cache.py`:
- Around line 68-73: The stats() method can raise FileNotFoundError due to a
race on self.index_file.stat(); update stats() to safely obtain index_size_bytes
by wrapping the stat/getsize call in a try/except (catch
OSError/FileNotFoundError) and default to 0 if it fails (instead of directly
calling self.index_file.stat() even when exists() returned True); keep the
returned keys ('total_cached', 'cache_dir') unchanged and reference self.index
(for total_cached) and self.cache_dir, self.index_file in your change so the fix
is localized to stats().
- Around line 126-153: The disk filenames are using a 16-character prefix of the
SHA256 hash while the in-memory keys use the full hash, which can cause
collisions; update the get and set methods to use the full hash returned by
_text_hash for cache_file names (or at minimum increase to a longer prefix like
32 chars) so disk filenames match the in-memory keys (refer to _text_hash, get,
set, cache_dir, cache, max_memory_items).
- Around line 35-41: The TTL check currently compares datetime.now() (naive) to
cached_at/fromisoformat (may be aware) causing TypeError; update the logic
around ttl_hours, cached_at and expiry to use timezone-aware UTC datetimes
consistently: parse cached_at with datetime.fromisoformat and if
cached_at.tzinfo is None set/assume UTC (or convert with
cached_at.replace(tzinfo=timezone.utc)), then compute expiry and compare against
datetime.now(timezone.utc); ensure any stored timestamps are treated as UTC so
comparisons between expiry and now are always timezone-aware.
In `@web_intelligence/chunker.py`:
- Around line 14-17: The current split_into_sentences uses a naive regex
(sentence_endings) that splits on any period+space and will break abbreviations
and decimals; replace this with a robust sentence tokenizer such as NLTK's
sent_tokenize (ensure punkt is downloaded) or spaCy's nlp(...).sents inside
split_into_sentences, or if you must keep the lightweight approach, update the
docstring of split_into_sentences to clearly state this limitation and add a
TODO to migrate to a proper tokenizer; refer to the split_into_sentences
function and the sentence_endings variable when making the change.
- Around line 77-78: Remove the dead helper function tokenizer from chunker.py:
delete the def tokenizer(text: str) -> List[str]: return text.split()
declaration and any direct references to it; also ensure no imports or __all__
exports refer to tokenizer (check __init__.py and top-of-file imports) so the
codebase no longer contains or exposes this unused function.
- Line 1: Replace the deprecated "from typing import List" import with the
built-in "list" type and update all type annotations that reference the imported
List to use the builtin list instead; search for occurrences of "List[" in this
file (e.g., in function signatures, return types, and variable annotations) and
change them to "list[...]" (and remove the import line "from typing import
List").
In `@web_intelligence/config.py`:
- Around line 27-28: The comparison currently uses "cast == bool" which should
use identity testing; update the type check in the branch that handles booleans
to use "cast is bool" instead of "==" so the condition reliably detects the bool
type for the conversion of val (the code path referencing cast and val).
- Around line 74-78: The ServerConfig defaults currently bind to all interfaces;
change the default host in the ServerConfig dataclass (the host field using
_env("WI_SERVER_HOST", "0.0.0.0")) to "127.0.0.1" so the server binds to
localhost by default while still allowing overrides via WI_SERVER_HOST; update
the host default value referenced in the ServerConfig class to
_env("WI_SERVER_HOST", "127.0.0.1").
In `@web_intelligence/context_formatter.py`:
- Around line 41-43: The list-comprehension building source_list can raise
KeyError because it accesses s['title'] and s['url'] directly; update the
comprehension in the source_list construction (referencing self.sources and
source_list) to defensively access values (e.g., s.get('title', '<untitled>')
and s.get('url', '<no-url>')) or filter out/skip malformed entries (e.g.,
include only if 'title' and 'url' in s) so missing keys don't raise exceptions
and the formatter yields safe fallback text or omits bad sources.
- Around line 155-162: The XML-like block built in the variable block uses
unescaped values (url, title, text) which can break the markup; update the code
that constructs block in context_formatter.py to XML-escape/HTML-escape those
interpolated fields (at least title and text, and url if it can contain special
chars) before formatting (e.g., use html.escape or xml.sax.saxutils.escape) and
add the appropriate import at the top; ensure the formatted relevance/score
remains numeric and not escaped.
In `@web_intelligence/crawler.py`:
- Around line 6-14: Fix the CrawlObject dataclass annotations: change the type
of crawled_at from datetime to Optional[datetime] and keep its default = None,
and normalize spacing around all type annotations (e.g., use url: str, html:
str, status_code: int, success: bool, error: Optional[str] = None, crawled_at:
Optional[datetime] = None) so annotations follow the standard "name: type"
style; update imports if needed to ensure Optional and datetime are available
and used consistently in the CrawlObject definition.
- Line 31: The success check currently treats only HTTP 200 as success; update
the assignment that sets success (where response.status_code is used) to accept
the full 2xx range (e.g. 200 <= response.status_code < 300) or use response.ok
so other successful codes like 201/204 are allowed; locate the assignment to
success in crawler.py and replace the equality check against 200 with a
2xx-range check or response.ok.
- Around line 16-43: The crawl_url function currently catches a bare Exception
which will swallow critical signals; change the signature to include a return
type (e.g., def crawl_url(url: str, timeout: int = 10) -> CrawlObject:) and
replace the broad except with targeted httpx exceptions — for example catch
httpx.RequestError and httpx.TimeoutException (or httpx.ReadTimeout/WriteTimeout
if used) around the httpx.get call and return the error-wrapped CrawlObject
there; do not catch BaseException/Exception broadly (allow
KeyboardInterrupt/SystemExit to propagate) and ensure httpx exceptions are
imported so the specific except block references httpx.RequestError,
httpx.TimeoutException and handles building the same CrawlObject(error=str(e),
success=False, status_code=0, ...).
In `@web_intelligence/embedders/__init__.py`:
- Around line 9-20: The auto_detect_embedder function currently forwards
**kwargs directly into SentenceTransformerEmbedder and FastEmbedEmbedder which
can raise TypeError on unexpected params and short-circuit fallbacks; update
auto_detect_embedder to either (a) try/except TypeError around each constructor
call and continue to the next backend on TypeError, or (b) whitelist/filter
kwargs to the specific initializer signature of SentenceTransformerEmbedder
(referencing SentenceTransformerEmbedder.__init__) before calling it, and
similarly for FastEmbedEmbedder, so unexpected keys like base_url or api_key
won't break detection and will allow the FastEmbedEmbedder fallback to run.
In `@web_intelligence/embedders/fastembed_embedder.py`:
- Around line 95-103: In the loop inside fastembed_embedder.py where you iterate
over texts_to_embed and vectors (for idx, (text, vector) in
enumerate(zip(texts_to_embed, vectors))): change the zip call to use strict=True
so mismatched lengths raise immediately (i.e., enumerate(zip(texts_to_embed,
vectors, strict=True))). Update that single line in the method that performs
embedding and maintains the cache (the block that sets
results[indices_to_embed[idx]] and calls self.cache.set) to use zip(...,
strict=True); ensure the project runs on Python 3.10+ if not already.
- Around line 17-23: The __init__ of FastembedEmbedder declares **kwargs but
never uses it; either remove it or forward it to the underlying TextEmbedding
constructor—prefer forwarding for compatibility. Modify
FastembedEmbedder.__init__ so any **kwargs passed are forwarded into the
TextEmbedding(...) call (i.e., include **kwargs when instantiating
TextEmbedding), ensuring the **kwargs parameter is actually consumed;
alternatively, if you want to drop it, remove **kwargs from the __init__
signature and any related unused references and update callers accordingly.
- Around line 65-103: The show_progress parameter on embed_batch is unused;
remove it from the method signature and any internal references (function
embed_batch in fastembed_embedder.py) so callers don't pass a misleading option,
and update any call sites to stop supplying show_progress; alternatively, if you
prefer to keep the API for compatibility, add a clear inline comment above
embed_batch and a docstring note stating that fastembed's _model.embed does not
support progress display and show_progress is a no-op, and ensure no attempt is
made to pass it into _model.embed.
In `@web_intelligence/embedders/openai_embedder.py`:
- Around line 45-46: The cache is currently shared across models and backends;
change the EmbeddingCache instantiation in OpenAIEmbedder (the self.cache
assignment) to namespace entries by backend and model so embeddings are not
mixed across different model/backends — e.g. derive a per-model/backend cache
subdir or pass a namespace/model identifier into EmbeddingCache (or update
EmbeddingCache to accept a backend/model key) using the embedder's backend
identifier and model name (from the same class where self.use_cache is set) so
cached keys include backend/model scope and prevent wrong-dimensionality hits.
In `@web_intelligence/fast_embedder.py`:
- Around line 39-40: The call torch.set_num_threads(torch.get_num_threads()) is
a no-op; either remove it or set a meaningful thread count. In fast_embedder.py,
update the device CPU branch (where device == "cpu") to either delete the
torch.set_num_threads line or set a concrete value (e.g., derive from
os.cpu_count() or an injected config/ENV value) using
torch.set_num_threads(<desired_count>) so CPU threading is actually configured;
ensure you import any module you use (e.g., os) and document/validate the chosen
thread count.
- Around line 136-174: The benchmark_embedder function creates on-disk cached
embeddings in ./data/cache/embeddings during the cache test and never removes
them; update benchmark_embedder to clean up that cache after the test by
removing the cache directory (e.g., using shutil.rmtree) or, better, by creating
and using a temporary cache directory for the FastEmbedder instance used in the
cache test and deleting it afterwards; reference the benchmark_embedder function
and the FastEmbedder instance named embedder_cached to locate where to
create/remove the cache and wrap the removal in a try/except to avoid crashing
if deletion fails.
In `@web_intelligence/optimized_pipeline.py`:
- Around line 533-546: The retrieve_async method currently just calls the
synchronous retrieve method, so it's not actually asynchronous; either document
this behavior or make it truly async: update retrieve_async (method
retrieve_async) to either add a clear docstring stating it delegates to
synchronous retrieve and does not perform I/O concurrently, or change its
implementation to run the blocking retrieve on a threadpool (e.g., use
asyncio.get_running_loop().run_in_executor(None, lambda: self.retrieve(...)) and
await the result) or refactor the underlying retrieval into an async helper
(e.g., async _retrieve_impl) and await that; ensure signatures and return type
(RetrievedContext) remain correct and preserve parameters (query, limit,
output_format, max_context_words, where_filter, min_score).
- Around line 489-496: In the except ImportError block that raises
SearchProviderError, chain the original ImportError when re-raising so debugging
preserves the original traceback: capture the caught exception (e.g., except
ImportError as err) and raise SearchProviderError(...) from err (or use from
None if you intentionally want to suppress the original), updating the raise in
the except block that references SearchProviderError.
- Line 279: The "indexed_at" timestamp is currently created with
datetime.now().isoformat(), which is naive; update the assignment that sets
"indexed_at" to produce a timezone-aware ISO timestamp (e.g.,
datetime.now(timezone.utc).isoformat() or
datetime.utcnow().replace(tzinfo=timezone.utc).isoformat()) and ensure timezone
is imported from datetime so the stored value includes UTC offset.
In `@web_intelligence/search_providers/base.py`:
- Line 4: The import list in this module includes an unused Optional symbol;
update the import statement that currently imports List, Optional, Protocol,
runtime_checkable to remove Optional so only used names remain (e.g., keep List,
Protocol, runtime_checkable) to eliminate the unused import warning in
web_intelligence.search_providers.base.
In `@web_intelligence/search_providers/duckduckgo_provider.py`:
- Around line 48-55: The loop building SearchResult objects currently uses
item.get("href", "") which yields empty URLs; update the iteration in
duckduckgo_provider.py (the block that iterates over raw and appends to results)
to validate each item before creating a SearchResult: check that href =
item.get("href") is truthy (non-empty) and optionally that title or body exist,
and skip the item if href is missing or empty so you do not append
SearchResult(url="") entries; keep using SearchResult(...) when validation
passes.
In `@web_intelligence/server.py`:
- Around line 158-159: The except block that currently does "except Exception as
e: raise HTTPException(status_code=500, detail=str(e))" loses the original
traceback; update that raise to use exception chaining by raising the
HTTPException from the caught exception (i.e., "raise HTTPException(... ) from
e") in the same except block so the original exception context is preserved for
debugging; locate the except in server.py where the HTTPException is raised and
apply this change to the handler/function containing that except.
- Around line 187-191: The current delete_by_url endpoint uses a DELETE method
with a request body (DeleteURLRequest), which may break clients; modify the
endpoint to avoid a DELETE body by either (A) changing it to accept the URL as a
query parameter (keep route "/documents/by-url", change signature to accept url:
str and call pipeline.delete_url(url)) or (B) switch the route to POST (e.g.,
"/documents/delete-by-url") and keep accepting DeleteURLRequest in the body;
update the `@app` decorator and the function signature (delete_by_url and any
client calls) accordingly and ensure the returned JSON stays {"deleted_chunks":
count, "url": url}.
- Around line 207-215: Update the start_server signature to use explicit
Optional type hints: import Optional from typing and change the parameters in
start_server (host, port, reload) to Optional[str], Optional[int], and
Optional[bool] respectively, keeping the same None defaults and existing logic
that falls back to default_config().server values; reference the start_server
function and its parameters when making this change.
In `@web_intelligence/vector_store.py`:
- Around line 29-37: The search method currently defines the parameter named
filter which causes TypeError when callers pass where_filter; rename the
parameter and its internal uses from filter to where_filter in the search method
signature and body (function search and its kwargs population: replace "filter"
parameter and kwargs["where"] assignment to use where_filter) so callers like
optimized_pipeline that call search(..., where_filter=...) work correctly;
ensure any internal references to the old name are updated to where_filter to
preserve behavior.
- Around line 22-27: The VectorStore.add wrapper currently only accepts the old
three-arg signature and must be made compatible with the new contract used by
optimized_pipeline (which calls vector_store.add(..., documents=chunk_texts)).
Update the add method (VectorStore.add) to accept a documents:
Optional[List[str]] parameter (or accept **kwargs and pop 'documents') while
preserving support for the existing vectors, metadatas, ids parameters, and
forward documents through to self.collection.add as documents=documents; ensure
type hints reflect Optional[List[str]] and that passing documents via keyword
works without raising TypeError so optimized_pipeline.py can pass chunk_texts
successfully.
In `@web_intelligence/vector_stores/base.py`:
- Line 1: The import and type annotations use deprecated typing.List and
typing.Dict; update the import line to remove List and Dict and change all
annotations in this module that reference List[...] and Dict[...] to use the
built-in generics list[...] and dict[...] (keep Optional, Protocol,
runtime_checkable from typing as-is); specifically update the top-level import
from "from typing import List, Dict, Optional, Protocol, runtime_checkable" and
any usages of List and Dict in functions/classes in this file to use list and
dict generics instead.
In `@web_intelligence/vector_stores/chroma_store.py`:
- Around line 106-136: get_document currently indexes results["documents"][i]
directly which can raise or return wrong values when documents is None or has a
different shape; update get_document (the collection.get handling logic) to
mirror the safe access used in search: extract docs = results.get("documents")
and if docs is truthy determine whether it's nested (docs[0] is list) or flat,
pick the correct element for each id and also guard against index errors (fall
back to results["metadatas"][i].get("text", "") when docs is None or
out-of-range); ensure you still build chunks from id, text, chunk_index and
word_count from results["metadatas"] and sort by chunk_index before returning
the assembled document dict.
- Around line 61-82: The current score calculation in the results formatting
loop (where you compute score = 1 - distance and compare to min_score) is wrong
for non-cosine metrics; update the code that builds formatted (and uses results
and min_score) to first read the collection's configured distance metric from
the Chroma collection object, then branch the distance→similarity mapping: for
"cosine" keep score = 1 - distance, for "ip" use score = -distance (negate), and
for "l2" use a bounded transform such as score = 1 / (1 + distance) (or a
sigmoid) before applying the min_score filter and appending to formatted; locate
this logic in the function that iterates over results["ids"][0] and replace the
single formula with the metric-aware mapping.
In `@web_intelligence/vector_stores/numpy_store.py`:
- Around line 54-71: Validate batch sizes and vector dimensions at the start of
add() before modifying state: ensure len(vectors) == len(ids) == len(metadatas),
and if documents is provided ensure len(documents) == len(ids); convert vectors
to new_vecs (np.array) and if self._vectors is not None check new_vecs.ndim == 2
and new_vecs.shape[1] == self._vectors.shape[1] (raise ValueError with a clear
message on mismatch) so you never call vstack() or extend lists when the batch
is malformed; only after all checks pass perform vstack/assignment to
self._vectors, extend self._ids, self._metadatas, self._documents, and call
_save().
- Around line 46-48: The _load method currently uses pickle.load on persist_path
which is unsafe; replace pickle-based deserialization with a safe format (e.g.,
JSON/MsgPack with explicit schema validation) or, if keeping pickle, add
cryptographic integrity and authenticity checks: compute/verify an HMAC or
digital signature for the file before calling pickle.load, fail loudly on
mismatch, and document the key storage. Update any companion save logic (e.g.,
where the store is written) to emit the safe format or to sign the payload so
_load can verify before deserializing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e5ecf859-3050-475c-9ed0-02c87654db50
⛔ Files ignored due to path filters (36)
tests/__pycache__/__init__.cpython-312.pycis excluded by!**/*.pyctests/__pycache__/test_library.cpython-312-pytest-9.0.2.pycis excluded by!**/*.pycweb_intelligence/__pycache__/__init__.cpython-310.pycis excluded by!**/*.pycweb_intelligence/__pycache__/__init__.cpython-312.pycis excluded by!**/*.pycweb_intelligence/__pycache__/_logging.cpython-312.pycis excluded by!**/*.pycweb_intelligence/__pycache__/async_crawler.cpython-310.pycis excluded by!**/*.pycweb_intelligence/__pycache__/async_crawler.cpython-312.pycis excluded by!**/*.pycweb_intelligence/__pycache__/cache.cpython-312.pycis excluded by!**/*.pycweb_intelligence/__pycache__/chunker.cpython-312.pycis excluded by!**/*.pycweb_intelligence/__pycache__/config.cpython-312.pycis excluded by!**/*.pycweb_intelligence/__pycache__/context_formatter.cpython-312.pycis excluded by!**/*.pycweb_intelligence/__pycache__/crawler.cpython-310.pycis excluded by!**/*.pycweb_intelligence/__pycache__/crawler.cpython-312.pycis excluded by!**/*.pycweb_intelligence/__pycache__/embedder.cpython-312.pycis excluded by!**/*.pycweb_intelligence/__pycache__/exceptions.cpython-312.pycis excluded by!**/*.pycweb_intelligence/__pycache__/extractor.cpython-312.pycis excluded by!**/*.pycweb_intelligence/__pycache__/fast_embedder.cpython-310.pycis excluded by!**/*.pycweb_intelligence/__pycache__/fast_embedder.cpython-312.pycis excluded by!**/*.pycweb_intelligence/__pycache__/optimized_pipeline.cpython-310.pycis excluded by!**/*.pycweb_intelligence/__pycache__/optimized_pipeline.cpython-312.pycis excluded by!**/*.pycweb_intelligence/__pycache__/pipeline.cpython-312.pycis excluded by!**/*.pycweb_intelligence/__pycache__/server.cpython-312.pycis excluded by!**/*.pycweb_intelligence/__pycache__/vector_store.cpython-312.pycis excluded by!**/*.pycweb_intelligence/embedders/__pycache__/__init__.cpython-312.pycis excluded by!**/*.pycweb_intelligence/embedders/__pycache__/base.cpython-312.pycis excluded by!**/*.pycweb_intelligence/embedders/__pycache__/fastembed_embedder.cpython-312.pycis excluded by!**/*.pycweb_intelligence/embedders/__pycache__/ollama_embedder.cpython-312.pycis excluded by!**/*.pycweb_intelligence/embedders/__pycache__/openai_embedder.cpython-312.pycis excluded by!**/*.pycweb_intelligence/embedders/__pycache__/sentence_transformer.cpython-312.pycis excluded by!**/*.pycweb_intelligence/search_providers/__pycache__/__init__.cpython-312.pycis excluded by!**/*.pycweb_intelligence/search_providers/__pycache__/base.cpython-312.pycis excluded by!**/*.pycweb_intelligence/search_providers/__pycache__/duckduckgo_provider.cpython-312.pycis excluded by!**/*.pycweb_intelligence/vector_stores/__pycache__/__init__.cpython-312.pycis excluded by!**/*.pycweb_intelligence/vector_stores/__pycache__/base.cpython-312.pycis excluded by!**/*.pycweb_intelligence/vector_stores/__pycache__/chroma_store.cpython-312.pycis excluded by!**/*.pycweb_intelligence/vector_stores/__pycache__/numpy_store.cpython-312.pycis excluded by!**/*.pyc
📒 Files selected for processing (37)
.python-versionMANIFEST.inREADME.mdexamples/quickstart.pymain.pypyproject.tomlpytest.inirequirements.txttests/__init__.pytests/test_library.pyweb_intelligence/__init__.pyweb_intelligence/_logging.pyweb_intelligence/async_crawler.pyweb_intelligence/cache.pyweb_intelligence/chunker.pyweb_intelligence/config.pyweb_intelligence/context_formatter.pyweb_intelligence/crawler.pyweb_intelligence/embedders/__init__.pyweb_intelligence/embedders/base.pyweb_intelligence/embedders/fastembed_embedder.pyweb_intelligence/embedders/ollama_embedder.pyweb_intelligence/embedders/openai_embedder.pyweb_intelligence/embedders/sentence_transformer.pyweb_intelligence/exceptions.pyweb_intelligence/extractor.pyweb_intelligence/fast_embedder.pyweb_intelligence/optimized_pipeline.pyweb_intelligence/search_providers/__init__.pyweb_intelligence/search_providers/base.pyweb_intelligence/search_providers/duckduckgo_provider.pyweb_intelligence/server.pyweb_intelligence/vector_store.pyweb_intelligence/vector_stores/__init__.pyweb_intelligence/vector_stores/base.pyweb_intelligence/vector_stores/chroma_store.pyweb_intelligence/vector_stores/numpy_store.py
| from langchain_groq import ChatGroq | ||
|
|
||
|
|
||
| def main(): | ||
| print("Setting up Web Intelligence pipeline...") | ||
| pipeline = FastPipeline() | ||
|
|
||
| question = "what is lcm of 9 and 17" | ||
|
|
||
| print(f"\nSearching the web for: '{question}'") | ||
| print("(This searches DuckDuckGo → crawls pages → indexes → retrieves)\n") | ||
|
|
||
| ctx = pipeline.search_web(question, max_results=3, limit=5) | ||
|
|
||
| print("=" * 60) | ||
| print("WEB INTELLIGENCE RESULTS") | ||
| print("=" * 60) | ||
| print(f"Sources found: {len(ctx.sources)}") | ||
| for s in ctx.sources: | ||
| print(f" - {s['title']}: {s['url']}") | ||
| print(f"Context chunks: {ctx.total_chunks}") | ||
| print(f"Context words: {ctx.total_words}") | ||
| print(f"\nContext preview (first 500 chars):") | ||
| print(ctx.context_text[:500]) | ||
| print("...\n") | ||
|
|
||
| print("=" * 60) | ||
| print("SENDING TO GROQ LLM (llama-3.3-70b-versatile)") | ||
| print("=" * 60) | ||
|
|
||
| llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0) | ||
|
|
||
| messages = ctx.as_messages() | ||
| response = llm.invoke(messages) | ||
|
|
||
| print(f"\nLLM ANSWER:\n") | ||
| print(response.content) | ||
|
|
||
| print("\n" + "=" * 60) | ||
| print("BONUS: Index a specific page and ask a question") | ||
| print("=" * 60) | ||
|
|
||
| pipeline.index_url("https://en.wikipedia.org/wiki/Python_(programming_language)") | ||
| ctx2 = pipeline.retrieve("Who created Python and when?") | ||
|
|
||
| response2 = llm.invoke(ctx2.as_messages()) | ||
| print(f"\nQuestion: Who created Python and when?") | ||
| print(f"Answer: {response2.content}") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's explore the repository structure to understand the context
git ls-files | head -30Repository: AOSSIE-Org/WebSift
Length of output: 808
🏁 Script executed:
# Check if the quickstart file exists and its current state
fd quickstart.py . | head -5Repository: AOSSIE-Org/WebSift
Length of output: 86
🏁 Script executed:
# List key configuration files
fd -e toml -e txt -e md . | grep -E "(pyproject|setup|requirements|README)" | head -20Repository: AOSSIE-Org/WebSift
Length of output: 103
🏁 Script executed:
# Read the quickstart example to verify current state
cat -n examples/quickstart.pyRepository: AOSSIE-Org/WebSift
Length of output: 2061
🏁 Script executed:
# Read pyproject.toml to check dependency declarations
cat pyproject.tomlRepository: AOSSIE-Org/WebSift
Length of output: 2997
🏁 Script executed:
# Read README to check quickstart prerequisites and documentation
cat README.mdRepository: AOSSIE-Org/WebSift
Length of output: 6964
Gate the Groq LLM section behind an optional check.
The quickstart imports langchain_groq unconditionally at line 6, but langchain_groq is not a declared dependency in pyproject.toml. This will cause an ImportError on a fresh install. Additionally, the README states "No API key needed for the core library," yet the example requires GROQ_API_KEY to run, preventing users from trying the core search/index workflow. Conditionally import ChatGroq and check for both the package availability and GROQ_API_KEY before running the LLM sections. The core pipeline (search, index, retrieve) should work standalone.
Suggested fix
+import os
from dotenv import load_dotenv
load_dotenv()
from web_intelligence import FastPipeline
-from langchain_groq import ChatGroq
+try:
+ from langchain_groq import ChatGroq
+except ImportError:
+ ChatGroq = None
@@
- print("=" * 60)
- print("SENDING TO GROQ LLM (llama-3.3-70b-versatile)")
- print("=" * 60)
-
- llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0)
-
- messages = ctx.as_messages()
- response = llm.invoke(messages)
-
- print(f"\nLLM ANSWER:\n")
- print(response.content)
+ llm = None
+ if ChatGroq is not None and os.getenv("GROQ_API_KEY"):
+ print("=" * 60)
+ print("SENDING TO GROQ LLM (llama-3.3-70b-versatile)")
+ print("=" * 60)
+ llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0)
+ response = llm.invoke(ctx.as_messages())
+ print("\nLLM ANSWER:\n")
+ print(response.content)
+ else:
+ print("Skipping Groq demo. Install `langchain-groq` and set `GROQ_API_KEY` to enable it.")
@@
- response2 = llm.invoke(ctx2.as_messages())
- print(f"\nQuestion: Who created Python and when?")
- print(f"Answer: {response2.content}")
+ if llm is not None:
+ response2 = llm.invoke(ctx2.as_messages())
+ print("\nQuestion: Who created Python and when?")
+ print(f"Answer: {response2.content}")🧰 Tools
🪛 Ruff (0.15.4)
[error] 28-28: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 41-41: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 52-52: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/quickstart.py` around lines 6 - 53, The Groq LLM usage is unguarded
and will cause ImportError or require GROQ_API_KEY; wrap the import and
LLM-related section so the core FastPipeline (FastPipeline, pipeline.search_web,
pipeline.index_url, pipeline.retrieve) runs standalone: attempt to import
ChatGroq in a try/except and only execute the LLM block (creating llm via
ChatGroq, calling ctx.as_messages and llm.invoke, and the second invoke on
ctx2.as_messages) if the import succeeded and os.environ.get("GROQ_API_KEY") is
present; if either check fails, print a short notice that Groq is unavailable
and skip the LLM/example invocation while keeping the rest of the quickstart
intact.
| for i, a in enumerate(args): | ||
| if a in ("--limit", "-n") and i + 1 < len(args): | ||
| limit = int(args[i + 1]) | ||
| query = " ".join(a for a in args[1:] if a not in ("--limit", "-n", args[i + 1])) |
There was a problem hiding this comment.
The --limit parser corrupts queries containing the same token value.
Line 70 rebuilds query by filtering tokens by value, not by position. python main.py search version 2 --limit 2 becomes version, and any literal -n/--limit token inside the query is also stripped.
Suggested fix
limit = 5
- for i, a in enumerate(args):
- if a in ("--limit", "-n") and i + 1 < len(args):
- limit = int(args[i + 1])
- query = " ".join(a for a in args[1:] if a not in ("--limit", "-n", args[i + 1]))
+ query_parts = []
+ i = 1
+ while i < len(args):
+ if args[i] in ("--limit", "-n"):
+ if i + 1 >= len(args):
+ print("Usage: python main.py search <query> [--limit N]")
+ return
+ limit = int(args[i + 1])
+ i += 2
+ continue
+ query_parts.append(args[i])
+ i += 1
+ query = " ".join(query_parts)
results = pipeline.search(query, limit=limit)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@main.py` around lines 67 - 70, The loop that parses args uses token value
filtering which removes query tokens equal to "--limit"/"-n" or the numeric
value; instead, when you detect the flag in the for i, a in enumerate(args)
loop, record the exact indices to remove (i and i+1), set limit =
int(args[i+1]), and then rebuild query from args[1:] by skipping tokens at those
recorded indices (or build a new list via enumerate and skip matching original
indices) so you exclude the flag and its value by position rather than by value;
update the variables args, limit and query accordingly (referencing args, limit,
query, and the for i, a in enumerate(args) loop).
| elif command == "delete": | ||
| if len(args) < 2: | ||
| print("Usage: python main.py delete <doc_id or url>") | ||
| return | ||
| target = args[1] | ||
| if target.startswith("http"): | ||
| count = pipeline.delete_url(target) | ||
| print(f"Deleted {count} chunks from {target}") | ||
| else: | ||
| ok = pipeline.delete_document(target) | ||
| print(f"Deleted: {ok}") |
There was a problem hiding this comment.
Add confirmation before destructive deletes.
Unlike clear, this command deletes a document or every chunk for a URL immediately. One mistyped identifier becomes irreversible data loss from the CLI.
Suggested fix
elif command == "delete":
if len(args) < 2:
print("Usage: python main.py delete <doc_id or url>")
return
target = args[1]
+ confirm = input(f"Delete indexed data for '{target}'? Type 'yes' to confirm: ")
+ if confirm.strip().lower() != "yes":
+ print("Aborted.")
+ return
if target.startswith("http"):
count = pipeline.delete_url(target)
print(f"Deleted {count} chunks from {target}")
else:
ok = pipeline.delete_document(target)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| elif command == "delete": | |
| if len(args) < 2: | |
| print("Usage: python main.py delete <doc_id or url>") | |
| return | |
| target = args[1] | |
| if target.startswith("http"): | |
| count = pipeline.delete_url(target) | |
| print(f"Deleted {count} chunks from {target}") | |
| else: | |
| ok = pipeline.delete_document(target) | |
| print(f"Deleted: {ok}") | |
| elif command == "delete": | |
| if len(args) < 2: | |
| print("Usage: python main.py delete <doc_id or url>") | |
| return | |
| target = args[1] | |
| confirm = input(f"Delete indexed data for '{target}'? Type 'yes' to confirm: ") | |
| if confirm.strip().lower() != "yes": | |
| print("Aborted.") | |
| return | |
| if target.startswith("http"): | |
| count = pipeline.delete_url(target) | |
| print(f"Deleted {count} chunks from {target}") | |
| else: | |
| ok = pipeline.delete_document(target) | |
| print(f"Deleted: {ok}") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@main.py` around lines 101 - 111, The delete branch for command == "delete"
(handling target, pipeline.delete_url and pipeline.delete_document) performs
irreversible deletes without confirmation; modify it to prompt the user to
confirm the destructive action (e.g., "Are you sure you want to delete <target>?
Type 'yes' to confirm:") and only call pipeline.delete_url or
pipeline.delete_document if the user types the explicit confirmation, or allow
an explicit --force flag to skip the prompt; ensure the confirmation text
includes the target so a mistype is obvious and abort (return) if the user does
not confirm.
| include README.md | ||
| include LICENSE | ||
| include pyproject.toml | ||
| recursive-include src *.py |
There was a problem hiding this comment.
Incorrect path will exclude all Python source files from sdist.
The recursive-include src *.py directive references a src/ directory, but per pyproject.toml (line 100: packages = ["web_intelligence"]), the package is located at web_intelligence/, not src/. This will cause source distributions to be missing all Python code.
🐛 Proposed fix
include README.md
include LICENSE
include pyproject.toml
-recursive-include src *.py
+recursive-include web_intelligence *.py📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| recursive-include src *.py | |
| include README.md | |
| include LICENSE | |
| include pyproject.toml | |
| recursive-include web_intelligence *.py |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@MANIFEST.in` at line 4, Update the MANIFEST.in line that currently reads
"recursive-include src *.py" to point at the actual package directory used in
pyproject.toml (packages = ["web_intelligence"]); replace "src" with
"web_intelligence" so Python sources are included in sdist (i.e., change the
directive to "recursive-include web_intelligence *.py" or equivalent to include
the package's source files).
| authors = [ | ||
| {name = "Your Name", email = "your.email@example.com"} |
There was a problem hiding this comment.
Update placeholder author information before release.
The author metadata contains placeholder values that should be replaced with actual information.
🔧 Update with actual values
authors = [
- {name = "Your Name", email = "your.email@example.com"}
+ {name = "Actual Name", email = "actual.email@example.com"}
]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| authors = [ | |
| {name = "Your Name", email = "your.email@example.com"} | |
| authors = [ | |
| {name = "Actual Name", email = "actual.email@example.com"} | |
| ] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` around lines 8 - 9, Replace the placeholder author metadata
in pyproject.toml by updating the authors table (the authors = [...] entry) to
include the real maintainer details: set the actual name and email for the
package owner/maintainer in the {name = "...", email = "..."} record (or add
multiple author records if needed) so the package metadata no longer contains
"Your Name" or "your.email@example.com".
| def search(self, query_vector: List[float], limit: int = 5, | ||
| filter: Optional[Dict] = None, | ||
| min_score: float = 0.0) -> List[Dict]: | ||
| kwargs = { | ||
| "query_embeddings": [query_vector], | ||
| "n_results": limit, | ||
| } | ||
| if filter: | ||
| kwargs["where"] = filter |
There was a problem hiding this comment.
Rename filter to where_filter to match current callers.
web_intelligence/optimized_pipeline.py:498-509 now invokes self.vector_store.search(..., where_filter=where_filter). Keeping the deprecated wrapper on filter= makes it incompatible with the rest of the package and raises TypeError: unexpected keyword argument 'where_filter' as soon as a filtered search is used.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web_intelligence/vector_store.py` around lines 29 - 37, The search method
currently defines the parameter named filter which causes TypeError when callers
pass where_filter; rename the parameter and its internal uses from filter to
where_filter in the search method signature and body (function search and its
kwargs population: replace "filter" parameter and kwargs["where"] assignment to
use where_filter) so callers like optimized_pipeline that call search(...,
where_filter=...) work correctly; ensure any internal references to the old name
are updated to where_filter to preserve behavior.
| @@ -0,0 +1,32 @@ | |||
| from typing import List, Dict, Optional, Protocol, runtime_checkable | |||
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Use built-in generics instead of deprecated typing.List and typing.Dict.
Since the project requires Python ≥3.10 (per pyproject.toml), use the built-in list and dict directly as generic types instead of the deprecated typing.List and typing.Dict.
♻️ Proposed fix
-from typing import List, Dict, Optional, Protocol, runtime_checkable
+from typing import Optional, Protocol, runtime_checkable
`@runtime_checkable`
class BaseVectorStore(Protocol):
def add(
self,
- vectors: List[List[float]],
- metadatas: List[Dict],
- ids: List[str],
- documents: Optional[List[str]] = None,
+ vectors: list[list[float]],
+ metadatas: list[dict],
+ ids: list[str],
+ documents: Optional[list[str]] = None,
) -> None: ...
def search(
self,
- query_vector: List[float],
+ query_vector: list[float],
limit: int = 5,
- where_filter: Optional[Dict] = None,
+ where_filter: Optional[dict] = None,
min_score: float = 0.0,
- ) -> List[Dict]: ...
+ ) -> list[dict]: ...
- def list_documents(self) -> List[Dict]: ...
+ def list_documents(self) -> list[dict]: ...
- def get_document(self, doc_id: str) -> Optional[Dict]: ...
+ def get_document(self, doc_id: str) -> Optional[dict]: ...🧰 Tools
🪛 Ruff (0.15.4)
[warning] 1-1: typing.List is deprecated, use list instead
(UP035)
[warning] 1-1: typing.Dict is deprecated, use dict instead
(UP035)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web_intelligence/vector_stores/base.py` at line 1, The import and type
annotations use deprecated typing.List and typing.Dict; update the import line
to remove List and Dict and change all annotations in this module that reference
List[...] and Dict[...] to use the built-in generics list[...] and dict[...]
(keep Optional, Protocol, runtime_checkable from typing as-is); specifically
update the top-level import from "from typing import List, Dict, Optional,
Protocol, runtime_checkable" and any usages of List and Dict in
functions/classes in this file to use list and dict generics instead.
| formatted = [] | ||
| for i in range(len(results["ids"][0])): | ||
| score = 1 - results["distances"][0][i] | ||
| if score < min_score: | ||
| continue | ||
|
|
||
| meta = results["metadatas"][0][i] | ||
| text = "" | ||
| if results.get("documents") and results["documents"][0]: | ||
| text = results["documents"][0][i] or "" | ||
| if not text: | ||
| text = meta.get("text", "") | ||
|
|
||
| formatted.append({ | ||
| "id": results["ids"][0][i], | ||
| "text": text, | ||
| "source": meta.get("url", ""), | ||
| "score": score, | ||
| "metadata": meta, | ||
| }) | ||
|
|
||
| return formatted |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's understand the context of this file
cd web_intelligence/vector_stores
cat -n chroma_store.py | head -100Repository: AOSSIE-Org/WebSift
Length of output: 3861
🏁 Script executed:
# Search for ChromaDB initialization and collection configuration
rg "collection\|Collection" web_intelligence/vector_stores/chroma_store.py -B 2 -A 2Repository: AOSSIE-Org/WebSift
Length of output: 44
🏁 Script executed:
# Look for any distance metric configuration in the codebase
rg "metric\|distance" web_intelligence/ -i --type py | grep -i "chroma\|metric"Repository: AOSSIE-Org/WebSift
Length of output: 44
🏁 Script executed:
# Check how the ChromaDB client/collection is created
rg "chroma\|query" web_intelligence/vector_stores/chroma_store.py -B 3 -A 3Repository: AOSSIE-Org/WebSift
Length of output: 44
🌐 Web query:
ChromaDB distance metrics L2 cosine IP how query returns distances
💡 Result:
In ChromaDB, collection.query(..., include=["distances"]) returns a distance value per returned neighbor (per query): results["distances"][query_idx][rank_idx]. Smaller = closer / more similar for all metrics (Chroma ranks by ascending “distance”). (tessl.io)
Which distance is returned?
The distance function is fixed per collection via the index space (hnsw:space / configuration={"hnsw":{"space": ...}}), not per-query. Possible values: l2, cosine, ip. (cookbook.chromadb.dev)
-
l2(default): squared Euclidean distance- Returned value is L2 distance squared (so it can look “large”; it is not square-rooted). (cookbook.chromadb.dev)
- If you want Euclidean distance:
euclidean = sqrt(distance).
-
cosine: cosine distance- Returned value is:
distance = 1 - cosine_similarity. (tessl.io) - Convert back:
cosine_similarity = 1 - distance.
- Returned value is:
-
ip: inner-product “distance”- Returned value is based on inner product, with Chroma treating it as a distance by using the negative sign (so that “smaller is better” still works):
distance = - (q · x). (tessl.io) - Convert back:
dot_product = -distance.
- Returned value is based on inner product, with Chroma treating it as a distance by using the negative sign (so that “smaller is better” still works):
Practical notes
- Distances are not normalized to 0–1 (except cosine distance is typically in
[0, 2]depending on vector direction). (tessl.io) - If you need “similarity scores” for UI thresholds, you usually convert as above (cosine:
1-d, ip:-d; l2: often use a custom transform). (tessl.io)
Citations:
- 1: https://tessl.io/registry/tessl/pypi-chromadb/1.0.0/files/docs/queries.md
- 2: https://cookbook.chromadb.dev/core/collections/?utm_source=openai
- 3: https://cookbook.chromadb.dev/core/collections/?utm_source=openai
- 4: https://tessl.io/registry/tessl/pypi-chromadb/1.0.0/files/docs/queries.md
- 5: https://tessl.io/registry/tessl/pypi-chromadb/1.0.0/files/docs/queries.md
- 6: https://tessl.io/registry/tessl/pypi-chromadb/1.0.0/files/docs/queries.md
- 7: https://tessl.io/registry/tessl/pypi-chromadb/1.0.0/files/docs/queries.md
Score calculation is incorrect for ChromaDB's default L2 distance metric.
The formula score = 1 - distance on line 63 produces incorrect similarity scores because:
- The collection is created without specifying a distance metric (line 26), defaulting to L2 (squared Euclidean distance)
- For L2, distance values can be arbitrarily large;
1 - distanceproduces nonsensical negative or near-zero scores unrelated to similarity - The formula only works correctly for cosine distance (
distance = 1 - cosine_similarity); it fails for IP (inner product) and L2
The min_score threshold comparison on line 64 is also unreliable as a result.
Retrieve the collection's configured distance function and apply the appropriate transformation:
- Cosine:
score = 1 - distance(current formula works) - IP:
score = -distance(negate to get similarity) - L2: Use a different scoring approach (e.g.,
score = 1 / (1 + distance)or sigmoid transformation)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web_intelligence/vector_stores/chroma_store.py` around lines 61 - 82, The
current score calculation in the results formatting loop (where you compute
score = 1 - distance and compare to min_score) is wrong for non-cosine metrics;
update the code that builds formatted (and uses results and min_score) to first
read the collection's configured distance metric from the Chroma collection
object, then branch the distance→similarity mapping: for "cosine" keep score = 1
- distance, for "ip" use score = -distance (negate), and for "l2" use a bounded
transform such as score = 1 / (1 + distance) (or a sigmoid) before applying the
min_score filter and appending to formatted; locate this logic in the function
that iterates over results["ids"][0] and replace the single formula with the
metric-aware mapping.
| def get_document(self, doc_id: str) -> Optional[Dict]: | ||
| results = self.collection.get(where={"doc_id": doc_id}) | ||
| if not results["ids"]: | ||
| return None | ||
|
|
||
| chunks = [] | ||
| for i, id_ in enumerate(results["ids"]): | ||
| text = "" | ||
| if results.get("documents") and results["documents"][i]: | ||
| text = results["documents"][i] | ||
| if not text: | ||
| text = results["metadatas"][i].get("text", "") | ||
|
|
||
| chunks.append({ | ||
| "id": id_, | ||
| "text": text, | ||
| "chunk_index": results["metadatas"][i].get("chunk_index", 0), | ||
| "word_count": results["metadatas"][i].get("word_count", 0), | ||
| }) | ||
| chunks.sort(key=lambda c: c["chunk_index"]) | ||
|
|
||
| meta = results["metadatas"][0] | ||
| return { | ||
| "doc_id": doc_id, | ||
| "url": meta.get("url", ""), | ||
| "title": meta.get("title", "Untitled"), | ||
| "indexed_at": meta.get("indexed_at", ""), | ||
| "chunk_count": len(chunks), | ||
| "full_text": "\n\n".join(c["text"] for c in chunks), | ||
| "chunks": chunks, | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Incorrect indexing for documents in get_document.
On line 114, results["documents"][i] treats documents as a flat list indexed by i. However, ChromaDB's collection.get() returns documents as a flat list (unlike query() which returns nested lists). But the check results.get("documents") and results["documents"][i] may fail if documents is None or if i is out of bounds when documents weren't stored.
The safer pattern used in search (checking for nested structure) is inconsistent here. Consider unifying the access pattern:
♻️ Proposed fix for consistent document access
for i, id_ in enumerate(results["ids"]):
text = ""
- if results.get("documents") and results["documents"][i]:
- text = results["documents"][i]
+ docs = results.get("documents")
+ if docs and i < len(docs) and docs[i]:
+ text = docs[i]
if not text:
text = results["metadatas"][i].get("text", "")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web_intelligence/vector_stores/chroma_store.py` around lines 106 - 136,
get_document currently indexes results["documents"][i] directly which can raise
or return wrong values when documents is None or has a different shape; update
get_document (the collection.get handling logic) to mirror the safe access used
in search: extract docs = results.get("documents") and if docs is truthy
determine whether it's nested (docs[0] is list) or flat, pick the correct
element for each id and also guard against index errors (fall back to
results["metadatas"][i].get("text", "") when docs is None or out-of-range);
ensure you still build chunks from id, text, chunk_index and word_count from
results["metadatas"] and sort by chunk_index before returning the assembled
document dict.
| def add( | ||
| self, | ||
| vectors: List[List[float]], | ||
| metadatas: List[Dict], | ||
| ids: List[str], | ||
| documents: Optional[List[str]] = None, | ||
| ) -> None: | ||
| new_vecs = np.array(vectors, dtype=np.float32) | ||
|
|
||
| if self._vectors is None: | ||
| self._vectors = new_vecs | ||
| else: | ||
| self._vectors = np.vstack([self._vectors, new_vecs]) | ||
|
|
||
| self._ids.extend(ids) | ||
| self._metadatas.extend(metadatas) | ||
| self._documents.extend(documents or [""] * len(ids)) | ||
| self._save() |
There was a problem hiding this comment.
Validate batch sizes and vector dimensions before mutating the store.
add() never checks that vectors, metadatas, ids, and documents have matching lengths, or that the new vectors match the existing embedding dimension. A shorter documents list will break later indexed reads, and a dimension mismatch will fail in vstack(). Reject malformed batches up front so the store cannot enter an inconsistent state.
Suggested fix
def add(
self,
vectors: List[List[float]],
metadatas: List[Dict],
ids: List[str],
documents: Optional[List[str]] = None,
) -> None:
+ if not (len(vectors) == len(metadatas) == len(ids)):
+ raise ValueError("vectors, metadatas, and ids must have the same length")
+ if documents is not None and len(documents) != len(ids):
+ raise ValueError("documents and ids must have the same length")
+
new_vecs = np.array(vectors, dtype=np.float32)
+ if new_vecs.ndim != 2:
+ raise ValueError("vectors must be a 2D array of embeddings")
+ if self._vectors is not None and new_vecs.shape[1] != self._vectors.shape[1]:
+ raise ValueError("embedding dimension mismatch")
if self._vectors is None:
self._vectors = new_vecs
else:
self._vectors = np.vstack([self._vectors, new_vecs])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web_intelligence/vector_stores/numpy_store.py` around lines 54 - 71, Validate
batch sizes and vector dimensions at the start of add() before modifying state:
ensure len(vectors) == len(ids) == len(metadatas), and if documents is provided
ensure len(documents) == len(ids); convert vectors to new_vecs (np.array) and if
self._vectors is not None check new_vecs.ndim == 2 and new_vecs.shape[1] ==
self._vectors.shape[1] (raise ValueError with a clear message on mismatch) so
you never call vstack() or extend lists when the batch is malformed; only after
all checks pass perform vstack/assignment to self._vectors, extend self._ids,
self._metadatas, self._documents, and call _save().
| if cast == bool: | ||
| return str(val).lower() in ("1", "true", "yes", "on") |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Use is for type comparison instead of ==.
Per PEP8 and Python idioms, type comparisons should use is rather than ==.
♻️ Proposed fix
- if cast == bool:
+ if cast is bool:
return str(val).lower() in ("1", "true", "yes", "on")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if cast == bool: | |
| return str(val).lower() in ("1", "true", "yes", "on") | |
| if cast is bool: | |
| return str(val).lower() in ("1", "true", "yes", "on") |
🧰 Tools
🪛 Ruff (0.15.4)
[error] 27-27: Use is and is not for type comparisons, or isinstance() for isinstance checks
(E721)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web_intelligence/config.py` around lines 27 - 28, The comparison currently
uses "cast == bool" which should use identity testing; update the type check in
the branch that handles booleans to use "cast is bool" instead of "==" so the
condition reliably detects the bool type for the conversion of val (the code
path referencing cast and val).
| @dataclass | ||
| class ServerConfig: | ||
| host: str = _env("WI_SERVER_HOST", "0.0.0.0") | ||
| port: int = _env("WI_SERVER_PORT", 8000, int) | ||
| reload: bool = _env("WI_SERVER_RELOAD", False, bool) |
There was a problem hiding this comment.
Consider defaulting to 127.0.0.1 for local-only deployments.
The default 0.0.0.0 binds to all network interfaces, making the server accessible from external hosts. Given the PR description emphasizes "runs 100% locally," a default of 127.0.0.1 (localhost only) would be more secure. Users who need external access can override via WI_SERVER_HOST.
🛡️ Proposed safer default
`@dataclass`
class ServerConfig:
- host: str = _env("WI_SERVER_HOST", "0.0.0.0")
+ host: str = _env("WI_SERVER_HOST", "127.0.0.1")
port: int = _env("WI_SERVER_PORT", 8000, int)
reload: bool = _env("WI_SERVER_RELOAD", False, bool)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @dataclass | |
| class ServerConfig: | |
| host: str = _env("WI_SERVER_HOST", "0.0.0.0") | |
| port: int = _env("WI_SERVER_PORT", 8000, int) | |
| reload: bool = _env("WI_SERVER_RELOAD", False, bool) | |
| `@dataclass` | |
| class ServerConfig: | |
| host: str = _env("WI_SERVER_HOST", "127.0.0.1") | |
| port: int = _env("WI_SERVER_PORT", 8000, int) | |
| reload: bool = _env("WI_SERVER_RELOAD", False, bool) |
🧰 Tools
🪛 Ruff (0.15.4)
[error] 76-76: Possible binding to all interfaces
(S104)
[error] 78-78: Boolean positional value in function call
(FBT003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web_intelligence/config.py` around lines 74 - 78, The ServerConfig defaults
currently bind to all interfaces; change the default host in the ServerConfig
dataclass (the host field using _env("WI_SERVER_HOST", "0.0.0.0")) to
"127.0.0.1" so the server binds to localhost by default while still allowing
overrides via WI_SERVER_HOST; update the host default value referenced in the
ServerConfig class to _env("WI_SERVER_HOST", "127.0.0.1").
| source_list = "\n".join( | ||
| f" - [{s['title']}]({s['url']})" for s in self.sources | ||
| ) |
There was a problem hiding this comment.
Potential KeyError if source dict is missing keys.
Line 42 accesses s['title'] and s['url'] directly without .get(). If a source dict is malformed or missing these keys, a KeyError will be raised.
🛡️ Proposed defensive fix
source_list = "\n".join(
- f" - [{s['title']}]({s['url']})" for s in self.sources
+ f" - [{s.get('title', 'Untitled')}]({s.get('url', '')})" for s in self.sources
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| source_list = "\n".join( | |
| f" - [{s['title']}]({s['url']})" for s in self.sources | |
| ) | |
| source_list = "\n".join( | |
| f" - [{s.get('title', 'Untitled')}]({s.get('url', '')})" for s in self.sources | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web_intelligence/context_formatter.py` around lines 41 - 43, The
list-comprehension building source_list can raise KeyError because it accesses
s['title'] and s['url'] directly; update the comprehension in the source_list
construction (referencing self.sources and source_list) to defensively access
values (e.g., s.get('title', '<untitled>') and s.get('url', '<no-url>')) or
filter out/skip malformed entries (e.g., include only if 'title' and 'url' in s)
so missing keys don't raise exceptions and the formatter yields safe fallback
text or omits bad sources.
| block = ( | ||
| f"<document index=\"{i}\">\n" | ||
| f" <source>{url}</source>\n" | ||
| f" <title>{title}</title>\n" | ||
| f" <relevance>{score:.2f}</relevance>\n" | ||
| f" <content>{text}</content>\n" | ||
| f"</document>" | ||
| ) |
There was a problem hiding this comment.
Content is not XML-escaped, which may produce malformed output.
If chunk text contains characters like <, >, or &, the generated XML-like structure will be malformed. While this is primarily for LLM consumption rather than strict XML parsing, it could cause issues with some LLM prompting strategies that expect well-formed markup.
🛡️ Optional fix using html.escape
+from html import escape
+
...
block = (
f"<document index=\"{i}\">\n"
- f" <source>{url}</source>\n"
- f" <title>{title}</title>\n"
+ f" <source>{escape(url)}</source>\n"
+ f" <title>{escape(title)}</title>\n"
f" <relevance>{score:.2f}</relevance>\n"
- f" <content>{text}</content>\n"
+ f" <content>{escape(text)}</content>\n"
f"</document>"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| block = ( | |
| f"<document index=\"{i}\">\n" | |
| f" <source>{url}</source>\n" | |
| f" <title>{title}</title>\n" | |
| f" <relevance>{score:.2f}</relevance>\n" | |
| f" <content>{text}</content>\n" | |
| f"</document>" | |
| ) | |
| block = ( | |
| f"<document index=\"{i}\">\n" | |
| f" <source>{escape(url)}</source>\n" | |
| f" <title>{escape(title)}</title>\n" | |
| f" <relevance>{score:.2f}</relevance>\n" | |
| f" <content>{escape(text)}</content>\n" | |
| f"</document>" | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web_intelligence/context_formatter.py` around lines 155 - 162, The XML-like
block built in the variable block uses unescaped values (url, title, text) which
can break the markup; update the code that constructs block in
context_formatter.py to XML-escape/HTML-escape those interpolated fields (at
least title and text, and url if it can contain special chars) before formatting
(e.g., use html.escape or xml.sax.saxutils.escape) and add the appropriate
import at the top; ensure the formatted relevance/score remains numeric and not
escaped.
| "url": c.url, | ||
| "doc_id": doc_id, | ||
| "word_count": c.word_count, | ||
| "indexed_at": datetime.now().isoformat(), |
There was a problem hiding this comment.
Use timezone-aware datetime for indexed_at timestamp.
datetime.now() returns a naive datetime without timezone information. This can cause issues when comparing timestamps across different systems or when the data is used in timezone-aware contexts.
♻️ Proposed fix
+from datetime import datetime, timezone
+
...
- "indexed_at": datetime.now().isoformat(),
+ "indexed_at": datetime.now(timezone.utc).isoformat(),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "indexed_at": datetime.now().isoformat(), | |
| from datetime import datetime, timezone | |
| ... | |
| "indexed_at": datetime.now(timezone.utc).isoformat(), |
🧰 Tools
🪛 Ruff (0.15.4)
[warning] 279-279: datetime.datetime.now() called without a tz argument
(DTZ005)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web_intelligence/optimized_pipeline.py` at line 279, The "indexed_at"
timestamp is currently created with datetime.now().isoformat(), which is naive;
update the assignment that sets "indexed_at" to produce a timezone-aware ISO
timestamp (e.g., datetime.now(timezone.utc).isoformat() or
datetime.utcnow().replace(tzinfo=timezone.utc).isoformat()) and ensure timezone
is imported from datetime so the stored value includes UTC offset.
| async def retrieve_async( | ||
| self, | ||
| query: str, | ||
| limit: Optional[int] = None, | ||
| output_format: Literal["plain", "numbered", "structured"] = "numbered", | ||
| max_context_words: Optional[int] = None, | ||
| where_filter: Optional[Dict] = None, | ||
| min_score: float = 0.0, | ||
| ) -> RetrievedContext: | ||
| return self.retrieve( | ||
| query, limit=limit, output_format=output_format, | ||
| max_context_words=max_context_words, | ||
| where_filter=where_filter, min_score=min_score, | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
retrieve_async is not truly asynchronous.
The retrieve_async() method simply calls the synchronous retrieve() method. This provides API consistency but doesn't yield async benefits. Consider adding a docstring noting this, or making it truly async if the underlying operations support it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web_intelligence/optimized_pipeline.py` around lines 533 - 546, The
retrieve_async method currently just calls the synchronous retrieve method, so
it's not actually asynchronous; either document this behavior or make it truly
async: update retrieve_async (method retrieve_async) to either add a clear
docstring stating it delegates to synchronous retrieve and does not perform I/O
concurrently, or change its implementation to run the blocking retrieve on a
threadpool (e.g., use asyncio.get_running_loop().run_in_executor(None, lambda:
self.retrieve(...)) and await the result) or refactor the underlying retrieval
into an async helper (e.g., async _retrieve_impl) and await that; ensure
signatures and return type (RetrievedContext) remain correct and preserve
parameters (query, limit, output_format, max_context_words, where_filter,
min_score).
| except Exception as e: | ||
| raise HTTPException(status_code=500, detail=str(e)) |
There was a problem hiding this comment.
Improve exception handling with proper chaining.
The broad Exception catch loses the exception chain. Use raise ... from e to preserve the traceback for debugging.
♻️ Proposed fix
except Exception as e:
- raise HTTPException(status_code=500, detail=str(e))
+ raise HTTPException(status_code=500, detail=str(e)) from e📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| raise HTTPException(status_code=500, detail=str(e)) | |
| except Exception as e: | |
| raise HTTPException(status_code=500, detail=str(e)) from e |
🧰 Tools
🪛 Ruff (0.15.4)
[warning] 158-158: Do not catch blind exception: Exception
(BLE001)
[warning] 159-159: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web_intelligence/server.py` around lines 158 - 159, The except block that
currently does "except Exception as e: raise HTTPException(status_code=500,
detail=str(e))" loses the original traceback; update that raise to use exception
chaining by raising the HTTPException from the caught exception (i.e., "raise
HTTPException(... ) from e") in the same except block so the original exception
context is preserved for debugging; locate the except in server.py where the
HTTPException is raised and apply this change to the handler/function containing
that except.
| @app.delete("/documents/by-url") | ||
| def delete_by_url(req: DeleteURLRequest): | ||
| pipeline = get_pipeline() | ||
| count = pipeline.delete_url(req.url) | ||
| return {"deleted_chunks": count, "url": req.url} |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
DELETE with request body may cause client compatibility issues.
While HTTP allows request bodies in DELETE, some HTTP clients and proxies don't handle them well. Consider using a query parameter or POST method for better compatibility.
♻️ Alternative using query parameter
-@app.delete("/documents/by-url")
-def delete_by_url(req: DeleteURLRequest):
+@app.delete("/documents/by-url")
+def delete_by_url(url: str):
pipeline = get_pipeline()
- count = pipeline.delete_url(req.url)
- return {"deleted_chunks": count, "url": req.url}
+ count = pipeline.delete_url(url)
+ return {"deleted_chunks": count, "url": url}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web_intelligence/server.py` around lines 187 - 191, The current delete_by_url
endpoint uses a DELETE method with a request body (DeleteURLRequest), which may
break clients; modify the endpoint to avoid a DELETE body by either (A) changing
it to accept the URL as a query parameter (keep route "/documents/by-url",
change signature to accept url: str and call pipeline.delete_url(url)) or (B)
switch the route to POST (e.g., "/documents/delete-by-url") and keep accepting
DeleteURLRequest in the body; update the `@app` decorator and the function
signature (delete_by_url and any client calls) accordingly and ensure the
returned JSON stays {"deleted_chunks": count, "url": url}.
| def start_server(host: str = None, port: int = None, reload: bool = None): | ||
| import uvicorn | ||
| config = default_config().server | ||
| uvicorn.run( | ||
| "web_intelligence.server:app", | ||
| host=host or config.host, | ||
| port=port or config.port, | ||
| reload=reload if reload is not None else config.reload, | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add explicit type hints for optional parameters.
The parameters use None defaults without explicit Optional type hints, which violates PEP 484.
♻️ Proposed fix
-def start_server(host: str = None, port: int = None, reload: bool = None):
+def start_server(
+ host: Optional[str] = None,
+ port: Optional[int] = None,
+ reload: Optional[bool] = None,
+) -> None:
import uvicorn🧰 Tools
🪛 Ruff (0.15.4)
[warning] 207-207: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
[warning] 207-207: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
[error] 207-207: Boolean-typed positional argument in function definition
(FBT001)
[warning] 207-207: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web_intelligence/server.py` around lines 207 - 215, Update the start_server
signature to use explicit Optional type hints: import Optional from typing and
change the parameters in start_server (host, port, reload) to Optional[str],
Optional[int], and Optional[bool] respectively, keeping the same None defaults
and existing logic that falls back to default_config().server values; reference
the start_server function and its parameters when making this change.
| def _load(self) -> None: | ||
| with open(self.persist_path, "rb") as f: | ||
| data = pickle.load(f) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Python pickle.load() security risk untrusted user-controlled files arbitrary code execution
💡 Result:
pickle.load() / pickle.loads() is unsafe for untrusted or user-controlled input because a crafted pickle can cause arbitrary code execution during unpickling (this is a property of the format/design, not just a “bug”). The official Python docs explicitly warn: malicious pickle data “will execute arbitrary code during unpickling” and you should “never unpickle data received from an untrusted or unauthenticated source.” [1]
This is a classic insecure deserialization issue (CWE-502). If an attacker can control or tamper with the pickle bytes (e.g., uploaded files, data read from a DB/cache/message queue, downloaded “model” artifacts), they may be able to run code as the process user. [2][3]
Mitigations (typical guidance):
- Do not use pickle with untrusted input; use safer formats like JSON (and validate schemas) when possible. [1]
- If you must use pickle, only load data that is trusted and integrity-protected (e.g., authenticated/signature/HMAC so it can’t be modified). [1][2]
- Treat any “wrapper” APIs that use pickle under the hood (e.g., some ML/model loaders) as the same risk class. [1][3]
Sources
- Python docs:
pickle— security warning about arbitrary code execution and “never unpickle … untrusted” data. [1] - MITRE CWE-502: Deserialization of Untrusted Data (includes Python pickle as an example pattern). [2]
- CodeQL guidance: unsafe deserialization (pickle is “inherently unsafe” with untrusted data). [3]
🏁 Script executed:
# Examine the numpy_store.py file to understand context
cat -n web_intelligence/vector_stores/numpy_store.py | head -100Repository: AOSSIE-Org/WebSift
Length of output: 3868
🏁 Script executed:
# Check for persist_path usage and configuration
rg "persist_path" --context=3Repository: AOSSIE-Org/WebSift
Length of output: 4792
🏁 Script executed:
# Look for any validation or sanitization
rg "_load|_save|pickle" web_intelligence/vector_stores/numpy_store.py -A 5 -B 2Repository: AOSSIE-Org/WebSift
Length of output: 1677
Replace pickle.load() with a secure serialization format or add integrity verification.
pickle.load() on persist_path is unsafe because a malicious or tampered pickle file will execute arbitrary code during unpickling. Since persist_path is user-configurable, this turns the store file into a code-execution sink. Switch to JSON (or another safe format) with schema validation, or if pickle is required, cryptographically verify the file with HMAC/signatures before loading.
🧰 Tools
🪛 Ruff (0.15.4)
[error] 48-48: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web_intelligence/vector_stores/numpy_store.py` around lines 46 - 48, The
_load method currently uses pickle.load on persist_path which is unsafe; replace
pickle-based deserialization with a safe format (e.g., JSON/MsgPack with
explicit schema validation) or, if keeping pickle, add cryptographic integrity
and authenticity checks: compute/verify an HMAC or digital signature for the
file before calling pickle.load, fail loudly on mismatch, and document the key
storage. Update any companion save logic (e.g., where the store is written) to
emit the safe format or to sign the payload so _load can verify before
deserializing.
Websift
Fetch the internet, serve it to any AI.
Web Intelligence is a Python library that crawls websites (or searches the web for you), extracts the useful text, and turns it into clean, formatted context that any LLM can read — OpenAI, Ollama, Groq, Anthropic, LangChain, or your own code. No API key needed for the core library.
What does it do?
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests