Tavily Web Search: Auth, Retries & Fallback #28

albertyosef · 2025-09-20T21:37:43Z

Features

TavilySearchProvider (services/web_search_service.py)
- Bearer auth via Authorization: Bearer <TAVILY_API_KEY>.
- httpx with per-request timeout.
- Exponential backoff on 429/5xx (bounded, jitter).
- Clear errors on 401 and non-retryable 4xx.
- Normalized result schema: {url, title, site_name, snippet, published_at, score}.
Settings & Fallback (config/settings.py)
- If WEB_SEARCH_ENGINE=tavily but key missing → log warning and fallback to dummy (no crash).

Fixes

Graceful degradation when keys are missing/invalid.
Clearer diagnostics for auth vs. throttle vs. transient failures.

Refactors

Unified provider interface and result normalization.
Centralized timeout/retry/error mapping.

Observability

Startup warning on misconfig; info when fallback is used.
Debug logs for retries (status + delay), rate-limited.

Configuration

WEB_SEARCH_ENGINE=tavily|dummy|...
TAVILY_API_KEY required only when using Tavily; otherwise dummy is used.

Manual QA

Valid key → real Tavily results with normalized fields.
Force 429 → retries/backoff visible in logs.
Invalid/missing key → clear 401 or startup warning + dummy fallback.

Compatibility & Risks

No breaking API changes.
Bounded retries to avoid churn; prominent logs to prevent silent misconfig.

Checklist

Bearer auth
Timeout + backoff (429/5xx)
Clear 401/4xx errors
Normalized schema
Settings validation + dummy fallback
QA steps updated

Summary by CodeRabbit

New Features
- Concurrent, rate-limited fetching of multiple web pages for faster results.
- Structured fetch results including URL, title, site name, body text, publish date, and fetch latency.
- Robust HTTP handling with timeouts, retries, and extensive logging for observability.
- Content extraction that prefers advanced parsing but falls back to a local sanitizer when unavailable.
- Safety checks to reduce risky URLs and optional preservation of search-result snippets when needed.

- Added web search configuration variables to .env.example - Updated .gitignore to include a new line for clarity - Refactored import statement in indexing_router.py for consistency - Included web answering router in main.py for improved routing structure - Updated requirements.txt to include trafilatura and ensure httpx uses HTTP/2

…ngs for search configuration

…ror handling and HTML extraction

coderabbitai · 2025-09-20T21:37:50Z

Walkthrough

Adds a new asynchronous WebFetchService and FetchedDoc dataclass that concurrently fetches URLs with semaphore-based concurrency limits, extracts title/site/text (trafilatura optional, fallback sanitizer), checks URL safety, logs timings/errors, and provides fetch_urls and fetch_search_results public methods.

Changes

Cohort / File(s)	Summary
Web fetching service `services/web_fetch_service.py`	New `FetchedDoc` dataclass and `WebFetchService` implementing async concurrent URL fetching with semaphore-based max_concurrency, httpx requests with timeouts/headers, URL allowlist checks, title/site/text extraction (uses `trafilatura` if available, otherwise a sanitizer fallback), per-URL timing/error logging, `fetch_urls` and `fetch_search_results` (with optional snippet fallback).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant WFS as WebFetchService
  participant Sem as Semaphore
  participant HTTP as Remote HTTP
  participant Ext as Extractor (trafilatura / fallback)

  Caller->>WFS: fetch_urls(urls, timeout)
  WFS->>Sem: acquire slot
  loop for each URL (concurrent tasks)
    WFS->>HTTP: GET url (timeout, headers)
    alt Success (HTML)
      HTTP-->>WFS: HTML response
      WFS->>Ext: extract title & text
      Ext-->>WFS: title, text
      WFS-->>Caller: FetchedDoc(url, title, site_name, text, fetch_ms)
    else Failure / non-HTML / timeout
      HTTP-->>WFS: error/none
      WFS-->>Caller: FetchedDoc(url, text="", fetch_ms)
    end
  end
  WFS->>Sem: release slot
  note right of WFS: logs timings, counts, fallbacks

sequenceDiagram
  autonumber
  actor Caller
  participant WFS as WebFetchService
  participant SR as Search Results
  participant Merge as Snippet Merge

  Caller->>WFS: fetch_search_results(results, preserve_snippets)
  WFS->>SR: extract URLs + optional snippets
  WFS->>WFS: fetch_urls(extracted_urls)
  alt preserve_snippets and fetch failures
    WFS->>Merge: fill missing text with snippets
    Merge-->>Caller: FetchedDoc list (with snippets)
  else
    WFS-->>Caller: FetchedDoc list
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Add DummySearchProvider Fallback and Settings Improvements #27 — Adds max_fetch_concurrency setting and web search scaffolding that WebFetchService depends on (configuration and related integration).

Poem

A rabbit hops to fetch each link,
I nibble HTML before you blink.
If trafilatura hides away,
My tidy fallback saves the day.
I log the time — then hop away. 🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly captures the primary change: adding Tavily web search integration with authentication, retry/backoff logic, and a fallback provider. It is concise, specific, and directly aligned with the PR objectives and the added services (e.g., web_search_service and settings fallback). The phrasing is clear and reviewer-friendly without extraneous details.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch enhancement/tavily-api-auth

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (8)

services/web_fetch_service.py (8)

53-55: Reuse a single AsyncClient for pooling and HTTP/2; expose close.

Per-request clients defeat pooling and add overhead.

         self.max_concurrency = max_concurrency or settings.max_fetch_concurrency
         self.semaphore = asyncio.Semaphore(self.max_concurrency)
 
         # Common HTML headers for the fetch requests
         self.headers = {
@@
-                async with httpx.AsyncClient(timeout=timeout_seconds, follow_redirects=True) as client:
-                    response = await client.get(url, headers=self.headers)
+                # Reuse a single pooled client
+                response = await self._client.get(url, timeout=timeout_seconds)

Add outside the selected range:

# in __init__ after headers
self._client = httpx.AsyncClient(
    http2=True,
    follow_redirects=True,
    headers=self.headers,
    limits=httpx.Limits(
        max_connections=self.max_concurrency,
        max_keepalive=self.max_concurrency,
    ),
)

# add to class
async def aclose(self) -> None:
    await self._client.aclose()

Also applies to: 163-165

167-174: Skip non-HTML content early.

Avoid passing binary or JSON to HTML extractors.

-                    html = response.text
+                    ct = response.headers.get("Content-Type", "")
+                    if "text/html" not in ct and "application/xhtml+xml" not in ct:
+                        fetch_ms = int((time.time() - start_time) * 1000)
+                        fetched_doc.fetch_ms = fetch_ms
+                        logger.info("Skipping non-HTML content from %s (type=%s)", url, ct)
+                        return fetched_doc
+                    html = response.text

183-185: Use parameterized logs and exception logging (fixes RUF010/TRY400).

Avoid f-strings in logs; include tracebacks where useful.

-                    logger.info(f"Fetched {url} in {fetch_ms}ms, extracted {len(text)} chars")
+                    logger.info("Fetched %s in %dms, extracted %d chars", url, fetch_ms, len(text))
@@
-                logger.warning(f"HTTP error {status} fetching {url}: {str(e)}")
+                logger.warning("HTTP error %s fetching %s: %s", status, url, e)
@@
-                logger.warning(f"Request error fetching {url}: {str(e)}")
+                logger.warning("Request error fetching %s: %s", url, e)
@@
-            except Exception as e:
-                logger.warning(f"Error fetching {url}: {str(e)}")
+            except Exception:
+                logger.exception("Error fetching %s", url)
@@
-            logger.warning(f"Failed to fetch {url} after {fetch_ms}ms")
+            logger.warning("Failed to fetch %s after %dms", url, fetch_ms)
@@
-        except Exception as e:
-            logger.error(f"Error in fetch_urls: {str(e)}")
+        except Exception:
+            logger.exception("Error in fetch_urls")
@@
-            if isinstance(result, Exception):
-                logger.warning(f"Exception during fetch: {str(result)}")
+            if isinstance(result, Exception):
+                logger.warning("Exception during fetch: %r", result)

Also applies to: 188-192, 197-198, 221-223, 228-231

108-116: Unescape HTML entities in titles.

Improves output fidelity.

-        if title_match:
-            title = title_match.group(1).strip()
-            # Clean up title
-            title = re.sub(r'\s+', ' ', title)
-            return title
+        if title_match:
+            title = re.sub(r'\s+', ' ', title_match.group(1).strip())
+            return html_lib.unescape(title)

Add import if missing:

+import html as html_lib

95-106: Use stdlib entity decoding instead of ad-hoc replacements.

Less brittle and more complete.

-        # Replace entities
-        html = re.sub(r'&nbsp;', ' ', html)
-        html = re.sub(r'&amp;', '&', html)
-        html = re.sub(r'&lt;', '<', html)
-        html = re.sub(r'&gt;', '>', html)
-        html = re.sub(r'&quot;', '"', html)
-        html = re.sub(r'&#\d+;', ' ', html)
+        # Decode HTML entities
+        html = html_lib.unescape(html)

133-136: Tighten trafilatura error logging level; avoid noisy warnings.

This is a normal fallback path; keep logs low-volume.

-                logger.warning("trafilatura extraction failed, falling back to simple extraction")
+                logger.debug("trafilatura extraction returned no content; using fallback")
@@
-            except Exception as e:
-                logger.warning(f"trafilatura extraction error: {e}, falling back to simple extraction")
+            except Exception as e:
+                logger.debug("trafilatura extraction error; falling back: %s", e)

8-15: Imports needed for above changes.

Add stdlib modules for entity decoding and IP checks.

 import asyncio
 import logging
 import re
 import time
 from dataclasses import dataclass
-from typing import Dict, List, Optional, Set
+from typing import Dict, List, Optional, Set
 from urllib.parse import urlparse
+import html as html_lib
+import ipaddress

235-274: Type safety: define a SearchResult protocol for fetch_search_results.

Clarify expected attributes without hard coupling to a concrete class.

Add outside the selected range:

from typing import Protocol, Iterable, Optional

class SearchResultLike(Protocol):
    url: str
    title: Optional[str]
    site_name: Optional[str]
    snippet: Optional[str]
    published_at: Optional[str]

Then adjust the signature:

-async def fetch_search_results(self, search_results: List, timeout_seconds: int = 10,
+async def fetch_search_results(self, search_results: Iterable[SearchResultLike], timeout_seconds: int = 10,

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a0e924 and e01739f.

📒 Files selected for processing (1)

services/web_fetch_service.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.13.1)

services/web_fetch_service.py

79-79: Consider moving this statement to an else block

(TRY300)

80-80: Do not catch blind exception: Exception

(BLE001)

134-134: Do not catch blind exception: Exception

(BLE001)

188-188: Use explicit conversion flag

Replace with conversion flag

(RUF010)

190-190: Use explicit conversion flag

Replace with conversion flag

(RUF010)

191-191: Do not catch blind exception: Exception

(BLE001)

192-192: Use explicit conversion flag

Replace with conversion flag

(RUF010)

220-220: Do not catch blind exception: Exception

(BLE001)

221-221: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

221-221: Use explicit conversion flag

Replace with conversion flag

(RUF010)

228-228: Use explicit conversion flag

Replace with conversion flag

(RUF010)

services/web_fetch_service.py

…cement/tavily-api-auth

…d allow httpx to manage encoding automatically

…hing

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (6)

services/web_fetch_service.py (6)

104-128: Prefer html.unescape over manual entity replacements.

Simpler and more complete entity handling; fewer edge cases.

Apply this diff:

@@
-import re
+import re
+import html as html_unescape
@@ def _sanitize_html_fallback(self, html: str) -> str:
-        # Replace entities
-        html = re.sub(r'&nbsp;', ' ', html)
-        html = re.sub(r'&amp;', '&', html)
-        html = re.sub(r'&lt;', '<', html)
-        html = re.sub(r'&gt;', '>', html)
-        html = re.sub(r'&quot;', '"', html)
-        html = re.sub(r'&#\d+;', ' ', html)
+        # Replace entities
+        html = html_unescape.unescape(html)

Also applies to: 8-16

213-219: Use lazy/structured logging instead of f-strings; elevate unexpected errors to exception logs.

Avoid formatting costs and keep tracebacks.

Apply this diff:

@@
-                logger.warning(f"HTTP error {status} fetching {url}: {str(e)}")
+                logger.warning("HTTP error %s fetching %s: %s", status, url, e)
@@
-                logger.warning(f"Request error fetching {url}: {str(e)}")
+                logger.warning("Request error fetching %s: %s", url, e)
@@
-                logger.warning(f"Error fetching {url}: {str(e)}")
+                logger.exception("Unexpected error fetching %s", url)
@@
-            logger.error(f"Error in fetch_urls: {str(e)}")
+            logger.exception("Error in fetch_urls")
@@
-                logger.warning(f"Exception during fetch: {str(result)}")
+                logger.warning("Exception during fetch: %s", result)

Also applies to: 248-248, 255-255, 210-211

129-137: Optional: fall back to OpenGraph/og:title when <title> is missing.

Improves coverage for modern sites.

Sketch:

@@ def _extract_title(self, html: str) -> Optional[str]:
-        title_match = re.search(r'<title[^>]*>(.*?)</title>', html, re.IGNORECASE | re.DOTALL)
+        title_match = re.search(r'<title[^>]*>(.*?)</title>', html, re.IGNORECASE | re.DOTALL)
         if title_match:
@@
-        return None
+        og = re.search(r'<meta\s+property=["\']og:title["\']\s+content=["\'](.*?)["\']', html, re.IGNORECASE | re.DOTALL)
+        if og:
+            return re.sub(r'\s+', ' ', og.group(1).strip())
+        return None

227-261: Confirm intent: dropping docs without text.

fetch_urls filters out FetchedDoc without text, which suppresses non-HTML and tiny pages and enables snippet fallback later. If that’s intentional, add a brief docstring note; otherwise, consider returning all FetchedDoc and letting callers decide.

262-301: Add a typed protocol for search_results to improve safety.

Prevents attribute errors and clarifies the contract.

Example:

@@
-from typing import Dict, List, Optional, Set
+from typing import Dict, List, Optional, Set, Protocol, Sequence
@@
+class SearchResultLike(Protocol):
+    url: str
+    title: Optional[str]
+    site_name: Optional[str]
+    snippet: Optional[str]
+    published_at: Optional[str]
@@
-    async def fetch_search_results(self, search_results: List, timeout_seconds: int = 10,
+    async def fetch_search_results(self, search_results: Sequence[SearchResultLike], timeout_seconds: int = 10,
                                   preserve_snippets: bool = True) -> List[FetchedDoc]:

Also applies to: 14-14

57-61: Optional: use a product UA; reduce per-URL logs to debug.

Consider a descriptive UA (e.g., “AuthormatonFetcher/1.0 (+site)”) and keep per-URL logs at debug to avoid noisy INFO in production.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e01739f and b45c8da.

📒 Files selected for processing (1)

services/web_fetch_service.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.13.1)

services/web_fetch_service.py

73-73: Consider moving this statement to an else block

(TRY300)

74-74: Do not catch blind exception: Exception

(BLE001)

100-100: Consider moving this statement to an else block

(TRY300)

101-101: Do not catch blind exception: Exception

(BLE001)

155-155: Do not catch blind exception: Exception

(BLE001)

215-215: Use explicit conversion flag

Replace with conversion flag

(RUF010)

217-217: Use explicit conversion flag

Replace with conversion flag

(RUF010)

218-218: Do not catch blind exception: Exception

(BLE001)

219-219: Use explicit conversion flag

Replace with conversion flag

(RUF010)

247-247: Do not catch blind exception: Exception

(BLE001)

248-248: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

248-248: Use explicit conversion flag

Replace with conversion flag

(RUF010)

255-255: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🔇 Additional comments (1)

services/web_fetch_service.py (1)

57-63: Headers cleanup looks good (matches prior feedback).

Dropping Accept-Encoding br and browser-only headers is correct; letting httpx negotiate/decompress is safer.

coderabbitai · 2025-09-21T13:09:42Z

services/web_fetch_service.py

+    def __init__(self, max_concurrency: Optional[int] = None):
+        """
+        Initialize the web fetch service.
+
+        Args:
+            max_concurrency: Maximum number of concurrent requests
+                (defaults to settings.max_fetch_concurrency)
+        """
+        self.max_concurrency = max_concurrency or settings.max_fetch_concurrency
+        self.semaphore = asyncio.Semaphore(self.max_concurrency)
+


⚠️ Potential issue

Block SSRF via open redirects; disable env proxies; reuse a single AsyncClient with limits.

Currently follow_redirects=True allows 3xx hops to localhost/IMDS/etc, bypassing your pre-check. Also, creating a new client per request is wasteful and inherits proxy env by default. Reuse one AsyncClient, set trust_env=False, and manually validate each redirect hop.

Apply this diff:

@@ -from urllib.parse import urlparse +from urllib.parse import urlparse, urljoin @@ class WebFetchService: - def __init__(self, max_concurrency: Optional[int] = None): + def __init__(self, max_concurrency: Optional[int] = None): @@ - self.max_concurrency = max_concurrency or settings.max_fetch_concurrency - self.semaphore = asyncio.Semaphore(self.max_concurrency) + raw = max_concurrency or getattr(settings, "max_fetch_concurrency", 10) + self.max_concurrency = max(1, raw) + self.semaphore = asyncio.Semaphore(self.max_concurrency) @@ - # Common HTML headers for the fetch requests + # Common HTML headers for the fetch requests self.headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5" # Let httpx handle Accept-Encoding and compression automatically } + + # Reuse a single client; disable env proxies; bound pool size + self._limits = httpx.Limits( + max_connections=self.max_concurrency, + max_keepalive_connections=self.max_concurrency, + ) + self._client = httpx.AsyncClient( + headers=self.headers, + http2=True, + limits=self._limits, + follow_redirects=False, # we validate/handle redirects manually + trust_env=False, # ignore HTTP(S)_PROXY/NO_PROXY + ) + self._max_redirects = getattr(settings, "fetch_max_redirects", 5) + self._max_content_bytes = getattr(settings, "fetch_max_content_bytes", 2_000_000) + + async def aclose(self) -> None: + await self._client.aclose() @@ - try: - async with httpx.AsyncClient(timeout=timeout_seconds, follow_redirects=True) as client: - response = await client.get(url, headers=self.headers) - response.raise_for_status() - - html = response.text - - # Extract title if not already provided - title = self._extract_title(html) - - # Extract text content - text = self._extract_text_from_html(html) - - # Update the fetched document - fetched_doc.title = title - fetched_doc.text = text - - # Calculate fetch time - fetch_ms = int((time.time() - start_time) * 1000) - fetched_doc.fetch_ms = fetch_ms - - logger.info(f"Fetched {url} in {fetch_ms}ms, extracted {len(text)} chars") - return fetched_doc + try: + current_url = url + html = "" + # Manually follow and validate up to N redirects + for _ in range(self._max_redirects + 1): + if not self._is_url_allowed(current_url): + raise httpx.HTTPStatusError("Unsafe URL after redirect validation", + request=None, response=None) + # Stream to cap payload size + async with self._client.stream("GET", current_url, timeout=timeout_seconds) as response: + if 300 <= response.status_code < 400 and "location" in response.headers: + next_url = urljoin(current_url, response.headers["location"]) + current_url = next_url + continue + response.raise_for_status() + ctype = response.headers.get("content-type", "") + if "html" not in ctype and "xml" not in ctype: + logger.debug("Skipping non-HTML content: %s (%s)", current_url, ctype) + break + buf = bytearray() + async for chunk in response.aiter_bytes(): + buf.extend(chunk) + if len(buf) > self._max_content_bytes: + logger.warning("Aborting %s: response exceeded %d bytes", current_url, self._max_content_bytes) + break + if buf: + html = buf.decode("utf-8", errors="replace") + break + + # Extract title/text if we have HTML + if html: + title = self._extract_title(html) + text = self._extract_text_from_html(html) + fetched_doc.title = title + fetched_doc.text = text + + # Calculate fetch time + fetch_ms = int((time.time() - start_time) * 1000) + fetched_doc.fetch_ms = fetch_ms + logger.debug("Fetched %s in %dms, extracted %d chars", current_url, fetch_ms, len(fetched_doc.text)) + return fetched_doc

Also applies to: 15-16, 161-213

🤖 Prompt for AI Agents

In services/web_fetch_service.py around lines 46-56 (also applies to lines 15-16 and 161-213): the service currently creates a new HTTP client per request, inherits proxy environment variables, and uses follow_redirects=True which allows open-redirect SSRF to internal addresses; fix by instantiating a single httpx.AsyncClient in __init__ with appropriate limits (httpx.Limits or equivalent) and trust_env=False, set follow_redirects=False, and reuse that client for all requests; implement manual redirect handling when a 3xx response is returned: read the Location header, resolve and validate each redirect hop against the same SSRF/internal-host checks used for the initial URL (reject local/IMDS/private IPs), enforce a maximum redirect count, and only follow validated redirects using the single client; ensure the client is properly closed on service shutdown.

coderabbitai · 2025-09-21T13:09:42Z

services/web_fetch_service.py

+    def _is_url_allowed(self, url: str) -> bool:
+        """
+        Check if a URL is safe to fetch (SSRF protection).
+
+        Args:
+            url: The URL to validate
+
+        Returns:
+            True if URL is safe to fetch, False otherwise
+        """
+        try:
+            p = urlparse(url)
+            if p.scheme not in ("http", "https"):
+                return False
+            if not p.hostname or p.username or p.password:
+                return False
+            try:
+                ip = ipaddress.ip_address(p.hostname)
+                if not ip.is_global:
+                    return False
+            except ValueError:
+                # Hostname; DNS resolution checks can be added later if needed.
+                pass
+            return True
+        except Exception:
+            return False
+


⚠️ Potential issue

SSRF guard is a good start; consider tightening hostname cases.

Add explicit blocks for localhost-style hostnames and common link-local names; document future DNS resolution plans to catch rebinds.

Example tweak:

@@ def _is_url_allowed(self, url: str) -> bool: - if not p.hostname or p.username or p.password: + if not p.hostname or p.username or p.password: return False + host = p.hostname.lower() + if host in {"localhost", "localhost.localdomain"} or host.endswith(".local"): + return False

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _is_url_allowed(self, url: str) -> bool:

"""

Check if a URL is safe to fetch (SSRF protection).

Args:

url: The URL to validate

Returns:

True if URL is safe to fetch, False otherwise

"""

try:

p = urlparse(url)

if p.scheme not in ("http", "https"):

return False

if not p.hostname or p.username or p.password:

return False

try:

ip = ipaddress.ip_address(p.hostname)

if not ip.is_global:

return False

except ValueError:

# Hostname; DNS resolution checks can be added later if needed.

pass

return True

except Exception:

return False

def _is_url_allowed(self, url: str) -> bool:

"""

Check if a URL is safe to fetch (SSRF protection).

Args:

url: The URL to validate

Returns:

True if URL is safe to fetch, False otherwise

"""

try:

p = urlparse(url)

if p.scheme not in ("http", "https"):

return False

if not p.hostname or p.username or p.password:

return False

host = p.hostname.lower()

if host in {"localhost", "localhost.localdomain"} or host.endswith(".local"):

return False

try:

ip = ipaddress.ip_address(p.hostname)

if not ip.is_global:

return False

except ValueError:

# Hostname; DNS resolution checks can be added later if needed.

pass

return True

except Exception:

return False

🧰 Tools

🪛 Ruff (0.13.1)

100-100: Consider moving this statement to an else block

(TRY300)

101-101: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In services/web_fetch_service.py around lines 77 to 103, the SSRF guard currently allows hostnames that could still resolve to loopback/link-local/private addresses; update the function to explicitly reject common localhost-style hostnames and numeric edge cases by normalizing p.hostname (lowercase, strip surrounding brackets for IPv6) and returning False for literal names like "localhost", "ip6-localhost", "0.0.0.0" and for IPv4/IPv6 addresses in loopback, link-local (169.254.0.0/16 and fe80::/10), multicast, and private ranges (10/8, 172.16/12, 192.168/16) using ipaddress checks (is_loopback, is_link_local, is_private, is_multicast) after parsing the hostname to an ip object; if hostname is non-numeric keep the existing behavior but add a clear TODO comment that DNS resolution and rebind protection will be implemented later.

enfayz and others added 3 commits September 20, 2025 21:18

Add web search service with Tavily and dummy providers; enhance setti…

6a0e924

…ngs for search configuration

Implement WebFetchService for concurrent web content fetching with er…

e01739f

…ror handling and HTML extraction

coderabbitai bot reviewed Sep 20, 2025

View reviewed changes

services/web_fetch_service.py Show resolved Hide resolved

services/web_fetch_service.py Show resolved Hide resolved

fehranbit changed the base branch from feature/dummy-search-provider to main September 21, 2025 12:49

fehranbit added 3 commits September 21, 2025 06:52

Merge branch 'main' of https://github.com/Authormaton/core into enhan…

6b26d6f

…cement/tavily-api-auth

Refactor HTTP headers in WebFetchService to simplify configuration an…

b8664ff

…d allow httpx to manage encoding automatically

Add SSRF protection to WebFetchService by validating URLs before fetc…

b45c8da

…hing

fehranbit self-requested a review September 21, 2025 13:02

fehranbit approved these changes Sep 21, 2025

View reviewed changes

fehranbit merged commit 071a11f into main Sep 21, 2025
1 check was pending

coderabbitai bot reviewed Sep 21, 2025

View reviewed changes

coderabbitai bot mentioned this pull request Sep 22, 2025

Web Answering API: Cited Answers via Fetch→Rank→Synthesize #29

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tavily Web Search: Auth, Retries & Fallback #28

Tavily Web Search: Auth, Retries & Fallback #28

Uh oh!

albertyosef commented Sep 20, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 20, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 21, 2025

Uh oh!

coderabbitai bot Sep 21, 2025

Uh oh!

Uh oh!

Tavily Web Search: Auth, Retries & Fallback #28

Tavily Web Search: Auth, Retries & Fallback #28

Uh oh!

Conversation

albertyosef commented Sep 20, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features

Fixes

Refactors

Observability

Configuration

Manual QA

Compatibility & Risks

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albertyosef commented Sep 20, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 20, 2025 •

edited

Loading