Skip to content

Halfax/hal-web-mcp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hal-web-mcp

Typed web tools for the Halfax AI agent. Replaces ad-hoc bash → curl with safer, smaller-model-friendly primitives.

Tools

Tool Purpose
web_get(url, max_bytes?, accept?, follow_redirects?, as_markdown?) Fetch a URL. HTML auto-converts to markdown by default.
web_head(url, follow_redirects?) HEAD-style probe — status + content-type, no body.
web_search(query, n?) Federated search via SearXNG.
web_extract(url, max_chars?) Fetch + Readability-style article extraction.
web_cache_clear(prefix?) Flush the response cache.
web_config() Diagnostic: dump effective config + warmup state.

Safety posture

  • SSRF guard: every URL is DNS-resolved up front and rejected if any resolved IP is loopback / RFC1918 / link-local / CGNAT / IPv6-private, or if the resolution returns a mix of public and private IPs (DNS rebinding setup).
  • Scheme allowlist: http and https only. file:, javascript:, gopher:, ftp:, data: are all rejected.
  • Size cap: streaming read with byte counter; hard-kills the connection at the limit instead of trusting Content-Length.
  • Redirect handling: bounded chain (default 5), each hop re-validated, https→http downgrade refused.
  • Per-domain rate limit: 1 req/s burst 5, defends against feedback loops where the agent hammers a single host.
  • Cache: GET/HEAD only, honors Cache-Control: no-store and private, TTL + LRU bound.

Setup

cd hal-web-mcp
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
.venv/bin/python -m pytest tests/   # 31 tests

The Halfax extension's MCP manager auto-spawns this server. To make search work, point the env at a SearXNG instance.

SearXNG

The production instance runs on Cygnus (http://192.168.0.241:8888, Netbird http://100.87.21.129:8888; moved from Betelgeuse 2026-05-30). To stand one up elsewhere:

# docker-compose.yml
services:
  searxng:
    # Digest-pinned tag, NEVER :latest (DECISIONS §12) — engine plugins break on
    # upstream changes. Bumps are quarterly + manual; test a new tag in a sidecar
    # against the JSON probe below before swapping production.
    image: searxng/searxng:2026.5.6-330d56bba
    container_name: searxng
    restart: unless-stopped
    ports:
      - "8888:8080"
    volumes:
      - ./searxng:/etc/searxng
    environment:
      - SEARXNG_BASE_URL=http://<host-ip>:8888/
      - INSTANCE_NAME=halfax

After first run, edit ./searxng/settings.yml:

  • Add json to search.formats (default ships HTML-only)
  • Set server.secret_key to a random hex string

Sanity-check:

curl 'http://<host>:8888/search?q=test&format=json' | jq '.results[0].title'

Then set HALFAX_WEB_SEARXNG_URL (see env table). The MCP fires a warmup ping at startup and every 10 minutes, so the agent's first real search doesn't pay the cold-start cost.

Environment

Var Default Purpose
HALFAX_WEB_SEARCH_BACKEND searxng searxng | none
HALFAX_WEB_SEARXNG_URL (unset) e.g. http://100.87.21.129:8888 (Cygnus)
HALFAX_WEB_TIMEOUT 15 Total request timeout (s)
HALFAX_WEB_CONNECT_TIMEOUT 5 Connect timeout (s)
HALFAX_WEB_MAX_BYTES 2000000 Per-response size cap
HALFAX_WEB_MAX_REDIRECTS 5 Per-fetch redirect ceiling
HALFAX_WEB_MAX_CONCURRENT 4 Parallel-fetch semaphore
HALFAX_WEB_CACHE_DIR ~/.halfax-web-cache Disk cache root
HALFAX_WEB_CACHE_TTL 600 Cache entry TTL (s); 0 disables
HALFAX_WEB_CACHE_MAX_BYTES 100000000 LRU quota
HALFAX_WEB_ALLOW_DOMAINS (unset) Comma list. When set, only these domains.
HALFAX_WEB_DENY_DOMAINS (unset) Comma list. Always rejected. Overrides allow.
HALFAX_WEB_USER_AGENT Halfax-AI-Agent/<v> (+local) UA header
HALFAX_WEB_WARMUP_INTERVAL 600 SearXNG keepalive interval (s)
HALFAX_WEB_LOG_LEVEL INFO Stderr log verbosity

Limitations

  • No JavaScript rendering. SPA-only sites that don't ship server-rendered HTML produce empty/garbage output. web_extract (Readability) is the best fallback for those.
  • No request-body POST. web_get/web_head are for GET/HEAD only. Use bash + curl for POST/PUT to public APIs.
  • Cert validation is via the original hostname. We DNS-validate before the fetch but let httpx connect normally; a hostile DNS server could rebind between our resolve and httpx's, though that requires the attacker control DNS for a target the user already chose to query. A future pass can swap in a custom transport that uses the validated IP directly with proper SNI.

About

Web search/fetch MCP server behind a real SSRF wall — DNS-rebinding defense, private-range blocking, 31 tests.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages