pi-internet

Web search, content fetching, and research for Pi. Multi-provider search, specialized fetchers for GitHub/Reddit/Twitter/YouTube/PDF, and a scout subagent that keeps noise out of your context.

Install

pi install git:github.com/JJGO/pi-internet

Or try without installing:

pi -e git:github.com/JJGO/pi-internet

What You Get

`web_search`

Search the web using multiple providers in parallel. Results are deduplicated by URL and the richer snippet is kept.

Search for "typescript monorepo best practices 2025"

Primary providers run in parallel (default: Brave + Kagi)
Fallback providers fill gaps when primaries fail or return too few unique results (default: Tavily)
Brave rate-limit / usage-limit responses disable Brave for the rest of the current session
Override per-call with provider: "brave" / "kagi" / "tavily" — explicit provider selection does not auto-fallback
Default: 10 results (max 20) — each provider returns 10, merged and deduplicated
Freshness filter: day, week, month, year
Concise warnings are surfaced when a provider is disabled or a fallback had to fill results

`fetch_url`

Fetch any URL and get clean, token-efficient markdown. Auto-detects content type:

URL Type	Handler
GitHub	Clones repo locally, returns tree + README. Use `read`/`bash` on the local path.
Reddit	Uses a configurable Redlib-compatible proxy when configured. Structured posts + nested comments (depth 4).
Twitter/X	Uses a configurable Nitter-compatible proxy when configured. Profiles, threads, tweets with RT/quote detection.
YouTube	Videos return transcripts via yt-dlp. Playlists and channels preview 25 entries inline and write the full list to cache.
PDF	Extracts text via unpdf. Large extractions are also saved to `~/Downloads/`.
HTML	Readability → RSC parser → Jina Reader fallback chain.

Links stripped by default (saves ~50 tokens/link). Set includeLinks: true to keep.
CSS selector support: selector: ".docs-content" narrows extraction.
verbose: true for Reddit: full comment depth. For Twitter: untruncated tweets. For YouTube collections: no internal entry cap.
YouTube playlists/channels write full lists to ~/.cache/pi-internet/youtube-lists/. Default output previews the first 25 items inline.

`web_research` (hidden by default)

Spawns a lightweight scout subagent that searches + fetches pages, then returns only relevant findings. All noise stays in the scout's disposable context.

The /toggle-research switch is session-only by design.

/toggle-research    # Enable the tool

Auto-detected scout model based on your current provider:

Your Provider	Scout Uses
Anthropic	`claude-haiku-4-5`
OpenAI	`gpt-4.1-mini`
Google	`gemini-2.0-flash`

Override per-call with model: "...".

Configuration

Settings live in Pi's settings files (~/.pi/agent/settings.json or .pi/settings.json):

{
  "piInternet": {
    "searchProviders": ["brave", "kagi"],
    "fallbackProviders": ["tavily"],
    "reddit": {
      "commentDepth": 4
    },
    "twitter": {},
    "github": {
      "enabled": true,
      "maxRepoSizeMB": 350,
      "clonePath": "/absolute/path/to/github-repos",
      "refreshTtlMs": 300000
    },
    "youtube": {
      "enabled": true
    },
    "fetch": {
      "includeLinks": false,
      "timeoutMs": 30000,
      "socksProxy": "socks5h://127.0.0.1:25344"
    }
  }
}

fetch.socksProxy is optional and disabled by default. When set, pi-internet routes its outbound HTTP requests through that SOCKS proxy.

github.clonePath defaults to ~/.cache/pi-internet/github-repos when omitted.

github.refreshTtlMs defaults to 300000 (5 minutes). Cached repos are refreshed with git fetch + hard reset when they are older than the TTL. If the cached clone has local edits, pi-internet keeps it untouched and creates a fresh sibling clone instead.

piWebSurf is still accepted as a legacy config key for backward compatibility, but piInternet is preferred.

Environment variables

Search providers

Provider	Env Var
Brave Search	`BRAVE_API_KEY`
Kagi	`KAGI_SESSION_TOKEN`
Tavily	`TAVILY_API_KEY`

Kagi also checks ~/.pi/kagi-search.json and ~/.kagi_session_token as fallbacks.

Optional proxies

Feature	Env Var	Value
Package-wide SOCKS proxy for outbound HTTP	`PI_INTERNET_SOCKS_PROXY`	Full proxy URL, e.g. `socks5h://127.0.0.1:25344`
Reddit via Redlib-compatible proxy	`PI_INTERNET_REDLIB_PROXY`	Hostname only, e.g. `redlib.example.com`
Twitter/X via Nitter-compatible proxy	`PI_INTERNET_NITTER_PROXY`	Hostname only, e.g. `nitter.example.com`

PI_INTERNET_SOCKS_PROXY overrides piInternet.fetch.socksProxy when both are set.

If these env vars are unset, SOCKS proxying stays disabled and Reddit/X URLs fall through to regular HTTP fetching.

External dependencies

Binary	Required For
`yt-dlp`	YouTube transcripts and playlist/channel metadata
`git` or `gh`	GitHub cloning

Commands

Command	Description
`/search-providers`	List configured providers, availability, and session-disabled status
`/kagi-login`	Set Kagi session token interactively
`/toggle-research`	Show/hide the `web_research` tool for the current session

How It Works

web_search(query)
  → Run primary providers in parallel (Brave + Kagi)
  → Merge: deduplicate by URL, keep richer snippet
  → If merged results are short → try fallback providers (Tavily) to fill the gap
  → If Brave returns a rate-limit or usage-limit response → disable Brave for the rest of the session

fetch_url(url)
  → Reddit?    Configured Redlib-compatible proxy → parse posts/comments → render markdown
  → Twitter?   Configured Nitter-compatible proxy → parse tweets/profile → render markdown
  → GitHub?    Clone repo → tree + README + file content
  → YouTube?   Video → yt-dlp subtitles → parse VTT → timestamped transcript
               Playlist/channel → yt-dlp flat JSON → first 25 inline + full list file
  → PDF?       unpdf extraction → inline markdown (+ save large outputs)
  → HTTP?      Readability → RSC parser → Jina Reader fallback

web_research(task)
  → Spawn scout: pi --mode json --no-session -e <this-ext>
  → Scout searches + fetches + analyzes
  → Returns only relevant findings

Token Efficiency

Links stripped from extracted content by default
Images always stripped
Reddit comments capped at depth 4 with truncation notices
Tweet content truncated to 500 chars in non-verbose mode
Output truncated to Pi's standard limits (50KB / 2000 lines)
Social proxy auto-disables on failure (falls through to HTTP for session)

License

MIT

Provenance

This project is a pragmatic blend of original code plus ideas and implementation patterns adapted from a few prior Pi- and web-fetch-related projects.

Search builds on patterns from pi-websearch, pi-web-access, and pi-kagi-search for provider routing, result normalization, and Kagi session-based scraping.
Fetchers reuse or adapt ideas from pi-web-access, pi-fetch, for GitHub extraction, Readability/RSC/Jina fallback behavior, PDF extraction.
Research/scout mode borrows the disposable subagent pattern from pi-surf.
Utilities and glue code were simplified, consolidated, or rewritten to fit this package, especially around config loading, markdown rendering, routing, and packaging.

Where code was adapted, comments in the relevant source files point back to the upstream project or module.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
scripts		scripts
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
PLAN-search-provider-resilience.md		PLAN-search-provider-resilience.md
PLAN.md		PLAN.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pi-internet

Install

What You Get

`web_search`

`fetch_url`

`web_research` (hidden by default)

Configuration

Environment variables

Search providers

Optional proxies

External dependencies

Commands

How It Works

Token Efficiency

License

Provenance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pi-internet

Install

What You Get

web_search

fetch_url

web_research (hidden by default)

Configuration

Environment variables

Search providers

Optional proxies

External dependencies

Commands

How It Works

Token Efficiency

License

Provenance

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`web_search`

`fetch_url`

`web_research` (hidden by default)

Packages