Skip to content

Sophomoresty/bpc-fetch

Repository files navigation

bpc-fetch logo

bpc-fetch

Bypass paywall sites — search, discover, and batch fetch articles as Markdown.

中文文档936 SitesFeaturesInstallUsageCredits

LINUX DO Community


What is this?

A command-line tool that fetches full-text articles from 936 paywalled news sites and saves them as clean Markdown with images. It replicates the bypass logic of the Bypass Paywalls Clean browser extension, but runs headlessly — no browser extension needed.

Supported Sites

936 sites across 40+ countries. Highlights:

Category Sites
Financial The Economist, Financial Times, Bloomberg, WSJ, Reuters, Forbes, Business Insider
US News New York Times, Washington Post, LA Times, Chicago Tribune, Politico
UK/EU The Telegraph, The Times, Der Spiegel, Le Monde, El País, Corriere della Sera
Tech/Science Wired, The Atlantic, Nature, Science, Scientific American, MIT Tech Review
Magazines The New Yorker, Vanity Fair, Vogue, National Geographic, Esquire
German 76 sites (FAZ, Handelsblatt, Süddeutsche Zeitung...)
French 69 sites (Le Figaro, Libération, Les Echos...)
More Netherlands 30, Italy 28, Spain 26, Belgium 22, Australia 39...

Run bpc-fetch sites to see the full list.

Features

  • Full bypass coverage — Replicates all BPC extension strategies: custom User-Agent, Googlebot/Bingbot spoofing, referer manipulation, Playwright JS interception, archive.org fallback
  • Auto fallback chain — Each URL tries the optimal strategy, degrades gracefully until content is retrieved
  • Article discovery — Find recent articles via RSS, sitemap, or browser-rendered homepage
  • Cross-site crawl — Search + time filter + batch download in one command
  • Agent-friendly — JSON stdout, stderr progress, next_command hints in every response
  • Windows exe — Single-file distribution via PyInstaller, auto-downloads Chromium on first run

Installation

pip (recommended)

pip install bpc-fetch
playwright install chromium

From source

git clone https://github.com/user/bpc-fetch.git
cd bpc-fetch
pip install -e .
playwright install chromium

Windows exe

Download bpc-fetch.exe from Releases, then:

bpc-fetch.exe install-browser
bpc-fetch.exe doctor

Usage

# Check setup
bpc-fetch doctor

# List supported sites
bpc-fetch sites --filter economist

# Discover today's articles from a site
bpc-fetch discover economist.com --since today

# Fetch a single article
bpc-fetch fetch "https://www.economist.com/leaders/2024/01/01/example" --out-dir ./articles

# Batch fetch from URL list
bpc-fetch batch --file urls.txt --out-dir ./articles

# Cross-site crawl: keyword + time range
bpc-fetch crawl "AI regulation" --sites economist.com,ft.com --since 7d --out-dir ./ai-articles

Output format

article-title/
├── article-title.md      # YAML frontmatter + full text + image refs
└── images/
    ├── img_000_abc1.jpg
    └── img_001_def2.png

Agent integration

All commands output JSON. Use --compact for minimal output:

bpc-fetch discover ft.com --since today --compact
# → {"ok": true, "domain": "ft.com", "count": 15, "articles": [...], "next_command": "bpc-fetch batch ..."}

Bypass Strategies

Strategy Sites Method
ua:custom 7 Custom User-Agent string (Liskov, Google-InspectionTool, etc.)
ua:googlebot 85 Googlebot User-Agent
ua:facebookbot 5 Facebook crawler UA
referer:google 2 Google referer header
block_js 425 Playwright blocks paywall scripts via Page.route()
archive 274 Fetch from archive.org/archive.is
cookies 138 Access without tracking cookies

Building Windows exe

pip install pyinstaller
python build/build_win.py
# Output: dist/bpc-fetch.exe

Credits

This tool is built on top of the bypass logic from:

  • Bypass Paywalls Clean by magnolia1234 — the original browser extension that provides the site database and bypass strategies. All credit for the paywall bypass research goes to the BPC project maintainers.
  • LINUX DO — Community support and feedback.

License

MIT — see LICENSE.

The data/sites.js file is from the Bypass Paywalls Clean project (MIT License).

About

Bypass paywall sites — search, discover, and batch fetch articles as Markdown. 936 sites supported.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors