CLI crawler for collecting Xiaohongshu creator contact leads from profile bios and recommendation chains, with SQLite persistence and nightly automation.
Install the published CLI:
uv tool install red-crawler==0.1.3Install the Playwright browser runtime:
red-crawler install-browsersFor local development from a checkout:
uv sync
uv run playwright install chromiumRun without logging in. If Xiaohongshu shows a login popup, the crawler tries to close it and continue.
Collect creators from the Xiaohongshu fashion homefeed:
red-crawler crawl-homefeed \
--max-accounts 20 \
--db-path "./data/red_crawler.db" \
--output-dir "./output"The default homefeed URL is https://www.xiaohongshu.com/explore?channel_id=homefeed.fashion_v3. The crawler reads each card's author link and opens the user profile, not the note page.
Run a manual crawl from a known user profile:
red-crawler crawl-seed \
--seed-url "https://www.xiaohongshu.com/user/profile/USER_ID" \
--max-accounts 20 \
--max-depth 2 \
--gender-filter "女" \
--db-path "./data/red_crawler.db" \
--output-dir "./output"crawl-seed defaults to safe mode, adding slower request pacing and dwell/scroll delays that look more like a normal browsing session. Use --no-safe-mode only if you explicitly want a faster run.
Use --gender-filter "男" or --gender-filter "女" to keep only inferred male or female accounts in the exported and persisted crawl results.
Use Bright Data Browser API instead of launching local Chromium:
export BRIGHT_DATA_BROWSER_API_AUTH="SBR_ZONE_FULL_USERNAME:SBR_ZONE_PASSWORD"
red-crawler crawl-search \
--search-term "抗痘博主" \
--browser-mode bright-data \
--output-dir "./output"You can also pass the full CDP endpoint directly:
red-crawler crawl-seed \
--seed-url "https://www.xiaohongshu.com/user/profile/USER_ID" \
--browser-mode bright-data \
--browser-endpoint "wss://SBR_ZONE_FULL_USERNAME:SBR_ZONE_PASSWORD@brd.superproxy.io:9222"crawl-seed now does both:
- exports
accounts.csv,contact_leads.csv,run_report.json - upserts the same result into SQLite
Bright Data Browser API mode rotates by opening a fresh browser session on retry. If your Bright Data username, password, or endpoint contains {session}, red-crawler replaces it with a random session id for each browser session:
red-crawler crawl-homefeed \
--browser-mode bright-data \
--browser-auth "brd-customer-xxx-zone-xxx-session-{session}:PASSWORD" \
--rotation-mode session \
--rotation-retries 2In local browser mode, red-crawler cannot rotate the machine's real IP by itself. Provide one proxy with --proxy or a newline-delimited proxy pool with --proxy-list; session rotation will launch a new Chromium session with the next proxy after 403 or 429.
red-crawler crawl-homefeed \
--proxy-list "./proxies.txt" \
--rotation-mode session \
--rotation-retries 3 \
--output-dir "./output"Proxy entries can be host:port, http://user:pass@host:port, or socks5://host:port. By default, each proxy maps deterministically to one User-Agent, Accept-Language, and Sec-CH-UA header set, so the same outbound IP does not appear with a different browser fingerprint on a later retry. Direct local mode uses one stable local fingerprint. Use --no-randomize-headers only for debugging.
Run a manual crawl for one explicit search term without a seed_url:
red-crawler crawl-search \
--search-term "抗痘博主" \
--max-accounts 20 \
--search-scroll-rounds 8 \
--creator-only \
--min-followers 5000 \
--min-relevance-score 0.7 \
--db-path "./data/red_crawler.db" \
--output-dir "./output"想尽量覆盖某个搜索词下的博主时,可以把 --search-scroll-rounds 和 --max-accounts 调大,同时配合 --creator-only、--min-followers、--min-relevance-score 做收口。这里的“全量”只能是尽量覆盖,平台搜索结果本身不是稳定全量接口,而且滚动过深会明显增加风控概率。
Optional note-page expansion:
red-crawler crawl-seed \
--seed-url "https://www.xiaohongshu.com/user/profile/USER_ID" \
--include-note-recommendationsList high-quality contactable creators from the SQLite database:
red-crawler list-contactable \
--db-path "./data/red_crawler.db" \
--min-relevance-score 0.7 \
--limit 20Run nightly auto-collection with queue, search bootstrap, seed promotion, and daily report output:
red-crawler collect-nightly \
--db-path "./data/red_crawler.db" \
--report-dir "./reports" \
--cache-dir "./.cache/red-crawler" \
--crawl-budget 12 \
--daily-account-budget 12 \
--daily-search-term-budget 2collect-nightly now enforces a daily budget across all runs started on the same UTC day. If you schedule multiple slots, later runs automatically shrink or skip once the daily profile/search-term budget is used up.
Run the same discovery flow manually without a seed_url:
red-crawler crawl-discover \
--db-path "./data/red_crawler.db" \
--report-dir "./reports" \
--cache-dir "./.cache/red-crawler" \
--crawl-budget 6Export weekly growth report and a contactable creator CSV:
red-crawler report-weekly \
--db-path "./data/red_crawler.db" \
--report-dir "./reports" \
--days 7Key outputs:
- manual crawl:
accounts.csvcontact_leads.csvrun_report.json
- nightly automation:
reports/daily-run-report.jsonreports/weekly-growth-report.jsonreports/contactable_creators.csv
- SQLite database:
data/red_crawler.db
The OpenClaw skill for this project lives at openclaw-skills/red-crawler-ops/.
To install it from a local path, point OpenClaw at that folder, or copy the skill directory into your OpenClaw skills location and register the same path.
Use the OpenClaw skill actions in this order:
bootstrapvalidates a local working directory and can run Chromium installation when explicitly requested.crawl_homefeedcollects from the default fashion homefeed without requiring login.logincreates an optional Playwright storage state explicitly.crawl_seedandcollect_nightlycan run without--storage-state; pass one only when you want to reuse an authenticated session.report_weeklyandlist_contactablerun from the SQLite database and do not require--storage-state.
For long crawls, pass run_mode: background. The skill returns a job_id immediately, writes job state under ./.openclaw/red-crawler, and maintains HEARTBEAT.md for OpenClaw heartbeat polling. Use job_status, job_logs, or job_stop with the returned job_id for manual follow-up. After OpenClaw reports a pending heartbeat event to the user, call ack_event with its event_id to avoid duplicate notifications.
The skill does not clone repositories or create login sessions implicitly. Install the red-crawler CLI package first, point workspace_path at a local working directory, and run bootstrap only for reviewed local setup steps.
The package builds as a standard Python wheel and source distribution:
uv buildSee docs/publishing.md for the release checklist and PyPI/TestPyPI commands.
For macOS local scheduling, use the template at docs/launchd/red-crawler.collect-nightly.plist.
Replace the placeholder paths:
__WORKDIR____UV_BIN____STORAGE_STATE____DB_PATH____REPORT_DIR____CACHE_DIR____LOG_DIR__
Then load it with:
launchctl unload ~/Library/LaunchAgents/com.red-crawler.collect-nightly.plist 2>/dev/null || true
cp docs/launchd/red-crawler.collect-nightly.plist ~/Library/LaunchAgents/com.red-crawler.collect-nightly.plist
launchctl load ~/Library/LaunchAgents/com.red-crawler.collect-nightly.plist