crawl-serve

把任意 URL 转换成结构化 JSON：自动判定「文章详情页」还是「文章列表页」，分别返回 markdown 或 items[]。仅以 HTTP 接口形式暴露。

抓取统一走 Playwright
文章模式基于 Readability + Turndown 转 Markdown
列表模式基于重复兄弟启发式提取卡片

安装

npm install
npx playwright install chromium

安装 npm install 时会自动尝试 playwright install chromium；若网络受限请手动执行上面第二行。

启动服务

# 启动chrome
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
  --remote-debugging-port=9222 \
  --user-data-dir=/tmp/chrome-debug-profile

npm run build && npm run serve:built
# 或开发态
npm run serve

默认监听 http://0.0.0.0:3000，可通过环境变量覆盖：

变量	默认值	说明
`PORT`	`3000`	监听端口
`HOST`	`0.0.0.0`	监听地址
`MAX_BODY_BYTES`	`1048576`	POST body 上限（字节）
`REQUEST_TIMEOUT_MS`	`60000`	单请求最长处理时间

API

`GET /health`

健康检查，返回 {"ok":true}。

`POST /crawl` / `GET /crawl`

入参

{
  "url": "https://example.com/article",
  "mode": "auto"
}

字段	类型	必填	说明
`url`	`string`	✅	目标爬取地址
`mode`	`'article' \| 'feed' \| 'auto'`	❌	预设模式，默认 `auto`（自动检测）

GET 请求可改为 ?url=...&mode=auto 的形式。

返回

{
  title: string;            // 被爬网站的标题
  mode: 'article' | 'feed'; // 实际命中的模式
  items?: Array<{
    title: string;          // 文章标题
    summary: string;        // 文章小结（没有为空字符串）
    link: string;           // 文章详情页链接
    date: string;           // 文章日期，取最接近的字符串
    tag: string[];          // 文章标签
  }>;                       // 仅 mode === 'feed' 时有意义，article 模式下为 []
  markdown?: string;        // 仅 mode === 'article' 时返回页面主体内容
}

示例

curl -s -X POST http://localhost:3000/crawl \
  -H 'content-type: application/json' \
  -d '{"url":"https://example.com/post/1"}' | jq

curl -s "http://localhost:3000/crawl?url=https://news.example.com/&mode=feed" | jq

作为库使用

import { crawl } from 'html-to-markdown';

const result = await crawl({ url: 'https://example.com/post/1', mode: 'auto' });
console.log(result.title, result.mode, result.markdown ?? result.items);

项目结构

src/
  fetcher.ts  — Playwright + stealth 抓取层
  article.ts  — 文章详情页 → { title, markdown }
  feed.ts     — 列表页 → { title, items[] }
  index.ts    — 主流程编排（crawl）
  server.ts   — HTTP JSON 接口

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
test.md		test.md
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crawl-serve

安装

启动服务

API

`GET /health`

`POST /crawl` / `GET /crawl`

入参

返回

示例

作为库使用

项目结构

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

crawl-serve

安装

启动服务

API

GET /health

POST /crawl / GET /crawl

入参

返回

示例

作为库使用

项目结构

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /health`

`POST /crawl` / `GET /crawl`

Packages