Chromium / Puppeteer site crawler
Switch branches/tags
Nothing to show
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore Initial commit Aug 20, 2017
README.md Update README.md Aug 20, 2017
crawl Add mongo Aug 20, 2017
crawler.js Remove test code Aug 20, 2017
db.js Don't hold up the crawl for mongo Aug 20, 2017
package.json Add mongo Aug 20, 2017
yarn.lock Add mongo Aug 20, 2017

README.md

Chromium / Puppeteer site crawler

styled with prettier

This crawler does a BFS starting from a given site entry point. It will not leave the entry point domain and it will not crawl a page more than once. Given a shared redis host/cluster this crawler can be distributed across multiple machines or processes. Discovered pages will be stored in mongo collection, each with a url, outbound urls, and a radius from the origin.

Installation

yarn

Usage

Basic

./crawl -u https://www.dadoune.com

Distributed

# Terminal 1
./crawl -u https://www.dadoune.com
# Terminal 2
./crawl -r

Debug

DEBUG=crawler:* ./crawl -u https://www.dadoune.com

Options

  • --maxRadius or -m the maximum link depth the crawler will explore from the entry url.
  • --resume or -r to resume crawling after prematurely exiting a process or to add additional crawlers to an existing crawl.
  • --url or -u the entry point URL to kick the crawler off.