Skip to content
master
Switch branches/tags
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Chromium / Puppeteer site crawler

styled with prettier

This crawler does a BFS starting from a given site entry point. It will not leave the entry point domain and it will not crawl a page more than once. Given a shared redis host/cluster this crawler can be distributed across multiple machines or processes. Discovered pages will be stored in mongo collection, each with a url, outbound urls, and a radius from the origin.

Installation

yarn

Usage

Basic

./crawl -u https://www.dadoune.com

Distributed

# Terminal 1
./crawl -u https://www.dadoune.com
# Terminal 2
./crawl -r

Debug

DEBUG=crawler:* ./crawl -u https://www.dadoune.com

Options

  • --maxRadius or -m the maximum link depth the crawler will explore from the entry url.
  • --resume or -r to resume crawling after prematurely exiting a process or to add additional crawlers to an existing crawl.
  • --url or -u the entry point URL to kick the crawler off.

About

Chromium / Puppeteer site crawler

Topics

Resources

Releases

No releases published

Packages

No packages published