Skip to content

ReedD/crawler

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Chromium / Puppeteer site crawler

styled with prettier

This crawler does a BFS starting from a given site entry point. It will not leave the entry point domain and it will not crawl a page more than once. Given a shared redis host/cluster this crawler can be distributed across multiple machines or processes. Discovered pages will be stored in mongo collection, each with a url, outbound urls, and a radius from the origin.

Installation

yarn

Usage

Basic

./crawl -u https://www.dadoune.com

Distributed

# Terminal 1
./crawl -u https://www.dadoune.com
# Terminal 2
./crawl -r

Debug

DEBUG=crawler:* ./crawl -u https://www.dadoune.com

Options

  • --maxRadius or -m the maximum link depth the crawler will explore from the entry url.
  • --resume or -r to resume crawling after prematurely exiting a process or to add additional crawlers to an existing crawl.
  • --url or -u the entry point URL to kick the crawler off.

Releases

No releases published

Packages

No packages published