Chromium / Puppeteer site crawler

This crawler does a BFS starting from a given site entry point. It will not leave the entry point domain and it will not crawl a page more than once. Given a shared redis host/cluster this crawler can be distributed across multiple machines or processes. Discovered pages will be stored in mongo collection, each with a url, outbound urls, and a radius from the origin.

Installation

yarn

Usage

Basic

./crawl -u https://www.dadoune.com

Distributed

# Terminal 1
./crawl -u https://www.dadoune.com

# Terminal 2
./crawl -r

Debug

DEBUG=crawler:* ./crawl -u https://www.dadoune.com

Options

--maxRadius or -m the maximum link depth the crawler will explore from the entry url.
--resume or -r to resume crawling after prematurely exiting a process or to add additional crawlers to an existing crawl.
--url or -u the entry point URL to kick the crawler off.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
crawl		crawl
crawler.js		crawler.js
db.js		db.js
package.json		package.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chromium / Puppeteer site crawler

Installation

Usage

Basic

Distributed

Debug

Options

About

Releases

Packages

Contributors 2

Languages

ReedD/crawler

Folders and files

Latest commit

History

Repository files navigation

Chromium / Puppeteer site crawler

Installation

Usage

Basic

Distributed

Debug

Options

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages