HTTP Crawler
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
spec
src
.editorconfig
.gitignore
.travis.yml
LICENSE
README.md
shard.lock
shard.yml

README.md

crawler

This is a simple site crawler using selenium to work more efficiently with dynamic targets

Features

  1. Using selenium to parse dynamic client-side content.
  2. Uses sitemap and robots parsing to find additional data.
  3. Allows passing user/pass or cookie string for authentication.
  4. Allows exporting to CSV and HAR formats.

Installation

Prerequisites

  1. selenium-stand-alone running and binded on default port
  2. java -jar selenium-server-standalone-3.141.5.jar
  3. have firefox and geckodriver installed.
shards build

Usage

bin/crawler crawl target=https://www.exmaple.com max_depth=1 max_urls=5
=> example.com.csv # list of all urls and forms

Development

  • Make more robust and take care of edgecases.
  • Add export to HAR support.
  • Add support for user\pass and cookie (just add cli options)

Contributing

  1. Fork it (https://github.com/NeuraLegion/crawler/fork)
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

Contributors