This is a simple site crawler using selenium to work more efficiently with dynamic targets
- Using selenium to parse dynamic client-side content.
- Uses sitemap and robots parsing to find additional data.
- Allows passing user/pass or cookie string for authentication.
- Allows exporting to CSV and HAR formats.
- selenium-stand-alone running and binded on default port
java -jar selenium-server-standalone-3.141.5.jar
- have firefox and geckodriver installed.
bin/crawler crawl target=https://www.exmaple.com max_depth=1 max_urls=5 => example.com.csv # list of all urls and forms
- Make more robust and take care of edgecases.
- Add export to HAR support.
- Add support for user\pass and cookie (just add cli options)
- Fork it (https://github.com/NeuraLegion/crawler/fork)
- Create your feature branch (
git checkout -b my-new-feature)
- Commit your changes (
git commit -am 'Add some feature')
- Push to the branch (
git push origin my-new-feature)
- Create a new Pull Request
- Bar Hofesh - creator and maintainer