Skip to content
WebCrawl is a web site spiderer I wrote after being frustrated by httrack, which is supposed to be the best. Httrack was taking way too much CPU time in the GUI mode, didn't provide good enough feedback in the console mode, and didn't provide enough control overall. So I wrote a web crawler backend and a console frontend for it. It can be used m…
C#
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Backend
Crawler
GuiCrawler
README.md

README.md

WebCrawl is a web site spiderer I wrote many years ago after being frustrated by httrack, which is supposed to be the best. Httrack was taking way too much CPU time in the GUI mode, didn't provide good enough feedback in the console mode, and didn't provide enough control overall. So I wrote a web crawler backend and a console frontend for it. It can be used much more easily than httrack. It has a number of features that httrack doesn't, such as rewriting URLs and page content on the fly with regular expressions or arbitrary processing code, which enables a few interesting and important scenarios. I use it to mirror sites for offline reading, mainly so that I can ensure that I still have a copy even if the site gets taken down.

Maybe httrack has improved since I wrote this. I donno.

You can’t perform that action at this time.