Skip to content

WebCrawl is a web site spiderer I wrote after being frustrated by httrack, which is supposed to be the best. Httrack was taking way too much CPU time in the GUI mode, didn't provide good enough feedback in the console mode, and didn't provide enough control overall. So I wrote a web crawler backend and a console frontend for it. It can be used m…

master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 

WebCrawl is a web site spiderer I wrote many years ago after being frustrated by httrack, which is supposed to be the best. Httrack was taking way too much CPU time in the GUI mode, didn't provide good enough feedback in the console mode, and didn't provide enough control overall. So I wrote a web crawler backend and a console frontend for it. It can be used much more easily than httrack. It has a number of features that httrack doesn't, such as rewriting URLs and page content on the fly with regular expressions or arbitrary processing code, which enables a few interesting and important scenarios. I use it to mirror sites for offline reading, mainly so that I can ensure that I still have a copy even if the site gets taken down.

Maybe httrack has improved since I wrote this. I donno.

About

WebCrawl is a web site spiderer I wrote after being frustrated by httrack, which is supposed to be the best. Httrack was taking way too much CPU time in the GUI mode, didn't provide good enough feedback in the console mode, and didn't provide enough control overall. So I wrote a web crawler backend and a console frontend for it. It can be used m…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages