Skip to content

WebCrawl is a web site spiderer I wrote after being frustrated by httrack, which is supposed to be the best. Httrack was taking way too much CPU time in the GUI mode, didn't provide good enough feedback in the console mode, and didn't provide enough control overall. So I wrote a web crawler backend and a console frontend for it. It can be used m…

Notifications You must be signed in to change notification settings

AdamMil/WebCrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 

Repository files navigation

WebCrawl is a web site spiderer I wrote many years ago after being frustrated by httrack, which is supposed to be the best. Httrack was taking way too much CPU time in the GUI mode, didn't provide good enough feedback in the console mode, and didn't provide enough control overall. So I wrote a web crawler backend and a console frontend for it. It can be used much more easily than httrack. It has a number of features that httrack doesn't, such as rewriting URLs and page content on the fly with regular expressions or arbitrary processing code, which enables a few interesting and important scenarios. I use it to mirror sites for offline reading, mainly so that I can ensure that I still have a copy even if the site gets taken down.

Maybe httrack has improved since I wrote this. I donno.

About

WebCrawl is a web site spiderer I wrote after being frustrated by httrack, which is supposed to be the best. Httrack was taking way too much CPU time in the GUI mode, didn't provide good enough feedback in the console mode, and didn't provide enough control overall. So I wrote a web crawler backend and a console frontend for it. It can be used m…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages