ArchiveBot, an IRC bot for archiving websites
Python Ruby HTML Haxe JavaScript Shell Other
Latest commit d895c28 Feb 11, 2017 @yipdw yipdw committed on GitHub Merge pull request #238 from falconkirtaran/master
Add to igsets and improve pipeline documentation
Failed to load latest commit information.
bot bot: remove reference to undefined variable. Dec 22, 2016
cogs Don't tweet aborts because they don't mean much Jul 6, 2015
config Introduce Cucumber for integration testing. Apr 27, 2014
dashboard dashboard: Add /3 alias for /beta Nov 18, 2016
db Correct escaping in addition to reddit igset Feb 8, 2017
doc Remove !firstworldproblems. Nov 13, 2016
lib Use Addressable::URI to handle IDNs. Closes #165. May 19, 2015
ops A Dockerfile for ArchiveBot's backend. Oct 18, 2016
pipeline Fix partial removal of archivebot.control.ConnectionError Jan 6, 2017
plumbing Also update plumbing's version of json. Oct 18, 2016
spec Remove !firstworldproblems. Nov 13, 2016
test test: Add firehose to test harness; update dashboard runner. Dec 21, 2014
uploader Prevent filenames in IA uploader from starting with . and _, per sket… Jan 10, 2017
viewer Add link to archivelab WARC viewer Mar 2, 2016
.gitignore Dashboard: ignore Sass cache. Apr 11, 2014
.gitmodules redis-lua is no longer required. Mar 9, 2014
.travis.yml ci: We ought to be able to run on Ruby 2.2.3. Dec 29, 2015
Gemfile Use Addressable::URI to handle IDNs. Closes #165. May 19, 2015
Gemfile.lock Update json to 1.8.3. Oct 18, 2016
INSTALL.backend Update INSTALL.backend Nov 10, 2015
INSTALL.pipeline Minor documentation update to pipeline deployment. Add Feb 4, 2017
LICENSE Relicense as MIT. Sep 19, 2013
README readme: Note existence of & recommend grab-site Jan 20, 2016
Rakefile Remove pointless features. Dec 14, 2014


1. ArchiveBot

    <SketchCow> Coders, I have a question.
    <SketchCow> Or, a request, etc.
    <SketchCow> I spent some time with xmc discussing something we could
                do to make things easier around here.
    <SketchCow> What we came up with is a trigger for a bot, which can
                be triggered by people with ops.
    <SketchCow> You tell it a website. It crawls it. WARC. Uploads it to
    <SketchCow> I can supply machine as needed.
    <SketchCow> Obviously there's some sanitation issues, and it is root
                all the way down or nothing.
    <SketchCow> I think that would help a lot for smaller sites
    <SketchCow> Sites where it's 100 pages or 1000 pages even, pretty
    <SketchCow> And just being able to go "bot, get a sanity dump"

2. More info

ArchiveBot has two major backend components: the control node, which
runs the IRC interface and bookkeeping programs, and the crawlers, which
do all the Web crawling.  ArchiveBot users communicate with ArchiveBot
by issuing commands in an IRC channel.

User's guide:
Control node installation guide: INSTALL.backend
Crawler installation guide: INSTALL.pipeline

3. Local use

ArchiveBot was originally written as a set of separate programs for
deployment on a server.  This means it has a poor distribution story.
However, Ivan Kozik (@ivan) has taken the ArchiveBot pipeline,
dashboard, ignores, and control system and created a package intended for
personal use.  You can find it at

4. License

Copyright 2013 David Yip; made available under the MIT license.  See
LICENSE for details.

5. Acknowledgments

Thanks to Alard (@alard), who added WARC generation and Lua scripting to
GNU Wget.  Wget+lua was the first web crawler used by ArchiveBot.

Thanks to Christopher Foo (@chfoo) for wpull, ArchiveBot's current web

Thanks to Ivan Kozik (@ivan) for maintaining ignore patterns and
tracking down performance problems at scale.

Other thanks go to the following projects:

* Celluloid <>
* Cinch <>
* CouchDB <>
* Ember.js <>
* Redis <>
* Seesaw <>

6. Special thanks

Dragonette, Barnaby Bright, Vienna Teng, NONONO.

The memory hole of the Web has gone too far.
Don't look down, never look away; ArchiveBot's like the wind.