Skip to content
ArchiveBot, an IRC bot for archiving websites
Python Ruby HTML Haxe JavaScript Shell Other
Branch: master
Clone or download

Latest commit

Latest commit d3db653 Jun 16, 2020

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bot Fix other pending counter May 28, 2020
cogs cogs: Run the Broadcaster only if necessary, i.e. if tweeting is enabled May 29, 2019
config Introduce Cucumber for integration testing. Apr 27, 2014
dashboard Add compact ignore list view which collapses based on the existing ig… Mar 17, 2020
db Exempt Tumblr's video hosting domains Jun 16, 2020
doc Remove PhantomJS support Jul 23, 2019
lib Notify IRC channel on pipeline changes Aug 20, 2019
ops Support weaker SSL/TLS connections for a broader compatibility with o… Dec 6, 2019
pipeline Support weaker SSL/TLS connections for a broader compatibility with o… Dec 6, 2019
plumbing Add some error handling May 1, 2019
spec Remove PhantomJS support Jul 23, 2019
test Remove RSYNC_URL environment variable for pipeline Aug 20, 2019
uploader Handle log files in the uploader Aug 20, 2019
viewer
.gitignore
.gitmodules redis-lua is no longer required. Mar 9, 2014
.travis.yml Remove quiet flags on pip installs Aug 19, 2019
Gemfile Fix integration test and re-enable it Jun 13, 2017
Gemfile.lock Fix integration test and re-enable it Jun 13, 2017
INSTALL.backend New dashboard WebSocket server May 7, 2019
INSTALL.pipeline Support weaker SSL/TLS connections for a broader compatibility with o… Dec 6, 2019
LICENSE Relicense as MIT. Sep 19, 2013
README Updated grab-site repo URL in README Mar 3, 2019
Rakefile Remove pointless features. Dec 14, 2014

README

1. ArchiveBot

    <SketchCow> Coders, I have a question.
    <SketchCow> Or, a request, etc.
    <SketchCow> I spent some time with xmc discussing something we could
                do to make things easier around here.
    <SketchCow> What we came up with is a trigger for a bot, which can
                be triggered by people with ops.
    <SketchCow> You tell it a website. It crawls it. WARC. Uploads it to
                archive.org. Boom.
    <SketchCow> I can supply machine as needed.
    <SketchCow> Obviously there's some sanitation issues, and it is root
                all the way down or nothing.
    <SketchCow> I think that would help a lot for smaller sites
    <SketchCow> Sites where it's 100 pages or 1000 pages even, pretty
                simple.
    <SketchCow> And just being able to go "bot, get a sanity dump"

2. More info

ArchiveBot has two major backend components: the control node, which
runs the IRC interface and bookkeeping programs, and the crawlers, which
do all the Web crawling.  ArchiveBot users communicate with ArchiveBot
by issuing commands in an IRC channel.

User's guide: http://archivebot.readthedocs.org/en/latest/
Control node installation guide: INSTALL.backend
Crawler installation guide: INSTALL.pipeline

3. Local use

ArchiveBot was originally written as a set of separate programs for
deployment on a server.  This means it has a poor distribution story.
However, Ivan Kozik (@ivan) has taken the ArchiveBot pipeline,
dashboard, ignores, and control system and created a package intended for
personal use.  You can find it at https://github.com/ArchiveTeam/grab-site.

4. License

Copyright 2013 David Yip; made available under the MIT license.  See
LICENSE for details.

5. Acknowledgments

Thanks to Alard (@alard), who added WARC generation and Lua scripting to
GNU Wget.  Wget+lua was the first web crawler used by ArchiveBot.

Thanks to Christopher Foo (@chfoo) for wpull, ArchiveBot's current web
crawler.

Thanks to Ivan Kozik (@ivan) for maintaining ignore patterns and
tracking down performance problems at scale.

Other thanks go to the following projects:

* Celluloid <http://celluloid.io/>
* Cinch <https://github.com/cinchrb/cinch/>
* CouchDB <http://couchdb.apache.org/>
* Ember.js <http://emberjs.com/>
* Redis <http://redis.io/>
* Seesaw <https://github.com/ArchiveTeam/seesaw-kit>

6. Special thanks

Dragonette, Barnaby Bright, Vienna Teng, NONONO.

The memory hole of the Web has gone too far.
Don't look down, never look away; ArchiveBot's like the wind.

 vim:ts=2:sw=2:tw=72:et
You can’t perform that action at this time.