Skip to content
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
225 lines (152 sloc) 7.76 KB
To run the pipeline, you will need:
- a Python 3.3+ installation
- Pip (for Python 3.3+)
- seesaw (automatically installed by Pip)
- rsync
- wpull (automatically installed by Pip)
- youtube-dl
- PhantomJS 2.1.1 (or 1.9.8 if that doesn't work for you)
Quick install, for Debian and Debian-esque systems like Ubuntu:
sudo apt-get update
sudo apt-get install build-essential python3-dev python3-pip \
libxml2-dev libxslt-dev zlib1g-dev libssl-dev libsqlite3-dev \
libffi-dev git tmux fontconfig-config fonts-dejavu-core \
libfontconfig1 libjpeg-turbo8 libjpeg8 lsof ffmpeg youtube-dl \
autossh rsync
pip3 install --upgrade pip
NOTE: Installing phantomjs from apt will often not work on newer
versions of Ubuntu; it will cause an error. You'll need to do the
following to do a manual install:
bzip2 -d phantomjs-2.1.1-linux-x86_64.tar.bz2
tar -xvf phantomjs-2.1.1-linux-x86_64.tar
cp phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/bin/phantomjs
(You can also place it in /opt. You will need the path later.)
After you've installed all the software, set up a dedicated account
for ArchiveBot:
adduser archivebot
(You may also want to add the user to the sudo'ers group.)
Log out of the server and log back in as user archivebot.
Then do:
[keep hitting Enter]
cat ~/.ssh/
At this point, copy the public key output from your screen (it should start
with "ssh-rsa" followed by a bunch of letters and numbers), and put it in
an e-mail to David Yip (yipdw), letting him know that you're setting up a new
ArchiveBot pipeline, and that this is your new server's public key. Also let
him know a username you'd like for yourself, if you don't already have one.
He will set things up so your new pipeline server can coordinate with the
others, and will be allowed to upload finished WARCs to the Internet Archive.
Okay, back to the server stuff:
cd ~/
git clone
cd ArchiveBot
git submodule update --init
pip3 install --user -r pipeline/requirements.txt
If you get any error messages at this point, you should try to
fix them before continuing on, as there may be incompatibilities
between the things that ArchiveBot is expecting and what your server
actually has.
As user archivebot, in the FIRST tmux session:
autossh -C -L \ -N
As user archivebot, in the SECOND tmux session:
cd ~/ArchiveBot/pipeline
mkdir -p ~/warcs4fos
export RSYNC_URL=rsync://
export REDIS_URL=redis://
export FINISHED_WARCS_DIR=$HOME/warcs4fos
export PATH=$PATH:/opt/phantomjs-2.1.1-linux-x86_64/bin/
Now, think up a name for this new ArchiveBot pipeline. It will
appear on the publicly available pipeline status dashboard. It will
go in the command you enter next:
~/.local/bin/run-pipeline3 --disable-web-server \
--concurrent 2 YOUR-PIPELINE-NAME-GOES-HERE 2>&1 | \
tee "pipeline-$(date -u +"%Y-%m-%dT%H_%M_%SZ").log"
You can adjust the number of jobs your server can handle in
--concurrent as needed.
If you want your pipeline to only handle !ao/!archiveonly jobs, run it
with the AO_ONLY environment variable set:
AO_ONLY=1 ~/.local/bin/run-pipeline3 \
--disable-web-server --concurrent 2 \
export AO_ONLY=1
~/.local/bin/run-pipeline3 --disable-web-server \
If your pipeline has large amounts of disk space (at least 100GB dedicated to
ArchiveBot's processing), set the LARGE environment variable in the same way
as AO_ONLY above. Your pipeline will accept jobs queued with the --large
If you are getting errors about wpull, you may need to create a symbolic
link to it, like this:
ln -s /usr/bin/wpull /home/archivebot/ArchiveBot/pipeline/wpull
(You may need to edit that /home/YOUR_USER_HERE/YOUR-DIRECTORY/ path as needed.)
As user archivebot, in the THIRD tmux session:
export RSYNC_URL=rsync://
~/ArchiveBot/uploader/ $HOME/warcs4fos
If you start multiple pipelines, you can safely point them to the
same FINISHED_WARCS_DIR and run just one uploader.
Check out the ArchiveBot dashboard to make sure everything is
working like it ought to:
To gracefully stop the pipeline:
touch ~/ArchiveBot/pipeline/STOP
To gracefully stop the uploader, hit ctrl-c in its tmux session.
To upgrade, run:
pip3 install --user --upgrade -r pipeline/requirements.txt
youtube-dl is is a command line program that is used for downloading
videos from YouTube, Vimeo, and other websites that feature embedded videos.
It is supposed to be installed on your system automatically through
requirements.txt, but just in case that doesn't work, here's how you can
get it installed:
sudo apt-get install python3-pip
pip3 install --upgrade youtube_dl
Or, for older versions of Python:
sudo apt-get install python-pip
pip install --upgrade youtube_dl
PhantomJS is a command line program that can more fully evaluate webpages,
including the javascript, and scrape the page content in a way that is more
like a human looking at it, rather than a bot. This is especially
important if the page has comments or other features that are activated
only by scrolling down the page (such as Twitter timelines). Right now,
ArchiveBot requires PhantomJS version 1.9.8 or 2.1.1.
In case it doesn't install or work for you automatically, here's a link to
a Gist containing instructions for forcing 1.9.8 to install on your system:
Note that this Gist assumes that your user has sudo privileges.
** STEP 6: Operate the Pipeline **
Some pointers for pipeline operators:
You can find the process ID for a job by ps aux | grep $jobid.
That job has a job directory, which you can find in the data directory;
you can also get it out of ps. The job directory is wpull's scratch
space, where it puts files it's downloading and where it assembles its
WARC. It will move the WARC into the uploader folder when it reaches
the designated size.
If a job becomes stuck, find the process ID of its wpull instance and
kill it with kill -9. The pipeline will move the completed WARC and
upload it, and complete the job (you may want to note in #archivebot
that you did this). The job may be re-queued.
If you stop the pipeline or it crashes, you should remove the job
directories under pipeline/data, and clean out /tmp. The WARCs in the
pipeline directory are almost certainly incomplete and should not be
uploaded. The jobs cannot currently be resumed, and so the data dir and
/tmp are just consuming space.
If the pipeline runs out of disk, it will be unable to do any useful
work and jobs will lock up or fail. In this case, check that the
uploader is functioning; if it is, use the du command in the data
directory to see what is taking up space. If it is wpull.log, truncate
it (not rm, but rather ftruncate) to 0 to free up a little space if
If the pipeline runs out of RAM, you will likely have to kill the job
that is consuming all the RAM; wpull instances will pause to avoid the
OOM killer being run. Consider creating a small swap file if your VM
does not have any swap.
You can’t perform that action at this time.