- Scheduled Archiving
- Publishing Your Archive
- Chromium Install
- Security Overview
- Upgrading or Merging Archives
- Python API Reference
- REST API Reference
- Developer Documentation
- Background & Motivation
- Comparison to Other Tools
- Web Archiving Community
Clone this wiki locally
Archiving Public Content [Default]
This is the default (lax) mode, intended for archiving public (non-secret) URLs without authenticating the headless browser. This is the mode used if you're archiving news articles, audio, video, etc. browser bookmarks to a folder published on your webserver. This allows you to access and link to content on
http://your.archive.com/archive... after the originals go down.
This mode should not be used for archiving entire browser history or authenticated private content like Google Docs, paywalled content, invite-only subreddits, private photo share urls, etc.
Archiving Private Content
WARNING! Advanced users only
ArchiveBox is able to archive content that requires authentication or cookies, but it comes with some caveats. Create dedicated logins for archiving to access paywalled content, private forums, LAN-only content, etc. then share them with ArchiveBox via Chrome profile + cookies.txt file.
To get started, set
COOKIES_FILE to point to a Chrome user folder that has your sessions and a wget
cookies.txt file respectively.
If you're importing private links or authenticated content, you probably don't want to share your archive folder publicly on a webserver, so don't follow the Publishing Your Archive instructions unless you are only serving it on a trusted LAN or have some sort of authentication in front of it. Make sure to point ArchiveBox to an output folder with conservative permissions, as it may contain archived content with secret session tokens or pieces of your user data. You may also wish to encrypt the archive using an encrypted disk image or filesystem like ZFS as it will contain all requests and response data, including session keys, user data, usernames, etc.
⚠️ Things to watch out for: ⚠️
- any cookies / secret state present in a Chrome user profile or
cookies.txtfile may be reflected in server responses and saved in the Snapshot output (e.g. in
headers.json) making it visible in cleartext to anyone viewing the Snapshot, (don't use your personal Chrome profile for archiving or people viewing your archive can then authenticate as you!)
- any secret tokens embedded in URLs (e.g. secret invite links, Google Doc URLs, etc.) will be visible on
archive.orgas the URLs are not filtered when saving to
archive.org(disable submitting to Archive.org entirely with
- the domain portion in archived URLs is sent to a favicon service in order to retrieve an icon more reliably than a janky internal implementation would be able to (if leaking domains is a concern, you can disable the favicon fetching entirely with
- viewing malicious archived JS saved verbatim with the Wget extractor could allow an attacker to access your other archive items + the admin interface (viewed WGET-archived JS executes on the same origin as the admin panel right now, fix is pending, set
SAVE_WGET=Falseto disable WGET saving entirely or avoid viewing WGET Snapshot output directly in a browser)
An example of a session cookie reflected in
headers.json visible in the archive.
Do not run as root
Do not run ArchiveBox as root for a number of reasons:
- Chrome will execute as root and fail immediately because Chrome sandboxing is pointless when the data directory is opened as root (do not set
CHROME_SANDBOX=Falsejust to bypass that error!)
- All dependencies will be run as root, if any of them have a vulnerability that's exploited by sites you're archiving you're opening yourself up to full system compromise
- ArchiveBox does lots of HTML parsing, filesystem access, and shell command execution. A bug in any one of those subsystems could potentially lead to deleted/damaged data on your hard drive, or full system compromise unless restricted to a user that only has permissions to access the directories needed
- Do you really trust a project created by a Github user called
😉? Why give a random program off the internet root access to your entire system? (I don't have malicious intent, I'm just saying in principle you should not be running random Github projects as root)
Instead, you should run ArchiveBox as your normal user, or create a user with less privileged access:
useradd -r -g archivebox -G audio,video archivebox # the audio & video groups are used by chrome mkdir -p /home/archivebox/data chown -R archivebox:archivebox /home/archivebox ... sudo -u archivebox archivebox add ...
If you absolutely must run it as root for some reason, a footgun is provided: you can set This footgun option was removed (I'm sorry, the support burden of helping people who messed up their systems by running this as root was too high).
ALLOW_ROOT=True via environment variable or in your ArchiveBox.conf file.
The ArchiveBox database is an unencrypted, uncompressed SQLite3
index.sqlite3 file on disk, and such does not require an authenticated admin SQL login to access (like PostgreSQL/MySQL would). Make sure to protect your database file adequately as anyone who can read it can read your entire collection contents. Passwords for the admin users are stored as salted and PBKDF2 hashed strings in the
How much are you planning to archive? Only a few bookmarked articles, or thousands of pages of browsing history a day? If it's only 1-50 pages a day, you can probably just stick it in a normal folder on your hard drive, but if you want to go over 100 pages a day, you will likely want to put your archive on a compressed/deduplicated/encrypted disk image or filesystem like ZFS. Other distributed/networked/checksummed filesystems that have also been reported to work (but are not technically officially supported) include SMB, NFS, Ceph, Unraid, and BTRFS. Make sure the filesystem you're using supports FSYNC. Some filesystems are unable to store more than a certain number of directory entries, and your total number of snapshots in
./archive may be capped as a result. Some other filesystems begin to have performance degradations but continue to function when the directory entry count gets too high. Generally this isn't an issue unless you have more than ~20,000 Snapshot folders in
--delete is passed to
archivebox remove, Snapshots removed from the index remain in the filesystem and their
./archive/<timestamp> folders need to be deleted manually to be fully removed. Imported URLs are also logged separately in
./logs, and the Sonic full-text index
./sonic and should be removed manually as well to clear all traces of a URL added by accident. You can search for a URL on the filesystem you're trying to remove using
grep -a -r "https://example.com/url/to/search/for".
Consider what permissioning to apply to your archive folder carefully. Limit access to the fewest possible users by checking folder ownership and setting
OUTPUT_PERMISSIONS accordingly. Generally the
archive/ folder, and
ArchiveBox.conf file must all be owned and writable by the
archivebox user or a dedicated non-root user.
Are you publishing your archive? If so, make sure you use the built-in
archivebox server or only serve the static export as static HTML (don't accidentally serve it as PHP or CGI or you may execute malicious archived files by accident). Regardless of how you serve it, make sure to put it on its own domain not shared with other services. This is done in order to avoid cookies leaking between your main domain and domains hosting content you don't control. A common practice is to put user provided / untrusted archived content on completely separate top-level domains from anything else (like how Google and Github do with
Published archives automatically include a
Dissallow: / to block search engines from indexing them. You may still wish to publish your contact info in the index footer though using
FOOTER_INFO so that you can respond to any DMCA and copyright takedown notices if you accidentally rehost copyrighted content.