This is a reader for Heritrix (archive.org) site archives created with the MirrorWriter module.
It can read archives stored in .zip, .rar, .tar.bz2, and .tar.gz archives, though .rar is the only recommended format in most situations. (It can also, of-course, read the normal directories written by MirrorWriter.)
MirrorReader generally works with most MirrorWriter-written archives out-of-the-box, but includes two built in utilities that can be used to clean up such archives:
- fileNameFixer.php can be used to scan MirrorWritten directories and combine files and directories that should have the same name. (For instance, if MirrorWriter encounters a file "a" and then a file "a/b", it will write the file "a" normally, but then create a directory "a1/" to store the "b" file. fileNameFixer.php will move "a" into "a1/" as "index.html" and then rename "a1/" to "a/".)
- spider.php, which does basically the same thing Heritrix does (scanning and downloading copies of websites), but operates on existing archives instead. For instance, if Heritrix missed a file, spider.php will typically download the file itself. It does this by scanning the HTML of all files in an archive and looking for broken links. (The most common case was, until recently, Heritrix missing files located in the srcset HTML attribute.)
How it Parses Files
- HTML - Almost all HTML is correctly processed, including malformed HTML. The following HTML elements are processed:
<link>'s href attribute (if
<link>'s type attribute is "text/css" and rel attribute is "stylesheet")
<script>'s src attribute
<applet>'s src attributes
<img>'s srcset attribute
<area>'s href attribute
<th>'s background attribute
<option>value attribute, on a per-site basis
<style>tags are processed for CSS.
- CSS - CSS processing is simple: MirrorReader searches for and replaces url() tags. MirrorReader is smart, and will not process data: URIs.