This document describes various tests that were done to assess the quality of the NL-menu capture (i.e. the WARC generated by wget, as descibed here).
Using warctools' warcvalid tool:
warcvalid nl-menu.warc.gz
This doesn't result in any errors.
Number of files in (extracted) ISO image:
find /var/www/www.nl-menu.nl -type f | wc -l
Result:
85644
Number of files scraped by wget (from dir tree created by wget):
find /home/johan/NL-menu/warc-wget-noextensionadjust/www.nl-menu.nl -type f | wc -l
Result:
84976
Number of files scraped by wget (from cdx file, counting lines with substring " 200 ", which should identify all sucessfully scraped files):
grep " 200 " nl-menu.cdx | wc -l
Result:
84976
Which (as expected) is identical to the count from the fs. Difference with respect to ISO image: 668 files. These files are part of the ISO, but they weren't scraped by wget.
Detailed comparison:
diff --brief -r /var/www/www.nl-menu.nl /home/johan/NL-menu/warc-wget-noextensionadjust/www.nl-menu.nl/ | grep "Only in /var/www/" > diffdir.txt
Result here. In particular, the following items are missing in the wget crawled version1:
- 499 .gif files
- 83 .html files
- 36 .txt files
Not entirely clear why this happens, could be orphaned resources that are not referenced by the site.
One possible explanation for the missing files is that they are not referenced by any of the html files (or, to be more precise, the html files that are discovered by wget by crawling from the root document).
As a first attempt to test this, we can search for references to the names of the missing files inside the html. For instance, using the grep tool this is how we can count all references to "1580.html":
grep -r -F "1580.html" /var/www/www.nl-menu.nl/ | wc -l
This returns 110 references, whereas:
grep -r -F "frameset_zoekresultaten.html" /var/www/www.nl-menu.nl/ | wc -l
returns 0.
The following script does this for the names2 of all missing files:
Result here.
From the results we see that 573 (90 %) of all missing file names have 0 references . This also explains why they were not discovered by wget: these files are simply not used by any of the content that results from crawling from the root document. For the remaining 64 files one of the following things happen:
-
nlmenu.nl/admin/nr-sects.txt: several html files reference the name of this file in a comment (so it makes sense that the file itself is not included in the crawl) -
nlmenu.en/resources/alfabet_provincies.html: referenced innlmenu.en/fset/provincie.html, but this file is missing by itself! -
nlmenu.nl/admin/opdracht_verzonden.html: referenced innlmenu.nl/admin/mailing.htmlin the value attribute of an input tag:<input type="hidden" name="nextpage" value="http://www.nl-menu.nl/nlmenu.nl/admin/opdracht_verzonden.html">So it seems wget doesn't parse input tags.
-
nlmenu.nl/images/alfabet_a.gif: referenced in various files (e.g.nlmenu.en/resources/old/alfabet_sites.html) as an argument of of JavaScript function:<area shape="rect" coords="22,0,34,13" href="/nlmenu.en/w3all/www_a.html" target="inhoud" onClick="MM_swapImage('document.alfabetje','document.alfabetje','/nlmenu.nl/images/alfabet_a.gif')">So this doesn't appear to be picked up by weget as a dependency either (totally there are 51 gif files in the same directory, which are all not included in the crawl for the same reason).
-
nlmenu.nl/images/gastenboek2.gif: referenced in various html files under directoriesnlmenu.en/admin/gastenboek/andnlmenu.nl/admin/gastenboek/. But neither of these directories (nor their content) is present in the (wget) archived version. The NL-menu homepage also doesn't appear to link to any guestbook feature, so this might be an orphaned section of the site (there are 2 more gifs here that are part of the guestbook). -
nlmenu.nl/images/pijlbeneden.gif: referenced by 222 html files (e.g.nlmenu.en/sections/315/361/361.html) as a JavaScript variable:var expandedWidget = "/nlmenu.nl/images/pijlbeneden.gif" -
nlmenu.nl/resources/marge.html: referenced in 6 html files (amongst whichnlmenu.nl/fset/admin.html) within a frame definition:<frame src="/nlmenu.nl/resources/marge.html" scrolling="no" noresize name="marge" marginwidth="6" marginheight="1">Question: does wget even handle frames? How?
All of the following pages don't work, and give error "The url http://www.nl-menu.nl/nlmenu.nl/fset/ could not be found in this collection":
Having arrived on this page, the other links at the right hand menu (FAQ, colofon, etc) don't work either! But behaviour seems to depend on previous page we arrived from. Very strange.
On "live" site, going here:
http://www.nl-menu.nl/nlmenu.nl/admin/aanmeldform.html
Redirects to:
This doesn't happen in the archived site! The resource is present in the WARC though:
warcdump NL-menu.warc.gz > NL-menu-dump.txt
grep "http://www.nl-menu.nl/nlmenu.nl/admin/aanmeldform.html" NL-menu-dump.txt
Result:
WARC-Target-URI:<http://www.nl-menu.nl/nlmenu.nl/admin/aanmeldform.html>
WARC-Target-URI:http://www.nl-menu.nl/nlmenu.nl/admin/aanmeldform.html
Looking at the source of the page, there's this:
<SCRIPT LANGUAGE="JavaScript">
<!--
var navPrinting = false;
if ((navigator.appName + navigator.appVersion.substring(0, 1)) == "Netscape4") {
navPrinting = (self.innerHeight == 0) && (self.innerWidth == 0);}
if ((self.name != 'text') && (self.location.protocol != "file:") && !navPrinting)
if (top.location.href == location.href) {
// deze pagina opnieuw openen, maar dan binnen frameset
top.location.href = "http://www.nl-menu.nl/nlmenu.nl/fset/zoekenplus.html?" + unescape(document.URL);
}
// -->
</SCRIPT>
So, the JavaScript re-opens the page within a frame set. So perhaps the problem occurs because the JavaScript fails to run in the archived version? Possibly related to this: pywb actually uses JavaScript to render each archived page inside an iframe. For example, "view source" on the archived homepage produces this (which is not the NL-menu source!) :
<!DOCTYPE html>
<html>
<head>
<style>
html, body
{
height: 100%;
margin: 0px;
padding: 0px;
border: 0px;
overflow: hidden;
}
</style>
<script src='http://localhost:8080/static/wb_frame.js'> </script>
<!-- default banner, create through js -->
<script src='http://localhost:8080/static/default_banner.js'> </script>
<link rel='stylesheet' href='http://localhost:8080/static/default_banner.css'/>
</head>
<body style="margin: 0px; padding: 0px;">
<div id="wb_iframe_div">
<iframe id="replay_iframe" frameborder="0" seamless="seamless" scrolling="yes" class="wb_iframe"></iframe>
</div>
<script>
var cframe = new ContentFrame({"url": "http://www.nl-menu.nl/nlmenu.nl/nlmenu.shtml" + window.location.hash,
"prefix": "http://localhost:8080/my-web-archive/",
"request_ts": "20040123200406",
"iframe": "#replay_iframe"});
</script>
</body>
</html>
Also tried: disable JavaScript in browser on "live" site: page is still displayed!
BUT: if I am on one of the category pages, e.g.:
and then click on "aanmelden" (right-hand menu), the page loads normaly, even though the URL is identical!! Again, opening the URL in a new tab still produces the error.
Open http://www.nl-menu.nl/nlmenu.nl/fset/zoekenplus.html: works on "live" site, fails on archived site.
Same as above (JavaScript).
The WARC was crawled from a locally reconstructed version of the site and not from the live web. This is something that should somehow be recorded in metadata. Using the warcdump tool from warctools:
warcdump NL-menu.warc.gz > NL-menu-dump.txt
Example record:
archive record at NL-menu.warc.gz:778658
Headers:
WARC-Type:request
WARC-Target-URI:<http://www.nl-menu.nl/nlmenu.nl/new/home.html>
Content-Type:application/http;msgtype=request
WARC-Date:2004-01-23T20:04:06Z
WARC-Record-ID:<urn:uuid:a733b6c2-f31f-4bb1-822c-63fa08cdb2e2>
WARC-IP-Address:127.0.0.1
WARC-Warcinfo-ID:<urn:uuid:8a56cc17-3a38-4146-8594-3f0f39d31d51>
WARC-Block-Digest:sha1:23FDBRG7W5PPQHPA7TZOJCFNFEV76X55
Content-Length:230
Content Headers:
Content-Type : application/http;msgtype=request
Content-Length : 230
Content:
GET /nlmenu\x2Enl/new/home\x2Ehtml HTTP/1\x2E1\xD\xAReferer\x3A http\x3A//www\x2Enl\x2Dmenu\x2Enl/nlmenu\x2Enl/resources/linkermenu\x2Ehtml\xD\xAUser\x2DAgent\x3A Wget/1\x2E19 \x28linux\x2Dgnu\x29\xD\xAAccept\x3A \x2A/\x2A\xD\xAAccept\x2DEncoding\x3A identity\xD\xAHost\x3A www\x2Enl\x2Dmenu\x2Enl\xD\xAConnection\x3A Keep\x2DAlive\xD\xA\xD\xA
...
Note this line:
WARC-IP-Address:127.0.0.1
The field WARC-IP-Address is defined in the WARC specification as:
The WARC-IP-Address is the numeric Internet address contacted to retrieve any included content. An IPv4 address shall be written as a “dotted quad”; an IPv6 address shall be written as specified in [RFC4291]. For a HTTP retrieval, this will be the IP address used at retrieval time corresponding to the hostname in the record’s target-Uri.
In this case, from the value 127.0.0.1 (=localhost) we can see that the files inside the warc originate from a local copy.
Footnotes
-
The total number of items in the diff file is 637; expected number is 668! No idea why. ↩
-
Actually: names including the file paths, relative to the site root (e.g. nlmenu.nl/images/alfabet_dgeo.gif). Omitting the file path results in false positives, because the NL-menu directory tree contains many identically-named files that are in different directories. ↩