Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
239 lines (148 sloc) 11.1 KB

Quality assessment of archived site

This document describes various tests that were done to assess the quality of the NL-menu capture (i.e. the WARC generated by wget, as descibed here).

Validate the WARC

Using warctools' warcvalid tool:

warcvalid nl-menu.warc.gz

This doesn't result in any errors.

Compare number of files in archive against ISO image

Number of files in (extracted) ISO image:

find /var/www/www.nl-menu.nl -type f | wc -l

Result:

85644

Number of files scraped by wget (from dir tree created by wget):

find /home/johan/NL-menu/warc-wget-noextensionadjust/www.nl-menu.nl -type f | wc -l

Result:

84976

Number of files scraped by wget (from cdx file, counting lines with substring " 200 ", which should identify all sucessfully scraped files):

grep " 200 " nl-menu.cdx | wc -l

Result:

84976

Which (as expected) is identical to the count from the fs. Difference with respect to ISO image: 668 files. These files are part of the ISO, but they weren't scraped by wget.

Detailed comparison:

diff --brief -r /var/www/www.nl-menu.nl /home/johan/NL-menu/warc-wget-noextensionadjust/www.nl-menu.nl/ | grep "Only in /var/www/" > diffdir.txt

Result here. In particular, the following items are missing in the wget crawled version[^2]:

  • 499 .gif files
  • 83 .html files
  • 36 .txt files

Not entirely clear why this happens, could be orphaned resources that are not referenced by the site.

Search for references to missing files in html

One possible explanation for the missing files is that they are not referenced by any of the html files (or, to be more precise, the html files that are discovered by wget by crawling from the root document).

As a first attempt to test this, we can search for references to the names of the missing files inside the html. For instance, using the grep tool this is how we can count all references to "1580.html":

grep -r -F "1580.html" /var/www/www.nl-menu.nl/ | wc -l

This returns 110 references, whereas:

grep -r -F "frameset_zoekresultaten.html" /var/www/www.nl-menu.nl/ | wc -l

returns 0.

The following script does this for the names[^3] of all missing files:

checkmissingitems.sh

Result here.

From the results we see that 573 (90 %) of all missing file names have 0 references . This also explains why they were not discovered by wget: these files are simply not used by any of the content that results from crawling from the root document. For the remaining 64 files one of the following things happen:

  • nlmenu.nl/admin/nr-sects.txt: several html files reference the name of this file in a comment (so it makes sense that the file itself is not included in the crawl)

  • nlmenu.en/resources/alfabet_provincies.html: referenced in nlmenu.en/fset/provincie.html, but this file is missing by itself!

  • nlmenu.nl/admin/opdracht_verzonden.html: referenced in nlmenu.nl/admin/mailing.html in the value attribute of an input tag:

    <input type="hidden" name="nextpage" value="http://www.nl-menu.nl/nlmenu.nl/admin/opdracht_verzonden.html">

    So it seems wget doesn't parse input tags.

  • nlmenu.nl/images/alfabet_a.gif: referenced in various files (e.g. nlmenu.en/resources/old/alfabet_sites.html) as an argument of of JavaScript function:

    <area shape="rect" coords="22,0,34,13" href="/nlmenu.en/w3all/www_a.html" target="inhoud" onClick="MM_swapImage('document.alfabetje','document.alfabetje','/nlmenu.nl/images/alfabet_a.gif')">

    So this doesn't appear to be picked up by weget as a dependency either (totally there are 51 gif files in the same directory, which are all not included in the crawl for the same reason).

  • nlmenu.nl/images/gastenboek2.gif: referenced in various html files under directories nlmenu.en/admin/gastenboek/and nlmenu.nl/admin/gastenboek/. But neither of these directories (nor their content) is present in the (wget) archived version. The NL-menu homepage also doesn't appear to link to any guestbook feature, so this might be an orphaned section of the site (there are 2 more gifs here that are part of the guestbook).

  • nlmenu.nl/images/pijlbeneden.gif: referenced by 222 html files (e.g. nlmenu.en/sections/315/361/361.html) as a JavaScript variable:

    var expandedWidget = "/nlmenu.nl/images/pijlbeneden.gif"

  • nlmenu.nl/resources/marge.html: referenced in 6 html files (amongst which nlmenu.nl/fset/admin.html) within a frame definition:

    <frame src="/nlmenu.nl/resources/marge.html" scrolling="no" noresize name="marge" marginwidth="6" marginheight="1">

    Question: does wget even handle frames? How?

Pages/resources that are not available in Pywb

All of the following pages don't work, and give error "The url http://www.nl-menu.nl/nlmenu.nl/fset/ could not be found in this collection":

"aanmelden" / "wijzigen" from home page, right-hand menu:

http://localhost:8080/my-web-archive/20040123200406/http://www.nl-menu.nl/nlmenu.nl/fset/zoekenplus.html?http://www.nl-menu.nl/nlmenu.nl/admin/aanmeldform.html

Having arrived on this page, the other links at the right hand menu (FAQ, colofon, etc) don't work either! But behaviour seems to depend on previous page we arrived from. Very strange.

On "live" site, going here:

http://www.nl-menu.nl/nlmenu.nl/admin/aanmeldform.html

Redirects to:

http://www.nl-menu.nl/nlmenu.nl/fset/zoekenplus.html?http://www.nl-menu.nl/nlmenu.nl/admin/aanmeldform.html

This doesn't happen in the archived site! The resource is present in the WARC though:

warcdump NL-menu.warc.gz > NL-menu-dump.txt
grep "http://www.nl-menu.nl/nlmenu.nl/admin/aanmeldform.html" NL-menu-dump.txt

Result:

WARC-Target-URI:<http://www.nl-menu.nl/nlmenu.nl/admin/aanmeldform.html>
WARC-Target-URI:http://www.nl-menu.nl/nlmenu.nl/admin/aanmeldform.html

Looking at the source of the page, there's this:

<SCRIPT LANGUAGE="JavaScript">
<!--
        var navPrinting = false;
        if ((navigator.appName + navigator.appVersion.substring(0, 1)) == "Netscape4") {
            navPrinting = (self.innerHeight == 0) && (self.innerWidth == 0);}
        if ((self.name != 'text') && (self.location.protocol != "file:") && !navPrinting)
        if (top.location.href == location.href) {
                // deze pagina opnieuw openen, maar dan binnen frameset
                top.location.href = "http://www.nl-menu.nl/nlmenu.nl/fset/zoekenplus.html?" + unescape(document.URL);
        }
// -->
</SCRIPT>

So, the JavaScript re-opens the page within a frame set. So perhaps the problem occurs because the JavaScript fails to run in the archived version? Possibly related to this: pywb actually uses JavaScript to render each archived page inside an iframe. For example, "view source" on the archived homepage produces this (which is not the NL-menu source!) :

<!DOCTYPE html>
<html>
<head>
<style>
html, body
{
height: 100%;
margin: 0px;
padding: 0px;
border: 0px;
overflow: hidden;
}

</style>
<script src='http://localhost:8080/static/wb_frame.js'> </script>

<!-- default banner, create through js -->
<script src='http://localhost:8080/static/default_banner.js'> </script>
<link rel='stylesheet' href='http://localhost:8080/static/default_banner.css'/>


</head>
<body style="margin: 0px; padding: 0px;">
<div id="wb_iframe_div">
<iframe id="replay_iframe" frameborder="0" seamless="seamless" scrolling="yes" class="wb_iframe"></iframe>
</div>
<script>
var cframe = new ContentFrame({"url": "http://www.nl-menu.nl/nlmenu.nl/nlmenu.shtml" + window.location.hash,
                                "prefix": "http://localhost:8080/my-web-archive/",
                                "request_ts": "20040123200406",
                                "iframe": "#replay_iframe"});

</script>
</body>
</html>

Also tried: disable JavaScript in browser on "live" site: page is still displayed!

BUT: if I am on one of the category pages, e.g.:

http://localhost:8080/my-web-archive/20040123200406/http://www.nl-menu.nl/nlmenu.nl/fset/bedrijven.html

and then click on "aanmelden" (right-hand menu), the page loads normaly, even though the URL is identical!! Again, opening the URL in a new tab still produces the error.

Open http://www.nl-menu.nl/nlmenu.nl/fset/zoekenplus.html: works on "live" site, fails on archived site.

"digitalisering" (NL homepage, bottom-left under "Nieuwe rubrieken"):

http://localhost:8080/my-web-archive/20040123200406/http://www.nl-menu.nl/nlmenu.nl/fset/zoekenplus.html?http://www.nl-menu.nl/nlmenu.nl/sections/236/1868.html

Same as above (JavaScript).

Info on origin of files in WARC in metadata

The WARC was crawled from a locally reconstructed version of the site and not from the live web. This is something that should somehow be recorded in metadata. Using the warcdump tool from warctools:

warcdump NL-menu.warc.gz > NL-menu-dump.txt

Example record:

archive record at NL-menu.warc.gz:778658
Headers:
    WARC-Type:request
    WARC-Target-URI:<http://www.nl-menu.nl/nlmenu.nl/new/home.html>
    Content-Type:application/http;msgtype=request
    WARC-Date:2004-01-23T20:04:06Z
    WARC-Record-ID:<urn:uuid:a733b6c2-f31f-4bb1-822c-63fa08cdb2e2>
    WARC-IP-Address:127.0.0.1
    WARC-Warcinfo-ID:<urn:uuid:8a56cc17-3a38-4146-8594-3f0f39d31d51>
    WARC-Block-Digest:sha1:23FDBRG7W5PPQHPA7TZOJCFNFEV76X55
    Content-Length:230
Content Headers:
    Content-Type : application/http;msgtype=request
    Content-Length : 230
Content:
    GET /nlmenu\x2Enl/new/home\x2Ehtml HTTP/1\x2E1\xD\xAReferer\x3A http\x3A//www\x2Enl\x2Dmenu\x2Enl/nlmenu\x2Enl/resources/linkermenu\x2Ehtml\xD\xAUser\x2DAgent\x3A Wget/1\x2E19 \x28linux\x2Dgnu\x29\xD\xAAccept\x3A \x2A/\x2A\xD\xAAccept\x2DEncoding\x3A identity\xD\xAHost\x3A www\x2Enl\x2Dmenu\x2Enl\xD\xAConnection\x3A Keep\x2DAlive\xD\xA\xD\xA
    ...

Note this line:

WARC-IP-Address:127.0.0.1

The field WARC-IP-Address is defined in the WARC specification as:

The WARC-IP-Address is the numeric Internet address contacted to retrieve any included content. An IPv4 address shall be written as a “dotted quad”; an IPv6 address shall be written as specified in [RFC4291]. For a HTTP retrieval, this will be the IP address used at retrieval time corresponding to the hostname in the record’s target-Uri.

In this case, from the value 127.0.0.1 (=localhost) we can see that the files inside the warc originate from a local copy.

[^2]: The total number of items in the diff file is 637; expected number is 668! No idea why.

[^3]: Actually: names including the file paths, relative to the site root (e.g. nlmenu.nl/images/alfabet_dgeo.gif). Omitting the file path results in false positives, because the NL-menu directory tree contains many identically-named files that are in different directories.