Skip to content

Unix Utility Scripts

Alex Osborne edited this page Jul 4, 2018 · 3 revisions

Heritrix comes bundled with Unix utility scripts.

manifest_bundle.pl

This script will bundle all resources referenced in the crawl manifest file.  A bundle is an uncompressed or compressed tar ball.  The directory structure of the tar ball is:

  • Top level directory (crawl name)

  • Three default subdirectories

  • Any other arbitrary subdirectories

  • Script Usage

    manifest_bundle.pl crawl_name manifest_file -f output_tar_file -z [ -flag directory]
    -f output tar file. If omitted output to stdout.
    -z compress tar file with gzip.
    -flag is any upper case letter. Default values C, L, and are R are set to
    configuration, logs and reports
  • manifest-bundle.pl example

    manifest_bundle.pl testcrawl crawl-manifest.txt -f /0/testcrawl/manifest-bundle.tar.gz -z -F filters

For the example above, the tar ball will contain the following directory structure:
|- testcrawl

       |- configurations

       |- logs

       |- reports

       |- filters

hoppath.pl

This Perl script, found in (HERETRIX_HOME)/bin recreates the hop path to the specified URI.  The hop path is the path of links (URIs) that were followed to get to the specified URI.

Script Usage

hoppath.pl crawl.log URI_PREFIX
crawl.log Full-path to Heritrix crawl.log instance.
URI_PREFIX URI we're querying about. Must begin 'http(s)://' or 'dns:'.
Wrap this parameter in quotes to avoid shell interpretation
of any '&' present in URI_PREFIX.

hoppath.pl Example

hoppath.pl crawl.log 'http://www.house.gov/'

hoppath.pl Result

2004-02-25-02-36-06 - http://www.house.gov/house/MemberWWW_by_State.html
2004-02-25-02-36-06 L http://wwws.house.gov/search97cgi/s97_cgi
2004-02-25-03-30-38 L http://www.house.gov/

The L in the example refers to the type of link followed.

RecoveryLogMapper

The org.archive.crawler.util.RecoveryLogMapper Java class is similar to the hoppath.pl script.  It was contributed by Mike Schwartz.  The RecoveryLogMapper parses a Heritrix recovery log file and builds maps that allow a caller to look up any seed URI.  The RecoveryLogMapper then returns a list of all URIs successfully crawled from the seed.  The RecoveryLogMapper also can find the seed URI from which any crawled URI was captured.

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally