Skip to content
Andy Jackson edited this page Nov 28, 2013 · 21 revisions

monitrix is a monitoring/analytics frontend to the Heritrix Web crawler. monitrix ingests the log file from a (running or completed) Heritrix crawl, and computes various near-realtime statistics about the crawl as a whole, and the individual hosts that were crawled.

This page provides a high-level overview of what happens underneath the hood of monitrix. For a detailed tour of the source code project layout, package structure, implementation classes, etc. see this page: Project Layout

The overall integration architecture is illustrated below, showing how Monitrix can be used to explore the data in the crawl logs, and can link to OpenWayback in order to inspect individual items.

QA Integration

This figure also illustrates that, in the future, Heritrix3 itself could be configured to write directly to Cassandra instead of having to re-parse the data out of the log files.

Core Datamodel Concepts

During ingest, monitrix processes the log information and aggregates it into several database tables ('collections', in MongoDB terminology). Internally, these tables are referred to as: the Crawl Log, the Known Host List, the Crawl Stats, the Alert Log and the Virus Log.

Crawl Log

The Crawl Log is simply a 1:1 translation of the original crawl log file into the montrix backend. In addition to retaining the original log entries, monitrix also processes the entries to extract the host names and subdomains from the original crawl URLs (using a utility from the Google Guava libraries, plus some additional processing). The Crawl Log is indexed by timestamp and hostname.

Known Host List

The Known Host List maintains basic metadata about the hosts encountered during the crawl. (At the time of this writing: first & last access time, subdomains.) Known Hosts are indexed by time of last access and hostname.

A Note on host name indexing: MongoDB does not support fulltext-indexing, as known from e.g. SOLR. monitrix tokenizes the hostnames during indexing (i.e. splits at '.', '-' and '_' characters). But apart from that, searches will only yield results in case of exact matches. For example, a search for 'weather' would return hosts such as 'weather.co.uk' or 'spain-weather.co.uk', but not 'accuweather.com'.

Crawl Stats

The Crawl Stats table is an essential helper table which enables monitrix to produce timeseries graphs. During ingest, monitrix aggregates several time-variant properties into a pre-defined raster. These properties include e.g. the number of hosts crawled per time unit, the data volume downloaded per time unit, etc. The raster resolution is (at the time of writing) one minute. When generating live timeseries visualizations on the screen later on, this raster provides the base resolution from which monitrix resamples the data to the desired output resolution. Datapoints in the Crawl Stats table ("Crawl Stats Units") are indexed by timestamp.

Alert Log

The alert log is (unsurprisingly) a list of alerts recorded during ingest. Alerts are generated at various points during the ingest procedure, and are indexed by time of occurence, and the name of the offending host.

Virus Log

The virus log keeps track of the viruses that were detected during the crawl.

Near-Realtime Incremental Ingest

monitrix ingests log file entries in batches to avoid excessive processing overhead. When 'attaching' the monitrix importer to a log, it will ingest the whole contents of the log in one go (caution: this might take a while!). Once it reaches the end of the file, it will suspend reading for a pre-set interval (currently set to 15 seconds). After that, it will continue and import all log entries that have been written in the mean time. Updates are effective in monitrix immediately. (But keep in mind that for timeseries graphs, the ultimate update timelimit is defined by the base resolution of the pre-aggregation raster!)