Skip to content

Release Notes 1.14.4

Alex Osborne edited this page Jul 4, 2018 · 2 revisions

Release Notes - 1.14.4 (May 2010)

These are the project wiki Release Notes for the 1.14.4 release.

Release 1.14.4 is a 'micro' release with a number of small bugfixes and new requested features.

The 1.14.4 release is now available at TK.

Notable Changes

Support for FTP transactions in WARC records (HER-1577)

Heritrix now supports recording full FTP transactions in WARC records. For each FTP URL retrieved, the control conversation is recorded in a WARC metadata record with Content-Type: application/ftp; msgtype=control-conversation, the payload data is recorded in a WARC resource record with Content-Type: application/ftp; msgtype=payload-data, and FTP fetch metadata (as well as outlinks) are recorded in a corresponding WARC metadata record.

Other WARC corrections (HER-1659)

Written WARC files now consistently identify as WARC version "1.0" (HER-1648) and will grow to the 1GB size recommended by the specification.

Windows annoyances fixed (HER-510, HER-1622, HER-1625)

Several problems causing errors or problems in using Heritrix on Windows, related to improper quoting or path-separators, have been corrected.

Seeds with Internationalized Domain Names (IDN) better supported (HER-1711)

Encoding problems which interfered with specification of some Internationalized Domain Name seeds have been corrected.

Hosts report expanded to include novel/duplicate bytes/URLs counts (HER-1650)

Crawl statistics now collect, and the 'Hosts' report includes, counts of the URLs and total content byte-sizes deemed either 'novel' or 'duplicate' by the duplication-reduction/persist-history mechanisms, if enabled on a crawl.

Trailing '*' tolerated in robots.txt Disallow/Allow rules (HER-1620)

Heritrix will now tolerate a trailing '*' wildcard sometimes added by webmasters (though not necessary) in their robots.txt Disallow/Allow rules. (Leading or internal wildcards are not yet supported.)

CachedBdbMap changes, replacement (HER-1677, HER-1658, HER-1705, HER-1609

A number of performance, memory-retention, and deadlock-risk issues occasionally affecting the implementation class CachedBdbMap were identified. Fixes have been applied, but also the class has been replaced with a more simple implementation focused specifically on Heritrix's common use cases, ObjectIdentityBdbCache.

Additional contributors

In addition to the usual suspects, this release includes contributed fixes or functionality from:

  • Paul Baclace
  • Sergey Khenkin

All Tracked Changes

The following 44 tracked issues are recorded as addressed in this 1.14.4 release:

https://webarchive.jira.com/secure/ReleaseNote.jspa?projectId=10021&version=10105

T Key Summary Status
Improvement{.icon} HER-1756 improved crawl status reporting: more definitive FINISHED-after-logging of crawling Resolved
Improvement{.icon} HER-1754 treat robots.txt non-response as 404 (optionally?) Resolved
Bug{.icon} HER-1748 Arc2Warc should generally write WARC "response" records for ARC HTTP responses Resolved
Bug{.icon} HER-1710 on problem loading persist log, close reader Resolved
Bug{.icon} HER-1705 'harmless' low-memory-canary causing heap reference leak/OOME Resolved
Bug{.icon} HER-1700 PersistLogProcessor, PersistLoadProcessor don't fully respect "enabled" property Resolved
Bug{.icon} HER-1697 h1 - what about ftp dedupe? Resolved
Bug{.icon} HER-1685 seed report says "0 NOTCRAWLED" for all seeds Resolved
Bug{.icon} HER-1683 inconsistent host classification for dns: urls Resolved
Bug{.icon} HER-1677 threads stuck in CachedBdbMap.get/_getMem Resolved
Bug{.icon} HER-1675 FetchStats/CrawlSubstats not tallying fetchResponses correctly; as a result QuotaEnforcer doesn't work Resolved
Bug{.icon} HER-1666 seed redirect url sometimes not recorded as seed when seed also has a regular link to the redirect url Resolved
Bug{.icon} HER-1662 should use fully qualified hostname Resolved
Bug{.icon} HER-1659 make default WARC size comply with spec; adjust default pool size for fewer odd-sized (W)ARCs Resolved
Bug{.icon} HER-1658 CachedBdbMaps not expunging as expected (especially StatisticsTracker.processedSeedsRecords) Resolved
New Feature{.icon} HER-1650 novel/duplicate urls/bytes in the host report Resolved
Bug{.icon} HER-1649 com.sleepycat.je.DatabaseException when finishing crawl with dump-pending-at-close enabled Resolved
Bug{.icon} HER-1648 harmonize declared WARC versions in protocol, warcinfo record metadata Resolved
Bug{.icon} HER-1644 crawls stopped immediately after starting do not finish / clean up properly Resolved
New Feature{.icon} HER-1640 digest ftp content Resolved

Showing 20 out of 43 issues

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally