Skip to content

Release Notes Heritrix 3.4.0 20200304

Alex Osborne edited this page Mar 5, 2020 · 4 revisions

Summary of changes since Release Notes - Heritrix 3.4.0-20190418 - see the full changelog for more details.


This releases updates the Berkeley Database from a very old version 4.1.6 to version 7.5.11. This resolves a long-standing bug when recovering from checkpoints multiple times, but also means that the Heritrix state files from previous versions are not compatible with this version. In other words:

Any crawl state folders from previous versions of Heritrix are not compatible with this version! You can only use this new release with new crawls!


Additions

  • ExtractorYoutubeDL enables the discovery video URLs using the external tool youtube-dl. #257 (nlevitt)
  • WARC writing is now configurable with its own processor chain making it easier to write extra records. #285 (nlevitt)
  • MatchesListDecideRule gained a timeoutPerRegexSeconds option to help debug runaway regular expressions. #290 (csrster)
  • Added support for forced queue assignment and parallel queues. #286 (adam-miller)
  • JDK 11 is now supported. #269-#273 (ato)

Changes

  • BDB was upgrade to 7.5.11. See warning at top. #281 (anjackson)
  • Heritrix now uses Guava's bloom filter and base32 encoder. #300, #304 (hennekey)

Removals

  • JDK 7 is no longer supported. #269-#273 (ato)

Bugfixes

  • A number of performance and reliability improvements were made to the unit tests.

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally