Skip to content
Alex Osborne edited this page Sep 1, 2022 · 4 revisions

If your project, institution, or company uses Heritrix, please feel free to add details below.

If you'd like to cite Heritrix you can reference the following paper:

Mohr, G., Stack, M., Rnitovic, I., Avery, D., & Kimpton, M. (2004, July). An Introduction to Heritrix. In 4th International Web Archiving Workshop (pp. 109-115).

@inproceedings{mohr2004introduction,
  title={Introduction to heritrix},
  author={Mohr, Gordon and Stack, Michael and Rnitovic, Igor and Avery, Dan and Kimpton, Michele},
  booktitle={4th International Web Archiving Workshop},
  pages={109--115},
  year={2004},
  organization={Citeseer}
}

Libraries, Archives, and Memory Institutions

  • Internet Archive - leads open-source development of Heritrix; uses Heritrix as the crawler for numerous focused/thematic and broad crawling projects, including the Archive-It service and crawls feeding our Wayback Machine and partner collections. Versions most often used: Heritrix 1.14.4-SNAPSHOT, 3.0.0-SNAPSHOT. 
  • The British Library - uses Heritrix 1.14.3 as the crawler for our Domain Research Project. Collaborated with the National Library of New Zealand to develop the Web Curator Tool which also uses Heritrix as the underlying crawler and is presently used to drive the UK Web Archive.
  • The Library of Congress - works with the Internet Archive to help build focused and thematic collection Web archives. The Library's Web Archiving Team has also begun in-house crawling, currently using Heritrix 3.0.X, for selected projects.
  • California Digital Library, Web Archiving Service
  • BNCF, Biblioteca Nazionale Centrale Firenze uses heritrix3 and warc 1.0 for the archiving program of doctoral electronic theses from universities repositories.
  • Smithsonian Institution Archives testing Heritrix 1.14.3 on a Windows machine to capture its numerous websites and social networking sites.
  • Netarchive.dk uses Heritrix 1.14.3 integrated within NetarchiveSuite to harvest the Danish internet.
  • National and University Library of Iceland Has actively participated in the development of Heritrix since the projects inception. Have used Heritrix 1 to conduct domain and targeted crawls since 2004 and Heritrix 3 since 2010.
  • The French National Library (BnF) - uses Heritrix for productive crawls since the end of 2006. It has performed the first French national domain crawl with Heritrix/NetarchiveSuite in spring 2010.
  • The Austrian National Library uses Heritrix since the beginning of Web@rchive Austria in 2008.
  • The Biblioteca de Catalunya (BC), the National Library of Catalonia, initiated in June 2005 a project called PADICAT (Digital Heritage of Catalonia).  Uses Heritrix 1.14.4 as a crawler for the .CAT top level domain, selective compilation of the web site output of catalan organizations and focused harvesting of public events. Those crawls are used to feed Wayback 1.4.2 (URL search) and WERA (keyword search) and are shown in open access through PADICAT.
  • [add yourself here]

Companies and Commercial Projects

  • neofonie - uses Heritrix 1.14.4 and 3.0.x to gather unstructured data. After the data has been processed, enriched and classified, it is then used in search engines and web applications.
  • Dataclip - uses Heritrix 3 to crawl the top 10 million business websites to offer sales and marketing intelligence based on what web technologies are used by those organizations.
  • TNR Global - uses Heritrix to crawl web content.  E.g. slide #12 in this presentation: Migration from Fast ESP to Lucene Solr - Michael McIntosh
  • York University Libraries - uses Heritrix 3 to crawl local web content for common records schedule.
  • [add yourself here]

Research Projects

  • CiteSeerX - uses Heritrix 1.14.4  and 3.0.x to crawl open access academic documents online.  
  • Web Archiving Integration Layer (WAIL) bundles a pre-configured Heritrix 3.2.0 binary with other personal web archiving tools into a native application and provides a graphical user interface for access.
  • [add yourself here]

Other

  • [add yourself here]

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally