Skip to content

Only Store Successful HTML Pages

Alex Osborne edited this page Jul 4, 2018 · 2 revisions

Suppose you want to only capture the first 50 pages encountered from a set of seeds and archive only those pages that return a 200 response code and a text/html mime type.  Additionally, you only want to look for links in HTML resources.

In order to examine only HTML documents for links, you will need to remove the following extractors that tell Hertirix to look for links in style sheets, JavaScript, and Flash files:

  • ExtractorCss
  • ExtractorJs
  • ExtractorSwf

Leave the ExtractorHttp since it is useful for locating resources that can only be found using a redirect (301 or 302 response code).

You can limit the number of URIs downloaded by setting the maxDocumentsDownload property on the crawlLimiter bean.  Setting the value to 50 will probably not provide the intended results.  Since each DNS response and robots.txt file is counted in the number, you should set the value to 50 * number of seeds * 2.

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally