Skip to content
Alex Osborne edited this page Jul 4, 2018 · 3 revisions

Must reads

Nice to reads

Information on Java with respect to Heritrix/crawling

Find these (also may be outdated with respect to current Java and our implementation choices) at the archive-crawler Yahoo Group files page:

  • G. B. Reddy Study of synch vs. asynch IO in Java
  • G. B. Reddy Study of multi-threaded DNS performance in Java

Others

Relevant specifications

Attachments

Download All{.download-all-link}

Attachments:

crawler-requirements-2003-03.htm (text/html)
Mohr-et-al-2004.pdf (application/pdf)
1998-Cho-efficient.pdf (application/pdf)
1999-Heydon-javalimits.pdf (application/pdf)
1999-Hirai-webbase.pdf (application/pdf)
1999-Mercator.pdf (application/pdf)
2000-Broder-webgraph.pdf (application/pdf)
2000-Cho-incremental.pdf (application/pdf)
2001-Abiteboul-crawlorder.pdf (application/pdf)
2001-Arasu-search.pdf (application/pdf)
2001-Najork-breadthfirst.pdf (application/pdf)
2001-Najork-highperf.pdf (application/pdf)
2002-Guillaume-webgraph.pdf (application/pdf)
2008-IRLBot.pdf (application/pdf)
2008-Olston-recrawl.pdf (application/pdf)
2002-Shkapenyuk-polybot.pdf (application/pdf)

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally