Skip to content

Question (saving content as plain text, recrawling) #525

Answered by ato
zhaobm asked this question in Q&A
Discussion options

You must be logged in to vote
  1. Heritrix doesn't currently have a builtin way to extract text from HTML pages.
  2. My understanding is the easiest way to recrawl pages is to run the job again (with deduplication configured if desired). There's some builtin support for continuous recrawling mentioned in the wiki but I'm not sure if it was ever completed or how functional it is.
  3. This sounds like a configuration error but without more details I don't know what the problem is.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by ato
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
2 participants
Converted from issue

This discussion was converted from issue #485 on September 30, 2022 00:45.