2.x branch sitemaps #718

simonwibberley · 2020-10-14T15:47:33Z

A package of 'improvements' to sitemap handling, including:

'from' heuristic to skip loc's which have a lower lastmod than the provided timestamp. Includes refactored sitemap parse to eval sitemap node at the end of element to allow heuristic on indexes.
optional error escalation: exceptions propagated instead of logged
more defensive content-type checking: null check
accept '204 no content' as successful response
XML stream filter to remove / ignore invalid XML chars. Currently: unescaped control chars inc. '&', illegal white space preamble

Also, somewhat accidentally, includes an option to skip references we know would be beyond the depth limit.

Apologies for the messy commit history! I attempted to cherry pick only relevant changes.

Regards,

Simon

… mod date

…tic on indexes

…s to be stopped from within url adders

simonwibberley · 2020-10-14T17:22:31Z

@essiembre Happy to fill in any gaps if you point me in the right direction.

essiembre · 2020-10-25T21:38:30Z

Woah! Thanks for your contribution. I will merge it and also check if some of it can make it to version 3 (master branch). I'll let you know if I encounter issues.

I only have only one question/comment so far. The HTTP Collector already supports filtering documents on dates (DateMetadataFilter, in Importer module) and also supports not recrawling documents that do not feed a certain elapsed time. These two features combined can give you the same end result as the "from" you added. Did you add the "from" because you were not familiar with other options, or because you find it more convinient to filter it right at the source.

I can see rejecting sitemap entries as the sitemap is being read could significantly reduce the number of sitemap entries being queued for processing. For that alone, I think it is a worthwhile addition, but I am curious to find out more about your use case.

essiembre · 2020-10-25T21:50:41Z

One feature I think is having no effect is the "quitAtDepth" on the LinkExtractorStage. The reason being, URLs that are too deep are already rejected before link extraction could take place. If you have evidence of the contrary, please let me know how to reproduce as it woiuld be a bug IMO.

simonwibberley · 2020-10-29T17:32:18Z

@essiembre great, glad to contribute! Indeed, I wanted to prevent unnecessary downloading of ancient sitemaps which can be large. One of our use cases is to frequently get the latest articles from news sites by utilising only sitemaps, we want to run this on a managed schedule instead of recrawling based on update frequency and such (I'm not very familiar with that mode of operation yet).

FYI, some more sitemap related patches to PR soon...

simonwibberley and others added 22 commits April 29, 2020 15:13

ignore intellij project files

20fe409

custom artifact

7d59ee0

Norconex#498 option for link extractor to quit at depth

8045d19

Merge branch 'link-extractor-depth-quit' into 2.x-branch-casm

5e7ef47

Support non-standard canoncial expression Norconex#697

82e6b33

Merge branch '2.x-branch' into 2.x-branch-casm

2450b52

Support non-standard canoncial expression Norconex#697

612864a

Merge branch '2.x-branch' into 2.x-branch-casm

2bfc800

2.9.2-SNAPSHOT

f0d8782

configure sitemap resolver to only commit refs after a specified last…

7507339

… mod date

added sitemap from parameter to factory

2331479

added error escalation for sitemap resolver

22f7fef

added invalid char filter for sitemap data

c7b280f

more lenient header detection for sitemap content-type

c2e5374

v2.9.10 accept 204 for sitemaps

42a04be

resolve sitemap location at end element, allows lastmod > from heuris…

a8f94d6

…tic on indexes

sitemap: only mark url resolved if not already stopped, allows proces…

d9107ba

…s to be stopped from within url adders

sitemaps: ignore characters before first '<' for XML parsing

3ef08a8

Now also checking sitemap URL extension to detect if gzip. Norconex#707

683f3e3

Updated reference to new site.

3b3fb75

removed site specific exmaple

5a74e8b

Merge branch '2.x-branch' into 2.x-branch-sitemaps

9a2d39e

essiembre merged commit 2d294b8 into Norconex:2.x-branch Oct 25, 2020

essiembre added a commit that referenced this pull request Oct 26, 2020

Adapted PR #718.

d7b34e9

essiembre added a commit that referenced this pull request Oct 26, 2020

Fixed StringInvalidCharInputStream stripping valid XML entities. #718

3d2cab6

simonwibberley mentioned this pull request Nov 3, 2020

Stop LinkExtraction when MaxDepth is reached #498

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.x branch sitemaps #718

2.x branch sitemaps #718

simonwibberley commented Oct 14, 2020

simonwibberley commented Oct 14, 2020

essiembre commented Oct 25, 2020

essiembre commented Oct 25, 2020

simonwibberley commented Oct 29, 2020 •

edited

2.x branch sitemaps #718

2.x branch sitemaps #718

Conversation

simonwibberley commented Oct 14, 2020

simonwibberley commented Oct 14, 2020

essiembre commented Oct 25, 2020

essiembre commented Oct 25, 2020

simonwibberley commented Oct 29, 2020 • edited

simonwibberley commented Oct 29, 2020 •

edited