Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.x branch sitemaps #718

Merged
merged 22 commits into from Oct 25, 2020
Merged

Conversation

simonwibberley
Copy link

A package of 'improvements' to sitemap handling, including:

  • 'from' heuristic to skip loc's which have a lower lastmod than the provided timestamp. Includes refactored sitemap parse to eval sitemap node at the end of element to allow heuristic on indexes.
  • optional error escalation: exceptions propagated instead of logged
  • more defensive content-type checking: null check
  • accept '204 no content' as successful response
  • XML stream filter to remove / ignore invalid XML chars. Currently: unescaped control chars inc. '&', illegal white space preamble

Also, somewhat accidentally, includes an option to skip references we know would be beyond the depth limit.

Apologies for the messy commit history! I attempted to cherry pick only relevant changes.

Regards,

Simon

@simonwibberley
Copy link
Author

@essiembre Happy to fill in any gaps if you point me in the right direction.

@essiembre
Copy link
Contributor

Woah! Thanks for your contribution. I will merge it and also check if some of it can make it to version 3 (master branch). I'll let you know if I encounter issues.

I only have only one question/comment so far. The HTTP Collector already supports filtering documents on dates (DateMetadataFilter, in Importer module) and also supports not recrawling documents that do not feed a certain elapsed time. These two features combined can give you the same end result as the "from" you added. Did you add the "from" because you were not familiar with other options, or because you find it more convinient to filter it right at the source.

I can see rejecting sitemap entries as the sitemap is being read could significantly reduce the number of sitemap entries being queued for processing. For that alone, I think it is a worthwhile addition, but I am curious to find out more about your use case.

@essiembre essiembre merged commit 2d294b8 into Norconex:2.x-branch Oct 25, 2020
@essiembre
Copy link
Contributor

One feature I think is having no effect is the "quitAtDepth" on the LinkExtractorStage. The reason being, URLs that are too deep are already rejected before link extraction could take place. If you have evidence of the contrary, please let me know how to reproduce as it woiuld be a bug IMO.

essiembre added a commit that referenced this pull request Oct 26, 2020
@simonwibberley
Copy link
Author

simonwibberley commented Oct 29, 2020

@essiembre great, glad to contribute! Indeed, I wanted to prevent unnecessary downloading of ancient sitemaps which can be large. One of our use cases is to frequently get the latest articles from news sites by utilising only sitemaps, we want to run this on a managed schedule instead of recrawling based on update frequency and such (I'm not very familiar with that mode of operation yet).

FYI, some more sitemap related patches to PR soon...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants