-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
recrawlableResolver does not work as expected #741
Comments
I found the configuration part, which causes the issue. It's
Could you please take a look? Thanks! |
Odd, technically, the link extraction should not occur on a premature document. Can you please share a complete configuration that reproduces the issue? |
here we go: https://0x0.st/-q8z.xml |
Thanks for sharing your file. I was able to reproduce the issue with it. It was tied to pages containing links to self. Such links were added as a child link to process even if the "parent" (i.e., itself) was identified as premature. I just made a new 2.x snapshot release with a fix for it. Please confirm. |
Thank you very much for the quick fix! Really appreciate that! I just tested and noticed a new issue with the sitemap: before (not sure what snapshot it is -
latest snapshot:
Looks like the latest snapshot cannot fetch the sitemap. Could you please take a look? Thanks a lot! update: I just realized, that there is a similar issue #738 |
Since the sitemap issue is tracked in #738, I will close this one. I am assuming the "premature" issues are fixed? If not feel free to reopen or create a new ticket. |
I just tested the lasted snapshot
As you can see from the following, the page gets crawled twice when the crawlstore is not there and one PREMATURE and one ADD at every sub-sequent crawl:
Could you please re-open this thicket? It seems, I have no permission for that. |
processed normally (not through a redirect). #741
I just made a new snapshot with a fix. I could not reproduce the issue with it. Please confirm. |
yes, the snapshot |
hello Pascal,
some pages still being crawled despite
recrawlableResolver
policy, e.g.:it rejected, but fetched and committed this URL.
Expected behaviour: do not process it after
REJECTED_PREMATURE
Please let me know, if you need the whole config.
Thanks a lot!
The text was updated successfully, but these errors were encountered: