-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot crawl all urls from a sitemap #758
Comments
There are a few reasons this can happen. For instance, maybe a URL did not get updated because the sitemap indicated it did not change since the previous crawl. What do the logs say about those URLs? |
To avoid the last modify date cause, I have duplicated the application and cleaned all of the caches, then run the test. |
What about the logs? Maybe increase the verbosity if you have to, and look for what happened to the missing URLs. With the proper log level, every URL encountered should have an entry in the logs. |
Tried to increase the verbosity by changing the below loggers to DEBUG
But still cannot find the missing URLs, like https://store.acer.com/de-de/nitro-5-gaming-notebook-an517-51-schwarz-13 (I search it by using keyword I confirm that the URL above is in the sitemap. Attached the full log |
I was able to reproduce with what you shared. It turns out having |
My client is using version 2.8.2-SNAPSHOT and found that some urls didn't updated in the search engine.
For example: https://store.acer.com/de-de/nitro-5-gaming-notebook-an517-51-schwarz-13
Checked that the crawler didn't fetch this url but this url is included in the sitemap.
My client don't want to change the crawler program a lot.
Is there any workaround or hotfix for this version?
The config is in the below:
The text was updated successfully, but these errors were encountered: