Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recrawlableResolver does not work as expected #741

Closed
jetnet opened this issue Mar 18, 2021 · 9 comments
Closed

recrawlableResolver does not work as expected #741

jetnet opened this issue Mar 18, 2021 · 9 comments

Comments

@jetnet
Copy link

jetnet commented Mar 18, 2021

hello Pascal,

some pages still being crawled despite recrawlableResolver policy, e.g.:

<recrawlableResolver class="com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver" sitemapSupport="last" >
                <minFrequency applyTo="reference" value="1d">.*</minFrequency>
</recrawlableResolver>
  • start 1st crawl and check an URL
$ grep //schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/ latest/logs/schule.fragfinn.de.log
schule.fragfinn.de: 2021-03-18 21:16:36 INFO - DOCUMENT_METADATA_FETCHED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-18 21:16:36 INFO -          DOCUMENT_FETCHED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-18 21:16:36 INFO -       CREATED_ROBOTS_META: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-18 21:16:36 INFO -            URLS_EXTRACTED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-18 21:16:37 INFO -         DOCUMENT_IMPORTED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-18 21:16:37 INFO -    DOCUMENT_COMMITTED_ADD: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
  • start 2nd crawl right after and check the same URL
$ grep //schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/ latest/logs/schule.fragfinn.de.log
schule.fragfinn.de: 2021-03-18 21:17:09 INFO -        REJECTED_PREMATURE: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-18 21:17:09 INFO - DOCUMENT_METADATA_FETCHED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-18 21:17:10 INFO -          DOCUMENT_FETCHED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-18 21:17:10 INFO -       CREATED_ROBOTS_META: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-18 21:17:10 INFO -            URLS_EXTRACTED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-18 21:17:14 INFO -         DOCUMENT_IMPORTED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-18 21:17:14 INFO -    DOCUMENT_COMMITTED_ADD: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/

it rejected, but fetched and committed this URL.
Expected behaviour: do not process it after REJECTED_PREMATURE

Please let me know, if you need the whole config.
Thanks a lot!

@jetnet
Copy link
Author

jetnet commented Mar 19, 2021

I found the configuration part, which causes the issue. It's TikaLinkExtractor in the following snippet:

<linkExtractors>
        <!-- Tika link extractor fetches "alt" data from images -->
        <extractor class="com.norconex.collector.http.url.impl.TikaLinkExtractor" ignoreNofollow="false"/>
        <!-- GenericLinkExtractor used to extract links from the following tags -->
        <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
              <tags>
                  <tag name="frame" attribute="src" />
                  <tag name="iframe" attribute="src" />
                  <tag name="meta" attribute="http-equiv" />
                  <tag name="script" attribute="src" />
              </tags>
        </extractor>
</linkExtractors>

Could you please take a look? Thanks!

@essiembre
Copy link
Contributor

Odd, technically, the link extraction should not occur on a premature document. Can you please share a complete configuration that reproduces the issue?

@jetnet
Copy link
Author

jetnet commented Mar 22, 2021

here we go: https://0x0.st/-q8z.xml

@essiembre
Copy link
Contributor

Thanks for sharing your file. I was able to reproduce the issue with it. It was tied to pages containing links to self. Such links were added as a child link to process even if the "parent" (i.e., itself) was identified as premature. I just made a new 2.x snapshot release with a fix for it. Please confirm.

@jetnet
Copy link
Author

jetnet commented Mar 23, 2021

Thank you very much for the quick fix! Really appreciate that!

I just tested and noticed a new issue with the sitemap:

before (not sure what snapshot it is - 403017 Jun 7 2020 lib/norconex-collector-http-2.9.1-SNAPSHOT.jar):

schule.fragfinn.de: 2021-03-22 11:09:43 ERROR - Could not obtain sitemap: https://www.fragfinn.de/sitemap.xml.  Expected status code 200, but got 301
schule.fragfinn.de: 2021-03-22 11:09:43 ERROR - Could not obtain sitemap: https://schule.fragfinn.de/sitemap.xml.  Expected status code 200, but got 301
schule.fragfinn.de: 2021-03-22 11:09:43 INFO - Resolving sitemap: https://schule.fragfinn.de/sitemap_index.xml
schule.fragfinn.de: 2021-03-22 11:09:43 INFO - Resolving sitemap: https://schule.fragfinn.de/page-sitemap.xml

latest snapshot:

schule.fragfinn.de: 2021-03-23 09:47:37 ERROR - Could not obtain sitemap: https://www.fragfinn.de/sitemap.xml.  Expected status code 200, but got 301
schule.fragfinn.de: 2021-03-23 09:47:37 ERROR - Could not obtain sitemap: https://schule.fragfinn.de/sitemap.xml.  Expected status code 200, but got 301
schule.fragfinn.de: 2021-03-23 09:47:37 INFO - Resolving sitemap: https://schule.fragfinn.de/sitemap_index.xml
schule.fragfinn.de: 2021-03-23 09:47:37 ERROR - Cannot fetch sitemap: https://schule.fragfinn.de/sitemap_index.xml (java.lang.NullPointerException)

Looks like the latest snapshot cannot fetch the sitemap. Could you please take a look? Thanks a lot!

update:

I just realized, that there is a similar issue #738

@essiembre
Copy link
Contributor

Since the sitemap issue is tracked in #738, I will close this one.

I am assuming the "premature" issues are fixed? If not feel free to reopen or create a new ticket.

@jetnet
Copy link
Author

jetnet commented Mar 31, 2021

I just tested the lasted snapshot

-rw-r--r-- 1 crawler crawler 199934 Mar 29 22:32 norconex-collector-core-1.10.1-SNAPSHOT.jar
-rw-r--r-- 1 crawler crawler 407983 Mar 29 22:35 norconex-collector-http-2.9.1-SNAPSHOT.jar

As you can see from the following, the page gets crawled twice when the crawlstore is not there and one PREMATURE and one ADD at every sub-sequent crawl:

$ tail -F latest/logs/schule.fragfinn.de.log | grep //schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/

tail: 'latest/logs/schule.fragfinn.de.log' has become inaccessible: No such file or directory
tail: 'latest/logs/schule.fragfinn.de.log' has appeared;  following new file
schule.fragfinn.de: 2021-03-31 09:47:25 INFO - schule.fragfinn.de:          DOCUMENT_FETCHED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-31 09:47:25 INFO - schule.fragfinn.de:       CREATED_ROBOTS_META: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-31 09:47:25 INFO - schule.fragfinn.de:            URLS_EXTRACTED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-31 09:47:25 INFO - schule.fragfinn.de:         DOCUMENT_IMPORTED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-31 09:47:25 INFO - schule.fragfinn.de:    DOCUMENT_COMMITTED_ADD: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-31 09:47:25 INFO - schule.fragfinn.de:          DOCUMENT_FETCHED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-31 09:47:25 INFO - schule.fragfinn.de:       CREATED_ROBOTS_META: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-31 09:47:25 INFO - schule.fragfinn.de:            URLS_EXTRACTED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-31 09:47:25 INFO - schule.fragfinn.de:         DOCUMENT_IMPORTED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-31 09:47:25 INFO - schule.fragfinn.de:    DOCUMENT_COMMITTED_ADD: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/

tail: 'latest/logs/schule.fragfinn.de.log' has become inaccessible: No such file or directory
tail: 'latest/logs/schule.fragfinn.de.log' has appeared;  following new file
schule.fragfinn.de: 2021-03-31 09:47:47 INFO - schule.fragfinn.de:        REJECTED_PREMATURE: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-31 09:47:50 INFO - schule.fragfinn.de:          DOCUMENT_FETCHED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-31 09:47:50 INFO - schule.fragfinn.de:       CREATED_ROBOTS_META: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-31 09:47:50 INFO - schule.fragfinn.de:            URLS_EXTRACTED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-31 09:47:50 INFO - schule.fragfinn.de:         DOCUMENT_IMPORTED: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/
schule.fragfinn.de: 2021-03-31 09:47:50 INFO - schule.fragfinn.de:    DOCUMENT_COMMITTED_ADD: https://schule.fragfinn.de/macht-mit-finns-freundin-braucht-einen-namen/

Could you please re-open this thicket? It seems, I have no permission for that.
Thanks a lot!

@essiembre
Copy link
Contributor

I just made a new snapshot with a fix. I could not reproduce the issue with it. Please confirm.

@jetnet
Copy link
Author

jetnet commented Apr 6, 2021

yes, the snapshot norconex-collector-http-2.9.1-20210406.043458-18.zip works as expected!
Thanks you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants