Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal Example does not work... #252

Closed
liar666 opened this issue Jun 7, 2016 · 2 comments
Closed

Minimal Example does not work... #252

liar666 opened this issue Jun 7, 2016 · 2 comments

Comments

@liar666
Copy link

liar666 commented Jun 7, 2016

I've:

The output was :

[non-job]: 2016-06-07 16:47:20 INFO - Starting execution.
[non-job]: 2016-06-07 16:47:20 INFO - Version: Norconex HTTP Collector 2.5.0 (Norconex Inc.)
[non-job]: 2016-06-07 16:47:20 INFO - Version: Norconex Collector Core 1.5.0 (Norconex Inc.)
[non-job]: 2016-06-07 16:47:20 INFO - Version: Norconex Importer 2.5.2 (Norconex Inc.)
[non-job]: 2016-06-07 16:47:20 INFO - Version: Norconex JEF 4.0.7 (Norconex Inc.)
[non-job]: 2016-06-07 16:47:20 INFO - Version: Norconex Committer Core 2.0.3 (Norconex Inc.)
Norconex Minimum Test Page: 2016-06-07 16:47:20 INFO - Running Norconex Minimum Test Page: BEGIN (Tue Jun 07 16:47:20 CEST 2016)
Norconex Minimum Test Page: 2016-06-07 16:47:20 INFO - Norconex Minimum Test Page: RobotsTxt support: true
Norconex Minimum Test Page: 2016-06-07 16:47:20 INFO - Norconex Minimum Test Page: RobotsMeta support: true
Norconex Minimum Test Page: 2016-06-07 16:47:20 INFO - Norconex Minimum Test Page: Sitemap support: false
Norconex Minimum Test Page: 2016-06-07 16:47:20 INFO - Norconex Minimum Test Page: Canonical links support: true
Norconex Minimum Test Page: 2016-06-07 16:47:20 INFO - Norconex Minimum Test Page: User-Agent: <None specified>
Norconex Minimum Test Page: 2016-06-07 16:47:21 INFO - Norconex Minimum Test Page: Initializing sitemap store...
Norconex Minimum Test Page: 2016-06-07 16:47:21 INFO - Norconex Minimum Test Page: Done initializing sitemap store.
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - 1 start URLs identified.
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO -           CRAWLER_STARTED
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Norconex Minimum Test Page: Crawling references...
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO -       REJECTED_REDIRECTED: http://www.norconex.com/product/collector-http-test/minimum.php
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO -           REJECTED_FILTER: https://www.norconex.com/product/collector-http-test/minimum.php
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Norconex Minimum Test Page: Re-processing orphan references (if any)...
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Norconex Minimum Test Page: Reprocessed 0 orphan references...
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Norconex Minimum Test Page: Crawler finishing: committing documents.
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Norconex Minimum Test Page: 1 reference(s) processed.
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO -          CRAWLER_FINISHED
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Norconex Minimum Test Page: Crawler completed.
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Norconex Minimum Test Page: Crawler executed in 2 seconds.
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Running Norconex Minimum Test Page: END (Tue Jun 07 16:47:20 CEST 2016)

  • went to look at the results dir
  • I see no "crawledFiles", contrarily to what the docs ( http://www.norconex.com/collectors/collector-http/getting-started ) say:
    $ ls examples-output/minimum/
    total 24K
    drwx------ 6 gm gm 4.0K Jun 7 16:47 .
    drwx------ 3 gm gm 4.0K Jun 7 16:47 ..
    drwx------ 3 gm gm 4.0K Jun 7 16:47 crawlstore
    drwx------ 3 gm gm 4.0K Jun 7 16:47 logs
    drwx------ 3 gm gm 4.0K Jun 7 16:47 progress
    drwx------ 3 gm gm 4.0K Jun 7 16:47 sitemaps
@liar666
Copy link
Author

liar666 commented Jun 7, 2016

OK found the problem : the startUrl starts with http://, which redirects to https:// when accessed, which is rejected.

Modifying the example with the following lines did the trick for me:

https://www.norconex.com/product/collector-http-test/minimum.php

  <referenceFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
      https?://www\.norconex\.com/.*
    </filter>
  </referenceFilters>

@liar666 liar666 closed this as completed Jun 7, 2016
@essiembre
Copy link
Contributor

I have updated the sample configuration files to now point to https instead of http (for the next release).

I have already updated the online copies to reflect this:

Thanks for reporting this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants