You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The collector should not follow this link – or that of any other schema it can't actually process.
Actual
The collectors tries to follow the tel: link.
INFO [AbstractCollectorConfig] Configuration loaded: id=test-collector; logsDir=./logs; progressDir=./progress
INFO [JobSuite] JEF work directory is: ./progress
INFO [JobSuite] JEF log manager is : FileLogManager
INFO [JobSuite] JEF job status store is : FileJobStatusStore
INFO [AbstractCollector] Suite of 1 crawler jobs created.
INFO [JobSuite] Initialization...
INFO [JobSuite] No previous execution detected.
INFO [JobSuite] Starting execution.
INFO [AbstractCollector] Version: Norconex HTTP Collector 2.4.0-SNAPSHOT (Norconex Inc.)
INFO [AbstractCollector] Version: Norconex Collector Core 1.4.0-SNAPSHOT (Norconex Inc.)
INFO [AbstractCollector] Version: Norconex Importer 2.5.0-SNAPSHOT (Norconex Inc.)
INFO [AbstractCollector] Version: Norconex JEF 4.0.7 (Norconex Inc.)
INFO [AbstractCollector] Version: Norconex Committer Core 2.0.3 (Norconex Inc.)
INFO [JobSuite] Running test-crawler: BEGIN (Fri Jan 08 16:21:17 CET 2016)
INFO [MapDBCrawlDataStore] Initializing reference store ./work/crawlstore/mapdb/test-crawler/
INFO [MapDBCrawlDataStore] ./work/crawlstore/mapdb/test-crawler/: Done initializing databases.
INFO [HttpCrawler] test-crawler: RobotsTxt support: true
INFO [HttpCrawler] test-crawler: RobotsMeta support: true
INFO [HttpCrawler] test-crawler: Sitemap support: true
INFO [HttpCrawler] test-crawler: Canonical links support: true
INFO [HttpCrawler] test-crawler: User-Agent: <None specified>
INFO [SitemapStore] test-crawler: Initializing sitemap store...
INFO [SitemapStore] test-crawler: Done initializing sitemap store.
INFO [HttpCrawler] 1 start URLs identified.
INFO [CrawlerEventManager] CRAWLER_STARTED
INFO [AbstractCrawler] test-crawler: Crawling references...
INFO [CrawlerEventManager] DOCUMENT_FETCHED: https://herimedia.com/norconex-test/phone.html
INFO [CrawlerEventManager] CREATED_ROBOTS_META: https://herimedia.com/norconex-test/phone.html
INFO [CrawlerEventManager] URLS_EXTRACTED: https://herimedia.com/norconex-test/phone.html
INFO [CrawlerEventManager] DOCUMENT_IMPORTED: https://herimedia.com/norconex-test/phone.html
INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: https://herimedia.com/norconex-test/phone.html
INFO [CrawlerEventManager] REJECTED_NOTFOUND: https://herimedia.com/norconex-test/tel:123
INFO [AbstractCrawler] test-crawler: Re-processing orphan references (if any)...
INFO [AbstractCrawler] test-crawler: Reprocessed 0 orphan references...
INFO [AbstractCrawler] test-crawler: 2 reference(s) processed.
INFO [CrawlerEventManager] CRAWLER_FINISHED
INFO [AbstractCrawler] test-crawler: Crawler completed.
INFO [AbstractCrawler] test-crawler: Crawler executed in 6 seconds.
INFO [MapDBCrawlDataStore] Closing reference store: ./work/crawlstore/mapdb/test-crawler/
INFO [JobSuite] Running test-crawler: END (Fri Jan 08 16:21:17 CET 2016)
Note the REJECTED_NOTFOUND: https://herimedia.com/norconex-test/tel:123 message.
The text was updated successfully, but these errors were encountered:
I did not upgrade the TikaLinkExtractor with the same ability, since people may want to use the Tika implementation for what it is supposed to do out of the box (which seems to extract URIs for all schemas).
Given
A page linking to a
tel:
URI:And the following config:
Expected
The collector should not follow this link – or that of any other schema it can't actually process.
Actual
The collectors tries to follow the
tel:
link.Note the
REJECTED_NOTFOUND: https://herimedia.com/norconex-test/tel:123
message.The text was updated successfully, but these errors were encountered: