Tries to follow links with "tel:" schema #212

niels · 2016-01-08T15:25:17Z

Given

A page linking to a tel: URI:

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Norconex test</title>
  </head>

  <body>
    <a href="tel:123">Phone Number</a>
  </body>
</html>

And the following config:

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="test-collector">
  <crawlers>
    <crawler id="test-crawler">
      <startURLs>
        <url>https://herimedia.com/norconex-test/phone.html</url>
      </startURLs>
    </crawler>
  </crawlers>
</httpcollector>

Expected

The collector should not follow this link – or that of any other schema it can't actually process.

Actual

The collectors tries to follow the tel: link.

INFO  [AbstractCollectorConfig] Configuration loaded: id=test-collector; logsDir=./logs; progressDir=./progress
INFO  [JobSuite] JEF work directory is: ./progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] No previous execution detected.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.4.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.4.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.5.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.0.7 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.0.3 (Norconex Inc.)
INFO  [JobSuite] Running test-crawler: BEGIN (Fri Jan 08 16:21:17 CET 2016)
INFO  [MapDBCrawlDataStore] Initializing reference store ./work/crawlstore/mapdb/test-crawler/
INFO  [MapDBCrawlDataStore] ./work/crawlstore/mapdb/test-crawler/: Done initializing databases.
INFO  [HttpCrawler] test-crawler: RobotsTxt support: true
INFO  [HttpCrawler] test-crawler: RobotsMeta support: true
INFO  [HttpCrawler] test-crawler: Sitemap support: true
INFO  [HttpCrawler] test-crawler: Canonical links support: true
INFO  [HttpCrawler] test-crawler: User-Agent: <None specified>
INFO  [SitemapStore] test-crawler: Initializing sitemap store...
INFO  [SitemapStore] test-crawler: Done initializing sitemap store.
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] test-crawler: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://herimedia.com/norconex-test/phone.html
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://herimedia.com/norconex-test/phone.html
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://herimedia.com/norconex-test/phone.html
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: https://herimedia.com/norconex-test/phone.html
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: https://herimedia.com/norconex-test/phone.html
INFO  [CrawlerEventManager]         REJECTED_NOTFOUND: https://herimedia.com/norconex-test/tel:123
INFO  [AbstractCrawler] test-crawler: Re-processing orphan references (if any)...
INFO  [AbstractCrawler] test-crawler: Reprocessed 0 orphan references...
INFO  [AbstractCrawler] test-crawler: 2 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] test-crawler: Crawler completed.
INFO  [AbstractCrawler] test-crawler: Crawler executed in 6 seconds.
INFO  [MapDBCrawlDataStore] Closing reference store: ./work/crawlstore/mapdb/test-crawler/
INFO  [JobSuite] Running test-crawler: END (Fri Jan 08 16:21:17 CET 2016)

Note the REJECTED_NOTFOUND: https://herimedia.com/norconex-test/tel:123 message.

The text was updated successfully, but these errors were encountered:

ftp. It is possible to overwrite these default with #setSchemes(String[]). Github #212.

essiembre · 2016-01-15T02:29:49Z

By default GenericLinkExtractor now only handle these URL schemes: http, https, and ftp. This can be overwritten, like this:

  <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
     <schemes>http, https, somefunnyone</schemes>
  </extractor>

I did not upgrade the TikaLinkExtractor with the same ability, since people may want to use the Tika implementation for what it is supposed to do out of the box (which seems to extract URIs for all schemas).

This has been added to the latest snapshot.

Because this new logic will extract less links (what we want), I hope it won't cause regression issues for some people.

niels · 2016-01-20T11:34:55Z

Confirmed working. Thank you!

essiembre added a commit that referenced this issue Jan 15, 2016

GenericLinkExtractor now only supports these URI schemes: http, https,

04c1eb8

ftp. It is possible to overwrite these default with #setSchemes(String[]). Github #212.

essiembre added the resolved label Jan 15, 2016

essiembre added this to the 2.4.0 milestone Jan 15, 2016

essiembre self-assigned this Jan 15, 2016

niels closed this as completed Jan 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tries to follow links with "tel:" schema #212

Tries to follow links with "tel:" schema #212

niels commented Jan 8, 2016

essiembre commented Jan 15, 2016

niels commented Jan 20, 2016

Tries to follow links with "tel:" schema #212

Tries to follow links with "tel:" schema #212

Comments

niels commented Jan 8, 2016

Given

Expected

Actual

essiembre commented Jan 15, 2016

niels commented Jan 20, 2016