Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tries to follow links with "tel:" schema #212

Closed
niels opened this issue Jan 8, 2016 · 2 comments
Closed

Tries to follow links with "tel:" schema #212

niels opened this issue Jan 8, 2016 · 2 comments
Assignees
Labels
Milestone

Comments

@niels
Copy link

niels commented Jan 8, 2016

Given

A page linking to a tel: URI:

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Norconex test</title>
  </head>

  <body>
    <a href="tel:123">Phone Number</a>
  </body>
</html>

And the following config:

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="test-collector">
  <crawlers>
    <crawler id="test-crawler">
      <startURLs>
        <url>https://herimedia.com/norconex-test/phone.html</url>
      </startURLs>
    </crawler>
  </crawlers>
</httpcollector>

Expected

The collector should not follow this link – or that of any other schema it can't actually process.

Actual

The collectors tries to follow the tel: link.

INFO  [AbstractCollectorConfig] Configuration loaded: id=test-collector; logsDir=./logs; progressDir=./progress
INFO  [JobSuite] JEF work directory is: ./progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] No previous execution detected.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.4.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.4.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.5.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.0.7 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.0.3 (Norconex Inc.)
INFO  [JobSuite] Running test-crawler: BEGIN (Fri Jan 08 16:21:17 CET 2016)
INFO  [MapDBCrawlDataStore] Initializing reference store ./work/crawlstore/mapdb/test-crawler/
INFO  [MapDBCrawlDataStore] ./work/crawlstore/mapdb/test-crawler/: Done initializing databases.
INFO  [HttpCrawler] test-crawler: RobotsTxt support: true
INFO  [HttpCrawler] test-crawler: RobotsMeta support: true
INFO  [HttpCrawler] test-crawler: Sitemap support: true
INFO  [HttpCrawler] test-crawler: Canonical links support: true
INFO  [HttpCrawler] test-crawler: User-Agent: <None specified>
INFO  [SitemapStore] test-crawler: Initializing sitemap store...
INFO  [SitemapStore] test-crawler: Done initializing sitemap store.
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] test-crawler: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://herimedia.com/norconex-test/phone.html
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://herimedia.com/norconex-test/phone.html
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://herimedia.com/norconex-test/phone.html
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: https://herimedia.com/norconex-test/phone.html
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: https://herimedia.com/norconex-test/phone.html
INFO  [CrawlerEventManager]         REJECTED_NOTFOUND: https://herimedia.com/norconex-test/tel:123
INFO  [AbstractCrawler] test-crawler: Re-processing orphan references (if any)...
INFO  [AbstractCrawler] test-crawler: Reprocessed 0 orphan references...
INFO  [AbstractCrawler] test-crawler: 2 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] test-crawler: Crawler completed.
INFO  [AbstractCrawler] test-crawler: Crawler executed in 6 seconds.
INFO  [MapDBCrawlDataStore] Closing reference store: ./work/crawlstore/mapdb/test-crawler/
INFO  [JobSuite] Running test-crawler: END (Fri Jan 08 16:21:17 CET 2016)

Note the REJECTED_NOTFOUND: https://herimedia.com/norconex-test/tel:123 message.

essiembre added a commit that referenced this issue Jan 15, 2016
ftp. It is possible to overwrite these default with
#setSchemes(String[]). Github #212.
@essiembre
Copy link
Contributor

By default GenericLinkExtractor now only handle these URL schemes: http, https, and ftp. This can be overwritten, like this:

  <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
     <schemes>http, https, somefunnyone</schemes>
  </extractor>

I did not upgrade the TikaLinkExtractor with the same ability, since people may want to use the Tika implementation for what it is supposed to do out of the box (which seems to extract URIs for all schemas).

This has been added to the latest snapshot.

Because this new logic will extract less links (what we want), I hope it won't cause regression issues for some people.

@essiembre essiembre added this to the 2.4.0 milestone Jan 15, 2016
@essiembre essiembre self-assigned this Jan 15, 2016
@niels
Copy link
Author

niels commented Jan 20, 2016

Confirmed working. Thank you!

@niels niels closed this as completed Jan 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants