Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HstsResolver mishandles country code second-level domains #785

Open
blue-jam opened this issue May 5, 2022 · 2 comments
Open

HstsResolver mishandles country code second-level domains #785

blue-jam opened this issue May 5, 2022 · 2 comments

Comments

@blue-jam
Copy link

blue-jam commented May 5, 2022

Summary

HstsResolver doesn't handle country code second-level domains (e.g. co.jp) well and emits a WARN log and fails to check HSTS support correctly.

Reproduction

Run a collector with start URL = https://www.ipsj.or.jp/english/index.html.

Actual behavior

HstsResovler tries to communicate with or.jp and emits a WARN message:

WARN HstsResolver - Attempt to verify if the site supports Strict-Transport-Security (HSTS) failed for domain "or.jp". We'll assumume HSTS is not supported for all URLs on that domain.
  java.net.UnknownHostException: co.jp: No address associated with hostname
  at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) ~[?:?]
  at java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929) ~[?:?]
  at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1519) ~[?:?]
  at java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848) ~[?:?]
  at java.net.InetAddress.getAllByName0(InetAddress.java:1509) ~[?:?]
  at java.net.InetAddress.getAllByName(InetAddress.java:1368) ~[?:?]
  at java.net.InetAddress.getAllByName(InetAddress.java:1302) ~[?:?]
  at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.13.jar!/:4.5.13]
  at com.norconex.collector.http.fetch.util.HstsResolver.lambda$resolveHstsSupport$1(HstsResolver.java:105) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at java.util.HashMap.computeIfAbsent(HashMap.java:1134) ~[?:?]
  at com.norconex.collector.http.fetch.util.HstsResolver.resolveHstsSupport(HstsResolver.java:100) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.http.fetch.util.HstsResolver.resolve(HstsResolver.java:77) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.http.fetch.impl.GenericHttpFetcher.fetch(GenericHttpFetcher.java:399) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.http.fetch.HttpFetchClient.fetch(HttpFetchClient.java:102) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider.getRobotsTxt(StandardRobotsTxtProvider.java:99) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.http.pipeline.importer.HttpImporterPipeline$DelayResolverStage.executeStage(HttpImporterPipeline.java:89) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) ~[norconex-commons-lang-2.0.0.jar!/:2.0.0]
  at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:375) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.core.crawler.Crawler.processNextQueuedCrawlData(Crawler.java:611) ~[norconex-collector-core-2.0.0.jar!/:2.0.0]
  at com.norconex.collector.core.crawler.Crawler.processNextReference(Crawler.java:556) ~[norconex-collector-core-2.0.0.jar!/:2.0.0]
  at com.norconex.collector.core.crawler.Crawler$ProcessReferencesRunnable.run(Crawler.java:923) ~[norconex-collector-core-2.0.0.jar!/:2.0.0]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
  at java.lang.Thread.run(Thread.java:829) ~[?:?]

Expected behavior

HstsResovler tries to communicate with ipsj.or.jp.

Resources

  • Public Suffix List has a list of suffices that under which Internet users can (or historically could) directly register names (not just country specific ones). It also provides information about Java libraries.
@essiembre essiembre added the bug label Jun 12, 2022
essiembre added a commit that referenced this issue Jun 12, 2022
@essiembre
Copy link
Contributor

A new snapshot release was just made with a fix that now considers the "effective" top-level domain for a URL instead of just the last two parts of the domain. It is using the Public Suffix List as you suggested.

That being said, you will still get a warning/exception. The reason is, that your public suffix is or.jp so the effective top-level domain for your site is ipsj.or.jp (as you expected). That domain is not reachable (timeout) when trying to resolve HSTS with a HEAD request.

To ensure only https URLs get crawled for your site, I can think of two options:

  1. Update the website so HSTS can be resolved against the top-level domain ipsj.or.jp.
  2. Update your crawler configuration to set disableHSTS to true on the GenericHttpFetcher and enforce https using the GenericURLNormalizer.

@blue-jam
Copy link
Author

Thank you very much for fixing it.

That being said, you will still get a warning/exception. The reason is, that your public suffix is or.jp so the effective top-level domain for your site is ipsj.or.jp (as you expected). That domain is not reachable (timeout) when trying to resolve HSTS with a HEAD request.

Actually, the URL I shared was just an example which I randomly picked from sites I was familiar with. However, your suggestions to mitigate another error message are very helpful.

I'm looking forward to a new release with the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants