You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm running a crawler for days now. Apparently, a TimeOut occurred on one page and the crawler is stopped for more than 2 hours...
Is that the expected/normal behaviour? Isn't the problematic URL supposed to be put back in queue and the crawler continue it's job?
For more information, you'll find an extract of the log below:
Oct 20, 2016 12:02:16 PM ....
INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: http://www.freepatentsonline.com/2529443.html
ERROR [GenericDocumentFetcher] Cannot fetch document: http://www.freepatentsonline.com/2061740.html (Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out)
INFO [CrawlerEventManager] REJECTED_ERROR: http://www.freepatentsonline.com/2061740.html
ERROR [AbstractCrawler] Freepatentsonline: Could not process document: http://www.freepatentsonline.com/2061740.html (org.apache.http.conn.ConnectTimeoutException: Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out)
com.norconex.collector.core.CollectorException: org.apache.http.conn.ConnectTimeoutException: Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out
at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:171)
at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:42)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:300)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:488)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:378)
at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:736)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:150)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:110)
... 11 more
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:74)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:141)
... 22 more
It's now: Oct 20, 2016 14:16:56 PM ....
The text was updated successfully, but these errors were encountered:
It probably hangs since the thread with the stalled connection keeps waiting for it to return. I recommend you try explicitly configure timeout settings on HTTP connections being made. There is a handful of timeout-related configuration options in GenericHttpClientFactory. Here they are (to put in your crawler config):
Hi, I'm running a crawler for days now. Apparently, a TimeOut occurred on one page and the crawler is stopped for more than 2 hours...
Is that the expected/normal behaviour? Isn't the problematic URL supposed to be put back in queue and the crawler continue it's job?
For more information, you'll find an extract of the log below:
Oct 20, 2016 12:02:16 PM ....
INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: http://www.freepatentsonline.com/2529443.html
ERROR [GenericDocumentFetcher] Cannot fetch document: http://www.freepatentsonline.com/2061740.html (Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out)
INFO [CrawlerEventManager] REJECTED_ERROR: http://www.freepatentsonline.com/2061740.html
ERROR [AbstractCrawler] Freepatentsonline: Could not process document: http://www.freepatentsonline.com/2061740.html (org.apache.http.conn.ConnectTimeoutException: Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out)
com.norconex.collector.core.CollectorException: org.apache.http.conn.ConnectTimeoutException: Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out
at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:171)
at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:42)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:300)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:488)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:378)
at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:736)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:150)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:110)
... 11 more
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:74)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:141)
... 22 more
It's now: Oct 20, 2016 14:16:56 PM ....
The text was updated successfully, but these errors were encountered: