New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to connect all threads to a single domain #594

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
2 participants
@orliac

orliac commented Jul 20, 2018

So far in HttpProtocol.java there was an hard limit of 20 threads allowed per route:
CONNECTION_MANAGER.setDefaultMaxPerRoute(20);

This was creating problems while crawling on very limited number of domains and having set large numbers for configuration parameters fetcher.threads.per.queue and fetcher.threads.number (e.g. several hundreds of threads for both). Problem observed was of type:

org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:292) 

With the proposed solution we potentially allow all threads to connect to a single domain.

@jnioche jnioche added this to the 1.11 milestone Jul 20, 2018

@jnioche jnioche added the core label Jul 20, 2018

@jnioche

This comment has been minimized.

Member

jnioche commented Jul 20, 2018

Thanks @orliac, I'd rather it was set to the value of fetcher.threads.per.queue if its value is > 20.
Setting it to the value of maxFetchThreads is an overkill as in most cases crawlers behave politely and limit the number of threads to 1.

@orliac

This comment has been minimized.

orliac commented Jul 20, 2018

Hi @jnioche, perfect, thank you!

@jnioche jnioche closed this in c572c9f Aug 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment