-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement connection pooling #20
Comments
Here are the high level requirements I would like to propose: Types of poolTwo types of pool should be included: RouteSpecificPoolThis pool contains channels that are connected to a specific host/port. The pool capacity is reached if total channels which include checked out (or in use) channels and idle channels has reached to a configurable number. In this case, no new channels are allowed to be created. An instance of this pool can be associated with an instance of RxClient and utilize the its Bootstrap to create new channels if necessary. GeneralPoolThis pool contains one or more of the route specific pools. The pool capacity is reached for a route if
In this case, no new channels for that route can be created. Channel life cycle managementThe following are the possible states of channel:
Additional internal states required for channels in channel pool
Pool operationsRequest a channelThe pool checks if there is an idle channel. If so, return the ChannelFuture (which has been realized) associated with the channel. Otherwise, if it still has capacity, it will use the client's Bootstrap to create a new channel and return the ChannelFuture from Bootstrap.connect(). If pool has reached capacity for the route and all channels are in use, a ChannelFuture filled with a failure cause will be returned. Return a channelOnce the client has determined that it has done with the channel, it can return the channel to the pool. The pool will clear its pipeline and attachements and make sure it is reusable to any client. If not, the channel will be deleted. Clean up of channelsThe pool can delete the idle channels periodically if they have exceeded their idle timeout. It can also do it during request/return operation if it finds that the channel is no longer reusable. MetricsThe pool should provide APIs to get total capacity, channels in use and idle channels and statistics that reflect pool reuse. |
@allenxwang what will be the reason someone will specify the total connection limit for a general pool? For the Idle timeout, since the server will close the connection after the timeout, we would not have to keep track of that timeout, rite?
In order to create a ChannelFuture you would need a valid channel, so for the case, when you can not create a new channel, you can not return a ChannelFuture.
When you say clear the pipeline, it means remove any changes done to the pipeline during the execution? If yes, then we would have to track what changes were made to the pipeline during execution? |
The total connection limit for general pool applies to client (like the one in Ribbon) that talks to a server farm consists of homogeneous servers. It stands for the maximal concurrent connections allowed from one client host to the server farm. It will also help to limit the impact of resource/connection leaks. The HTTP specification states that both the client and server may choose to timeout the connection: "When a client or server wishes to time-out it SHOULD issue a graceful close on the transport connection. Clients and servers SHOULD both constantly watch for the other side of the transport close, and respond to it as appropriate. " So I think it is prudent on the client side to close the channel if it is not closed by server after the specified timeout. Apache HttpClient connection pool does the same thing. http://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html I think using Rx API on getConnection operation might be a good idea. I will give it a try. The pool should remove all channel handlers after a channel is returned to the pool. Ideally, the same channel should be able to be reused by another RxClient, as if the channel is just created without any prior state. However, I realized that there might be difficulty to remove all the channel attributes as I did not find an API to get all the attributes. Any suggestions? When the pool finds the channel is not reusable, for example, no longer active, it should close the channel and remove it from the pool. The channel should no longer be accessible and GCed. |
Generally, if someone has to configure this number, would it be atleast: number of servers * max connection per route?
The quote from the spec does not talk about keep-alive timeout (the keep-alive timeout attribute is not a standard), it talks about general client and server disconnections. We should not be completely basing our design on apache http client as some of its design decisions comes from the fact that the I/O is blocking and hence the connection state is correctly asserted when a blocking operation is issued on the connection. In the async world, netty gives a callback as soon as a channel is disconnected, so we do not need these kind of safety nets.
For HTTP, outside RxNetty code there isn't an easy way to manipulate the pipeline and add attributes. So, unless someone is working inside the pipeline, the code would not be able to change the state of the pipeline/attributes. I guess, this can be handled better by a best practice of "cleaning any dynamic state at the end of execution" as compared to creating a framework to stop people from doing this. Down the line, if we see that there are lot of pitfalls in this approach, may be we can create such a framework. At this point, to me it looks like it is an edge case. thoughts? |
No, in all cases in production (Zuul, for example) the max total connections is much less than servers * per route max connection. Since the load balancer in most of the time selects server in round robin fashion, the per route max connection only matters when the route becomes bad, for example, problem on a specific server which causes a lot of read timeout. This does not happen very often. But the total connections can be used as a way to shed load and put a limit on outbound sockets if the whole farm behaves bad, or if the client does not timeout appropriately when waiting for the response. So the max total connections matters more than per route max connections.
The measure will be useful for the case that the server wants to have a shorter timeout than the idle timeout configured on the client side, but somehow the server does not close the connection. Then the client can close the connection on behalf of the server.
I did not completely understand this. Can you elaborate?
Agreed. |
So that I understand this correctly. If we take an example of a server farm of 3 servers with a max per route of 5 and total max connections 9. Then this will mean that no server can have more than 3 concurrent connections, assuming round robin load balancing. In such a case, what is the use of max per route?
Just so that we are talking about the same thing. This keep alive timeout is essentially the max idle time between two requests on a persistent connection. I am also assuming that all connection pools will have a max idle time for every connection. |
For our internal usage, max per route is not really useful, given that there is a a load balancer that keeps round robin policy and will skip hosts with high concurrent requests count. So for our use case, we can potentially consolidate the RouteSpecificPool and GeneralPool into one type that has only limit on total connections, regardless whether the connections are for one route or not. The per route max connection will be useful if the pool is shared by RxClients that talk to heterogeneous severs without a load balancer. In this case, we want to have a total connections limit on the pool to limit the total outbound sockets created, and we do not want one or two bad servers occupy the majority of available of total connections in the pool, causing connection problems for others.
Not necessarily. I don't think the client has the obligation to close a channel at the exact moment that idle timeout has passed. We can take the following two approaches:
I think we can start with just the first approach. Under this approach, checking the idle timeout imposed from the server side does not create any overhead other than we have to store this timeout value somewhere. |
I see. One thing that is not clear to me is that if I am defining the total connection count on a pool of auto-scaled servers, how will I determine the optimal number of concurrent connections. If I understand it correctly, it will depend on the number of servers, correct?
Sure, if we are not doing the second approach ever, then it does not hurt to honor this timeout too. However, if we say that sometime down the line we are going to have a "cleanup thread" to remove idle connections, we should see if this will be optimal to support. |
Yes. The max total connections should be changeable. Apache connection pool also supports setting of the max total connections after it is created. Ribbon currently utilizes this feature to optionally change pool size when load balancer senses that server list size is bigger than current pool size.
The "cleanup thread" approach is optional, and will never replace the first approach. We have this enabled for RestClient, but it does not seem to be beneficial. Another problem with "cleanup thread" is that it will cause concurrency issues as it tries to access the pool, particularly all the idle channels, when other threads try to get an idle connection. It brings more complexity as the access pattern of the two are very different. So we should avoid it if we can. |
It looks to me that defining max connections per server suits the needs better than defining it for the whole farm (which is auto-scaled) as a user can determine that number more precisely than for the whole farm which is variable and is a function of number of instances.
The connection pool is inherently thread safe as it is always accessed concurrently, so I am not sure if the cleanup thread will add to that complexity. If you are referring to iterating over all connections, any concurrent data structure will give you an iterator which is thread-safe. So, I am not sure whether that is an added complexity too. |
I talked to @mhawthorne and he indicated that connection per server is not really used by Zuul. In practice, people usually set the connection pool size large enough to accommodate auto scale. I would start with pool implementations that support max total connections and see if we need the other variation after we test it in Zuul.
My understanding is that clients do not have to accurately archive idle timeout. It checks the idle timeout and skips and closes connections where idle timeout has passed as a way to defend itself from using a connection that might not be reusable. |
sorry for jumping into this so late. connection pooling the only reason we bound the connection pool size at all is to eagerly detect misconfigurations that would cause us to create an unlimited number of connections (aka connection leak). otherwise we wouldn't care. configuring the connection limit per-server doesn't feel right to me for a few reasons:
enforcing idle timeout in client I think we definitely want cleanup of idle connections in the client, if only for the reason that the load balancer may shift away from certain servers, and the connections to those servers will sit idly and take up space. I don't think the the idle timeout has to be particularly precise - if the timeout is 5 minutes and we kill it at 5 minutes and 30 seconds it seems fine. the implementation here can get tricky, I agree with @NiteshKant that a timeout thread per connection feels excessive, but doing a single cleanup thread also brings in some concurrency issues if you attempt to kill a connection that is in use, at which point you wouldn't want to kill it anymore. Maybe the cleanup thread can submit a callable/task that fires the next time a particular idle connection becomes idle again, which could be immediately? |
The current design on idle connection clean up is it will piggy back on thread that requests for connection. This is based on the following assumptions:
The other factor that will help to reduce the idle connections is that servers can close idle connections and RxNetty client will automatically handle this event without needing a separate thread in the pool. |
Talking to a few folks, it seems that having a thread to clean up idle connections periodically helps the server to reduce its concurrent connections and release resources so that it performs better when there is a surge of requests. I plan to add the periodical idle connection clean up task. The only concurrency issue is that the clean up task and the connection request thread may try to compete for idle connection at the same time. Since we are using the ConcurrentLinkedQueue for the idle connections and both operations just poll from the queue, this should be thread safe. |
Sorry for the delayed reply. @mhawthorne I agree with your point that having a limit on the number of connections interfere with routing logic for more potent servers, if the number is too pessimistic/restrictive. However, the argument is the same even for putting a number on the total connections, if it is set to a too restrictive value. The argument here is more from the point of view of connection pool implementations, whether they should expose parameters like max conn. per route and max. total conn. I think there are usecases for both of them and there should be a way to set both of them to an arbitrarily high number signifying no limit. |
Implemented with pull request #95 |
RxClient should be able to support connection pooling. We can use this issue to discuss the requirements.
The text was updated successfully, but these errors were encountered: