Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement connection pooling #20

Closed
NiteshKant opened this issue Jan 28, 2014 · 16 comments
Closed

Implement connection pooling #20

NiteshKant opened this issue Jan 28, 2014 · 16 comments

Comments

@NiteshKant
Copy link
Member

RxClient should be able to support connection pooling. We can use this issue to discuss the requirements.

@allenxwang
Copy link

Here are the high level requirements I would like to propose:

Types of pool

Two types of pool should be included:

RouteSpecificPool

This pool contains channels that are connected to a specific host/port. The pool capacity is reached if total channels which include checked out (or in use) channels and idle channels has reached to a configurable number. In this case, no new channels are allowed to be created.

An instance of this pool can be associated with an instance of RxClient and utilize the its Bootstrap to create new channels if necessary.

GeneralPool

This pool contains one or more of the route specific pools. The pool capacity is reached for a route if

  • total channels (including idle and in use) from all the underlying RouteSpecifPools have reached a configurable number, or
  • The pool capacity for the RouteSpecificPool that corresponds to the route is reached

In this case, no new channels for that route can be created.

Channel life cycle management

The following are the possible states of channel:

  • Created: A channel for a route may be created if the pool has not reached capacity for that route, but all channels are currently in use. This state should be automatically transferred to "In Use" state.
  • In Use: The channel is used by a client.
  • Idle: The channel is returned by a client and is ready to be checked out again.
  • Deleted: The channel is no longer reusable and is closed and permanently removed from the pool.

Additional internal states required for channels in channel pool

  • ChannelFuture: this is the object obtained from Bootstrap.connect() when the channel is first created. The same object is returned whenever the client requests for a channel.
  • idle: indicates whether the channel is currently in use. (The implementation may also choose not to keep this state, but having a different queue for idle channels.)
  • Timestamp of when channel starts being idle
  • Timestamp of when channel is in use
  • Idle timeout: this can be a configured maximal idle timeout, or from the last http response header "Keep-Alive: timeout=..."

Pool operations

Request a channel

The pool checks if there is an idle channel. If so, return the ChannelFuture (which has been realized) associated with the channel. Otherwise, if it still has capacity, it will use the client's Bootstrap to create a new channel and return the ChannelFuture from Bootstrap.connect(). If pool has reached capacity for the route and all channels are in use, a ChannelFuture filled with a failure cause will be returned.

Return a channel

Once the client has determined that it has done with the channel, it can return the channel to the pool. The pool will clear its pipeline and attachements and make sure it is reusable to any client. If not, the channel will be deleted.

Clean up of channels

The pool can delete the idle channels periodically if they have exceeded their idle timeout. It can also do it during request/return operation if it finds that the channel is no longer reusable.

Metrics

The pool should provide APIs to get total capacity, channels in use and idle channels and statistics that reflect pool reuse.

@NiteshKant
Copy link
Member Author

@allenxwang what will be the reason someone will specify the total connection limit for a general pool?

For the Idle timeout, since the server will close the connection after the timeout, we would not have to keep track of that timeout, rite?

If pool has reached capacity for the route and all channels are in use, a ChannelFuture filled with a failure cause will be returned.

In order to create a ChannelFuture you would need a valid channel, so for the case, when you can not create a new channel, you can not return a ChannelFuture.
Actually, so that we do not mix two asynchronous paradigms, its better to use Observable instead of Future in the Connection Pool API. So, the return type of getConnection can be Observable

The pool will clear its pipeline and attachements and make sure it is reusable to any client. If not, the channel will be deleted.

When you say clear the pipeline, it means remove any changes done to the pipeline during the execution? If yes, then we would have to track what changes were made to the pipeline during execution?
What does the channel deletion mean in this context?

@allenxwang
Copy link

The total connection limit for general pool applies to client (like the one in Ribbon) that talks to a server farm consists of homogeneous servers. It stands for the maximal concurrent connections allowed from one client host to the server farm. It will also help to limit the impact of resource/connection leaks.

The HTTP specification states that both the client and server may choose to timeout the connection:

"When a client or server wishes to time-out it SHOULD issue a graceful close on the transport connection. Clients and servers SHOULD both constantly watch for the other side of the transport close, and respond to it as appropriate. "

So I think it is prudent on the client side to close the channel if it is not closed by server after the specified timeout. Apache HttpClient connection pool does the same thing.

http://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html

I think using Rx API on getConnection operation might be a good idea. I will give it a try.

The pool should remove all channel handlers after a channel is returned to the pool. Ideally, the same channel should be able to be reused by another RxClient, as if the channel is just created without any prior state. However, I realized that there might be difficulty to remove all the channel attributes as I did not find an API to get all the attributes. Any suggestions?

When the pool finds the channel is not reusable, for example, no longer active, it should close the channel and remove it from the pool. The channel should no longer be accessible and GCed.

@NiteshKant
Copy link
Member Author

The total connection limit for general pool applies to client (like the one in Ribbon) that talks to a server farm consists of homogeneous servers. It stands for the maximal concurrent connections allowed from one client host to the server farm. It will also help to limit the impact of resource/connection leaks.

Generally, if someone has to configure this number, would it be atleast: number of servers * max connection per route?

So I think it is prudent on the client side to close the channel if it is not closed by server after the specified timeout. Apache HttpClient connection pool does the same thing.

The quote from the spec does not talk about keep-alive timeout (the keep-alive timeout attribute is not a standard), it talks about general client and server disconnections. We should not be completely basing our design on apache http client as some of its design decisions comes from the fact that the I/O is blocking and hence the connection state is correctly asserted when a blocking operation is issued on the connection. In the async world, netty gives a callback as soon as a channel is disconnected, so we do not need these kind of safety nets.
Varying timeouts based on server keep-alive headers, will lead to varying number of threads per connection, which seems unnecessary to me. The worst case, that will happen in this scenario is that the connection will expire after the idle timeout configured for the pool.

the same channel should be able to be reused by another RxClient, as if the channel is just created without any prior state

For HTTP, outside RxNetty code there isn't an easy way to manipulate the pipeline and add attributes. So, unless someone is working inside the pipeline, the code would not be able to change the state of the pipeline/attributes. I guess, this can be handled better by a best practice of "cleaning any dynamic state at the end of execution" as compared to creating a framework to stop people from doing this. Down the line, if we see that there are lot of pitfalls in this approach, may be we can create such a framework. At this point, to me it looks like it is an edge case. thoughts?

@allenxwang
Copy link

Generally, if someone has to configure this number, would it be at least: number of servers * max connection per route?

No, in all cases in production (Zuul, for example) the max total connections is much less than servers * per route max connection. Since the load balancer in most of the time selects server in round robin fashion, the per route max connection only matters when the route becomes bad, for example, problem on a specific server which causes a lot of read timeout. This does not happen very often. But the total connections can be used as a way to shed load and put a limit on outbound sockets if the whole farm behaves bad, or if the client does not timeout appropriately when waiting for the response. So the max total connections matters more than per route max connections.

In the async world, netty gives a callback as soon as a channel is disconnected, so we do not need these kind of safety nets.

The measure will be useful for the case that the server wants to have a shorter timeout than the idle timeout configured on the client side, but somehow the server does not close the connection. Then the client can close the connection on behalf of the server.

Varying timeouts based on server keep-alive headers, will lead to varying number of threads per connection, which seems unnecessary to me. The worst case, that will happen in this scenario is that the connection will expire after the idle timeout configured for the pool.

I did not completely understand this. Can you elaborate?

I guess, this can be handled better by a best practice of "cleaning any dynamic state at the end of execution" as compared to creating a framework to stop people from doing this.

Agreed.

@NiteshKant
Copy link
Member Author

So the max total connections matters more than per route max connections.

So that I understand this correctly. If we take an example of a server farm of 3 servers with a max per route of 5 and total max connections 9. Then this will mean that no server can have more than 3 concurrent connections, assuming round robin load balancing. In such a case, what is the use of max per route?
OTOH, if for some reason, connection counts are skewed towards one server (due to high response times eg) than other servers (behaving well) gets penalized due to this total connection count.
I am not sure if I understood your reasoning of this still.

Then the client can close the connection on behalf of the server.

Just so that we are talking about the same thing. This keep alive timeout is essentially the max idle time between two requests on a persistent connection. I am also assuming that all connection pools will have a max idle time for every connection.
Now, my argument is, what is the positive of the client trying to keep up with this server defined timeout which is changeable per connection, in the event that the server does not honor this timeout.
If the server does not disconnect after such a timeout, that means, it will serve any request served on that connection. There is no contract as such that the client should not use this connection. If the server disconnects, the client will automatically know about the disconnection and will discard the connection.
OTOH, what is the con of honoring such a timeout? In order to honor such a timeout, one will have to have a timer thread (one per such connection) to check for this timeout. This can soon explode the number of threads (even if it is a threadpool). In the light of the positive and the negative, it does not seem to me that this is adding value.

@allenxwang
Copy link

In such a case, what is the use of max per route?

For our internal usage, max per route is not really useful, given that there is a a load balancer that keeps round robin policy and will skip hosts with high concurrent requests count. So for our use case, we can potentially consolidate the RouteSpecificPool and GeneralPool into one type that has only limit on total connections, regardless whether the connections are for one route or not.

The per route max connection will be useful if the pool is shared by RxClients that talk to heterogeneous severs without a load balancer. In this case, we want to have a total connections limit on the pool to limit the total outbound sockets created, and we do not want one or two bad servers occupy the majority of available of total connections in the pool, causing connection problems for others.

In order to honor such a timeout, one will have to have a timer thread (one per such connection) to check for this timeout.

Not necessarily. I don't think the client has the obligation to close a channel at the exact moment that idle timeout has passed. We can take the following two approaches:

  • When a request for channel comes in and an idle channel is picked, we check if the idle timeout has passed on the channel. If so, we close the channel and find the next idle one.
  • Schedule a task at certain interval to go over all idle channels and close them if idle timeout have passed for them.

I think we can start with just the first approach. Under this approach, checking the idle timeout imposed from the server side does not create any overhead other than we have to store this timeout value somewhere.

@NiteshKant
Copy link
Member Author

For our internal usage, max per route is not really useful, given that there is a a load balancer that keeps round robin policy and will skip hosts with high concurrent requests count.

I see. One thing that is not clear to me is that if I am defining the total connection count on a pool of auto-scaled servers, how will I determine the optimal number of concurrent connections. If I understand it correctly, it will depend on the number of servers, correct?

I think we can start with just the first approach.

Sure, if we are not doing the second approach ever, then it does not hurt to honor this timeout too. However, if we say that sometime down the line we are going to have a "cleanup thread" to remove idle connections, we should see if this will be optimal to support.
The positive with the background thread is that in case of a bunch of idle connections, the request processing will not be penalized, as in that case, we would be spending more time finding a valid connection, if we do not use a background thread.

@allenxwang
Copy link

If I understand it correctly, it will depend on the number of servers, correct?

Yes. The max total connections should be changeable. Apache connection pool also supports setting of the max total connections after it is created. Ribbon currently utilizes this feature to optionally change pool size when load balancer senses that server list size is bigger than current pool size.

Sure, if we are not doing the second approach ever, then it does not hurt to honor this timeout too.

The "cleanup thread" approach is optional, and will never replace the first approach. We have this enabled for RestClient, but it does not seem to be beneficial.

Another problem with "cleanup thread" is that it will cause concurrency issues as it tries to access the pool, particularly all the idle channels, when other threads try to get an idle connection. It brings more complexity as the access pattern of the two are very different. So we should avoid it if we can.

@NiteshKant
Copy link
Member Author

Ribbon currently utilizes this feature to optionally change pool size when load balancer senses that server list size is bigger than current pool size.

It looks to me that defining max connections per server suits the needs better than defining it for the whole farm (which is auto-scaled) as a user can determine that number more precisely than for the whole farm which is variable and is a function of number of instances.
Yes, may be we can have heuristics around optimizing the max conn. for the farm at runtime but it looks like it is because we are asking the user to determine a number which inherently is difficult to assert.

Another problem with "cleanup thread" is that it will cause concurrency issues

The connection pool is inherently thread safe as it is always accessed concurrently, so I am not sure if the cleanup thread will add to that complexity. If you are referring to iterating over all connections, any concurrent data structure will give you an iterator which is thread-safe. So, I am not sure whether that is an added complexity too.
Having said that, I do agree that we can implement this later. The point was that if we ever think of implementing this cleanup process, having a timeout value per connection can not be accurately achieved without the overhead of every connection having its own timeout thread. OTOH, if we do not have a thread per connection, the cleanup thread will run at the timeout duration of the pool and per connection timeout will not mean much.

@allenxwang
Copy link

It looks to me that defining max connections per server suits the needs better than defining it for the whole farm (which is auto-scaled) as a user can determine that number more precisely than for the whole farm which is variable and is a function of number of instances.

I talked to @mhawthorne and he indicated that connection per server is not really used by Zuul. In practice, people usually set the connection pool size large enough to accommodate auto scale.

I would start with pool implementations that support max total connections and see if we need the other variation after we test it in Zuul.

The point was that if we ever think of implementing this cleanup process, having a timeout value per connection can not be accurately achieved without the overhead of every connection having its own timeout thread.

My understanding is that clients do not have to accurately archive idle timeout. It checks the idle timeout and skips and closes connections where idle timeout has passed as a way to defend itself from using a connection that might not be reusable.

@mhawthorne
Copy link

sorry for jumping into this so late.

connection pooling

the only reason we bound the connection pool size at all is to eagerly detect misconfigurations that would cause us to create an unlimited number of connections (aka connection leak). otherwise we wouldn't care. configuring the connection limit per-server doesn't feel right to me for a few reasons:

  1. we will undoubtedly be experimenting with heterogeneous clusters moving forward, whether changing instance type, NIO vs. BIO, etc.. I'd rather let the faster servers absorb as much traffic as possible without artificial constraints, and if it causes problems we will add some protective routing logic in Zuul.

  2. at some point I think our load balancing is going to advance past pure round-robin. the thought here is to let the load balancer do the hard work of choosing the target servers and prevent connection pool management from interfering with the decision.

enforcing idle timeout in client

I think we definitely want cleanup of idle connections in the client, if only for the reason that the load balancer may shift away from certain servers, and the connections to those servers will sit idly and take up space. I don't think the the idle timeout has to be particularly precise - if the timeout is 5 minutes and we kill it at 5 minutes and 30 seconds it seems fine. the implementation here can get tricky, I agree with @NiteshKant that a timeout thread per connection feels excessive, but doing a single cleanup thread also brings in some concurrency issues if you attempt to kill a connection that is in use, at which point you wouldn't want to kill it anymore. Maybe the cleanup thread can submit a callable/task that fires the next time a particular idle connection becomes idle again, which could be immediately?

@allenxwang
Copy link

The current design on idle connection clean up is it will piggy back on thread that requests for connection. This is based on the following assumptions:

  • All idle connections are on a single queue and easy to find an traverse
  • Calculating if a connection has passed idle time is trivial as doing x+y>z
  • Closing idle connection is done asynchronously and trivial

The other factor that will help to reduce the idle connections is that servers can close idle connections and RxNetty client will automatically handle this event without needing a separate thread in the pool.

@allenxwang
Copy link

Talking to a few folks, it seems that having a thread to clean up idle connections periodically helps the server to reduce its concurrent connections and release resources so that it performs better when there is a surge of requests.

I plan to add the periodical idle connection clean up task. The only concurrency issue is that the clean up task and the connection request thread may try to compete for idle connection at the same time. Since we are using the ConcurrentLinkedQueue for the idle connections and both operations just poll from the queue, this should be thread safe.

@NiteshKant
Copy link
Member Author

Sorry for the delayed reply.

@mhawthorne I agree with your point that having a limit on the number of connections interfere with routing logic for more potent servers, if the number is too pessimistic/restrictive. However, the argument is the same even for putting a number on the total connections, if it is set to a too restrictive value.
In our architecture, the throttling on concurrent request logic belongs in the hystrix layer and the connection count would always be set to a much higher number than the allowed hystrix concurrent requests count. The connection pool max either per server/route or total would not reach unless there is a leak.

The argument here is more from the point of view of connection pool implementations, whether they should expose parameters like max conn. per route and max. total conn. I think there are usecases for both of them and there should be a way to set both of them to an arbitrarily high number signifying no limit.

@NiteshKant
Copy link
Member Author

Implemented with pull request #95

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants