New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rails 4.0.0 and 4.0.1 PostgreSQLAdapter does not properly check whether a connection is active #12867
Comments
Hi @tenderlove, Your change to |
Reaping happens in a background thread. We can't send any data over the socket because there are no locks around the file descriptor (that I know of). Aaron Patterson
|
Awesome bug report @aetherknight. I'm also experiencing a variant of this problem in my application - the |
Here's the related error which can occur when calling
|
So I think there may be a few possible approaches to fixing this:
|
For the later approach, |
Yep. The former definitely makes more sense to me. |
@aetherknight I agree, we should go with solution #1. Seems the mysql adapter has a setting to automatically do this, but the pg adapter does not. We'll have to rescue exceptions wherever we call methods on the connection, then call PGConn#reset. Though we should probably do a loop and PGConn#reset_set + PGConn#reset_poll so we don't block the process. |
Will another stacktrace help out here? |
Same issue, using sidekiq but not sure that is the main cause of problems. |
Is there a temporarily fix for this issue? Stacktrace: |
@rubytastic You could determine why your db connections are dying so quickly. I don't have any pgsql experience but I know MySQL has a timeout setting where the server will kill any clients that haven't used the socket in that time period. |
For the short term, we monkeypatched The PostgreSQLAdapter to use the older On Wed, Dec 4, 2013 at 1:37 PM, Mike Perham notifications@github.comwrote:
|
@aetherknight Unfortunately that's not an option for Sidekiq users. |
@mperham Ah, good point. |
@aetherknight So switching back to mysql, for now would be the best option to go or is there a monkey patch that could be implemented to fix the problem temporarily |
@tenderlove by my reading, solution 2 is necessary. If you can't tell whether a connection is #active?, you either never reap it, or you regularly reap (stale) still-active connections -- and so the server could end up with many more connections than your pool limit. But then, maybe I'm just missing something. Because as far as I can see, neither of the mysql adapters are any safer from a reaper in a different thread than the SELECTive postgres version. |
@matthewd I am beginning to think that both solutions are ultimately necessary:
Also, if I understand the current reaper code correctly, it will only reap a connection if it is currently in use, it has been longer than dead_connection_timeout since it was checked out, and |
All this stems from the fact that the connection is not self-repairing, yes? Dalli and Redis client connections are pooled with my connection_pool gem all the time. connection_pool documents that all connections should be self-repairing so that the pool doesn't need to worry about reaping or repairing.
|
@mperham not just self repairing. If a thread checks out a connection, then dies before checking the connection back in, what happens? This is why the reaper checks on "long lived" connections. |
@tenderlove If a thread somehow does not check in connections, they'd quickly exhaust. What's wrong with that? |
@mperham yes, that will leak a connection if done inside a Thread.new { } (and you don't specifically check it back in) |
Yep, that form of API should be deprecated in preference for |
@aetherknight Yeah, I could definitely get behind an idea of arranging automatic reconnection iff you're performing a SELECT (or BEGIN), and there isn't [already] supposed to be a transaction open. If you're trying to send an INSERT, say, you'd want to be very sure the exception meant "driver failed to send the query", and not "driver lost connection while waiting to hear what happened", before silently retrying. |
@tenderlove can/does the pool keep track of which thread checked out the connection? We can determine pretty easily that the thread's gone away... though whether you trust that it's safe to return the connection to the pool, vs just tell it to close, is another matter. Or is there a worry about code that goes out of its way to check out a connection in one thread, hand it to another, then have the first thread die? |
@matthewd The pool does keep track of which thread checks out the connection, but you shouldn't rely on it (it uses the object id of the thread). We used to loop through every thread in the system, check it's status, then check the pool, but I'd prefer a solution that doesn't depend on doing that. We could probably stuff the thread in a hash or something when it checks out, then just have the reaper check connections for those threads. We shouldn't worry about one thread passing the connection to another. I don't think we support that (I think it's documented too). Also, we should probably split this in to two issues. One for self-repairing connections, and one for dealing with long lived and possibly dead threads. I tried adding code for self-repairing connections, but TBH I can't repro this issue and I don't want to rescue from just any old PG exception. |
@tenderlove I did a thing: matthewd/rails@master...pg_verify Does that look at all sane? As far as I can see, those two tests both need to pass for the reaper and checkout_and_verify to both function correctly... and no less-invasive solution comes to mind. |
It looks sane, but I think you can make the locks more granular, and possibly remove some locks. For example, the locks should probably be inside the "log" calls. The connection shouldn't be shared among threads (except the reaper). Does the reaper impact the statement cache? I'm not sure we need that lock. Shouldn't there be a lock in "active?" Looks good! Aaron Patterson
|
@matthewd ConnectionTimeoutError. I am getting this errors on a external script using active_record. https://gist.github.com/flujan/9621099 I installed the master rails from github but don't how to require the master "active_record" on it. |
We get this error using mysql on |
Like above folks it still seems broken. This problem started when upgrading to rails 3 to rails 4 after that it just has timed out for me totally and I have reverted back to PHP for several projects. Please anyone track this down I really wan't my rails development env back! Is there anything we can do beside posting or configs and stack traces to make this problem fixed sooner>? |
Interestingly this morning I was having asset issues. I ran my app in production mode locally and discovered that because my error controller wasn't handling requests for non html format those assets would return a 500 instead of a 404. After refreshing the page 4 or 5 times while tailing my log, the page would stop loading and good old This is on Rails 4.0.3 running on puma 2.8.0. Don't know if this info will be useful to anyone. |
@pawel2105 Rails 4.0.4.rc1 Tried using Thin but couldn't validate if the problem persist since I have rails 4 server side events in my project which is hard to remove from it. Perhaps its an idea to testdrive if the problem would persist if using a non puma server. |
I just take the tip from @matthewd and put a larger checkout on my external script. Put a :checkout_timeout of 60 seconds... The error happens with less frequency but still occurs. |
@flujan
|
@rubytastic since we do not have knowledge enough to fix it and the issue is closed we will need to wait other bugs to happen and hope that the fix of one of them, fixes this one too. With ruby > 2.0 the use of threads will improve and this kind of bug will surely happen more. |
.. not a general timeout. Now, if a thread checks out a connection then dies, we can immediately recover that connection and re-use it. This should alleviate the pool exhaustion discussed in rails#12867. More importantly, it entirely avoids the potential issues of the reaper attempting to check whether connections are still active: as long as the owning thread is alive, the connection is its business alone. As a no-op reap is now trivial (only entails checking a thread status per connection), we can also perform one in-line any time we decide to sleep for a connection.
any update on possible solutions? |
@blklane this thread ended up discussing two very different issues... which one are you experiencing, and on which Rails version? |
@matthewd 4.0.1 and ruby 2.0 with the following error
|
@blklane 4.1 revised the connection pool handling to address some sources of starvation (around thread death), so you could try that... but the most common cause I've seen has been people genuinely starving the pool: running more DB-touching threads than they have connections in their pool. And there's not currently much we can do about that. 😕 |
@blklane Rails 4.0.4+ made a change to the Postgresql connection adapter to allow the normal request-response cycle to correctly detect when a connection is no longer active again, which should help the ConnectionPool self-heal after a DB connection dies, which might help you. However, there are still some other DB connection issues 4.0.x that could lead to the ConnectionPool exhausting its pool of possible connections. Part of a gist I posted further up attempts to address a scenario where if an error occurs during checkout, the ConnectionPool cleans up the connection that failed to be fully checked out: https://gist.github.com/aetherknight/3d0af086a92bac8f3ee9#file-active_record_patches-rb-L76 I haven't looked into Rails 4.1's connection management yet. |
I believe this is related to this issue: I run into pool starvation when I perform any database operations while handling errors (404 not found) via config.exceptions_app. A refresh of a 404 page 4 or 5 times seems to starve my connection pool. I workaround it by removing DB operations from the error pages. Posting here in case anyone else sees this. Let me know if anyone wants more information on my setup for this. |
I still have this issue with the latest rails, ruby, postgres, all stable versions. |
Hi, My error message : INFO -- : Started POST "/my_ctrl/import" for X.X.X.X at 2014-06-06 15:38:52 +0200 With this infrastructure (also tested with rb 1.9.3 and rails 4.0.4) : @rubytastic : Your postgresql is on the same server of rails app ? Ps : I'm using Apartment. (gem for multi-tenant App). |
[Note : I work with @rivsc] I added these lines at the end of /etc/sysctl.conf on machine pgpool2 (cf. the infrastructure schema from @rivsc comment above)
And I pushed config changes to procfs to avoid a reboot : According to @rivsc, it solved the problem. On my opinion, that tends to indicate that ActiveRecord and/or the PG Gem don't close connections correctly. |
I'm locking this issue, because it's in danger of becoming a catch-all for every possible issue involving database connections and PostgreSQL.
If you're experiencing either of the above problems on a version I've claimed it should be fixed, or some other problem not described above, please open a new issue, so it can be properly tracked and dealt with. If you're having an issue with connection management and you're using a gem that deliberately overrides ActiveRecord's connection management, please attempt to reproduce the problem without that gem first. |
.. not a general timeout. Now, if a thread checks out a connection then dies, we can immediately recover that connection and re-use it. This should alleviate the pool exhaustion discussed in rails#12867. More importantly, it entirely avoids the potential issues of the reaper attempting to check whether connections are still active: as long as the owning thread is alive, the connection is its business alone. As a no-op reap is now trivial (only entails checking a thread status per connection), we can also perform one in-line any time we decide to sleep for a connection. Conflicts: activerecord/CHANGELOG.md
.. not a general timeout. Now, if a thread checks out a connection then dies, we can immediately recover that connection and re-use it. This should alleviate the pool exhaustion discussed in rails#12867. More importantly, it entirely avoids the potential issues of the reaper attempting to check whether connections are still active: as long as the owning thread is alive, the connection is its business alone. As a no-op reap is now trivial (only entails checking a thread status per connection), we can also perform one in-line any time we decide to sleep for a connection. Conflicts: activerecord/CHANGELOG.md activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb
.. not a general timeout. Now, if a thread checks out a connection then dies, we can immediately recover that connection and re-use it. This should alleviate the pool exhaustion discussed in rails#12867. More importantly, it entirely avoids the potential issues of the reaper attempting to check whether connections are still active: as long as the owning thread is alive, the connection is its business alone. As a no-op reap is now trivial (only entails checking a thread status per connection), we can also perform one in-line any time we decide to sleep for a connection. Conflicts: activerecord/CHANGELOG.md
.. not a general timeout. Now, if a thread checks out a connection then dies, we can immediately recover that connection and re-use it. This should alleviate the pool exhaustion discussed in rails#12867. More importantly, it entirely avoids the potential issues of the reaper attempting to check whether connections are still active: as long as the owning thread is alive, the connection is its business alone. As a no-op reap is now trivial (only entails checking a thread status per connection), we can also perform one in-line any time we decide to sleep for a connection. Conflicts: activerecord/CHANGELOG.md activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb
.. not a general timeout. Now, if a thread checks out a connection then dies, we can immediately recover that connection and re-use it. This should alleviate the pool exhaustion discussed in rails#12867. More importantly, it entirely avoids the potential issues of the reaper attempting to check whether connections are still active: as long as the owning thread is alive, the connection is its business alone. As a no-op reap is now trivial (only entails checking a thread status per connection), we can also perform one in-line any time we decide to sleep for a connection. Conflicts: activerecord/CHANGELOG.md activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb Conflicts: activerecord/CHANGELOG.md
TL;DR: Rails 4's PostgreSQLAdapter does not properly check whether a connection has died or not because
connect_poll
does not detect if the underlying TCP socket has closed.Rails 4.0.0 and 4.0.1 do not properly detect whether a PostgreSQL connection still alive or not before using it. A previously good connection that has been disconnected by the remote server or by an intermediate proxy will not be detected by rails, which will result in queries failing. An example error that we ran into:
This error began occurring when we began using Rails 4.0.0 and 4.0.1 instead of Rails 3.2.15. Our database connections were being terminated by an HAProxy in our setup that terminates idle TCP connections, which Rails 3.2 had no issue with --- it would detect the dead connection and establish a new connection without issue. Rails 4 on the other hand does not. A subsequent request cycle caused a new connection to be created, which would then work as long as that connection was not killed as well.
It turns out, Rails already has a unit test to verify that a dead connection is properly detected, but it is "skipped" by default because it requires a human to manually restart the PostgreSQL database (although there are ways this could be automated). When ActiveRecord::PostgresqlConnectionTest#test_reconnection_after_actual_disconnection_with_verify is run in Rails 4.0.1:
It fails, along with the tests performed after it.
The above error, as well as the failed test, are caused by PostgreSQLAdapter not properly discovering when the underlying TCP connection closes: libpq ran into an EOF, which occurs if the underlying TCP socket has closed.
When the connection pool determines whether a given database connection is still valid, it calls PostgreSQLAdapter#active?:
Ruby-pg's
PGConn#connect_poll
is a thin wrapper aroundPQconnectPoll()
from libpq. However,PQconnectPoll()
is meant for setting up a Postgres client connection in a non-blocking manner after the connection's socket is ready for reading/writing (as determined byselect()
orpoll()
) --- it does not actually perform any health checks on the connection. In fact, if the last known state of the connection is good, it immediately returns:(From src/interface/libpq/fe-connect.c in the postgresql source)
This means that
connect_poll
is not sufficient for detecting an inactive connection that has been closed by the other end. The Rails 3.2.15 version ofactive?
successfully detects the lost TCP connection by actually performing a database query, which exercises the connection and would fail if the connection were terminated:This code change was introduced in 34c7e73
To Fix
There is no easy way to determine if a TCP socket is dead other than by trying to use the connection (as was done in Rails 3.2.15 and earlier, and is also still used by the mysql adapters).
Traditionally, a remotely closed TCP connection can be detected by calling
recv
on the socket. If recv returns a string of length 0, then the remote side has closed the connection. If the socket is non-blocking but has not been closed, it will return an EAGAIN error. However, this approach does not work if there is data buffered on the TCP stream. I attempted to come up with a fix that usesrecv_nonblock
to peek ahead in the stream:However, if the server closes the connection, as it does during
test_reconnection_after_actual_disconnection_with_verify
, it sends a final message to the client that won't have been consumed yet whensocket_alive?
runs:There is actually a deeper problem here involving a time-of-check-to-time-of-use (TOCTTOU) weakness where the connection might die after checking the connection but before trying to use it. A better solution would be to detect the dead connection while trying to use it, and to then attempt to reestablish the connection when it dies. ActiveRecord should only let the error bubble up if the reconnect attempt fails.
The text was updated successfully, but these errors were encountered: