Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rails 4.0.0 and 4.0.1 PostgreSQLAdapter does not properly check whether a connection is active #12867

Closed
aetherknight opened this issue Nov 12, 2013 · 103 comments · Fixed by #13981

Comments

@aetherknight
Copy link

TL;DR: Rails 4's PostgreSQLAdapter does not properly check whether a connection has died or not because connect_poll does not detect if the underlying TCP socket has closed.

Rails 4.0.0 and 4.0.1 do not properly detect whether a PostgreSQL connection still alive or not before using it. A previously good connection that has been disconnected by the remote server or by an intermediate proxy will not be detected by rails, which will result in queries failing. An example error that we ran into:

    PG::UnableToSend: SSL SYSCALL error: EOF detected

    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql_adapter.rb:774→ exec
    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql_adapter.rb:774→ exec_no_cache
    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql/database_statements.rb:138→ block in exec_query
    activerecord-4.0.1/lib/active_record/connection_adapters/abstract_adapter.rb:435→ block in log
    activesupport-4.0.1/lib/active_support/notifications/instrumenter.rb:20→ instrument
    activerecord-4.0.1/lib/active_record/connection_adapters/abstract_adapter.rb:430→ log
    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql/database_statements.rb:137→ exec_query
    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql_adapter.rb:921→ column_definitions
    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql/schema_statements.rb:174→ columns
    activerecord-4.0.1/lib/active_record/connection_adapters/schema_cache.rb:114→ block in prepare_default_proc
    activerecord-4.0.1/lib/active_record/connection_adapters/schema_cache.rb:56→ yield
    activerecord-4.0.1/lib/active_record/connection_adapters/schema_cache.rb:56→ columns
    activerecord-4.0.1/lib/active_record/model_schema.rb:208→ columns
    activerecord-4.0.1/lib/active_record/model_schema.rb:217→ columns_hash
    activerecord-4.0.1/lib/active_record/relation/delegation.rb:14→ columns_hash
    activerecord-4.0.1/lib/active_record/relation/finder_methods.rb:278→ find_one
    activerecord-4.0.1/lib/active_record/relation/finder_methods.rb:268→ find_with_ids
    activerecord-4.0.1/lib/active_record/relation/finder_methods.rb:35→ find
    activerecord-deprecated_finders-1.0.3/lib/active_record/deprecated_finders/relation.rb:122→ find
    activerecord-4.0.1/lib/active_record/relation.rb:334→ update
    activerecord-4.0.1/lib/active_record/querying.rb:7→ update
    (our controller method)
    ...

This error began occurring when we began using Rails 4.0.0 and 4.0.1 instead of Rails 3.2.15. Our database connections were being terminated by an HAProxy in our setup that terminates idle TCP connections, which Rails 3.2 had no issue with --- it would detect the dead connection and establish a new connection without issue. Rails 4 on the other hand does not. A subsequent request cycle caused a new connection to be created, which would then work as long as that connection was not killed as well.

It turns out, Rails already has a unit test to verify that a dead connection is properly detected, but it is "skipped" by default because it requires a human to manually restart the PostgreSQL database (although there are ways this could be automated). When ActiveRecord::PostgresqlConnectionTest#test_reconnection_after_actual_disconnection_with_verify is run in Rails 4.0.1:

cd activerecord
$EDITOR test/config.yml # Set with_manual_interventions to true instead of false
ARCONN=postgresql ruby -Itest test/cases/adapters/postgresql/connection_test.rb

It fails, along with the tests performed after it.

The above error, as well as the failed test, are caused by PostgreSQLAdapter not properly discovering when the underlying TCP connection closes: libpq ran into an EOF, which occurs if the underlying TCP socket has closed.

When the connection pool determines whether a given database connection is still valid, it calls PostgreSQLAdapter#active?:

      # Is this connection alive and ready for queries?
      def active?
        @connection.connect_poll != PG::PGRES_POLLING_FAILED
      rescue PGError
        false
      end

Ruby-pg's PGConn#connect_poll is a thin wrapper around PQconnectPoll() from libpq. However, PQconnectPoll() is meant for setting up a Postgres client connection in a non-blocking manner after the connection's socket is ready for reading/writing (as determined by select() or poll()) --- it does not actually perform any health checks on the connection. In fact, if the last known state of the connection is good, it immediately returns:

(From src/interface/libpq/fe-connect.c in the postgresql source)

PostgresPollingStatusType
PQconnectPoll(PGconn *conn)
{
    PGresult   *res;
    char        sebuf[256];
    int         optval;

    if (conn == NULL)
        return PGRES_POLLING_FAILED;

    /* Get the new data */
    switch (conn->status)
    {
            /*
             * We really shouldn't have been polled in these two cases, but we
             * can handle it.
             */
        case CONNECTION_BAD:
            return PGRES_POLLING_FAILED;
        case CONNECTION_OK:
            return PGRES_POLLING_OK;

...

This means that connect_poll is not sufficient for detecting an inactive connection that has been closed by the other end. The Rails 3.2.15 version of active? successfully detects the lost TCP connection by actually performing a database query, which exercises the connection and would fail if the connection were terminated:

      # Is this connection alive and ready for queries?
      def active?
        @connection.query 'SELECT 1'
        true
      rescue PGError
        false
      end

This code change was introduced in 34c7e73

To Fix

There is no easy way to determine if a TCP socket is dead other than by trying to use the connection (as was done in Rails 3.2.15 and earlier, and is also still used by the mysql adapters).

Traditionally, a remotely closed TCP connection can be detected by calling recv on the socket. If recv returns a string of length 0, then the remote side has closed the connection. If the socket is non-blocking but has not been closed, it will return an EAGAIN error. However, this approach does not work if there is data buffered on the TCP stream. I attempted to come up with a fix that uses recv_nonblock to peek ahead in the stream:

      def socket_alive?
        socket = Socket.for_fd(@connection.socket)
        socket.autoclose = false
        socket.recv_nonblock(1, Socket::MSG_PEEK) != ''
      rescue Errno::EAGAIN
        true
      end

      # Is this connection alive and ready for queries?
      def active?
        socket_alive? && @connection.connect_poll != PG::PGRES_POLLING_FAILED
      rescue PGError
        false
      end

However, if the server closes the connection, as it does during test_reconnection_after_actual_disconnection_with_verify, it sends a final message to the client that won't have been consumed yet when socket_alive? runs:

[3] pry(#<ActiveRecord::ConnectionAdapters::PostgreSQLAdapter>)> socket.recv_nonblock(1000, Socket::MSG_PEEK)
=> "E\x00\x00\x00mSFATAL\x00C57P01\x00Mterminating connection due to administrator command\x00Fpostgres.c\x00L2855\x00RProcessInterrupts\x00\x00"

There is actually a deeper problem here involving a time-of-check-to-time-of-use (TOCTTOU) weakness where the connection might die after checking the connection but before trying to use it. A better solution would be to detect the dead connection while trying to use it, and to then attempt to reestablish the connection when it dies. ActiveRecord should only let the error bubble up if the reconnect attempt fails.

@aetherknight
Copy link
Author

Hi @tenderlove, Your change to #active? in the mentioned commit (34c7e73) 8 months ago has been causing us some major pain with recovering from dead database connections. What sort of trouble did a select 1 query cause during the connection pool's reaping?

@tenderlove
Copy link
Member

Reaping happens in a background thread. We can't send any data over the socket because there are no locks around the file descriptor (that I know of).

Aaron Patterson
http://tenderlovemaking.com/
I'm on an iPhone so I apologize for top posting.

On Nov 12, 2013, at 2:59 PM, "William (B.J.) Snow Orvis" notifications@github.com wrote:

Hi @tenderlove, Your change to #active? in the mentioned commit (34c7e73) 8 months ago has been causing us some major pain with recovering from dead database connections. What sort of trouble did a select 1 query cause during the connection pool's reaping?


Reply to this email directly or view it on GitHub.

@jonleighton
Copy link
Member

Awesome bug report @aetherknight. I'm also experiencing a variant of this problem in my application - the StatementPool#connection_active? method is also broken in a similar way.

@jonleighton
Copy link
Member

Here's the related error which can occur when calling establish_connection:

PG::UnableToSend: SSL SYSCALL error: Connection timed out 
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:517:in `exec`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:517:in `dealloc`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:500:in `block in clear`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:499:in `each_value`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:499:in `clear`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:563:in `clear_cache!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract_adapter.rb:339:in `disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:588:in `disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:310:in `block (2 levels) in disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:308:in `each`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:308:in `block in disconnect!`
/usr/local/rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/monitor.rb:211:in `mon_synchronize`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:306:in `disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:564:in `remove_connection`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_handling.rb:88:in `remove_connection`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_handling.rb:45:in `establish_connection`

@aetherknight
Copy link
Author

So I think there may be a few possible approaches to fixing this:

  1. Implement correct TCP connection handling. When a connection abnormally closes during use, AR should attempt to reconnect automatically. The goal here is to also eliminate the TOCTTOU issue that exists between checking whether a connection is still good and then trying to do work on it, and it would only cause an error on a closed connection if it can't reestablish the connection, or if there's an additional error during the reestablish attempt. This approach would make the #active? and #connection_active? methods not particularly meaningful anymore (since you discover a dead connection by attempting to do work on it), and reaping stale connections in the ConnectionManager would need to only be based on connections that weren't checked back in (the current logic only reaps a connection that is in use, is stale, and is no longer active).
  2. Wrap uses of the PG connection with a monitor/lock within the PostgreSQLAdapter to prevent 2+ threads (the worker/request handling thread and the reaper thread) from doing work on the PG connection object. #active? and #connection_active? then would probably need to handle timeouts on the lock, or their caller would need to handle that case.

@aetherknight
Copy link
Author

For the later approach, #active? and #connection_active? would then do work on the connection (eg, go back to select 1 within the synchronize/monitor/lock code.

@jonleighton
Copy link
Member

Yep. The former definitely makes more sense to me.

@tenderlove
Copy link
Member

@aetherknight I agree, we should go with solution #1. Seems the mysql adapter has a setting to automatically do this, but the pg adapter does not. We'll have to rescue exceptions wherever we call methods on the connection, then call PGConn#reset. Though we should probably do a loop and PGConn#reset_set + PGConn#reset_poll so we don't block the process.

@bradleypriest
Copy link
Contributor

Will another stacktrace help out here?
I'm getting intermittent ruby segfaults with a Capybara test suite https://gist.github.com/bradleypriest/7660331 in the same place

@rubytastic
Copy link

Same issue, using sidekiq but not sure that is the main cause of problems.
Any fix already pending? My app totally broke and gives the ActiveRecord::ConnectionTimeoutError - could not obtain a database connection within ... in about 1 or 2 page reloads.

@rubytastic
Copy link

Is there a temporarily fix for this issue?
Its preventing my app to run at all locally or in production.
Tried increasing pg pool, timeout all fails this problem comes back every time.
Connection pool is now at 100 which still doesn't help. There are only 10 active pg connections.
How is the rest of the world running there Rails apps on PG when this issue persist?

Stacktrace:
https://gist.github.com/rubytastic/7786482

@mperham
Copy link
Contributor

mperham commented Dec 4, 2013

@rubytastic You could determine why your db connections are dying so quickly. I don't have any pgsql experience but I know MySQL has a timeout setting where the server will kill any clients that haven't used the socket in that time period.

@aetherknight
Copy link
Author

For the short term, we monkeypatched The PostgreSQLAdapter to use the older
version of #active? that performed a select 1 query, since we aren't
using Ruby threading to handle multiple requests, nor are we using the
reaper in its own thread.

On Wed, Dec 4, 2013 at 1:37 PM, Mike Perham notifications@github.comwrote:

@rubytastic https://github.com/rubytastic You could determine why your
db connections are dying so quickly. I don't have any pgsql experience but
I know MySQL has a timeout setting where the server will kill any clients
that haven't used the socket in that time period.


Reply to this email directly or view it on GitHubhttps://github.com//issues/12867#issuecomment-29848045
.

@mperham
Copy link
Contributor

mperham commented Dec 4, 2013

@aetherknight Unfortunately that's not an option for Sidekiq users.

@aetherknight
Copy link
Author

@mperham Ah, good point.

@rubytastic
Copy link

@aetherknight
Total PG newbie just switched from Mysql.
Understanding correctly there is no solid solution nor will it be implemented in the short terms.
Than this is a serious issue for PG and Sidekiq(?) combination, I wonder why so little people have this issue.

So switching back to mysql, for now would be the best option to go or is there a monkey patch that could be implemented to fix the problem temporarily

@matthewd
Copy link
Member

@tenderlove by my reading, solution 2 is necessary.

If you can't tell whether a connection is #active?, you either never reap it, or you regularly reap (stale) still-active connections -- and so the server could end up with many more connections than your pool limit.

But then, maybe I'm just missing something. Because as far as I can see, neither of the mysql adapters are any safer from a reaper in a different thread than the SELECTive postgres version.

@aetherknight
Copy link
Author

@matthewd I am beginning to think that both solutions are ultimately necessary:

#active? is fundamentally a bad way of determining if any network connection is still alive because there is still a race condition between when #active? is called and when the connection is then used (this does not apply to the reaper's use of #active? but to other uses such as in #verify!). This can only be solved by solution 1.

Also, if I understand the current reaper code correctly, it will only reap a connection if it is currently in use, it has been longer than dead_connection_timeout since it was checked out, and #active? reports that the connection is dead. That means that it will not reap (close and remove from the connection pool) a dangling connection from a dead thread until the OS, the network, or the remote server kills the connection --- which can take a minute or more of inactivity on the connection. However, this may be the only way to determine whether the connection is dangling without pulling a connection out from under a long-lived thread (as may be the case with sidekiq jobs). Which sounds like we also need solution 2, in order to allow the reaper to actually detect dead connections.

@mperham
Copy link
Contributor

mperham commented Dec 15, 2013

All this stems from the fact that the connection is not self-repairing, yes? Dalli and Redis client connections are pooled with my connection_pool gem all the time. connection_pool documents that all connections should be self-repairing so that the pool doesn't need to worry about reaping or repairing.

On Dec 15, 2013, at 8:50, "William (B.J.) Snow Orvis" notifications@github.com wrote:

@matthewd I am beginning to think that both solutions are ultimately necessary:

#active? is fundamentally a bad way of determining if any network connection is still alive because there is still a race condition between when #active? is called and when the connection is then used (this does not apply to the reaper's use of #active? but to other uses such as in #verify!). This can only be solved by solution 1.

Also, if I understand the current reaper code correctly, it will only reap a connection if it is currently in use, it has been longer than dead_connection_timeout since it was checked out, and #active? reports that the connection is dead. That means that it will not reap (close and remove from the connection pool) a dangling connection from a dead thread until the OS, the network, or the remote server kills the connection --- which can take a minute or more of inactivity on the connection. However, this may be the only way to determine whether the connection is dangling without pulling a connection out from under a long-lived thread (as may be the case with sidekiq jobs). Which sounds like we also need solution 2, in order to allow the reaper to actually detect dead connections.


Reply to this email directly or view it on GitHub.

@tenderlove
Copy link
Member

@mperham not just self repairing. If a thread checks out a connection, then dies before checking the connection back in, what happens? This is why the reaper checks on "long lived" connections.

@mperham
Copy link
Contributor

mperham commented Dec 16, 2013

@tenderlove If a thread somehow does not check in connections, they'd quickly exhaust. What's wrong with that? connection_pool only documents block usage with_connection {|conn| ... } to ensure it's automatically checked back in. Which AR API can leak connections? You talking about conn = Model.connection?

@tenderlove
Copy link
Member

@mperham yes, that will leak a connection if done inside a Thread.new { } (and you don't specifically check it back in)

@mperham
Copy link
Contributor

mperham commented Dec 16, 2013

Yep, that form of API should be deprecated in preference for with_connection in Rails 5 IMO. I know that doesn't help today though. :-/

@matthewd
Copy link
Member

@aetherknight Yeah, I could definitely get behind an idea of arranging automatic reconnection iff you're performing a SELECT (or BEGIN), and there isn't [already] supposed to be a transaction open.

If you're trying to send an INSERT, say, you'd want to be very sure the exception meant "driver failed to send the query", and not "driver lost connection while waiting to hear what happened", before silently retrying.

@matthewd
Copy link
Member

@tenderlove can/does the pool keep track of which thread checked out the connection? We can determine pretty easily that the thread's gone away... though whether you trust that it's safe to return the connection to the pool, vs just tell it to close, is another matter.

Or is there a worry about code that goes out of its way to check out a connection in one thread, hand it to another, then have the first thread die?

@tenderlove
Copy link
Member

@matthewd The pool does keep track of which thread checks out the connection, but you shouldn't rely on it (it uses the object id of the thread). We used to loop through every thread in the system, check it's status, then check the pool, but I'd prefer a solution that doesn't depend on doing that. We could probably stuff the thread in a hash or something when it checks out, then just have the reaper check connections for those threads.

We shouldn't worry about one thread passing the connection to another. I don't think we support that (I think it's documented too).

Also, we should probably split this in to two issues. One for self-repairing connections, and one for dealing with long lived and possibly dead threads. I tried adding code for self-repairing connections, but TBH I can't repro this issue and I don't want to rescue from just any old PG exception.

@matthewd
Copy link
Member

@tenderlove I did a thing: matthewd/rails@master...pg_verify

Does that look at all sane?

As far as I can see, those two tests both need to pass for the reaper and checkout_and_verify to both function correctly... and no less-invasive solution comes to mind.

@tenderlove
Copy link
Member

It looks sane, but I think you can make the locks more granular, and possibly remove some locks. For example, the locks should probably be inside the "log" calls.

The connection shouldn't be shared among threads (except the reaper). Does the reaper impact the statement cache? I'm not sure we need that lock.

Shouldn't there be a lock in "active?"

Looks good!

Aaron Patterson
http://tenderlovemaking.com/
I'm on an iPhone so I apologize for top posting.

On Dec 18, 2013, at 7:39 AM, Matthew Draper notifications@github.com wrote:

@tenderlove I did a thing: matthewd@master...pg_verify

Does that look at all sane?

As far as I can see, those two tests both need to pass for the reaper and checkout_and_verify to both function correctly... and no less-invasive solution comes to mind.


Reply to this email directly or view it on GitHub.

@flujan
Copy link

flujan commented Mar 18, 2014

@matthewd ConnectionTimeoutError.

I am getting this errors on a external script using active_record.

https://gist.github.com/flujan/9621099

I installed the master rails from github but don't how to require the master "active_record" on it.

@anazar
Copy link

anazar commented Mar 18, 2014

We get this error using mysql on 4.1.0.rc1

@rubytastic
Copy link

Like above folks it still seems broken.
I have same with Postgres and mysql its in the rails 4 active record layer. How can an issue like this persist so long ( or got introduced ) for 2/4 months now with so little people having this problem. This is a real rails killer for me and sadly i don't have the core competence to fix it or supply thoughts on how to resolve it.

This problem started when upgrading to rails 3 to rails 4 after that it just has timed out for me totally and I have reverted back to PHP for several projects. Please anyone track this down I really wan't my rails development env back! Is there anything we can do beside posting or configs and stack traces to make this problem fixed sooner>?

@pawel2105
Copy link
Contributor

Interestingly this morning I was having asset issues.

I ran my app in production mode locally and discovered that because my error controller wasn't handling requests for non html format those assets would return a 500 instead of a 404. After refreshing the page 4 or 5 times while tailing my log, the page would stop loading and good old ActiveRecord::ConnectionTimeoutError: could not obtain a database connection within 5.000 seconds (waited 5.001 seconds) reared its ugly head again.

This is on Rails 4.0.3 running on puma 2.8.0.

Don't know if this info will be useful to anyone.

@rubytastic
Copy link

@pawel2105
Your comment made me think about puma i use it blatantly without thought in every project. So I switched my server to thin in development mode.

Rails 4.0.4.rc1
thin 1.6.2 codename Doc Brown
ruby 2.1.1p76 (2014-02-24 revision 45161) [x86_64-darwin13.0]

Tried using Thin but couldn't validate if the problem persist since I have rails 4 server side events in my project which is hard to remove from it. Perhaps its an idea to testdrive if the problem would persist if using a non puma server.

@flujan
Copy link

flujan commented Mar 19, 2014

I just take the tip from @matthewd and put a larger checkout on my external script. Put a :checkout_timeout of 60 seconds... The error happens with less frequency but still occurs.

@rubytastic
Copy link

@flujan
Tried that also. All the bleeding edge latest versions as of today but problem still persist

ActiveRecord::ConnectionTimeoutError - could not obtain a database connection within 5.000 seconds (waited 5.002 seconds):
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:190:in `block in wait_poll'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:181:in `wait_poll'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:136:in `block in poll'
  /Users/Pascal/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:146:in `synchronize'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:134:in `poll'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:423:in `acquire_connection'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:356:in `block in checkout'
  /Users/Pascal/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:355:in `checkout'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:265:in `block in connection'
  /Users/Pascal/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:264:in `connection'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:546:in `retrieve_connection'
  activerecord (4.0.4.rc1) lib/active_record/connection_handling.rb:79:in `retrieve_connection'
  activerecord (4.0.4.rc1) lib/active_record/connection_handling.rb:53:in `connection'
  activerecord (4.0.4.rc1) lib/active_record/query_cache.rb:51:in `restore_query_cache_settings'
  activerecord (4.0.4.rc1) lib/active_record/query_cache.rb:43:in `rescue in call'
  activerecord (4.0.4.rc1) lib/active_record/query_cache.rb:32:in `call'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:626:in `call'
  activerecord (4.0.4.rc1) lib/active_record/migration.rb:373:in `call'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/callbacks.rb:29:in `block in call'
  activesupport (4.0.4.rc1) lib/active_support/callbacks.rb:373:in `_run__4419992542473468413__call__callbacks'
  activesupport (4.0.4.rc1) lib/active_support/callbacks.rb:80:in `run_callbacks'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/callbacks.rb:27:in `call'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/reloader.rb:64:in `call'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/remote_ip.rb:76:in `call'
  better_errors (1.1.0) lib/better_errors/middleware.rb:84:in `protected_app_call'
  better_errors (1.1.0) lib/better_errors/middleware.rb:79:in `better_errors_call'
  better_errors (1.1.0) lib/better_errors/middleware.rb:56:in `call'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/debug_exceptions.rb:17:in `call'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/show_exceptions.rb:30:in `call'
  railties (4.0.4.rc1) lib/rails/rack/logger.rb:38:in `call_app'
  railties (4.0.4.rc1) lib/rails/rack/logger.rb:20:in `block in call'
  activesupport (4.0.4.rc1) lib/active_support/tagged_logging.rb:68:in `block in tagged'
  activesupport (4.0.4.rc1) lib/active_support/tagged_logging.rb:26:in `tagged'
  activesupport (4.0.4.rc1) lib/active_support/tagged_logging.rb:68:in `tagged'
  railties (4.0.4.rc1) lib/rails/rack/logger.rb:20:in `call'
  quiet_assets (1.0.2) lib/quiet_assets.rb:18:in `call_with_quiet_assets'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/request_id.rb:21:in `call'
  rack (1.5.2) lib/rack/methodoverride.rb:21:in `call'
  rack (1.5.2) lib/rack/runtime.rb:17:in `call'
  activesupport (4.0.4.rc1) lib/active_support/cache/strategy/local_cache.rb:83:in `call'
  rack-livereload (0.3.15) lib/rack/livereload.rb:23:in `_call'
  rack-livereload (0.3.15) lib/rack/livereload.rb:14:in `call'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/static.rb:64:in `call'
  rack (1.5.2) lib/rack/sendfile.rb:112:in `call'
  railties (4.0.4.rc1) lib/rails/engine.rb:511:in `call'
  railties (4.0.4.rc1) lib/rails/application.rb:97:in `call'
  rack (1.5.2) lib/rack/content_length.rb:14:in `call'
  puma (2.8.1) lib/puma/server.rb:490:in `handle_request'
  puma (2.8.1) lib/puma/server.rb:361:in `process_client'
  puma (2.8.1) lib/puma/server.rb:254:in `block in run'
  puma (2.8.1) lib/puma/thread_pool.rb:92:in `block in spawn_thread'

@flujan
Copy link

flujan commented Mar 21, 2014

@rubytastic since we do not have knowledge enough to fix it and the issue is closed we will need to wait other bugs to happen and hope that the fix of one of them, fixes this one too.

With ruby > 2.0 the use of threads will improve and this kind of bug will surely happen more.

chris-maginatics pushed a commit to maginatics/rails that referenced this issue Apr 10, 2014
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in rails#12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.
@barnett
Copy link

barnett commented May 8, 2014

any update on possible solutions?

@matthewd
Copy link
Member

matthewd commented May 8, 2014

@blklane this thread ended up discussing two very different issues... which one are you experiencing, and on which Rails version?

@barnett
Copy link

barnett commented May 8, 2014

@matthewd 4.0.1 and ruby 2.0 with the following error

 Error during failsafe response: could not obtain a database connection within 5.000 seconds (waited 5.000 seconds)
2014-05-08T14:08:49.009040+00:00 app[web.1]:   /app/vendor/bundle/ruby/2.0.0/gems/activerecord-4.0.2/lib/active_record/connection_adapters/abstract/connection_pool.rb:190:in `block in wait_poll'
2014-05-08T14:08:49.009042+00:00 app[web.1]:   /app/vendor/bundle/ruby/2.0.0/gems/activerecord-4.0.2/lib/active_record/connection_adapters/abstract/connection_pool.rb:181:in `loop'
2014-05-08T14:08:49.009044+00:00 app[web.1]:   /app/vendor/bundle/ruby/2.0.0/gems/activerecord-4.0.2/lib/active_record/connection_adapters/abstract/connection_pool.rb:181:in `wait_poll'
2014-05-08T14:08:49.009046+00:00 app[web.1]:   /app/vendor/bundle/ruby/2.0.0/gems/activerecord-4.0.2/lib/active_record/connection_adapters/abstract/connection_pool.rb:136:in `block in poll'

@matthewd
Copy link
Member

matthewd commented May 8, 2014

@blklane 4.1 revised the connection pool handling to address some sources of starvation (around thread death), so you could try that... but the most common cause I've seen has been people genuinely starving the pool: running more DB-touching threads than they have connections in their pool. And there's not currently much we can do about that. 😕

@aetherknight
Copy link
Author

@blklane Rails 4.0.4+ made a change to the Postgresql connection adapter to allow the normal request-response cycle to correctly detect when a connection is no longer active again, which should help the ConnectionPool self-heal after a DB connection dies, which might help you.

However, there are still some other DB connection issues 4.0.x that could lead to the ConnectionPool exhausting its pool of possible connections. Part of a gist I posted further up attempts to address a scenario where if an error occurs during checkout, the ConnectionPool cleans up the connection that failed to be fully checked out: https://gist.github.com/aetherknight/3d0af086a92bac8f3ee9#file-active_record_patches-rb-L76

I haven't looked into Rails 4.1's connection management yet.

@adailey14
Copy link

I believe this is related to this issue:

I run into pool starvation when I perform any database operations while handling errors (404 not found) via config.exceptions_app. A refresh of a 404 page 4 or 5 times seems to starve my connection pool. I workaround it by removing DB operations from the error pages. Posting here in case anyone else sees this.

Let me know if anyone wants more information on my setup for this.

@rubytastic
Copy link

I still have this issue with the latest rails, ruby, postgres, all stable versions.

@rivsc
Copy link

rivsc commented Jun 6, 2014

Hi,

My error message :

INFO -- : Started POST "/my_ctrl/import" for X.X.X.X at 2014-06-06 15:38:52 +0200
INFO -- : Processing by MyController#import as HTML
INFO -- : Parameters: {...}
ERROR -- : PG::UnableToSend: : INSERT INTO "..."
ERROR -- : PG::ConnectionBad: PQsocket() can't get socket descriptor: ROLLBACK
INFO -- : Completed 500 Internal Server Error in 5115ms
FATAL -- :
ActiveRecord::StatementInvalid (PG::ConnectionBad: PQsocket() can't get socket descriptor: ROLLBACK):

With this infrastructure (also tested with rb 1.9.3 and rails 4.0.4) :

infrarb

@rubytastic : Your postgresql is on the same server of rails app ?

Ps : I'm using Apartment. (gem for multi-tenant App).

@zedalaye
Copy link

[Note : I work with @rivsc]

I added these lines at the end of /etc/sysctl.conf on machine pgpool2 (cf. the infrastructure schema from @rivsc comment above)

# KeepAlives tuning to detect dead PG connections faster
net.ipv4.tcp_keepalive_time = 3
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 9

And I pushed config changes to procfs to avoid a reboot :
# echo 3 > /proc/sys/net/ipv4/tcp_keepalive_time
# echo 60 > /proc/sys/net/ipv4/tcp_keepalive_intvl
# echo 9 > /proc/sys/net/ipv4/tcp_keepalive_probes

According to @rivsc, it solved the problem. On my opinion, that tends to indicate that ActiveRecord and/or the PG Gem don't close connections correctly.

@rails rails locked and limited conversation to collaborators Jun 21, 2014
@matthewd
Copy link
Member

I'm locking this issue, because it's in danger of becoming a catch-all for every possible issue involving database connections and PostgreSQL.

  • PG::UnableToSend: SSL SYSCALL error / PG::UnableToSend in general, should be fixed as of 4.0.4 / 4.1.0 (Reaper has access to threadsafe active? call #13981)
  • ActiveRecord::ConnectionTimeoutError - could not obtain a database connection within 5.000 seconds (pool exhaustion) should no longer occur in a properly behaving application as of master / future 4.2.0 (Track owning thread for connection pool #14360). It's still quite possible to exhaust your pool if you have more threads than connections, and those threads are hanging on to the connections for too long.

If you're experiencing either of the above problems on a version I've claimed it should be fixed, or some other problem not described above, please open a new issue, so it can be properly tracked and dealt with.

If you're having an issue with connection management and you're using a gem that deliberately overrides ActiveRecord's connection management, please attempt to reproduce the problem without that gem first.

jigish pushed a commit to jigish/rails that referenced this issue Aug 8, 2014
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in rails#12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.

Conflicts:
	activerecord/CHANGELOG.md
jigish pushed a commit to jigish/rails that referenced this issue Sep 9, 2014
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in rails#12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.

Conflicts:
	activerecord/CHANGELOG.md
	activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb
jigish pushed a commit to jigish/rails that referenced this issue Sep 9, 2014
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in rails#12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.

Conflicts:
	activerecord/CHANGELOG.md
uar-david-litvak pushed a commit to FanaticalPlatform/rails that referenced this issue Sep 17, 2014
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in rails#12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.

Conflicts:
	activerecord/CHANGELOG.md
	activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb
jigish pushed a commit to jigish/rails that referenced this issue Oct 13, 2014
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in rails#12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.

Conflicts:
	activerecord/CHANGELOG.md
	activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb

Conflicts:
	activerecord/CHANGELOG.md
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.