gevent killed greenlet causes pymysql to continue on the broken connection #275

Open
lxyu opened this Issue Dec 5, 2014 · 26 comments

Projects

None yet

5 participants

@lxyu
lxyu commented Dec 5, 2014

Bug reported in the sqlalchmey issue tracker but from @zzzeek's point of view, it's pymysql side bug.

This will cause a serious bug "transaction partial succeeding" when greenlet timed out and killed by gevent.

Original issue here: https://bitbucket.org/zzzeek/sqlalchemy/issue/3258/a-gevent-killed-greenlet-causes-pymysql

Reproduce script:

@lxyu
lxyu commented Dec 5, 2014

similar (or maybe the same) issues: #234, #238, #260

the reproduce script use latest stable version of pymysql and sqlalchemy, this bug still exists in v0.6.2

@lxyu
lxyu commented Dec 5, 2014

I just found there's a release of v0.6.3 days before, I'v tested it, but still broken.

@zzzeek
Contributor
zzzeek commented Dec 5, 2014

just an FYI mysql-connector has the same issue: https://gist.github.com/zzzeek/d196fa91c40cb515365e

I think there is somehting about gevent and/or MySQL's server protocol that makes this happen. Or all the MySQL drivers are just extremely similar to each other.

@methane
Contributor
methane commented Dec 6, 2014

I think this is not a bug but a just limitation.
MySQL protocol is request-response style. And PyMySQL doesn't support like pipelining.
Without finishing previous request successfully, you should not do any request including rollback.

So you should just con.close() and should not call any methods after that.

@zzzeek
Contributor
zzzeek commented Dec 6, 2014

it should be straightforward for state within the conversation to be tracked, such that when rollback() is called, if the conversation is not in the expected state, that an error is raised and the connection is closed instead.

@zzzeek
Contributor
zzzeek commented Dec 6, 2014

what's happening here is that there's a connection, in such an invalid state on it where if you call "SELECT connection_id()" on it, it barfs; yet, if you don't do that, somehow it's able to accept an INSERT statement and a commit. That seems like at least a really deep issue in MySQL server's protocols.

@zzzeek
Contributor
zzzeek commented Dec 6, 2014

also I really don't understand how in the first place we get to line 118 at https://gist.github.com/zzzeek/d196fa91c40cb515365e#file-gistfile1-py-L118 with this bad connection.

@methane
Contributor
methane commented Dec 6, 2014

@zzzeek OK, PyMySQL should raise error when command out of sync (sending command before reading previous reply).

also I really don't understand how in the first place we get to line 118 at https://gist.github.com/zzzeek/d196fa91c40cb515365e#file-gistfile1-py-L118 with this bad connection.

I can't catch what you say. Do you really mention about L118 ?

@zzzeek
Contributor
zzzeek commented Dec 6, 2014

i mean that there is a flow control issue, or something that makes it look like one. we get to line 118 on a connection where we should have left the block due to the timeout.

@cactus
Contributor
cactus commented Dec 6, 2014

@zzzeek You are using random.random, which of course may return very small values. You may end up with very tiny amounts of sleep, not enough to trigger the gevent timeout condition.

@zzzeek
Contributor
zzzeek commented Dec 6, 2014

@cactus I am not referring to cases where the timeout has not occurred. We are referring to cases where either the timeout has occurred, or it has not, yet the connection that is present here has somehow been corrupted due to timeouts that have proceeded elsewhere.

It reaches line 118 upon a connection for which either line 115 has not executed, or for which the connection in progress on line 115 has been corrupted, such that the table_a row at line 115 is no longer present, and for which "SELECT connection_id()" now raises an exception. if the timeout has not occurred, then there is no issue at line 118.

@cactus
Contributor
cactus commented Dec 6, 2014

ok. I ran the code. It seems generally things proceed fine. After a while I start seeing a few $'s with the FK constraint uncommented. If I uncomment the verify connection part, after a long while I see some errors on rollback. Turning the $'s and @'s into logs, I can see the errors.

$'s are FK constraint failures, and @'s are "Command Out of Sync" on rollbacks.

Consider the following scenario. Assume the greenlet makes it to line 119 before the timeout fires, lets say right in the middle of IO (after it sends the query but before it can read the response). There is now unread response data in the socket, the MYSQL_RESULT, which contains things like affected rows.

socket = | MYSQL_RESULT(insert2) | 

On the next time through the main loop the first execute(insert) sends the query, and mysql responds with an MYSQL_RESULT. There are now two unread responses in the socket.

socket = | MYSQL_RESULT(insert2) | MYSQL_RESULT(insert1) |

We haven't timed out though, so the driver reads the previous result packet. It was an insert, so the response is of the appropriate type (just from the wrong insert). So now we have:

socket = | MYSQL_RESULT(insert1) |

So now we hit the verify_connection_id part, which sends a select. Now the socket has something like this in it.

socket = | MYSQL_RESULT(insert1) | MYSQL_RESULT(select) |

The driver reads the response packet (again more or less compatible structure), and we get this:

socket = | MYSQL_RESULT(select) |

The code now sends a commit in the if block, which the mysql server responds with...lets say an OK packet. This is what the socket may look like:

socket = | MYSQL_RESULT(select) | OK |

The driver goes to read that response here, and instead of getting an OK packet, gets the result packet garbage. I believe a similar series of events can occur with a rollback operation as well, since it expects an OK packet response from the server as well. The error on rollback was actually the most common one I saw, so that likely happens more often with the code sample.

To show a bit more of what is happening underneath,if I change connection._read_packet to the following..

    def _read_ok_packet(self):
        pkt = self._read_packet()
        if not pkt.is_ok_packet():
            dump_packet(pkt.get_all_data())
            raise OperationalError(2014, "Command Out of Sync")
        ok = OKPacketWrapper(pkt)
        self.server_status = ok.server_status
        return True

...I get this output:

packet length: 1
method call[1]: _read_ok_packet
method call[2]: commit
method call[3]: transaction_kill_worker
method call[4]: run
01                                               .
----------------------------------------------------------------------------------------

packet length: 37
method call[1]: _read_ok_packet
method call[2]: rollback
method call[3]: transaction_kill_worker
method call[4]: run
03 64 65 66 00 00 00 0F 63 6F 6E 6E 65 63 74 69  . d e f . . . . c o n n e c t i
6F 6E 5F 69 64 28 29 00 0C 3F 00 15 00 00 00 08  o n _ i d . . . . . . . . . . .
A1 00 00 00 00                                   . . . . .
----------------------------------------------------------------------------------------

@packet length: 5
method call[1]: _read_ok_packet
method call[2]: rollback
method call[3]: return_conn
method call[4]: transaction_kill_worker
method call[5]: run
----------------------------------------------------------------------------------------
FE 00 00 01 00                                   . . . . .
----------------------------------------------------------------------------------------

ERROR:foo:Exception during rollback
Traceback (most recent call last):
  File "test.py", line 42, in return_conn
    conn.rollback()
  File "/home/elij/projects/00venv/lib/python2.7/site-packages/pymysql/connections.py", line 680, in rollback
    self._read_ok_packet()
  File "/home/elij/projects/00venv/lib/python2.7/site-packages/pymysql/connections.py", line 661, in _read_ok_packet
    raise OperationalError(2014, "Command Out of Sync")
OperationalError: (2014, 'Command Out of Sync')

This seems to support my hypothesis with the socket simply becoming decoupled from connection state, due to the gevent timeout aborting after a query is sent, but before the socket data is fully read.

If you know the conn is dirty due to enforced timeout, the socket should probably either be fully cleaned, or the connection just thrown away entirely.

At $dayjob I have code that heavily uses gevent and pymysql, and in cases where we enforce a timeout like your code does, we indeed just chuck the conn and build an entirely new one. We also do a rollback when inserting a conn back into our pool, but that is more to avoid programmatic mistakes with devs forgetting to commit a transaction.

@zzzeek
Contributor
zzzeek commented Dec 6, 2014

im having trouble getting the part where we just "throw the connection away", if I call conn.close() and try to recycle it when the timeout occurs, the whole script runs out of connections, as though the socket isn't really going away. Not sure what that's about (try to make the program do this, if you get it to work then show me how).

@cactus
Contributor
cactus commented Dec 6, 2014

Another option may be to add a 'scrub socket' that sets socket to nonblocking, and reads until it can't read any longer, then restores socket options. The danger there is that some mysql data may show up later, if the server is very busy and slower to respond (as you would expect during a scenario what a gevent timeout would fire anyway of course!).

@zzzeek
Contributor
zzzeek commented Dec 6, 2014

oh i get it, here's mine, just needed to make sure the pool was topped at 50: https://gist.github.com/zzzeek/4cb5615faebd1bcb686e

@methane
Contributor
methane commented Dec 7, 2014

Since I'm not good at English, I can't catch this discussion.
What is PyMySQL's problem?

@cactus
Contributor
cactus commented Dec 8, 2014

@methane When a gevent timeout interrupts pymysql after it sends a query, but before it reads the response, the response is left unread in the socket. A subsequent query ends up reading the previous server response out of the socket, which may cause an error or silently be wrong, depending on what the previous query was.

I don't think there is a clean/safe way for the driver to gracefully recover from such a situation, and instead of re-using the connection the user should close the connection and make a new one if they are using gevent timeouts to abruptly abort normal processing of responses.

@zzzeek
Contributor
zzzeek commented Dec 12, 2014

I don't think the connection should gracefully recover but it should definitely stop emitting SQL. The connection should at all times be matching those messages it sends/receives over the socket with an internal state machine, and at any time that these don't match, no further messages will proceed on the socket, full stop. I'm surprised this is not how this is already architected.

MySQLdb does the right thing in a plain threaded scenario at least, you get "commands out of sync".

@zzzeek
Contributor
zzzeek commented Dec 12, 2014

so it's the "silently be wrong" part that I guess is where things may or may not be possible here. I've yet to dig into MySQL's protocol as of yet. I might suggest that when COMMIT is called, at least, that an additional message verifying that things are what they seem would be worthwhile. Better yet, when ROLLBACK is called, it is OK for ROLLBACK to be expensive when there was actually state in the transaction, do a "live" check right there (note, not when there was no transaction present, b.c. we do need cheap no-op rollbacks).

The fact that "SELECT connection_id()" blows up yet the connection can still commit data seems very strange.

@cactus
Contributor
cactus commented Dec 12, 2014

Well, the problem is the mysql protocol is a bit lacking in this regard. It appears there is no way to associate a statement with a server response, other than serially. If you externally break serial processing (send a command but fail to read the response), there is no way to safely resynchronize other than to give up and start fresh.

Maybe if the mysql protocol returned ok packets with some type of incrementing counter some type of local association could be maintained, but it apparently does not.

As far as MySQLdb doing the right thing, I don't think you can stop it from reading the response to a networking event (doesn't support premption nor gevent, as far as I recall), so I don't believe it can fall out of sync with regard to itself and the socket state. It looks like it calls mysql_real_query on its own internal managed connection. Further, as seen here you can still make it do the wrong thing if you use it incorrectly.

That said, perhaps pymysql could have a per connection counter, incremented under a mutex lock when a command is sent that expects a response, and decremented when said response is read. If you try sending a command when the counter is non-zero, it could honk loudly. This could prevent things getting out of sync.

EDIT: Since pymysql conns are not supposed to be shared, I don't think a mutex would actually be required.

@zzzeek
Contributor
zzzeek commented Dec 12, 2014

I've spent hours playing with that script. It's still confusing me what the specific steps are that make it illustrate this, as every time I try to simplify the timeout logic to narrow down the exact sequence of steps, the error goes away. I still don't understand what state we are in that that "table_b" is going in, and the "table_a" isn't there, unless the idea is that the "table_a" INSERT silently failed b.c. the connection was in the wrong state from the last run, but then every time I try to add more things around it to prove that, it falls apart.

So yeah I don't really know what to do here. We rely upon rollback() definitely working or explicitly failing. I don't really feel like building gevent APIs, documentation and warnings all throughout my downstream libraries that suggest "if you don't do this, you'll get erroneous data committed!", especially since I don't really understand the specifics of how it's even happening.

@cactus
Contributor
cactus commented Dec 13, 2014

@zzzeek here is the shortest version I can reproduce the error with. https://gist.github.com/cactus/3540d8034e280097071b

If you run it, you should also see the DEBUG output of pymysql (debug turned on just after conn setup).

@zzzeek
Contributor
zzzeek commented Dec 13, 2014

i want the version where rollback() silently fails, the next INSERT statement invoked then quasi-silently fails, returning 0 for cursor.lastrowid, then the next INSERT + COMMIT in the same "transaction" succeed. That's pretty much what's happening in the bigger test case.

@cactus
Contributor
cactus commented Dec 13, 2014

@zzzeek not sure if this is what you are looking for, but here it is: https://gist.github.com/cactus/4359c4992267cb3aff73

This is just more convoluted though, the previous example neatly distills the problem. Adding further wonky state (additional timed out and piled up unread packets in the socket) may result in different perceived behavior to the end user of the db connection, but I don't know that it adds any clarity to the underlying root cause.

@wooparadog
wooparadog commented Jun 29, 2016 edited

After some experiment, the reason this happens is that gevent.Timeout is actually a BaseException which is not so often handled by any libraries.

The temporary fix is very simple: https://gist.github.com/a9755c0a5f256cd533c590dd5369ae1e

with this modified version of pymysql, the problem won't happen with the reproduction script listed in the issue description.


edit:

There should be a nicer way to handle this exception. But the cause is gevent.Timeout being a BaseException.

@methane methane added this to the After 1.0 milestone Jul 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment