-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gevent killed greenlet causes pymysql to continue on the broken connection #275
Comments
I just found there's a release of v0.6.3 days before, I'v tested it, but still broken. |
just an FYI mysql-connector has the same issue: https://gist.github.com/zzzeek/d196fa91c40cb515365e I think there is somehting about gevent and/or MySQL's server protocol that makes this happen. Or all the MySQL drivers are just extremely similar to each other. |
I think this is not a bug but a just limitation. So you should just |
it should be straightforward for state within the conversation to be tracked, such that when rollback() is called, if the conversation is not in the expected state, that an error is raised and the connection is closed instead. |
what's happening here is that there's a connection, in such an invalid state on it where if you call "SELECT connection_id()" on it, it barfs; yet, if you don't do that, somehow it's able to accept an INSERT statement and a commit. That seems like at least a really deep issue in MySQL server's protocols. |
also I really don't understand how in the first place we get to line 118 at https://gist.github.com/zzzeek/d196fa91c40cb515365e#file-gistfile1-py-L118 with this bad connection. |
@zzzeek OK, PyMySQL should raise error when command out of sync (sending command before reading previous reply).
I can't catch what you say. Do you really mention about L118 ? |
i mean that there is a flow control issue, or something that makes it look like one. we get to line 118 on a connection where we should have left the block due to the timeout. |
@zzzeek You are using random.random, which of course may return very small values. You may end up with very tiny amounts of sleep, not enough to trigger the gevent timeout condition. |
@cactus I am not referring to cases where the timeout has not occurred. We are referring to cases where either the timeout has occurred, or it has not, yet the connection that is present here has somehow been corrupted due to timeouts that have proceeded elsewhere. It reaches line 118 upon a connection for which either line 115 has not executed, or for which the connection in progress on line 115 has been corrupted, such that the table_a row at line 115 is no longer present, and for which "SELECT connection_id()" now raises an exception. if the timeout has not occurred, then there is no issue at line 118. |
ok. I ran the code. It seems generally things proceed fine. After a while I start seeing a few $'s are FK constraint failures, and @'s are "Command Out of Sync" on rollbacks. Consider the following scenario. Assume the greenlet makes it to line 119 before the timeout fires, lets say right in the middle of IO (after it sends the query but before it can read the response). There is now unread response data in the socket, the MYSQL_RESULT, which contains things like affected rows.
On the next time through the main loop the first execute(insert) sends the query, and mysql responds with an MYSQL_RESULT. There are now two unread responses in the socket.
We haven't timed out though, so the driver reads the previous result packet. It was an insert, so the response is of the appropriate type (just from the wrong insert). So now we have:
So now we hit the verify_connection_id part, which sends a select. Now the socket has something like this in it.
The driver reads the response packet (again more or less compatible structure), and we get this:
The code now sends a commit in the if block, which the mysql server responds with...lets say an OK packet. This is what the socket may look like:
The driver goes to read that response here, and instead of getting an OK packet, gets the result packet garbage. I believe a similar series of events can occur with a rollback operation as well, since it expects an OK packet response from the server as well. The error on rollback was actually the most common one I saw, so that likely happens more often with the code sample. To show a bit more of what is happening underneath,if I change connection._read_packet to the following.. def _read_ok_packet(self):
pkt = self._read_packet()
if not pkt.is_ok_packet():
dump_packet(pkt.get_all_data())
raise OperationalError(2014, "Command Out of Sync")
ok = OKPacketWrapper(pkt)
self.server_status = ok.server_status
return True ...I get this output:
This seems to support my hypothesis with the socket simply becoming decoupled from connection state, due to the gevent timeout aborting after a query is sent, but before the socket data is fully read. If you know the conn is dirty due to enforced timeout, the socket should probably either be fully cleaned, or the connection just thrown away entirely. At $dayjob I have code that heavily uses gevent and pymysql, and in cases where we enforce a timeout like your code does, we indeed just chuck the conn and build an entirely new one. We also do a rollback when inserting a conn back into our pool, but that is more to avoid programmatic mistakes with devs forgetting to commit a transaction. |
im having trouble getting the part where we just "throw the connection away", if I call conn.close() and try to recycle it when the timeout occurs, the whole script runs out of connections, as though the socket isn't really going away. Not sure what that's about (try to make the program do this, if you get it to work then show me how). |
Another option may be to add a 'scrub socket' that sets socket to nonblocking, and reads until it can't read any longer, then restores socket options. The danger there is that some mysql data may show up later, if the server is very busy and slower to respond (as you would expect during a scenario what a gevent timeout would fire anyway of course!). |
oh i get it, here's mine, just needed to make sure the pool was topped at 50: https://gist.github.com/zzzeek/4cb5615faebd1bcb686e |
Since I'm not good at English, I can't catch this discussion. |
@methane When a gevent timeout interrupts pymysql after it sends a query, but before it reads the response, the response is left unread in the socket. A subsequent query ends up reading the previous server response out of the socket, which may cause an error or silently be wrong, depending on what the previous query was. I don't think there is a clean/safe way for the driver to gracefully recover from such a situation, and instead of re-using the connection the user should close the connection and make a new one if they are using gevent timeouts to abruptly abort normal processing of responses. |
I don't think the connection should gracefully recover but it should definitely stop emitting SQL. The connection should at all times be matching those messages it sends/receives over the socket with an internal state machine, and at any time that these don't match, no further messages will proceed on the socket, full stop. I'm surprised this is not how this is already architected. MySQLdb does the right thing in a plain threaded scenario at least, you get "commands out of sync". |
so it's the "silently be wrong" part that I guess is where things may or may not be possible here. I've yet to dig into MySQL's protocol as of yet. I might suggest that when COMMIT is called, at least, that an additional message verifying that things are what they seem would be worthwhile. Better yet, when ROLLBACK is called, it is OK for ROLLBACK to be expensive when there was actually state in the transaction, do a "live" check right there (note, not when there was no transaction present, b.c. we do need cheap no-op rollbacks). The fact that "SELECT connection_id()" blows up yet the connection can still commit data seems very strange. |
Well, the problem is the mysql protocol is a bit lacking in this regard. It appears there is no way to associate a statement with a server response, other than serially. If you externally break serial processing (send a command but fail to read the response), there is no way to safely resynchronize other than to give up and start fresh. Maybe if the mysql protocol returned ok packets with some type of incrementing counter some type of local association could be maintained, but it apparently does not. As far as MySQLdb doing the right thing, I don't think you can stop it from reading the response to a networking event (doesn't support premption nor gevent, as far as I recall), so I don't believe it can fall out of sync with regard to itself and the socket state. It looks like it calls mysql_real_query on its own internal managed connection. Further, as seen here you can still make it do the wrong thing if you use it incorrectly. That said, perhaps pymysql could have a per connection counter, incremented under a mutex lock when a command is sent that expects a response, and decremented when said response is read. If you try sending a command when the counter is non-zero, it could honk loudly. This could prevent things getting out of sync. EDIT: Since pymysql conns are not supposed to be shared, I don't think a mutex would actually be required. |
I've spent hours playing with that script. It's still confusing me what the specific steps are that make it illustrate this, as every time I try to simplify the timeout logic to narrow down the exact sequence of steps, the error goes away. I still don't understand what state we are in that that "table_b" is going in, and the "table_a" isn't there, unless the idea is that the "table_a" INSERT silently failed b.c. the connection was in the wrong state from the last run, but then every time I try to add more things around it to prove that, it falls apart. So yeah I don't really know what to do here. We rely upon rollback() definitely working or explicitly failing. I don't really feel like building gevent APIs, documentation and warnings all throughout my downstream libraries that suggest "if you don't do this, you'll get erroneous data committed!", especially since I don't really understand the specifics of how it's even happening. |
@zzzeek here is the shortest version I can reproduce the error with. https://gist.github.com/cactus/3540d8034e280097071b If you run it, you should also see the DEBUG output of pymysql (debug turned on just after conn setup). |
i want the version where rollback() silently fails, the next INSERT statement invoked then quasi-silently fails, returning 0 for cursor.lastrowid, then the next INSERT + COMMIT in the same "transaction" succeed. That's pretty much what's happening in the bigger test case. |
@zzzeek not sure if this is what you are looking for, but here it is: https://gist.github.com/cactus/4359c4992267cb3aff73 This is just more convoluted though, the previous example neatly distills the problem. Adding further wonky state (additional timed out and piled up unread packets in the socket) may result in different perceived behavior to the end user of the db connection, but I don't know that it adds any clarity to the underlying root cause. |
After some experiment, the reason this happens is that The temporary fix is very simple: https://gist.github.com/a9755c0a5f256cd533c590dd5369ae1e with this modified version of pymysql, the problem won't happen with the reproduction script listed in the issue description. edit: There should be a nicer way to handle this exception. But the cause is |
The link has failed, could you update it? |
Bug reported in the sqlalchmey issue tracker but from @zzzeek's point of view, it's pymysql side bug.
This will cause a serious bug "transaction partial succeeding" when greenlet timed out and killed by gevent.
Original issue here: https://bitbucket.org/zzzeek/sqlalchemy/issue/3258/a-gevent-killed-greenlet-causes-pymysql
Reproduce script:
The text was updated successfully, but these errors were encountered: