Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exceptions in Socket _recv_loop may cause thread to die silently #28

Open
APCBoston opened this issue Jun 13, 2023 · 12 comments
Open

Exceptions in Socket _recv_loop may cause thread to die silently #28

APCBoston opened this issue Jun 13, 2023 · 12 comments

Comments

@APCBoston
Copy link
Contributor

See Paramiko Transport constructor at:

https://github.com/paramiko/paramiko/blob/main/paramiko/transport.py#LL454-L457

The result of this is that netconf_client ssh connections that should time out are susceptible to hang forever instead.

Working on further investigation/resolution, putting this here for situational awareness.

@APCBoston
Copy link
Contributor Author

We do pass a timeout to paramiko's transport.open_session(), but I'm still seeing it not time out when it should...

@JennToo
Copy link
Member

JennToo commented Jun 13, 2023

The code you linked there seems to be the opposite, it sets the timeout very low.

One point of confusion here, where you're expecting to see something timeout is it because the TCP session dies or something else? I would assume that setting timeouts on the socket only affects TCP-level stuckness. But if the SSH server or NETCONF server itself took a long time to reply, that wouldn't get caught by socket-level timeouts.

I could be wrong there, I haven't done socket programming in a while. But that's how I'd expect socket-level timeouts to work.

@JennToo
Copy link
Member

JennToo commented Jun 13, 2023

To elaborate further: if it's e.g. the NETCONF server getting stuck at the NC protocol level, the TCP stream will still get ACK'd by the kernel and as far as TCP is concerned everything is fine. Eventually if the underlying service isn't grabbing stuff off the socket, the socket's buffer might fill up and then the kernel will stop ACK'ing stuff, but that might not happen if it's a low-volume connection.

@APCBoston
Copy link
Contributor Author

@JennToo the operative bit that caught my eye in the linked code is that the socket timeout exception is never propagated.

What I'm seeing in my testing is that whether I kill the NETCONF server while keeping the TCP socket open or take the device down entirely, subsequent NETCONF RPCs hang indefinitely.

It looks like this is because (and I'm sort of live-blogging my troubleshooting in this thread) Transport.open_channel() accepts a timeout, but only uses it for the initial connection. After that, its default behavior is a blocking socket with no timeout. (See here, and here ).

@JennToo
Copy link
Member

JennToo commented Jun 13, 2023

Oh that's interesting. Are you using the ncclient-adapter manager from https://github.com/ADTRAN/netconf_client/blob/main/netconf_client/ncclient.py ?

We're setting a timeout there on the futures returned by this library's session handler, and that should actually be catching this too. That one should be catching a timeout regardless of which protocol layer things get stuck at. There might be a bug with the timeout logic in this library though.

@APCBoston
Copy link
Contributor Author

Yes we are using the manager, but it's not catching it. I believe the reason is related to the fact that paramiko is spawning threads rather than using asyncio futures, but I haven't looked at that bit of the code in detail (yet). I think I should be able to have a patch out later today regardless.

@JennToo
Copy link
Member

JennToo commented Jun 13, 2023

I did a re-read of how we do the timeout logic in the manager object and I didn't spot any obvious bugs. It is a little complicated though, so there could certainly be something wrong with it.

@JennToo
Copy link
Member

JennToo commented Jun 13, 2023

We actually spawn a thread too, all the NC-level protocol stuff happens on that thread and the future is what the client thread is using.

I guess it is possible though that something (possibly paramiko) is making a blocking call into some C thing with the GIL held. That would prevent the other python threads from running

@APCBoston
Copy link
Contributor Author

My hypothesis is that there's a simple solution of calling channel.settimeout(general_timeout) in connect_ssh. Testing that now...

@APCBoston
Copy link
Contributor Author

Quick update: channel.settimeout() didn't hurt anything, but I ultimately discovered that a timeout error was not my friction point. Instead, I have a much odder problem, one that none of the Python experts in my own network can make heads or tails of, and that (I say this advisedly) could be a bug in the Python interpreter itself (CPython 3.10, CPython 3.11).

The thing I ran into is described below, though it is not -this- issue and may not be a netconf_client issue at all, I'm including it here for reference and in case anybody coming across this has the answer. I'm inclined to leave this issue open until I've generated a socket timeout and watched it get handled correctly or incorrectly, which is in my near-term (4-8 week) backlog.

Weird error details below the cut

I had an asyncio coroutine calling netconf_client 's Manager.dispatch() method to send an RPC. This in turn filters down through a series of netconf_client and paramiko calls to a paramiko Transport object, which does the socket-level stuff in a separate thread. (Transport in fact subclasses threading.Thread). netconf_client creates a concurrent.futures.Future (which I have learned is similar but not the same as an AIO Future) to await the response from the device.

I found that when the device was dead, the following were true:

  1. Paramiko instantly raised a socket.error exception and passed it back to netconf_client.
  2. netconf_client caught the exception, logged, and re-raised
  3. If I provoked the behavior through the REPL or pdb debugging session, that same exception would immediately propagate up to that REPL or debugger session, where I'd see it in the terminal. You know, like you expect exceptions to do. Stack trace and all. (I have since learned that asyncio provides a separate async REPL. Haven't tried this yet)
  4. HOWEVER, if the same code was called from an aio coroutine, that coroutine would hang forever. If I subsequently broke in with the debugger, I could see an event loop churning forever, doing nothing.
  5. BUT WAIT, THERE's MORE: when I added an explicit handler for that specific exception (except sock.error: ... ), in the coroutine function, the behavior returned to normal , the exception was caught and handled per that code block, etc etc

Thanks to #5, I got my code to work, but it bugs me that I don't understand in the slightest why 3-5 happen the way they do.

@JennToo
Copy link
Member

JennToo commented Jun 15, 2023

Wow! Yeah that's very strange indeed.

Some day I'd like for this library to be properly AIO-aware and compatible. It was written well before that stuff got standardized, but long term it'd be good to just make it async-native. I guess we'd also need (or at least want) an async-friendly SSH library too though or it'd be a bit moot.

Realistically it'd be nearly a rewrite though, and at least for the way our company uses this library (mainly just for integration testing), it probably won't get priority any time soon.

I'll change this bug's title to reflect what you found and leave it open, just in case anyone else ever stumbles into this. But it sounds like there's not much we can do to fix it within this library.

@JennToo JennToo changed the title Paramiko doesn't propagate timeout errors, so our timeouts don't work right Paramiko doesn't propagate timeout errors, so our timeouts don't work right when used in async co-routine Jun 15, 2023
@APCBoston APCBoston changed the title Paramiko doesn't propagate timeout errors, so our timeouts don't work right when used in async co-routine Exceptions in Socket _recv_loop may cause thread to die silently Aug 18, 2023
@APCBoston
Copy link
Contributor Author

APCBoston commented Aug 18, 2023

Okay, this came back up for me earlier this month and after a lot of digging I believe I have found the ur-source of the problem at https://github.com/ADTRAN/netconf_client/blob/280d9d6e19828ae7c96d359ee3e2729b44e63a48/netconf_client/session.py#L107C1-L115C1.

Things I have learned:

  • It has nothing to do with AIO after all; I can reproduce the issue in a synchronous version
  • When Session._recv_loop() attempts to read from a socket that is associated with a dead device, that socket raises an exception. The Session emits an info level log and the thread managing the recv_loop exits, but the exception is not propagated.
  • Because no exception is propagated from either Session or paramiko and the Session object still exists, it will continue accepting new RPC requests (and other method calls)
  • Paramiko can wind up in an infinite loop on the sendall call if the TCP socket attached to the dead device times out repeatedly. https://github.com/ADTRAN/netconf_client/blob/280d9d6e19828ae7c96d359ee3e2729b44e63a48/netconf_client/session.py#L107C1-L115C1

In my case, I found that an adequate workaround was to call Session.thread.is_alive() before making a periodic RPC to check if the device is still there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants