New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ssh
: Fix stealing of 'EXIT'
messages if a client is trapping exits
#8226
ssh
: Fix stealing of 'EXIT'
messages if a client is trapping exits
#8226
Conversation
CT Test Results 2 files 29 suites 17m 19s ⏱️ For more details on these failures, see this check. Results for commit a6482bd. ♻️ This comment has been updated with latest results. To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass. See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally. Artifacts// Erlang/OTP Github Action Bot |
I'm glad not to be the only one 😅
|
36295ec
to
ea38c67
Compare
Added two tests (one for |
lib/ssh/src/ssh_acceptor.erl
Outdated
|
||
handle_connection(Address, Port, _Peer, Options, Socket, _MaxSessions, _NumSessions, ParallelLogin) | ||
when ParallelLogin == true -> | ||
Parent = self(), | ||
Ref = make_ref(), | ||
Pid = spawn_link( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spawning this process here that is not part of the supervisor tree and linking it to the user process I think is problematic.
|
||
takeover(ConnPid, _, Socket, Options) -> | ||
takeover(ParentPid, ConnPid, _, Socket, Options) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we will be changing controlling process a second time. I think we should write the code in such a way that we only need to do it once. The acceptor process that should be in the supervisor tree should create the socket and start the appropriate subsystem dynamic supervisor tree and retrieve the pid of the connection_handler procsee and transfer the ownership once.
@Maria-12648430 I add some thoughts on how I think it should work. |
Also I think it should work to have several accepting processes, first one will win and the others might get the next connection attempt. |
Thanks @IngelaAndin, I will take a closer look next week, I'm currently on a short vacation 🙂 |
@IngelaAndin your suggestions make sense to me and I think we should go that way. However, this makes it a larger refactoring (vs a rather simple bugfix) which will take some time to work out. |
@Maria-12648430 no problem Maria, I think it is important to fix things the "tm correct way" because otherwise it is mainly a question of time until some new problem pops up. I believe in fixing the problem not only shutting the symptoms up. We very much apricate you taking the time, as we alas not always have enough resources to fix everything as swiftly as we may desire. |
Do you mean to have that in scenario with parallel_login == true or parallel_login == false? |
I agree with suggestions from Ingela.
I think instruction in ssh_acceptor could be removed. |
end. | ||
|
||
%%-------------------------------------------------------------------- | ||
%% Issue #8223 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you write GH-8223
it will be clear you're referring to github issue.
This is pretty hard stuff 😰 Everything happens all over the place and gets pushed around to everywhere else, or so it seems 🤪 This will take a while to sort out and not break anything. |
|
ea38c67
to
6e82e50
Compare
@IngelaAndin @u3s Ok, with help from @juhlig, this is what we came up with. Let us know what you think. We put another simple_one_for_one supervisor We also introduced a new value for the The overall behavior is as follows, depending on the value given for
The outlined behavior introduces no backwards incompatibilities. Moreover, the approach solves the issue (#8223) that gave raise to this PR, and as far as we can tell incorporates the suggestions made before. There is one failing test, though: |
6e82e50
to
dc8d099
Compare
Co-authored-by: Jan Uhlig <juhlig@hnc-agency.org>
dc8d099
to
fd5a27d
Compare
|
||
listen(Port, Options) -> | ||
{_, Callback, _} = ?GET_OPT(transport, Options), | ||
SockOpts = ?GET_OPT(socket_options, Options) ++ [{active, false}, {reuseaddr, true}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was [{active, false}, {reuseaddr,true} | ?GET_OPT(socket_options, Options)]
in ssh_acceptor
, which is wrong. For one, it would prevent the usage of option inet_backend
since this option must be the first in the list. For another, in options given to gen_tcp
or ssl
, the last option wins, such that if for example {active, true}
was in the list of given socket options, it would be set to {active, true}
despite the tacked-on {active, false}
.
Thanks a lot for work. We planned code review and will get back. |
I think the test looks like a white-box test, and probably it trying to test that there is no process leak if the client gives up the connection after sending its hello message. Probably it should be checked differently as you have changed the supervisor tree. I like the way that you describe in the documentation what should happen. I have not full understood the code yet. I think I need to look at it as whole the diff is not enough. |
@IngelaAndin looking forward to your review 🙂 I'll see what I can do about that test next week. Btw, I got a notification by email about a comment from you regarding |
@Maria-12648430 yes it was because I thought maybe my comment was premature, so I deleted it for the moment. It is true that processes under a simple_one_for_one supervisor would benefit from using proc_lib:spawn_link and then use behaviour:enter_loop instead of proc_lib:start_link as they are independent of each other and do not need to wait for each others init, so we can skip send and receiving of the ack message. Question is, if I had put this comment in the right place. So now I just made a general comment instead. |
case inet:sockname(LSock) of | ||
{ok, {_, Port}} -> | ||
%% A usable, open LSock | ||
spawn_link(fun() -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is probably the fault of the original design, but transferring ownership of the listen socket from ssh user process context feels like a bad idea. Why not make the listen call in a dedicated acceptor process? In current solution that would be in the ssh_acceptor_subsup process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the same thing, but see here:
Lines 599 to 613 in 9859b72
daemon(Host0, Port0, UserOptions0) when 0 =< Port0, Port0 =< 65535, | |
Host0 == any ; Host0 == loopback ; is_tuple(Host0) -> | |
try | |
{Host1, UserOptions} = handle_daemon_args(Host0, UserOptions0), | |
#{} = Options0 = ssh_options:handle_options(server, UserOptions), | |
%% We need to open the listen socket here before start of the system supervisor. That | |
%% is because Port0 might be 0, or if an FD is provided in the Options0, in which case | |
%% the real listening port will be known only after the gen_tcp:listen call. | |
maybe_open_listen_socket(Host1, Port0, Options0) | |
of | |
{Host, Port, ListenSocket, Options1} -> | |
try | |
%% Now Host,Port is what to use for the supervisor to register its name, | |
%% and ListenSocket, if provided, is for listening on connections. But | |
%% it is still owned by self()... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, well maybe we need it then. I am still not fond of making an explicit receive in the users process like that without more control like in gen_server:call for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a huge fan of that, either... Maybe a greater refactoring effort is in order then, but I think that should involve more planning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could go with a "good enough bug fix" and then make a refactor later in new master after OTP-27 (as it is really soon)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should be able to move away from the above approach. Th concept of fd is provided is inherited from httpd_sup and we should be able to handle it like that, that is port is also specified but not opened as it is preopend. And the case of 0 is the port is only for testing and I think we should rather report the the chosen port back to the caller then the way it is done now.
We are still stuck on this one test. It just doesn't add up. Before the structural changes in this PR, when parallel login was enabled, the acceptor would spawn_link a new process to do the authentication and handshake, which would trap exits. If the acceptor died during this, it would notice the exit via the link and stop the connection handler process again. This is what it looks like the test covers: it sends an exit to the acceptor (which is not trapping exits), then checks that there are no processes in handshake and no connection handler processes. Notably, when parallel login was disabled, the implementation would not stop a connection handler process hanging in handshake. The acceptor process was doing the authentication and handshake by itself (ie, not in a spawned process), and the death of the acceptor would just stop the process. The handshake might time out, that would be kinda ok, but if it succeeded later there would be an orphaned connection handler process. Notably again, this can also happen when a user starts a client via With the changes in this PR, there are now multiple acceptors handling authorization and handshake when parallel login is enabled, each handling authentication and handshake by itself like the single acceptor in the old implementation with parallel login disabled. When the acceptor receives an exit and dies, the connection handler is orphaned the same way. The acceptor process could be changed to trap exits and shut down connections stuck in handshake, but as the selective receive in the handshake does not have a clause, this could only be done after the handshake eventually succeeded or timed out. Adding another clause is also not a good option since the same function is also used in the user process (which gave raise to the initial issue). So it looks like this cannot be solved without more structural changes, that is, another layer of separate processes independent from the acceptor and user processes. But as I already said above, this gets out of hand and turns into a bigger refactoring which requires better planning. What to do? 😰 (@IngelaAndin @u3s) |
I guess the only "good enough bugfix" that we can reasonably aim for now, without going into large-scale restructuring, would be what we had in the beginning: Revert the structural changes we have now (bring them back later), instead keep the spawn_linking of a process by the acceptor, but nail the receive of the |
@juhlig @Maria-12648430 We discussed it at meeting today and we feel that we will try avoiding making a good enough bug-fix as it might introduce new strangeness. We rather like to continue working on a rewrite in the OTP-27 track that possible could be backported later when it has proven itself. |
In general I think all process should be under a supervisor unless they are really temporary and simple so we know they will terminate. An example of that is start_connection_tree in tls_gen_connection. This is needed for the dynaminc TLS supervisor tree with a significant child to work correctly as otherwise it could become inconsistent if the user process dies in the middle of starting. If we need to synchronize something with users process and it is not trough a behavior call it should be tagged with the pid of the process we are synchronizing with. I think the transport accept call should be done by an acceptor process and the handshake shall be done by a new connection handler process. Timeouts should be handled on the server side of things. All non temporary process should be implemented as behaviors. And as discussed before I think we can avoid user process opening listen sockets. |
Ok @IngelaAndin, I agree 🙂 Shall we close this PR then? |
@Maria-12648430 sure we can close it and we are hoping you replace it with a new and even better one :) We really apricate your help with this. We are kind of swamped right now in the security protocol area so to speak. |
Ok then, I'll dig into the code to better understand the status quo, see what I can do, and come back when I have something to show 😃 |
Fixes #8223.
I haven't written any tests for that yet. TBH, I don't know in which of the 20 suites I should put them 😅