Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ssh: Fix stealing of 'EXIT' messages if a client is trapping exits #8226

Conversation

Maria-12648430
Copy link
Contributor

Fixes #8223.

I haven't written any tests for that yet. TBH, I don't know in which of the 20 suites I should put them 😅

Copy link
Contributor

github-actions bot commented Mar 5, 2024

CT Test Results

    2 files     29 suites   17m 19s ⏱️
  460 tests   454 ✅  5 💤 1 ❌
1 665 runs  1 639 ✅ 25 💤 1 ❌

For more details on these failures, see this check.

Results for commit a6482bd.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@u3s u3s self-assigned this Mar 5, 2024
@u3s u3s added the team:PS Assigned to OTP team PS label Mar 5, 2024
@u3s
Copy link
Contributor

u3s commented Mar 5, 2024

I haven't written any tests for that yet. TBH, I don't know in which of the 20 suites I should put them 😅

I'm glad not to be the only one 😅

  • in case of doubts, I would pick ssh_connection_SUITE as it seems a safe choice as everything in ssh is based on some connections ;-)

@Maria-12648430
Copy link
Contributor Author

Added two tests (one for ssh:connect and another for ssh:daemon) in the last commit.

@Maria-12648430 Maria-12648430 marked this pull request as draft March 6, 2024 16:05

handle_connection(Address, Port, _Peer, Options, Socket, _MaxSessions, _NumSessions, ParallelLogin)
when ParallelLogin == true ->
Parent = self(),
Ref = make_ref(),
Pid = spawn_link(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spawning this process here that is not part of the supervisor tree and linking it to the user process I think is problematic.


takeover(ConnPid, _, Socket, Options) ->
takeover(ParentPid, ConnPid, _, Socket, Options) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we will be changing controlling process a second time. I think we should write the code in such a way that we only need to do it once. The acceptor process that should be in the supervisor tree should create the socket and start the appropriate subsystem dynamic supervisor tree and retrieve the pid of the connection_handler procsee and transfer the ownership once.

@IngelaAndin
Copy link
Contributor

@Maria-12648430 I add some thoughts on how I think it should work.

@IngelaAndin
Copy link
Contributor

Also I think it should work to have several accepting processes, first one will win and the others might get the next connection attempt.

@Maria-12648430
Copy link
Contributor Author

@Maria-12648430 I add some thoughts on how I think it should work.

Thanks @IngelaAndin, I will take a closer look next week, I'm currently on a short vacation 🙂

@IngelaAndin IngelaAndin added the waiting waiting for changes/input from author label Mar 19, 2024
@Maria-12648430
Copy link
Contributor Author

@IngelaAndin your suggestions make sense to me and I think we should go that way. However, this makes it a larger refactoring (vs a rather simple bugfix) which will take some time to work out.

@IngelaAndin
Copy link
Contributor

@Maria-12648430 no problem Maria, I think it is important to fix things the "tm correct way" because otherwise it is mainly a question of time until some new problem pops up. I believe in fixing the problem not only shutting the symptoms up. We very much apricate you taking the time, as we alas not always have enough resources to fix everything as swiftly as we may desire.
It was an unfortunate mistake I think this design, that probably slipped through as we might have missed to review some code, but code reviewing activities are also a tradeoff between other things that needs doing, so we just have to deal with it.

@u3s
Copy link
Contributor

u3s commented Mar 20, 2024

Also I think it should work to have several accepting processes, first one will win and the others might get the next connection attempt.

Do you mean to have that in scenario with parallel_login == true or parallel_login == false?

@u3s
Copy link
Contributor

u3s commented Mar 21, 2024

I agree with suggestions from Ingela.
When browsing code I also found code line which is probably redundant.

user_pid option is added twice - probably safe to remove (tm) ;-) if turns out related to this work.
ssh_acceptor:handle_connection/4 invokes ?PUT_INTERNAL_OPT([{user_pid, which is then repeated in ssh_system_sup:start_subsystem/4

I think instruction in ssh_acceptor could be removed.

end.

%%--------------------------------------------------------------------
%% Issue #8223
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you write GH-8223 it will be clear you're referring to github issue.

@Maria-12648430
Copy link
Contributor Author

This is pretty hard stuff 😰 Everything happens all over the place and gets pushed around to everywhere else, or so it seems 🤪 This will take a while to sort out and not break anything.

@IngelaAndin
Copy link
Contributor

Also I think it should work to have several accepting processes, first one will win and the others might get the next connection attempt.

Do you mean to have that in scenario with parallel_login == true or parallel_login == false?
I guess it makes little sense for parallel_login false! I think this option makes little sense at all. I think you could always have parallel login and put the limit to one to get false behavior.

@Maria-12648430
Copy link
Contributor Author

Maria-12648430 commented Mar 25, 2024

@IngelaAndin @u3s Ok, with help from @juhlig, this is what we came up with. Let us know what you think.

We put another simple_one_for_one supervisor ssh_acceptor_subsup between ssh_acceptor_sup and ssh_acceptor. It sets up the listening socket for a daemon (this was formerly in ssh_acceptor). The actual acceptor processes are children of this supervisor.

We also introduced a new value for the parallel_login option in addition to true and false which can be a positive integer.

The overall behavior is as follows, depending on the value given for parallel_login:

  • If parallel_login is a number N (false is synonymous to N=1), that number of ssh_acceptor children are started initially. Each of those acceptors runs in a loop that waits for a connection, handles it, then wraps around to waiting for the next connection. This way, up to N login attempts can be handled simultaneously; if there are more than N simultaneous login attempts, they have to wait.
  • If parallel_login is true (in the sense of an unlimited number of simultaneous login attempts), only 1 ssh_acceptor child is started initially. As soon as this acceptor gets a connection, it tells the supervisor to start another acceptor before handling the connection and finally exiting (instead of looping around to wait for another connection). The freshly started acceptor will behave in the same way.

The outlined behavior introduces no backwards incompatibilities. parallel_login values false and true remain with the same outward behavior, and the newly introduced value N provides a good "middle ground" between sequential accepting and unlimited all-out parallelism.

Moreover, the approach solves the issue (#8223) that gave raise to this PR, and as far as we can tell incorporates the suggestions made before.

There is one failing test, though: ssh_protocol_SUITE:client_close_after_hello/1 (there may be more, some are being skipped on my machine, don't know about those). I don't really grasp what this one is supposed to be testing, and if it still applies with the changes made. I would be grateful if you could have a look at it.

@Maria-12648430 Maria-12648430 marked this pull request as ready for review March 25, 2024 11:33
Co-authored-by: Jan Uhlig <juhlig@hnc-agency.org>

listen(Port, Options) ->
{_, Callback, _} = ?GET_OPT(transport, Options),
SockOpts = ?GET_OPT(socket_options, Options) ++ [{active, false}, {reuseaddr, true}],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was [{active, false}, {reuseaddr,true} | ?GET_OPT(socket_options, Options)] in ssh_acceptor, which is wrong. For one, it would prevent the usage of option inet_backend since this option must be the first in the list. For another, in options given to gen_tcp or ssl, the last option wins, such that if for example {active, true} was in the list of given socket options, it would be set to {active, true} despite the tacked-on {active, false}.

@u3s
Copy link
Contributor

u3s commented Apr 3, 2024

@IngelaAndin @u3s Ok, with help from @juhlig, this is what we came up with. Let us know what you think.

Thanks a lot for work. We planned code review and will get back.

@IngelaAndin IngelaAndin removed the waiting waiting for changes/input from author label Apr 9, 2024
@IngelaAndin
Copy link
Contributor

I think the test looks like a white-box test, and probably it trying to test that there is no process leak if the client gives up the connection after sending its hello message. Probably it should be checked differently as you have changed the supervisor tree.

I like the way that you describe in the documentation what should happen. I have not full understood the code yet. I think I need to look at it as whole the diff is not enough.

@Maria-12648430
Copy link
Contributor Author

@IngelaAndin looking forward to your review 🙂 I'll see what I can do about that test next week.

Btw, I got a notification by email about a comment from you regarding proc_lib:spawn_link vs proc_lib:start_link+proc_lib:init_ack in ssh_acceptor, but strangely I can't find it here in Github 🤔

@IngelaAndin
Copy link
Contributor

IngelaAndin commented Apr 10, 2024

@Maria-12648430 yes it was because I thought maybe my comment was premature, so I deleted it for the moment. It is true that processes under a simple_one_for_one supervisor would benefit from using proc_lib:spawn_link and then use behaviour:enter_loop instead of proc_lib:start_link as they are independent of each other and do not need to wait for each others init, so we can skip send and receiving of the ack message. Question is, if I had put this comment in the right place. So now I just made a general comment instead.

case inet:sockname(LSock) of
{ok, {_, Port}} ->
%% A usable, open LSock
spawn_link(fun() ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is probably the fault of the original design, but transferring ownership of the listen socket from ssh user process context feels like a bad idea. Why not make the listen call in a dedicated acceptor process? In current solution that would be in the ssh_acceptor_subsup process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same thing, but see here:

otp/lib/ssh/src/ssh.erl

Lines 599 to 613 in 9859b72

daemon(Host0, Port0, UserOptions0) when 0 =< Port0, Port0 =< 65535,
Host0 == any ; Host0 == loopback ; is_tuple(Host0) ->
try
{Host1, UserOptions} = handle_daemon_args(Host0, UserOptions0),
#{} = Options0 = ssh_options:handle_options(server, UserOptions),
%% We need to open the listen socket here before start of the system supervisor. That
%% is because Port0 might be 0, or if an FD is provided in the Options0, in which case
%% the real listening port will be known only after the gen_tcp:listen call.
maybe_open_listen_socket(Host1, Port0, Options0)
of
{Host, Port, ListenSocket, Options1} ->
try
%% Now Host,Port is what to use for the supervisor to register its name,
%% and ListenSocket, if provided, is for listening on connections. But
%% it is still owned by self()...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, well maybe we need it then. I am still not fond of making an explicit receive in the users process like that without more control like in gen_server:call for instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of that, either... Maybe a greater refactoring effort is in order then, but I think that should involve more planning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could go with a "good enough bug fix" and then make a refactor later in new master after OTP-27 (as it is really soon)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should be able to move away from the above approach. Th concept of fd is provided is inherited from httpd_sup and we should be able to handle it like that, that is port is also specified but not opened as it is preopend. And the case of 0 is the port is only for testing and I think we should rather report the the chosen port back to the caller then the way it is done now.

@Maria-12648430
Copy link
Contributor Author

Maria-12648430 commented Apr 15, 2024

We are still stuck on this one test. It just doesn't add up.

Before the structural changes in this PR, when parallel login was enabled, the acceptor would spawn_link a new process to do the authentication and handshake, which would trap exits. If the acceptor died during this, it would notice the exit via the link and stop the connection handler process again. This is what it looks like the test covers: it sends an exit to the acceptor (which is not trapping exits), then checks that there are no processes in handshake and no connection handler processes.

Notably, when parallel login was disabled, the implementation would not stop a connection handler process hanging in handshake. The acceptor process was doing the authentication and handshake by itself (ie, not in a spawned process), and the death of the acceptor would just stop the process. The handshake might time out, that would be kinda ok, but if it succeeded later there would be an orphaned connection handler process.

Notably again, this can also happen when a user starts a client via connect. If the client dies when the connection is in handshake, the connection handler process will not be stopped.

With the changes in this PR, there are now multiple acceptors handling authorization and handshake when parallel login is enabled, each handling authentication and handshake by itself like the single acceptor in the old implementation with parallel login disabled. When the acceptor receives an exit and dies, the connection handler is orphaned the same way.

The acceptor process could be changed to trap exits and shut down connections stuck in handshake, but as the selective receive in the handshake does not have a clause, this could only be done after the handshake eventually succeeded or timed out. Adding another clause is also not a good option since the same function is also used in the user process (which gave raise to the initial issue). So it looks like this cannot be solved without more structural changes, that is, another layer of separate processes independent from the acceptor and user processes. But as I already said above, this gets out of hand and turns into a bigger refactoring which requires better planning. What to do? 😰 (@IngelaAndin @u3s)

@juhlig
Copy link
Contributor

juhlig commented Apr 15, 2024

What to do?

Maybe we could go with a "good enough bug fix" and then make a refactor later in new master after OTP-27 (as it is really soon)

I guess the only "good enough bugfix" that we can reasonably aim for now, without going into large-scale restructuring, would be what we had in the beginning: Revert the structural changes we have now (bring them back later), instead keep the spawn_linking of a process by the acceptor, but nail the receive of the 'EXIT' in handshake down to the acceptor pid 🤷‍♂️

@IngelaAndin
Copy link
Contributor

@juhlig @Maria-12648430 We discussed it at meeting today and we feel that we will try avoiding making a good enough bug-fix as it might introduce new strangeness. We rather like to continue working on a rewrite in the OTP-27 track that possible could be backported later when it has proven itself.

@IngelaAndin
Copy link
Contributor

In general I think all process should be under a supervisor unless they are really temporary and simple so we know they will terminate. An example of that is start_connection_tree in tls_gen_connection. This is needed for the dynaminc TLS supervisor tree with a significant child to work correctly as otherwise it could become inconsistent if the user process dies in the middle of starting.

If we need to synchronize something with users process and it is not trough a behavior call it should be tagged with the pid of the process we are synchronizing with.

I think the transport accept call should be done by an acceptor process and the handshake shall be done by a new connection handler process.

Timeouts should be handled on the server side of things.

All non temporary process should be implemented as behaviors.

And as discussed before I think we can avoid user process opening listen sockets.

@Maria-12648430
Copy link
Contributor Author

Ok @IngelaAndin, I agree 🙂 Shall we close this PR then?

@IngelaAndin
Copy link
Contributor

@Maria-12648430 sure we can close it and we are hoping you replace it with a new and even better one :) We really apricate your help with this. We are kind of swamped right now in the security protocol area so to speak.

@Maria-12648430
Copy link
Contributor Author

Ok then, I'll dig into the code to better understand the status quo, see what I can do, and come back when I have something to show 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team:PS Assigned to OTP team PS
Projects
None yet
4 participants