`ssh`: Fix stealing of `'EXIT'` messages if a client is trapping exits #8226

Maria-12648430 · 2024-03-05T14:12:42Z

Fixes #8223.

I haven't written any tests for that yet. TBH, I don't know in which of the 20 suites I should put them 😅

github-actions · 2024-03-05T14:14:16Z

CT Test Results

2 files 29 suites 17m 19s ⏱️
460 tests 454 ✅ 5 💤 1 ❌
1 665 runs 1 639 ✅ 25 💤 1 ❌

For more details on these failures, see this check.

Results for commit a6482bd.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

u3s · 2024-03-05T15:01:47Z

I haven't written any tests for that yet. TBH, I don't know in which of the 20 suites I should put them 😅

I'm glad not to be the only one 😅

in case of doubts, I would pick ssh_connection_SUITE as it seems a safe choice as everything in ssh is based on some connections ;-)

Maria-12648430 · 2024-03-06T13:08:34Z

Added two tests (one for ssh:connect and another for ssh:daemon) in the last commit.

IngelaAndin · 2024-03-14T16:09:04Z

lib/ssh/src/ssh_acceptor.erl


 handle_connection(Address, Port, _Peer, Options, Socket, _MaxSessions, _NumSessions, ParallelLogin)
  when ParallelLogin == true ->
+    Parent = self(),
    Ref = make_ref(),
    Pid = spawn_link(


Spawning this process here that is not part of the supervisor tree and linking it to the user process I think is problematic.

IngelaAndin · 2024-03-15T09:23:12Z

lib/ssh/src/ssh_connection_handler.erl


-takeover(ConnPid, _, Socket, Options) ->
+takeover(ParentPid, ConnPid, _, Socket, Options) ->


Here we will be changing controlling process a second time. I think we should write the code in such a way that we only need to do it once. The acceptor process that should be in the supervisor tree should create the socket and start the appropriate subsystem dynamic supervisor tree and retrieve the pid of the connection_handler procsee and transfer the ownership once.

IngelaAndin · 2024-03-15T09:42:37Z

@Maria-12648430 I add some thoughts on how I think it should work.

IngelaAndin · 2024-03-15T09:51:23Z

Also I think it should work to have several accepting processes, first one will win and the others might get the next connection attempt.

Maria-12648430 · 2024-03-16T08:13:21Z

@Maria-12648430 I add some thoughts on how I think it should work.

Thanks @IngelaAndin, I will take a closer look next week, I'm currently on a short vacation 🙂

Maria-12648430 · 2024-03-20T07:54:50Z

@IngelaAndin your suggestions make sense to me and I think we should go that way. However, this makes it a larger refactoring (vs a rather simple bugfix) which will take some time to work out.

IngelaAndin · 2024-03-20T09:19:08Z

@Maria-12648430 no problem Maria, I think it is important to fix things the "tm correct way" because otherwise it is mainly a question of time until some new problem pops up. I believe in fixing the problem not only shutting the symptoms up. We very much apricate you taking the time, as we alas not always have enough resources to fix everything as swiftly as we may desire.
It was an unfortunate mistake I think this design, that probably slipped through as we might have missed to review some code, but code reviewing activities are also a tradeoff between other things that needs doing, so we just have to deal with it.

u3s · 2024-03-20T15:53:30Z

Also I think it should work to have several accepting processes, first one will win and the others might get the next connection attempt.

Do you mean to have that in scenario with parallel_login == true or parallel_login == false?

u3s · 2024-03-21T10:56:28Z

I agree with suggestions from Ingela.
When browsing code I also found code line which is probably redundant.

user_pid option is added twice - probably safe to remove (tm) ;-) if turns out related to this work.
ssh_acceptor:handle_connection/4 invokes ?PUT_INTERNAL_OPT([{user_pid, which is then repeated in ssh_system_sup:start_subsystem/4

I think instruction in ssh_acceptor could be removed.

u3s · 2024-03-21T14:12:35Z

lib/ssh/test/ssh_connection_SUITE.erl

+    end.
+
+%%--------------------------------------------------------------------
+%% Issue #8223


if you write GH-8223 it will be clear you're referring to github issue.

Maria-12648430 · 2024-03-21T16:05:18Z

This is pretty hard stuff 😰 Everything happens all over the place and gets pushed around to everywhere else, or so it seems 🤪 This will take a while to sort out and not break anything.

IngelaAndin · 2024-03-22T16:45:11Z

Also I think it should work to have several accepting processes, first one will win and the others might get the next connection attempt.

Do you mean to have that in scenario with parallel_login == true or parallel_login == false?
I guess it makes little sense for parallel_login false! I think this option makes little sense at all. I think you could always have parallel login and put the limit to one to get false behavior.

Maria-12648430 · 2024-03-25T11:33:12Z

@IngelaAndin @u3s Ok, with help from @juhlig, this is what we came up with. Let us know what you think.

We put another simple_one_for_one supervisor ssh_acceptor_subsup between ssh_acceptor_sup and ssh_acceptor. It sets up the listening socket for a daemon (this was formerly in ssh_acceptor). The actual acceptor processes are children of this supervisor.

We also introduced a new value for the parallel_login option in addition to true and false which can be a positive integer.

The overall behavior is as follows, depending on the value given for parallel_login:

If parallel_login is a number N (false is synonymous to N=1), that number of ssh_acceptor children are started initially. Each of those acceptors runs in a loop that waits for a connection, handles it, then wraps around to waiting for the next connection. This way, up to N login attempts can be handled simultaneously; if there are more than N simultaneous login attempts, they have to wait.
If parallel_login is true (in the sense of an unlimited number of simultaneous login attempts), only 1 ssh_acceptor child is started initially. As soon as this acceptor gets a connection, it tells the supervisor to start another acceptor before handling the connection and finally exiting (instead of looping around to wait for another connection). The freshly started acceptor will behave in the same way.

The outlined behavior introduces no backwards incompatibilities. parallel_login values false and true remain with the same outward behavior, and the newly introduced value N provides a good "middle ground" between sequential accepting and unlimited all-out parallelism.

Moreover, the approach solves the issue (#8223) that gave raise to this PR, and as far as we can tell incorporates the suggestions made before.

There is one failing test, though: ssh_protocol_SUITE:client_close_after_hello/1 (there may be more, some are being skipped on my machine, don't know about those). I don't really grasp what this one is supposed to be testing, and if it still applies with the changes made. I would be grateful if you could have a look at it.

Co-authored-by: Jan Uhlig <juhlig@hnc-agency.org>

Maria-12648430 · 2024-03-25T13:21:12Z

lib/ssh/src/ssh_acceptor_subsup.erl

+
+listen(Port, Options) ->
+    {_, Callback, _} = ?GET_OPT(transport, Options),
+    SockOpts = ?GET_OPT(socket_options, Options) ++ [{active, false}, {reuseaddr, true}],


This was [{active, false}, {reuseaddr,true} | ?GET_OPT(socket_options, Options)] in ssh_acceptor, which is wrong. For one, it would prevent the usage of option inet_backend since this option must be the first in the list. For another, in options given to gen_tcp or ssl, the last option wins, such that if for example {active, true} was in the list of given socket options, it would be set to {active, true} despite the tacked-on {active, false}.

u3s · 2024-04-03T07:51:29Z

@IngelaAndin @u3s Ok, with help from @juhlig, this is what we came up with. Let us know what you think.

Thanks a lot for work. We planned code review and will get back.

IngelaAndin · 2024-04-09T14:56:51Z

I think the test looks like a white-box test, and probably it trying to test that there is no process leak if the client gives up the connection after sending its hello message. Probably it should be checked differently as you have changed the supervisor tree.

I like the way that you describe in the documentation what should happen. I have not full understood the code yet. I think I need to look at it as whole the diff is not enough.

Maria-12648430 · 2024-04-10T06:35:05Z

@IngelaAndin looking forward to your review 🙂 I'll see what I can do about that test next week.

Btw, I got a notification by email about a comment from you regarding proc_lib:spawn_link vs proc_lib:start_link+proc_lib:init_ack in ssh_acceptor, but strangely I can't find it here in Github 🤔

IngelaAndin · 2024-04-10T07:05:24Z

@Maria-12648430 yes it was because I thought maybe my comment was premature, so I deleted it for the moment. It is true that processes under a simple_one_for_one supervisor would benefit from using proc_lib:spawn_link and then use behaviour:enter_loop instead of proc_lib:start_link as they are independent of each other and do not need to wait for each others init, so we can skip send and receiving of the ack message. Question is, if I had put this comment in the right place. So now I just made a general comment instead.

IngelaAndin · 2024-04-10T09:04:11Z

lib/ssh/src/ssh_acceptor_subsup.erl

+            case inet:sockname(LSock) of
+                {ok, {_, Port}} ->
+                    %% A usable, open LSock
+                    spawn_link(fun() ->


I realize this is probably the fault of the original design, but transferring ownership of the listen socket from ssh user process context feels like a bad idea. Why not make the listen call in a dedicated acceptor process? In current solution that would be in the ssh_acceptor_subsup process.

I was thinking the same thing, but see here:

otp/lib/ssh/src/ssh.erl

Lines 599 to 613 in 9859b72

daemon(Host0, Port0, UserOptions0) when 0 =< Port0, Port0 =< 65535,

Host0 == any ; Host0 == loopback ; is_tuple(Host0) ->

try

{Host1, UserOptions} = handle_daemon_args(Host0, UserOptions0),

#{} = Options0 = ssh_options:handle_options(server, UserOptions),

%% We need to open the listen socket here before start of the system supervisor. That

%% is because Port0 might be 0, or if an FD is provided in the Options0, in which case

%% the real listening port will be known only after the gen_tcp:listen call.

maybe_open_listen_socket(Host1, Port0, Options0)

of

{Host, Port, ListenSocket, Options1} ->

try

%% Now Host,Port is what to use for the supervisor to register its name,

%% and ListenSocket, if provided, is for listening on connections. But

%% it is still owned by self()...

Hum, well maybe we need it then. I am still not fond of making an explicit receive in the users process like that without more control like in gen_server:call for instance.

I'm not a huge fan of that, either... Maybe a greater refactoring effort is in order then, but I think that should involve more planning.

Maybe we could go with a "good enough bug fix" and then make a refactor later in new master after OTP-27 (as it is really soon)

I think that we should be able to move away from the above approach. Th concept of fd is provided is inherited from httpd_sup and we should be able to handle it like that, that is port is also specified but not opened as it is preopend. And the case of 0 is the port is only for testing and I think we should rather report the the chosen port back to the caller then the way it is done now.

lib/ssh/src/ssh_acceptor_subsup.erl

Maria-12648430 · 2024-04-15T14:38:14Z

We are still stuck on this one test. It just doesn't add up.

Before the structural changes in this PR, when parallel login was enabled, the acceptor would spawn_link a new process to do the authentication and handshake, which would trap exits. If the acceptor died during this, it would notice the exit via the link and stop the connection handler process again. This is what it looks like the test covers: it sends an exit to the acceptor (which is not trapping exits), then checks that there are no processes in handshake and no connection handler processes.

Notably, when parallel login was disabled, the implementation would not stop a connection handler process hanging in handshake. The acceptor process was doing the authentication and handshake by itself (ie, not in a spawned process), and the death of the acceptor would just stop the process. The handshake might time out, that would be kinda ok, but if it succeeded later there would be an orphaned connection handler process.

Notably again, this can also happen when a user starts a client via connect. If the client dies when the connection is in handshake, the connection handler process will not be stopped.

With the changes in this PR, there are now multiple acceptors handling authorization and handshake when parallel login is enabled, each handling authentication and handshake by itself like the single acceptor in the old implementation with parallel login disabled. When the acceptor receives an exit and dies, the connection handler is orphaned the same way.

The acceptor process could be changed to trap exits and shut down connections stuck in handshake, but as the selective receive in the handshake does not have a clause, this could only be done after the handshake eventually succeeded or timed out. Adding another clause is also not a good option since the same function is also used in the user process (which gave raise to the initial issue). So it looks like this cannot be solved without more structural changes, that is, another layer of separate processes independent from the acceptor and user processes. But as I already said above, this gets out of hand and turns into a bigger refactoring which requires better planning. What to do? 😰 (@IngelaAndin @u3s)

juhlig · 2024-04-15T14:57:09Z

What to do?

Maybe we could go with a "good enough bug fix" and then make a refactor later in new master after OTP-27 (as it is really soon)

I guess the only "good enough bugfix" that we can reasonably aim for now, without going into large-scale restructuring, would be what we had in the beginning: Revert the structural changes we have now (bring them back later), instead keep the spawn_linking of a process by the acceptor, but nail the receive of the 'EXIT' in handshake down to the acceptor pid 🤷‍♂️

IngelaAndin · 2024-04-16T13:28:53Z

@juhlig @Maria-12648430 We discussed it at meeting today and we feel that we will try avoiding making a good enough bug-fix as it might introduce new strangeness. We rather like to continue working on a rewrite in the OTP-27 track that possible could be backported later when it has proven itself.

IngelaAndin · 2024-04-17T07:12:52Z

In general I think all process should be under a supervisor unless they are really temporary and simple so we know they will terminate. An example of that is start_connection_tree in tls_gen_connection. This is needed for the dynaminc TLS supervisor tree with a significant child to work correctly as otherwise it could become inconsistent if the user process dies in the middle of starting.

If we need to synchronize something with users process and it is not trough a behavior call it should be tagged with the pid of the process we are synchronizing with.

I think the transport accept call should be done by an acceptor process and the handshake shall be done by a new connection handler process.

Timeouts should be handled on the server side of things.

All non temporary process should be implemented as behaviors.

And as discussed before I think we can avoid user process opening listen sockets.

Maria-12648430 · 2024-04-17T14:16:06Z

Ok @IngelaAndin, I agree 🙂 Shall we close this PR then?

IngelaAndin · 2024-04-17T15:17:27Z

@Maria-12648430 sure we can close it and we are hoping you replace it with a new and even better one :) We really apricate your help with this. We are kind of swamped right now in the security protocol area so to speak.

Maria-12648430 · 2024-04-18T13:06:10Z

Ok then, I'll dig into the code to better understand the status quo, see what I can do, and come back when I have something to show 😃

u3s self-assigned this Mar 5, 2024

u3s added the team:PS Assigned to OTP team PS label Mar 5, 2024

Maria-12648430 mentioned this pull request Mar 6, 2024

In OTP 26, Erlang's SSH client fails to connect if it receives an {'EXIT', Pid, normal} message from an unrelated process. #8223

Open

Maria-12648430 force-pushed the fix_ssh_connect_trapping_exits branch from 36295ec to ea38c67 Compare March 6, 2024 13:07

Maria-12648430 marked this pull request as draft March 6, 2024 16:05

IngelaAndin reviewed Mar 14, 2024

View reviewed changes

IngelaAndin reviewed Mar 15, 2024

View reviewed changes

IngelaAndin added the waiting waiting for changes/input from author label Mar 19, 2024

u3s reviewed Mar 21, 2024

View reviewed changes

Maria-12648430 force-pushed the fix_ssh_connect_trapping_exits branch from ea38c67 to 6e82e50 Compare March 25, 2024 11:02

Maria-12648430 marked this pull request as ready for review March 25, 2024 11:33

Maria-12648430 force-pushed the fix_ssh_connect_trapping_exits branch from 6e82e50 to dc8d099 Compare March 25, 2024 12:27

ssh: Fix stealing of EXIT messages if a client is trapping exits

fd5a27d

Co-authored-by: Jan Uhlig <juhlig@hnc-agency.org>

Maria-12648430 force-pushed the fix_ssh_connect_trapping_exits branch from dc8d099 to fd5a27d Compare March 25, 2024 12:30

Maria-12648430 commented Mar 25, 2024

View reviewed changes

IngelaAndin removed the waiting waiting for changes/input from author label Apr 9, 2024

u3s mentioned this pull request Apr 9, 2024

ssh: start monitor ConnPid before casting socket_control #8310

Open

IngelaAndin reviewed Apr 10, 2024

View reviewed changes

lib/ssh/src/ssh_acceptor_subsup.erl Outdated Show resolved Hide resolved

Implement suggestions from review

a6482bd

Maria-12648430 closed this Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ssh`: Fix stealing of `'EXIT'` messages if a client is trapping exits #8226

`ssh`: Fix stealing of `'EXIT'` messages if a client is trapping exits #8226

Maria-12648430 commented Mar 5, 2024

github-actions bot commented Mar 5, 2024 •

edited

u3s commented Mar 5, 2024

Maria-12648430 commented Mar 6, 2024

IngelaAndin Mar 14, 2024

IngelaAndin Mar 15, 2024

IngelaAndin commented Mar 15, 2024

IngelaAndin commented Mar 15, 2024

Maria-12648430 commented Mar 16, 2024

Maria-12648430 commented Mar 20, 2024

IngelaAndin commented Mar 20, 2024

u3s commented Mar 20, 2024

u3s commented Mar 21, 2024

u3s Mar 21, 2024

Maria-12648430 commented Mar 21, 2024

IngelaAndin commented Mar 22, 2024

Maria-12648430 commented Mar 25, 2024 •

edited

Maria-12648430 Mar 25, 2024 •

edited

u3s commented Apr 3, 2024

IngelaAndin commented Apr 9, 2024

Maria-12648430 commented Apr 10, 2024

IngelaAndin commented Apr 10, 2024 •

edited

IngelaAndin Apr 10, 2024

Maria-12648430 Apr 11, 2024

IngelaAndin Apr 12, 2024

Maria-12648430 Apr 15, 2024 •

edited

IngelaAndin Apr 15, 2024

IngelaAndin Apr 16, 2024 •

edited

Maria-12648430 commented Apr 15, 2024 •

edited

juhlig commented Apr 15, 2024

IngelaAndin commented Apr 16, 2024

IngelaAndin commented Apr 17, 2024

Maria-12648430 commented Apr 17, 2024

IngelaAndin commented Apr 17, 2024

Maria-12648430 commented Apr 18, 2024


		takeover(ConnPid, _, Socket, Options) ->
		takeover(ParentPid, ConnPid, _, Socket, Options) ->

	daemon(Host0, Port0, UserOptions0) when 0 =< Port0, Port0 =< 65535,
	Host0 == any ; Host0 == loopback ; is_tuple(Host0) ->
	try
	{Host1, UserOptions} = handle_daemon_args(Host0, UserOptions0),
	#{} = Options0 = ssh_options:handle_options(server, UserOptions),
	%% We need to open the listen socket here before start of the system supervisor. That
	%% is because Port0 might be 0, or if an FD is provided in the Options0, in which case
	%% the real listening port will be known only after the gen_tcp:listen call.
	maybe_open_listen_socket(Host1, Port0, Options0)
	of
	{Host, Port, ListenSocket, Options1} ->
	try
	%% Now Host,Port is what to use for the supervisor to register its name,
	%% and ListenSocket, if provided, is for listening on connections. But
	%% it is still owned by self()...

ssh: Fix stealing of 'EXIT' messages if a client is trapping exits #8226

ssh: Fix stealing of 'EXIT' messages if a client is trapping exits #8226

Conversation

Maria-12648430 commented Mar 5, 2024

github-actions bot commented Mar 5, 2024 • edited

CT Test Results

Artifacts

u3s commented Mar 5, 2024

Maria-12648430 commented Mar 6, 2024

IngelaAndin Mar 14, 2024

Choose a reason for hiding this comment

IngelaAndin Mar 15, 2024

Choose a reason for hiding this comment

IngelaAndin commented Mar 15, 2024

IngelaAndin commented Mar 15, 2024

Maria-12648430 commented Mar 16, 2024

Maria-12648430 commented Mar 20, 2024

IngelaAndin commented Mar 20, 2024

u3s commented Mar 20, 2024

u3s commented Mar 21, 2024

u3s Mar 21, 2024

Choose a reason for hiding this comment

Maria-12648430 commented Mar 21, 2024

IngelaAndin commented Mar 22, 2024

Maria-12648430 commented Mar 25, 2024 • edited

Maria-12648430 Mar 25, 2024 • edited

Choose a reason for hiding this comment

u3s commented Apr 3, 2024

IngelaAndin commented Apr 9, 2024

Maria-12648430 commented Apr 10, 2024

IngelaAndin commented Apr 10, 2024 • edited

IngelaAndin Apr 10, 2024

Choose a reason for hiding this comment

Maria-12648430 Apr 11, 2024

Choose a reason for hiding this comment

IngelaAndin Apr 12, 2024

Choose a reason for hiding this comment

Maria-12648430 Apr 15, 2024 • edited

Choose a reason for hiding this comment

IngelaAndin Apr 15, 2024

Choose a reason for hiding this comment

IngelaAndin Apr 16, 2024 • edited

Choose a reason for hiding this comment

Maria-12648430 commented Apr 15, 2024 • edited

juhlig commented Apr 15, 2024

IngelaAndin commented Apr 16, 2024

IngelaAndin commented Apr 17, 2024

Maria-12648430 commented Apr 17, 2024

IngelaAndin commented Apr 17, 2024

Maria-12648430 commented Apr 18, 2024

`ssh`: Fix stealing of `'EXIT'` messages if a client is trapping exits #8226

`ssh`: Fix stealing of `'EXIT'` messages if a client is trapping exits #8226

github-actions bot commented Mar 5, 2024 •

edited

Maria-12648430 commented Mar 25, 2024 •

edited

Maria-12648430 Mar 25, 2024 •

edited

IngelaAndin commented Apr 10, 2024 •

edited

Maria-12648430 Apr 15, 2024 •

edited

IngelaAndin Apr 16, 2024 •

edited

Maria-12648430 commented Apr 15, 2024 •

edited