Skip to content

Commit

Permalink
crimson/net: allow a Message to be resent.
Browse files Browse the repository at this point in the history
Recetly the `crimson::mon::Client` has started resending
messages when session is being reopened which may happen
during recovery from a network issue.

A crash documented below has been observed in teuthology
testing. It looks the fixes in `mon::Client` unveiled
a problem in the messenger -- it assumes that a `Message`
instance shall not be sent twice. This stays in conflict
with the behaviour of `mon::Client` about e.g. `MMonCommand`.

```
INFO  2021-03-02 14:29:01,192 [shard 0] monc - handle_mon_map: renewed tickets
DEBUG 2021-03-02 14:29:01,192 [shard 0] ms - [osd.2(client) v2:172.21.15.57:6804/34494@55832 >> mon.0 v2:172.21.15.57:3300/0] <== #2 === auth_reply(proto 2 0 (0) Success) v1 (18)
INFO  2021-03-02 14:29:01,192 [shard 0] monc - handle_auth_reply mon v2:172.21.15.57:6804/34494 => v2:172.21.15.57:3300/0 returns auth_reply(proto 2 0 (0) Success) v1: 0
INFO  2021-03-02 14:29:01,192 [shard 0] monc - handle_auth_reply
INFO  2021-03-02 14:29:01,192 [shard 0] monc - do_auth_single: mon v2:172.21.15.57:6804/34494 => v2:172.21.15.57:3300/0 returns auth_reply(proto 2 0 (0) Success) v1: 0
ERROR 2021-03-02 14:29:01,192 [shard 0] none - /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-1345-g1dc78fd5/rpm/el8/BUILD/ceph-17.0.0-1345-g1dc78fd5/src/crimson/net/ProtocolV2.cc:1828 : In function 'crimson::net::ProtocolV2::do_sweep_messages(const std::deque<boost::intrusive_ptr<Message> >&, size_t, bool, std::optional<utime_t>, bool)::<lambda(const MessageRef&)>', ceph_assert(%s)
!msg->get_seq() && "message already has seq"
Aborting on shard 0.
Backtrace:
  0x00000000013c2bbc
  0x0000000001384d10
  0x0000000001385012
  0x00000000013850d2
  /lib64/libpthread.so.0+0x0000000000012b1f
  /lib64/libc.so.6+0x00000000000377fe
  /lib64/libc.so.6+0x0000000000021c34
  0x00000000005e2e98
  0x00000000005e2ee0
  0x0000000000dfb215
  0x0000000000def167
  0x0000000000df0854
  0x0000000000df100e
  0x0000000000df165c
  0x0000000000de97e9
  0x0000000000d8c3fd
  0x0000000000d8c5bf
  0x0000000000d85ba5
  0x0000000001381237
  0x00000000013815a2
  0x00000000013ae735
  0x000000000134b1d7
  0x0000000000661de5
  /lib64/libc.so.6+0x00000000000237b2
  0x00000000006b256d
daemon-helper: command crashed with signal 6
```

Processing the backtrace with thes`seastar-addr2line` confirms the hypothesis.

```
[Backtrace ceph#8]
ceph::__ceph_assert_fail(ceph::assert_data const&) at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/crimson/common/assert.cc:14

[Backtrace ceph#9]
operator() at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/crimson/net/ProtocolV2.cc:1828
 (inlined by) ?? at /opt/rh/gcc-toolset-9/root/usr/include/c++/9/bits/stl_algo.h:3876
 (inlined by) crimson::net::ProtocolV2::do_sweep_messages(std::deque<boost::intrusive_ptr<Message>, std::allocator<boost::intrusive_ptr<Message> > > const&, unsigned lo
ng, bool, std::optional<utime_t>, bool) at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/crimson/net/ProtocolV2.cc:1848

[Backtrace ceph#10]
operator() at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/crimson/net/Protocol.cc:235

[Backtrace ceph#11]
crimson::net::Protocol::do_write_dispatch_sweep() at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/seastar/include/seastar/core/future.hh:2135
 (inlined by) ?? at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/seastar/include/seastar/core/loop.hh:118
 (inlined by) crimson::net::Protocol::do_write_dispatch_sweep() at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/crimson/net/Protocol.cc:217

[Backtrace ceph#12]
operator() at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/crimson/net/Protocol.cc:312
 (inlined by) ?? at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/seastar/include/seastar/core/future.hh:2135
 (inlined by) ?? at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/seastar/include/seastar/core/future.hh:2166
 (inlined by) ?? at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/seastar/include/seastar/core/gate.hh:126
 (inlined by) ?? at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/seastar/include/seastar/core/gate.hh:144
 (inlined by) ?? at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/crimson/common/gated.h:38
 (inlined by) ?? at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/crimson/common/gated.h:23
 (inlined by) ?? at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/crimson/net/Protocol.cc:311
 (inlined by) crimson::net::Protocol::write_event() at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/crimson/net/Protocol.cc:298

[Backtrace ceph#13]
crimson::net::Protocol::send(boost::intrusive_ptr<Message>) at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/crimson/net/Protocol.cc:97

[Backtrace ceph#14]
crimson::net::SocketConnection::send(boost::intrusive_ptr<Message>) at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/crimson/net/SocketConnection.cc:75

[Backtrace ceph#15]
crimson::mon::Client::send_message(boost::intrusive_ptr<Message>) at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/crimson/mon/MonClient.cc:1024

[Backtrace ceph#16]
operator()<crimson::mon::Client::mon_command_t> at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/crimson/mon/MonClient.cc:1045
 (inlined by) ?? at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/seastar/include/seastar/core/future.hh:2135
 (inlined by) ?? at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/seastar/include/seastar/core/future.hh:2166
 (inlined by) ?? at /usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/seastar/include/seastar/core/loop.hh:549
 (inlined by) parallel_for_each_impl<std::vector<crimson::mon::Client::mon_command_t>&, crimson::mon::Client::on_session_opened()::<lambda()>::<lambda(auto:82&)> > at $
usr/src/debug/ceph-17.0.0-1345.g1dc78fd5.el8.x86_64/src/seastar/include/seastar/core/loop.hh:594
```

In classical OSD the `Message` serialization code is free
from the assertion:

```cpp
ssize_t ProtocolV2::write_message(Message *m, bool more) {
  FUNCTRACE(cct);
  ceph_assert(connection->center->in_thread());
  m->set_seq(++out_seq);
  // ...
}
```

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
  • Loading branch information
rzarzynski committed Mar 2, 2021
1 parent 150868a commit 7bca260
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion src/crimson/net/ProtocolV2.cc
Expand Up @@ -1825,7 +1825,9 @@ ceph::bufferlist ProtocolV2::do_sweep_messages(

msg->encode(conn.features, 0);

ceph_assert(!msg->get_seq() && "message already has seq");
if (const auto seq = msg->get_seq(); seq) {
logger().info("{} already has seq={}, reassigning", *msg, seq);
}
msg->set_seq(++conn.out_seq);

ceph_msg_header &header = msg->get_header();
Expand Down

0 comments on commit 7bca260

Please sign in to comment.