Add mechanism for recovering from disconnections in conferences#1069
Add mechanism for recovering from disconnections in conferences#1069zugz merged 1 commit intoTokTok:masterfrom
Conversation
iphydf
left a comment
There was a problem hiding this comment.
I'll review the logic more in-depth later.
There was a problem hiding this comment.
This can be a regular assignment.
There was a problem hiding this comment.
This can be "const int" if you use a ternary expression.
|
Done all. Thanks. I also made some more things const.
|
a14947c to
1c1ebd5
Compare
|
+The intention is to recover seemlessly from splits in the group, the
most
seamlessly
Oops! Done.
|
iphydf
left a comment
There was a problem hiding this comment.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @iphydf and @zugz)
auto_tests/conference_test.c, line 107 at r4 (raw file):
uint16_t disconnect_toxes[NUM_DISCONNECT]; for (uint16_t i = 0; i < NUM_DISCONNECT; ++i) {
NUM_DISCONNECT is actually MAX_DISCONNECT? Since numbers may repeat?
auto_tests/conference_test.c, line 112 at r4 (raw file):
} for (uint16_t j = 0; j < 70 * 5; ++j) {
What's 70 * 5? It looks like we wait for a fixed amount of time. Perhaps we should wait for events to happen? I.e. wait until all the toxes we expect to be disconnected are in fact disconnected.
auto_tests/conference_test.c, line 128 at r4 (raw file):
} c_sleep(200);
I think the test can be a little faster by using 100 here. I think we have an ITERATION_INTERVAL constant for that, which was chosen to iterate as few times as possible while giving about the same test timing. 200, I think (I might be wrong), was increasing test times significantly. The reason I kept 200 here is because in some loops we print something on every loop iteration. That can be fixed, and then we can iterate more frequently, speeding up this test (which currently times out).
auto_tests/conference_test.c, line 141 at r4 (raw file):
printf("reconnecting toxes\n"); for (uint16_t j = 0; j < 120 * 5; ++j) {
Should we also wait for an event here?
1fc620c to
93b6d54
Compare
There was a problem hiding this comment.
I don't know.. bool disconnected[NUM_GROUP_TOX] = {0}; does the same. You may have a reason to have this explicit loop. What's the reason?
There was a problem hiding this comment.
It took me about 30 seconds to understand this loop. I might be incompetent, but more incompetent people are going to read this and not understand it within 10 seconds. Perhaps you can factor this out into a function with a name (and maybe a comment) to clarify what this does.
There was a problem hiding this comment.
It might be worth asserting that this isn't 0 before decrementing. I'm conflicted, because I soon want assert to mean "provably true" and this assertion won't be easy to prove. Just something to consider.
There was a problem hiding this comment.
Instead, I rearranged things to call the callbacks after we deal with g->frozen. So now we know g->numfrozen > 0 because we just checked that get_frozen_index didn't return -1. It makes more sense this way anyway.
iphydf
left a comment
There was a problem hiding this comment.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @iphydf)
toxcore/group.c, line 1269 at r1 (raw file):
Previously, iphydf wrote…
const
No const?
|
No const?
Done.
|
|
I tried running it now and ran into a few problems:
I also see the ID warning during auto_conference_test after "letting random toxes timeout".
|
|
Thanks for testing.
Warning: "ERROR:Messenger.c:m_send_message_generic:508: Message length 1378 is too large"
I don't see what would cause this. Is it really introduced by minpgc?
core/toxlogger.cpp:57 : Warning: "WARN :network.c:networking_poll:654: [33] -- Packet has no handler"
WARNING network.c:550 sendpacket: unknown address type: 0
Interesting. I don't see either of these during the
auto_conference_test, but travis gets the latter at least. Do you get
either with master?
4. qTox segfaults on close (minpgc toxcore)
double free or corruption (out)
Aargh. That one was my fault. Thanks for finding it! Should be fixed
now.
5. Probably a qTox issue, qTox segfaults on leaving group (old toxcore, pgc group members)
Well that certainly isn't minpgc's fault - the client shouldn't segfault
whatever peers do. I can't imagine what could be causing it.
|
hard to say, but going back and forth I do not see it with master but do see it with this branch. I will test a bit more to make sure.
no, didn't see either with master. I think the "Packet has no handler" is actually blocking functionality, if I close and re-open my client using this branch, I don't rejoin the group, but I see those warnings instead (I think). I have some non-default network settings in client, i.e. ipv4 only, I will test full default and see if I can still repro.
Agreed, just got a repro on stock toxcore, guaranteed qTox issue. |
|
no, didn't see either with master. [...] "Packet has no handler"
That one really makes no sense to me. It should be triggered only if you
receive a packet with an unknown packet kind. minpgc doesn't add any new
packet kinds (only a new crypto packet id), and nor can I imagine why
anything it does would solicit any non-standard packets from other
peers.
if I close and re-open my client using this branch, I don't rejoin the
group
That's expected! You need to have a friend in the group who's also
running minpgc for the persistence to work, and even then it doesn't
work if you restart the client rather than just lose your network
connection, because saving and loading isn't implemented (yet).
|
|
Ok more testing and some root causing:
I tested again stopping one of the clients in the group with gdb until they were viewed as offline instead of restarting the client, and you're right that that works. I was under the impression that the other minpgc peer would re-include the re-connecting minpgc peer, but it makes sense that the peer who left needs to maintain the state as well. In that case testing looks good from my end. I'll continue reading through the changes tomorrow. |
|
* Wednesday, 2018-08-22 at 23:44 -0700 - Anthony Bilinski <notifications@github.com>:
4. May have been the same as 5? Saw this on stock toxcore.
Ah yes, that would make sense. Well, it made me find and fix a bug in
minpgc anyway!
5. Looks like it actually is toxcore
Yes. #1117 should fix it.
In that case testing looks good from my end. I'll continue reading
through the changes tomorrow.
Great. Thanks for testing and reading! I've just pushed a couple more
little changes, and there's one more thing I know needs doing (there's
a FIXME in the code for it), but otherwise to my knowledge it's bugfree
and working.
|
cb731c6 to
e3441ea
Compare
c718ae0 to
14a58e4
Compare
Codecov Report
@@ Coverage Diff @@
## master #1069 +/- ##
========================================
+ Coverage 82.5% 83.1% +0.5%
========================================
Files 82 82
Lines 14476 14593 +117
========================================
+ Hits 11948 12131 +183
+ Misses 2528 2462 -66
Continue to review full report at Codecov.
|
|
This is now ready for final review. |
sudden6
left a comment
There was a problem hiding this comment.
Reviewed 3 of 15 files at r6.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @iphydf and @zugz)
toxcore/group.c, line 1320 at r6 (raw file):
* return -1 otherwise. */ static int try_send_rejoin(Group_Chats *g_c, uint32_t groupnumber, const uint8_t *real_pk)
use true/false for success/failure?
toxcore/group.c, line 1898 at r6 (raw file):
} static int handle_packet_rejoin(Group_Chats *g_c, int friendcon_id, const uint8_t *data, uint16_t length,
use true/false for success/failure?
zugz
left a comment
There was a problem hiding this comment.
Reviewable status: 2 change requests, 0 of 1 approvals obtained (waiting on @iphydf, @zugz, and @sudden6)
auto_tests/conference_test.c, line 107 at r4 (raw file):
Previously, iphydf wrote…
NUM_DISCONNECT is actually MAX_DISCONNECT? Since numbers may repeat?
Not any more.
auto_tests/conference_test.c, line 112 at r4 (raw file):
Previously, iphydf wrote…
What's 70 * 5? It looks like we wait for a fixed amount of time. Perhaps we should wait for events to happen? I.e. wait until all the toxes we expect to be disconnected are in fact disconnected.
Done.
auto_tests/conference_test.c, line 128 at r4 (raw file):
Previously, iphydf wrote…
I think the test can be a little faster by using 100 here. I think we have an ITERATION_INTERVAL constant for that, which was chosen to iterate as few times as possible while giving about the same test timing. 200, I think (I might be wrong), was increasing test times significantly. The reason I kept 200 here is because in some loops we print something on every loop iteration. That can be fixed, and then we can iterate more frequently, speeding up this test (which currently times out).
Done.
auto_tests/conference_test.c, line 141 at r4 (raw file):
Previously, iphydf wrote…
Should we also wait for an event here?
Done.
toxcore/group.c, line 1320 at r6 (raw file):
Previously, sudden6 wrote…
use
true/falseforsuccess/failure?
Done.
toxcore/group.c, line 1898 at r6 (raw file):
Previously, sudden6 wrote…
use
true/falseforsuccess/failure?
Making this bool would introduce an ugly mismatch with g_handle_packet. The whole file should be converted to use boolean returns where possible, but this is outside the scope of this PR.
sudden6
left a comment
There was a problem hiding this comment.
Reviewable status: 2 change requests, 0 of 1 approvals obtained (waiting on @iphydf)
toxcore/group.c, line 1898 at r6 (raw file):
Previously, zugz (zugz) wrote…
Making this bool would introduce an ugly mismatch with g_handle_packet. The whole file should be converted to use boolean returns where possible, but this is outside the scope of this PR.
ok
robinlinden
left a comment
There was a problem hiding this comment.
Reviewed 2 of 5 files at r1, 1 of 4 files at r3, 2 of 3 files at r5, 14 of 15 files at r6, 1 of 1 files at r7, 1 of 1 files at r8.
Reviewable status: 2 change requests, 0 of 1 approvals obtained (waiting on @iphydf)
sudden6
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 2 change requests, 1 of 1 approvals obtained (waiting on @iphydf)
* add freezing and unfreezing of peers * add rejoin packet * revise handling of temporary invited connections * rename "peer kill" packet to "peer leave" packet * test rejoining in conference test * use custom clock in conference test
This change is