{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":28449431,"defaultBranch":"master","name":"scylladb","ownerLogin":"scylladb","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2014-12-24T13:16:33.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/14364730?v=4","public":true,"private":false,"isOrgOwned":true},"refInfo":{"name":"","listCacheKey":"v0:1715133371.0","currentOid":""},"activityList":{"items":[{"before":"1cb959fc846ad4b16c00d447c948e78fd928df55","after":"b68c06cc3ade3e523344b20e914a0f217d3d4a80","ref":"refs/heads/branch-5.2","pushedAt":"2024-05-08T20:54:13.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"scylladb-promoter","name":null,"path":"/scylladb-promoter","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/36883350?s=80&v=4"},"commit":{"message":"direct_failure_detector: increase ping timeout and make it tunable\n\nThe direct failure detector design is simplistic. It sends pings\nsequentially and times out listeners that reached the threshold (i.e.\ndidn't hear from a given endpoint for too long) in-between pings.\n\nGiven the sequential nature, the previous ping must finish so the next\nping can start. We timeout pings that take too long. The timeout was\nhardcoded and set to 300ms. This is too low for wide-area setups --\nlatencies across the Earth can indeed go up to 300ms. 3 subsequent timed\nout pings to a given node were sufficient for the Raft listener to \"mark\nserver as down\" (the listener used a threshold of 1s).\n\nIncrease the ping timeout to 600ms which should be enough even for\npinging the opposite side of Earth, and make it tunable.\n\nIncrease the Raft listener threshold from 1s to 2s. Without the\nincreased threshold, one timed out ping would be enough to mark the\nserver as down. Increasing it to 2s requires 3 timed out pings which\nmakes it more robust in presence of transient network hiccups.\n\nIn the future we'll most likely want to decrease the Raft listener\nthreshold again, if we use Raft for data path -- so leader elections\nstart quickly after leader failures. (Faster than 2s). To do that we'll\nhave to improve the design of the direct failure detector.\n\nRef: scylladb/scylladb#16410\nFixes: scylladb/scylladb#16607\n\n---\n\nI tested the change manually using `tc qdisc ... netem delay`, setting\nnetwork delay on local setup to ~300ms with jitter. Without the change,\nthe result is as observed in scylladb/scylladb#16410: interleaving\n```\nraft_group_registry - marking Raft server ... as dead for Raft groups\nraft_group_registry - marking Raft server ... as alive for Raft groups\n```\nhappening once every few seconds. The \"marking as dead\" happens whenever\nwe get 3 subsequent failed pings, which is happens with certain (high)\nprobability depending on the latency jitter. Then as soon as we get a\nsuccessful ping, we mark server back as alive.\n\nWith the change, the phenomenon no longer appears.\n\n(cherry picked from commit 8df6d10e883f0965b16f75bd0217e5676eaa4c04)\n\nCloses #18558","shortMessageHtmlLink":"direct_failure_detector: increase ping timeout and make it tunable"}},{"before":"905b8f59bd5a922a4b0bd812f150223337461bb3","after":"ed89deab408c47443408cdfd60380b5a27766322","ref":"refs/heads/branch-5.4","pushedAt":"2024-05-08T16:52:54.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"scylladb-promoter","name":null,"path":"/scylladb-promoter","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/36883350?s=80&v=4"},"commit":{"message":"direct_failure_detector: increase ping timeout and make it tunable\n\nThe direct failure detector design is simplistic. It sends pings\nsequentially and times out listeners that reached the threshold (i.e.\ndidn't hear from a given endpoint for too long) in-between pings.\n\nGiven the sequential nature, the previous ping must finish so the next\nping can start. We timeout pings that take too long. The timeout was\nhardcoded and set to 300ms. This is too low for wide-area setups --\nlatencies across the Earth can indeed go up to 300ms. 3 subsequent timed\nout pings to a given node were sufficient for the Raft listener to \"mark\nserver as down\" (the listener used a threshold of 1s).\n\nIncrease the ping timeout to 600ms which should be enough even for\npinging the opposite side of Earth, and make it tunable.\n\nIncrease the Raft listener threshold from 1s to 2s. Without the\nincreased threshold, one timed out ping would be enough to mark the\nserver as down. Increasing it to 2s requires 3 timed out pings which\nmakes it more robust in presence of transient network hiccups.\n\nIn the future we'll most likely want to decrease the Raft listener\nthreshold again, if we use Raft for data path -- so leader elections\nstart quickly after leader failures. (Faster than 2s). To do that we'll\nhave to improve the design of the direct failure detector.\n\nRef: scylladb/scylladb#16410\nFixes: scylladb/scylladb#16607\n\n---\n\nI tested the change manually using `tc qdisc ... netem delay`, setting\nnetwork delay on local setup to ~300ms with jitter. Without the change,\nthe result is as observed in scylladb/scylladb#16410: interleaving\n```\nraft_group_registry - marking Raft server ... as dead for Raft groups\nraft_group_registry - marking Raft server ... as alive for Raft groups\n```\nhappening once every few seconds. The \"marking as dead\" happens whenever\nwe get 3 subsequent failed pings, which is happens with certain (high)\nprobability depending on the latency jitter. Then as soon as we get a\nsuccessful ping, we mark server back as alive.\n\nWith the change, the phenomenon no longer appears.\n\n(cherry picked from commit 8df6d10e883f0965b16f75bd0217e5676eaa4c04)\n\nCloses scylladb/scylladb#18559","shortMessageHtmlLink":"direct_failure_detector: increase ping timeout and make it tunable"}},{"before":"ec124958f6c8d1be200a09f9765fc55c42be723c","after":"70258b2e9abe9de81731426cecd1b6a04075cec6","ref":"refs/heads/next","pushedAt":"2024-05-08T14:15:09.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"denesb","name":"Botond Dénes","path":"/denesb","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1389273?s=80&v=4"},"commit":{"message":"build: cmake: build async_utils.cc\n\nasync_utils.cc was introduced in e1411f39, so let's\nupdate the cmake building system to build it. without\nwhich, we'd run into link failure like:\n\n```\nld.lld: error: undefined symbol: to_mutation_gently(canonical_mutation const&, seastar::lw_shared_ptr<schema const>)\n>>> referenced by storage_service.cc\n>>>               storage_service.cc.o:(service::storage_service::merge_topology_snapshot(service::raft_snapshot)) in archive service/Dev/libservice.a\n>>> referenced by group0_state_machine.cc\n>>>               group0_state_machine.cc.o:(service::write_mutations_to_database(service::storage_proxy&, gms::inet_address, std::vector<canonical_mutation, std::allocator<canonical_mutation>>)) inarchive service/Dev/libservice.a\n>>> referenced by group0_state_machine.cc\n>>>               group0_state_machine.cc.o:(service::write_mutations_to_database(service::storage_proxy&, gms::inet_address, std::vector<canonical_mutation, std::allocator<canonical_mutation>>) (.resume)) in archive service/Dev/libservice.a\n>>> referenced 1 more times\n```\n\nSigned-off-by: Kefu Chai <kefu.chai@scylladb.com>\n\nCloses scylladb/scylladb#18524","shortMessageHtmlLink":"build: cmake: build async_utils.cc"}},{"before":"7568bcdb02273f0aa419a6af8b7474cad01a60c5","after":"ec124958f6c8d1be200a09f9765fc55c42be723c","ref":"refs/heads/next","pushedAt":"2024-05-08T14:13:22.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"denesb","name":"Botond Dénes","path":"/denesb","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1389273?s=80&v=4"},"commit":{"message":"build: cmake: mark abseil include SYSTEM\n\nthis change is a followup of 0b0e661a. it helps to ensure that the header files in\nabseil submodule have higher priority when the compiler includes abseil headers\nwhen building with CMake.\n\nSigned-off-by: Kefu Chai <kefu.chai@scylladb.com>\n\nCloses scylladb/scylladb#18523","shortMessageHtmlLink":"build: cmake: mark abseil include SYSTEM"}},{"before":"2ba68c9b86f537386df2c60dd82f285e2a15eb7b","after":"7568bcdb02273f0aa419a6af8b7474cad01a60c5","ref":"refs/heads/next","pushedAt":"2024-05-08T14:07:20.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"denesb","name":"Botond Dénes","path":"/denesb","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1389273?s=80&v=4"},"commit":{"message":"db,service: fix typos in comments\n\nSigned-off-by: Kefu Chai <kefu.chai@scylladb.com>\n\nCloses scylladb/scylladb#18567","shortMessageHtmlLink":"db,service: fix typos in comments"}},{"before":"c68a0fbedad7f259a5ebf9b1ae477aff29a804a5","after":"2ba68c9b86f537386df2c60dd82f285e2a15eb7b","ref":"refs/heads/next","pushedAt":"2024-05-08T14:05:35.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"denesb","name":"Botond Dénes","path":"/denesb","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1389273?s=80&v=4"},"commit":{"message":"doc: add OS support in version 6.0\n\nThis commit adds OS support in version 6.0.\nIn addition, it removes the information about version 5.2, as this version is no longer supported, according to our policy.\n\nCloses scylladb/scylladb#18562","shortMessageHtmlLink":"doc: add OS support in version 6.0"}},{"before":"432d01a57021e200fad0f5b6087ad2ed2051f408","after":"c68a0fbedad7f259a5ebf9b1ae477aff29a804a5","ref":"refs/heads/next","pushedAt":"2024-05-08T13:51:20.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"kbr-scylla","name":"Kamil Braun","path":"/kbr-scylla","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/120044486?s=80&v=4"},"commit":{"message":"doc: update Consistent Topology with Raft\n\nThis PR:\n- Removes the `.. only:: opensource` directive from Consistent Topology with Raft.\n  This feature is no longer an Open Source-only experimental feature.\n- Removes redundant version-specific information.\n- Moves the necessary version-specific information to a separate file.\n\nThis is a follow-up to https://github.com/scylladb/scylladb/pull/18285/commits/55b011902e20472ef5ea7273a28f0a8dd85cb41b.\n\nRefs https://github.com/scylladb/scylladb/pull/18285/\n\nCloses scylladb/scylladb#18553","shortMessageHtmlLink":"doc: update Consistent Topology with Raft"}},{"before":"1cb959fc846ad4b16c00d447c948e78fd928df55","after":"b68c06cc3ade3e523344b20e914a0f217d3d4a80","ref":"refs/heads/next-5.2","pushedAt":"2024-05-08T13:47:05.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"kbr-scylla","name":"Kamil Braun","path":"/kbr-scylla","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/120044486?s=80&v=4"},"commit":{"message":"direct_failure_detector: increase ping timeout and make it tunable\n\nThe direct failure detector design is simplistic. It sends pings\nsequentially and times out listeners that reached the threshold (i.e.\ndidn't hear from a given endpoint for too long) in-between pings.\n\nGiven the sequential nature, the previous ping must finish so the next\nping can start. We timeout pings that take too long. The timeout was\nhardcoded and set to 300ms. This is too low for wide-area setups --\nlatencies across the Earth can indeed go up to 300ms. 3 subsequent timed\nout pings to a given node were sufficient for the Raft listener to \"mark\nserver as down\" (the listener used a threshold of 1s).\n\nIncrease the ping timeout to 600ms which should be enough even for\npinging the opposite side of Earth, and make it tunable.\n\nIncrease the Raft listener threshold from 1s to 2s. Without the\nincreased threshold, one timed out ping would be enough to mark the\nserver as down. Increasing it to 2s requires 3 timed out pings which\nmakes it more robust in presence of transient network hiccups.\n\nIn the future we'll most likely want to decrease the Raft listener\nthreshold again, if we use Raft for data path -- so leader elections\nstart quickly after leader failures. (Faster than 2s). To do that we'll\nhave to improve the design of the direct failure detector.\n\nRef: scylladb/scylladb#16410\nFixes: scylladb/scylladb#16607\n\n---\n\nI tested the change manually using `tc qdisc ... netem delay`, setting\nnetwork delay on local setup to ~300ms with jitter. Without the change,\nthe result is as observed in scylladb/scylladb#16410: interleaving\n```\nraft_group_registry - marking Raft server ... as dead for Raft groups\nraft_group_registry - marking Raft server ... as alive for Raft groups\n```\nhappening once every few seconds. The \"marking as dead\" happens whenever\nwe get 3 subsequent failed pings, which is happens with certain (high)\nprobability depending on the latency jitter. Then as soon as we get a\nsuccessful ping, we mark server back as alive.\n\nWith the change, the phenomenon no longer appears.\n\n(cherry picked from commit 8df6d10e883f0965b16f75bd0217e5676eaa4c04)\n\nCloses #18558","shortMessageHtmlLink":"direct_failure_detector: increase ping timeout and make it tunable"}},{"before":"7de6347eb7e0607ce2eabfaca7c1d541e4828e48","after":"432d01a57021e200fad0f5b6087ad2ed2051f408","ref":"refs/heads/next","pushedAt":"2024-05-08T13:44:35.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"denesb","name":"Botond Dénes","path":"/denesb","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1389273?s=80&v=4"},"commit":{"message":"commitlog: Fix request_controller semaphore accounting.\n\nFixes #18488\n\nDue to the discrepancy between bytes added to CL and bytes written to disk\n(due to CRC sector overhead), we fail to account for the proper byte count\nwhen issuing account_memory_usage in allocate (using bytes added) and in\ncycle:s notify_memory_written (disk bytes written).\n\nThis leads us to slowly, but surely, add to the semaphore all the time.\nEventually rendering it useless.\n\nAlso, terminate call would _not_ take any of this into account,\nand the chunk overhead there would cause a (smaller) discrepancy\nas well.\n\nFix by simply ensuring that buffer alloc handles its byte usage,\nthen accounting based on buffer position, not input byte size.\n\nCloses scylladb/scylladb#18489","shortMessageHtmlLink":"commitlog: Fix request_controller semaphore accounting."}},{"before":"9a0dd462746bde3a78213da6851ae4d114336254","after":"7de6347eb7e0607ce2eabfaca7c1d541e4828e48","ref":"refs/heads/next","pushedAt":"2024-05-08T13:43:26.000Z","pushType":"push","commitsCount":3,"pusher":{"login":"denesb","name":"Botond Dénes","path":"/denesb","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1389273?s=80&v=4"},"commit":{"message":"Merge 'Drain view_builder in generic drain (again)' from Pavel Emelyanov\n\nSome time ago #16558 was merged that moved view builder drain into generic drain. After this merge dtests started to fail from time to time, so the PR was reverted (see #18278). In #18295 the hang was found. View builder drain was moved from \"before stopping messaging service to \"after\" it, and view update write handlers in proxy hanged for hard-coded timeout of 5 minutes without being aborted. Tests don't wait for 5 minutes and kill scylla, then complain about it and fail.\n\nThis PR brings back the original PR as well as the necessary fix that cancels view update write handlers on stop.\n\nCloses scylladb/scylladb#18408\n\n* github.com:scylladb/scylladb:\n  Reapply \"Merge 'Drain view_builder in generic drain' from ScyllaDB\"\n  view: Abort pending view updates when draining","shortMessageHtmlLink":"Merge 'Drain view_builder in generic drain (again)' from Pavel Emelyanov"}},{"before":"7da4b307403b70246c60b4416965cd57331b11c7","after":"9a0dd462746bde3a78213da6851ae4d114336254","ref":"refs/heads/next","pushedAt":"2024-05-08T13:38:38.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"denesb","name":"Botond Dénes","path":"/denesb","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1389273?s=80&v=4"},"commit":{"message":"tasks: use default task_ttl in scylla.yaml\n\nCurrently default task_ttl_in_seconds is 0, but scylla.yaml changes\nthe value to 10.\n\nChange task_ttl_in_seconds in scylla.yaml to 0, so that there are\nconsistent defaults. Comment it out.\n\nFixes: #16714.\n\nCloses scylladb/scylladb#18495","shortMessageHtmlLink":"tasks: use default task_ttl in scylla.yaml"}},{"before":"9a366ec207ce1bd417b862cc509933d03733806d","after":"7da4b307403b70246c60b4416965cd57331b11c7","ref":"refs/heads/next","pushedAt":"2024-05-08T13:37:12.000Z","pushType":"push","commitsCount":3,"pusher":{"login":"denesb","name":"Botond Dénes","path":"/denesb","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1389273?s=80&v=4"},"commit":{"message":"Merge 'alternator: fix REST API access to an Alternator LSI' from Nadav Har'El\n\nThe name of the Scylla table backing an Alternator LSI looks like `basename:!lsiname`. Some REST API clients (including Scylla Manager) when they send a \"!\" character in the REST API request path may decide to \"URL encode\" it - convert it to `%21`.\n\nBecause of a Seastar bug (https://github.com/scylladb/seastar/issues/725) Scylla's REST API server forgets to do the URL decoding on the path part of the request, which leads to the REST API request failing to address the LSI table.\n\nThe first patch in this PR fixes the bug by using a new Seastar API introduced in https://github.com/scylladb/seastar/pull/2125 that does the URL decoding as appropriate. The second patch in the PR is a new test for this bug, which fails without the fix, and passes afterwards.\n\nFixes #5883.\n\nCloses scylladb/scylladb#18286\n\n* github.com:scylladb/scylladb:\n  test/alternator: test addressing LSI using REST API\n  REST API: stop using deprecated, buggy, path parameter","shortMessageHtmlLink":"Merge 'alternator: fix REST API access to an Alternator LSI' from Nad…"}},{"before":"6e63a9b293f0a12faf6b7c4a277d2f9111de6444","after":"9a366ec207ce1bd417b862cc509933d03733806d","ref":"refs/heads/next","pushedAt":"2024-05-08T13:34:53.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"nyh","name":"Nadav Har'El","path":"/nyh","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/584227?s=80&v=4"},"commit":{"message":"docs/dev/object_stroage.md: convert example AWS keys to be more innocent\n\nSomeone thought that they actually represent real keys (the 'EXAMPLE' in their name was not enough).\nConverted them to be as clear as can be, example data.\n\nSigned-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>\n\nCloses scylladb/scylladb#18565","shortMessageHtmlLink":"docs/dev/object_stroage.md: convert example AWS keys to be more innocent"}},{"before":"66e1f302d411e916c4bfc41929509798e83ade27","after":"6e63a9b293f0a12faf6b7c4a277d2f9111de6444","ref":"refs/heads/next","pushedAt":"2024-05-08T13:18:09.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"denesb","name":"Botond Dénes","path":"/denesb","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1389273?s=80&v=4"},"commit":{"message":"repair: Add ranges_parallelism option support for tablet\n\nThe ranges_parallelism option is introduced in commit 9b3fd9407b0.\nCurrently, this option works for vnode table repair only.\n\nThis patch enables it for tablet repair, since it is useful for\ntablet repair too.\n\nFixes #18383\n\nCloses scylladb/scylladb#18385","shortMessageHtmlLink":"repair: Add ranges_parallelism option support for tablet"}},{"before":"3ed4a7a8e3e71b6cc6a6f0d6d6e3a28697e64724","after":"66e1f302d411e916c4bfc41929509798e83ade27","ref":"refs/heads/next","pushedAt":"2024-05-08T13:12:34.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"denesb","name":"Botond Dénes","path":"/denesb","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1389273?s=80&v=4"},"commit":{"message":"storage_proxy: cas: reject for tablets-enabled tables\n\nCurrently, LWT is not supported with tablets.\nIn particular the interaction between paxos and tablet\nmigration is not handled yet.\n\nTherefore, it is better to outright reject LWT queries\nfor tablets-enabled tables rather than support them\nin a flaky way.\n\nThis commit also marks tests that depend on LWT\nas expeced to fail.\n\nFixes scylladb/scylladb#18066\n\nSigned-off-by: Benny Halevy <bhalevy@scylladb.com>\n\nCloses scylladb/scylladb#18103","shortMessageHtmlLink":"storage_proxy: cas: reject for tablets-enabled tables"}},{"before":"49307242d2f2f9333c62c884c64b20696ee4e51f","after":"3ed4a7a8e3e71b6cc6a6f0d6d6e3a28697e64724","ref":"refs/heads/next","pushedAt":"2024-05-08T13:00:48.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"kbr-scylla","name":"Kamil Braun","path":"/kbr-scylla","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/120044486?s=80&v=4"},"commit":{"message":"raft topology: join_token_ring: prevent shutdown hangs\n\nShutdown of a bootstrapping node could hang on\n`_topology_state_machine.event.when()` in\n`wait_for_topology_request_completion`. It caused\nscylladb/scylladb#17246 and scylladb/scylladb#17608.\n\nOn a normal node, `wait_for_group0_stop` would prevent it, but this\nfunction won't be called before we join group 0. Solve it by adding\na new subscriber to `_abort_source`.\n\nAdditionally, trigger `_group0_as` to prevent other hang scenarios.\n\nNote that if both the new subscriber and `wait_for_group0_stop` are\ncalled, nothing will break. `abort_source::request_abort` and\n`conditional_variable::broken` can be called multiple times.\n\nThe raft-based topology is moved out of experimental in 6.0, no need\nto backport the patch.\n\nFixes scylladb/scylladb#17246\nFixes scylladb/scylladb#17608\n\nCloses scylladb/scylladb#18549","shortMessageHtmlLink":"raft topology: join_token_ring: prevent shutdown hangs"}},{"before":"3f11e455e363a2ee4e2d5312a9327bc2fa6157e8","after":"49307242d2f2f9333c62c884c64b20696ee4e51f","ref":"refs/heads/next","pushedAt":"2024-05-08T12:59:53.000Z","pushType":"push","commitsCount":6,"pusher":{"login":"denesb","name":"Botond Dénes","path":"/denesb","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1389273?s=80&v=4"},"commit":{"message":"Merge 'sstables: add dead row count when issuing warning to system.large_partitions' from Ferenc Szili\n\nThis is the second half of the fix for issue #13968. The first half is already merged with PR #18346\n\nScylla issues warnings for partitions containing more rows than a configured threshold. The warning is issued by inserting a row into the `system.large_partitions` table. This row contains the information about the partition for which the warning is issued: keyspace, table, sstable, partition key and size, compaction time and the number of rows in the partition. A previous PR #18346 also added range tombstone count to this row.\n\nThis change adds a new counter for dead rows to the large_partitions table.\n\nThis change also adds cluster feature protection for writing into these new counters. This is needed in case a cluster is in the process of being upgraded to this new version, after which an upgraded node writes data with the new schema into `system.large_partitions`, and finally a node is then rolled back to an old version. This node will then revert the schema to the old version, but the written sstables will still contain data with the new counters, causing any readers of this table to throw errors when they encounter these cells.\n\nThis is an enhancement, and backporting is not needed.\n\nFixes #13968\n\nCloses scylladb/scylladb#18458\n\n* github.com:scylladb/scylladb:\n  sstable: added test for counting dead rows\n  sstable: added docs for system.large_partitions.dead_rows\n  sstable: added cluster feature for dead rows and range tombstones\n  sstable: write dead_rows count to system.large_partitions\n  sstable: added counter for dead rows","shortMessageHtmlLink":"Merge 'sstables: add dead row count when issuing warning to system.la…"}},{"before":"905b8f59bd5a922a4b0bd812f150223337461bb3","after":"ed89deab408c47443408cdfd60380b5a27766322","ref":"refs/heads/next-5.4","pushedAt":"2024-05-08T12:57:16.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"kbr-scylla","name":"Kamil Braun","path":"/kbr-scylla","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/120044486?s=80&v=4"},"commit":{"message":"direct_failure_detector: increase ping timeout and make it tunable\n\nThe direct failure detector design is simplistic. It sends pings\nsequentially and times out listeners that reached the threshold (i.e.\ndidn't hear from a given endpoint for too long) in-between pings.\n\nGiven the sequential nature, the previous ping must finish so the next\nping can start. We timeout pings that take too long. The timeout was\nhardcoded and set to 300ms. This is too low for wide-area setups --\nlatencies across the Earth can indeed go up to 300ms. 3 subsequent timed\nout pings to a given node were sufficient for the Raft listener to \"mark\nserver as down\" (the listener used a threshold of 1s).\n\nIncrease the ping timeout to 600ms which should be enough even for\npinging the opposite side of Earth, and make it tunable.\n\nIncrease the Raft listener threshold from 1s to 2s. Without the\nincreased threshold, one timed out ping would be enough to mark the\nserver as down. Increasing it to 2s requires 3 timed out pings which\nmakes it more robust in presence of transient network hiccups.\n\nIn the future we'll most likely want to decrease the Raft listener\nthreshold again, if we use Raft for data path -- so leader elections\nstart quickly after leader failures. (Faster than 2s). To do that we'll\nhave to improve the design of the direct failure detector.\n\nRef: scylladb/scylladb#16410\nFixes: scylladb/scylladb#16607\n\n---\n\nI tested the change manually using `tc qdisc ... netem delay`, setting\nnetwork delay on local setup to ~300ms with jitter. Without the change,\nthe result is as observed in scylladb/scylladb#16410: interleaving\n```\nraft_group_registry - marking Raft server ... as dead for Raft groups\nraft_group_registry - marking Raft server ... as alive for Raft groups\n```\nhappening once every few seconds. The \"marking as dead\" happens whenever\nwe get 3 subsequent failed pings, which is happens with certain (high)\nprobability depending on the latency jitter. Then as soon as we get a\nsuccessful ping, we mark server back as alive.\n\nWith the change, the phenomenon no longer appears.\n\n(cherry picked from commit 8df6d10e883f0965b16f75bd0217e5676eaa4c04)\n\nCloses scylladb/scylladb#18559","shortMessageHtmlLink":"direct_failure_detector: increase ping timeout and make it tunable"}},{"before":"9fdacb25fca8c085732dacd8bebb48384b2c86ab","after":"3f11e455e363a2ee4e2d5312a9327bc2fa6157e8","ref":"refs/heads/next","pushedAt":"2024-05-08T12:57:13.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"denesb","name":"Botond Dénes","path":"/denesb","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1389273?s=80&v=4"},"commit":{"message":"docs: change \"create an issue\" github label to \"type/documentation\"\n\nCloses scylladb/scylladb#18550","shortMessageHtmlLink":"docs: change \"create an issue\" github label to \"type/documentation\""}},{"before":"98f7b205a7218928da976f5cef5054b50957be09","after":"9fdacb25fca8c085732dacd8bebb48384b2c86ab","ref":"refs/heads/next","pushedAt":"2024-05-08T12:50:11.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"denesb","name":"Botond Dénes","path":"/denesb","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1389273?s=80&v=4"},"commit":{"message":".github: add clang-tidy workflow\n\nclang-tidy is a tool provided by Clang to perform static analysis on\nC++ source files. here, we are mostly intersted in using its\nhttps://clang.llvm.org/extra/clang-tidy/checks/bugprone/use-after-move.html\ncheck to reveal the potential issues.\n\nthis workflow is added to run clang-tidy when building the tree, so\nthat the warnings from clang-tidy can be noticed by developers.\n\na dedicated action is added so other github workflow can reuse it to\nsetup the building environment in an ubuntu:jammy runner.\n\nclang-tidy-matcher.json is added to annotate the change, so that the\nwarnings are more visible with github webpage.\n\nSigned-off-by: Kefu Chai <kefu.chai@scylladb.com>\n\nCloses scylladb/scylladb#18342","shortMessageHtmlLink":".github: add clang-tidy workflow"}},{"before":"f0002746746272d7696291521e421115136acaba","after":"98f7b205a7218928da976f5cef5054b50957be09","ref":"refs/heads/next","pushedAt":"2024-05-08T10:42:08.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"denesb","name":"Botond Dénes","path":"/denesb","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1389273?s=80&v=4"},"commit":{"message":"docs: add swagger ui extension\n\nRenders the API Reference from api/api-doc using Swagger UI 2.2.10.\n\naddress comments\n\nCloses scylladb/scylladb#18253","shortMessageHtmlLink":"docs: add swagger ui extension"}},{"before":"4dcae6638073036bd92baa7507e60f585df2763b","after":"f0002746746272d7696291521e421115136acaba","ref":"refs/heads/next","pushedAt":"2024-05-08T08:52:44.000Z","pushType":"push","commitsCount":54,"pusher":{"login":"avikivity","name":"Avi Kivity","path":"/avikivity","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1017210?s=80&v=4"},"commit":{"message":"Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec\n\nThis is needed to avoid severe imbalance between shards which can\nhappen when some table grows and is split. The inter-node balance can\nbe equal, so inter-node migration cannot fix the imbalance. Also, if RF=N\nthen there is not even a possibility of moving tablets around to fix the imbalance.\nThe only way to bring the system to balance is to move tablets within the nodes.\n\nThe system is not prepared for intra-node migration currently. Request coordination\nis host-based, while for intra-node migration it should be (also) shard-based.\nThe solution employed here is to keep the coordination between nodes as-is,\nand for intra-node migration storage_proxy-level coordinator is not aware of\nthe migration (no pending host). The replica-side request handler will be a\nsecond-level coordinator which routes requests to shards, similar to how\nthe first-level coordinator routes them to hosts.\n\nTablet sharder is adjusted to handle intra-migration where a tablet\ncan have two replicas on the same host. For reads, sharder uses the\nread selector to resolve the conflict. For writes, the write selector\nis used.\n\nThe old shard_of() API is kept to represent shard for reads, and new\nmethod is introduced to query the shards for writing:\nshard_for_writes(). All writers should be switched to that API, which\nis not done in this patch yet.\n\nThe request handler on replica side acts as a second-level\ncoordinator, using sharder to determine routing to shards. A given\nsharder has a scope of a single topology version, a single\neffective_replication_map_ptr, which should be kept alive during\nwrites.\n\nperf-simple-query test results show no signs of regression:\n\nCommand: perf-simple-query -c1 -m1G --write --tablets --duration=10\n\nBefore:\n\n> 83294.81 tps ( 59.5 allocs/op,  14.3 tasks/op,   53725 insns/op,        0 errors)\n> 87756.72 tps ( 59.5 allocs/op,  14.3 tasks/op,   54049 insns/op,        0 errors)\n> 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)\n> 86211.38 tps ( 59.7 allocs/op,  14.3 tasks/op,   54219 insns/op,        0 errors)\n> 86559.89 tps ( 59.6 allocs/op,  14.3 tasks/op,   54188 insns/op,        0 errors)\n> 86609.39 tps ( 59.6 allocs/op,  14.3 tasks/op,   54117 insns/op,        0 errors)\n> 87464.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   54039 insns/op,        0 errors)\n> 86185.43 tps ( 59.6 allocs/op,  14.3 tasks/op,   54169 insns/op,        0 errors)\n> 86254.71 tps ( 59.6 allocs/op,  14.3 tasks/op,   54139 insns/op,        0 errors)\n> 83395.35 tps ( 60.2 allocs/op,  14.4 tasks/op,   54693 insns/op,        0 errors)\n>\n> median 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)\n> median absolute deviation: 243.04\n> maximum: 87756.72\n> minimum: 83294.81\n>\n\nAfter:\n\n> 85523.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   53872 insns/op,        0 errors)\n> 89362.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54226 insns/op,        0 errors)\n> 88167.55 tps ( 59.7 allocs/op,  14.3 tasks/op,   54400 insns/op,        0 errors)\n> 87044.40 tps ( 59.7 allocs/op,  14.3 tasks/op,   54310 insns/op,        0 errors)\n> 88344.50 tps ( 59.6 allocs/op,  14.3 tasks/op,   54289 insns/op,        0 errors)\n> 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)\n> 88725.46 tps ( 59.6 allocs/op,  14.3 tasks/op,   54230 insns/op,        0 errors)\n> 88640.08 tps ( 59.6 allocs/op,  14.3 tasks/op,   54210 insns/op,        0 errors)\n> 90306.31 tps ( 59.4 allocs/op,  14.3 tasks/op,   54043 insns/op,        0 errors)\n> 87343.62 tps ( 59.8 allocs/op,  14.3 tasks/op,   54496 insns/op,        0 errors)\n>\n> median 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)\n> median absolute deviation: 1007.41\n> maximum: 90306.31\n> minimum: 85523.06\n\nCommand (reads): perf-simple-query -c1 -m1G  --tablets --duration=10\n\nBefore:\n\n> 95860.18 tps ( 63.1 allocs/op,  14.1 tasks/op,   42476 insns/op,        0 errors)\n> 97537.69 tps ( 63.1 allocs/op,  14.1 tasks/op,   42454 insns/op,        0 errors)\n> 97549.23 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)\n> 97511.29 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)\n> 97227.32 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)\n> 94031.94 tps ( 63.1 allocs/op,  14.1 tasks/op,   42441 insns/op,        0 errors)\n> 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)\n> 96401.70 tps ( 63.1 allocs/op,  14.1 tasks/op,   42473 insns/op,        0 errors)\n> 96573.77 tps ( 63.1 allocs/op,  14.1 tasks/op,   42440 insns/op,        0 errors)\n> 96340.54 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)\n>\n> median 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)\n> median absolute deviation: 571.20\n> maximum: 97549.23\n> minimum: 94031.94\n>\n\nAfter:\n\n> 99794.67 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)\n> 101244.99 tps ( 63.1 allocs/op,  14.1 tasks/op,   42472 insns/op,        0 errors)\n> 101128.37 tps ( 63.1 allocs/op,  14.1 tasks/op,   42485 insns/op,        0 errors)\n> 101065.27 tps ( 63.1 allocs/op,  14.1 tasks/op,   42465 insns/op,        0 errors)\n> 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)\n> 101413.31 tps ( 63.1 allocs/op,  14.1 tasks/op,   42463 insns/op,        0 errors)\n> 101464.92 tps ( 63.1 allocs/op,  14.1 tasks/op,   42466 insns/op,        0 errors)\n> 101086.74 tps ( 63.1 allocs/op,  14.1 tasks/op,   42488 insns/op,        0 errors)\n> 101559.09 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)\n> 100742.58 tps ( 63.1 allocs/op,  14.1 tasks/op,   42491 insns/op,        0 errors)\n>\n> median 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)\n> median absolute deviation: 200.33\n> maximum: 101559.09\n> minimum: 99794.67\n>\n\nFixes #16594\n\nCloses scylladb/scylladb#18026\n\n* github.com:scylladb/scylladb:\n  Implement fast streaming for intra-node migration\n  test: tablets_test: Test sharding during intra-node migration\n  test: tablets_test: Check sharding also on the pending host\n  test: py: tablets: Test writes concurrent with migration\n  test: py: tablets: Test crash during intra-node migration\n  api, storage_service: Introduce API to wait for topology to quiesce\n  dht, replica: Remove deprecated sharder APIs\n  test: Avoid using deprecated sharded API\n  db: do_apply_many() avoid deprecated sharded API\n  replica: mutation_dump: Avoid deprecated sharder API\n  repair: Avoid deprecated sharder API\n  table: Remove optimization which returns empty reader when key is not owned by the shard\n  dht: is_single_shard: Avoid deprecated sharder API\n  dht: split_range_to_single_shard: Work with static_sharder only\n  dht: ring_position_range_sharder: Avoid deprecated sharder APIs\n  dht: token: Avoid use of deprecated sharder API by switching to static_sharder\n  selective_token_sharder: Avoid use of deprecated sharder API\n  docs: Document tablet sharding vs tablet replica placement\n  readers/multishard.cc: use shard_for_reads() instead of shard_of()\n  multishard_mutation_query.cc: use shard_for_reads() instead of shard_of()\n  storage_proxy: Avoid shard_of() use in choose_rate_limit_info()\n  storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate()\n  storage_proxy: Prepare mutate_hint() for intra-node tablet migration\n  commitlog_replayer: Avoid deprecated sharder::shard_of()\n  lwt: Avoid deprecated sharder::shard_of()\n  compaction: Avoid deprecated sharder::shard_of()\n  dht: Extract dht::static_sharder\n  replica: Deprecate table::shard_of()\n  locator: Deprecate effective_replication_map::shard_of()\n  dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard\n  tests: tablets: py: Add intra-node migration test\n  tests: tablets: Test that drained nodes are not balanced internally\n  tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load\n  tests: tablets: Verify that disabling balancing results in no intra-node migrations\n  tests: tablets: Check that nodes are internally balanced\n  tests: tablets: Improve debuggability by showing which rows are missing\n  tablets, storage_service: Support intra-node migration in move_tablet() API\n  tablet_allocator: Generate intra-node migration plan\n  tablet_allocator: Extract make_internode_plan()\n  tablet_allocator: Maintain candidate list and shard tablet count for target nodes\n  tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions\n  tablets, streaming: Implement tablet streaming for intra-node migration\n  dht, auto_refreshing_sharder: Allow overriding write selector\n  multishard_writer: Handle intra-node migration\n  storage_proxy: Handle intra-node tablet migration for writes\n  tablets: Get rid of tablet_map::get_shard()\n  tablets: Avoid tablet_map::get_shard in cleanup\n  tablets: test: Use sharder instead of tablet_map::get_shard()\n  tablets: tablet_sharder: Allow working with non-local host\n  sharding: Prepare for intra-node-migration\n  docs: Document sharder use for tablets\n  tablets: Introduce tablet transition kind for intra-node migration\n  tests: tablets: Fix use-after-move of skiplist in rebalance_tablets()","shortMessageHtmlLink":"Merge 'Balance tablets within nodes (intra-node migration)' from Toma…"}},{"before":"52a9aeeed6eb938de020f2d89ed6fd649e896d3d","after":"d71fd881481a2cdb8a62bbc9ba5382c791e8b895","ref":"refs/heads/mergify/copy/branch-5.2/pr-18443","pushedAt":"2024-05-08T08:25:37.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"kbr-scylla","name":"Kamil Braun","path":"/kbr-scylla","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/120044486?s=80&v=4"},"commit":{"message":"direct_failure_detector: increase ping timeout and make it tunable\n\nThe direct failure detector design is simplistic. It sends pings\nsequentially and times out listeners that reached the threshold (i.e.\ndidn't hear from a given endpoint for too long) in-between pings.\n\nGiven the sequential nature, the previous ping must finish so the next\nping can start. We timeout pings that take too long. The timeout was\nhardcoded and set to 300ms. This is too low for wide-area setups --\nlatencies across the Earth can indeed go up to 300ms. 3 subsequent timed\nout pings to a given node were sufficient for the Raft listener to \"mark\nserver as down\" (the listener used a threshold of 1s).\n\nIncrease the ping timeout to 600ms which should be enough even for\npinging the opposite side of Earth, and make it tunable.\n\nIncrease the Raft listener threshold from 1s to 2s. Without the\nincreased threshold, one timed out ping would be enough to mark the\nserver as down. Increasing it to 2s requires 3 timed out pings which\nmakes it more robust in presence of transient network hiccups.\n\nIn the future we'll most likely want to decrease the Raft listener\nthreshold again, if we use Raft for data path -- so leader elections\nstart quickly after leader failures. (Faster than 2s). To do that we'll\nhave to improve the design of the direct failure detector.\n\nRef: scylladb/scylladb#16410\nFixes: scylladb/scylladb#16607\n\n---\n\nI tested the change manually using `tc qdisc ... netem delay`, setting\nnetwork delay on local setup to ~300ms with jitter. Without the change,\nthe result is as observed in scylladb/scylladb#16410: interleaving\n```\nraft_group_registry - marking Raft server ... as dead for Raft groups\nraft_group_registry - marking Raft server ... as alive for Raft groups\n```\nhappening once every few seconds. The \"marking as dead\" happens whenever\nwe get 3 subsequent failed pings, which is happens with certain (high)\nprobability depending on the latency jitter. Then as soon as we get a\nsuccessful ping, we mark server back as alive.\n\nWith the change, the phenomenon no longer appears.\n\n(cherry picked from commit 8df6d10e883f0965b16f75bd0217e5676eaa4c04)","shortMessageHtmlLink":"direct_failure_detector: increase ping timeout and make it tunable"}},{"before":"cfc6e07097f6ec262e3833b2b6ccca2dc7e29fd1","after":"97f85f7616b69119043bd6375715e2b6ada2422b","ref":"refs/heads/mergify/copy/branch-5.4/pr-18443","pushedAt":"2024-05-08T08:08:26.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"kbr-scylla","name":"Kamil Braun","path":"/kbr-scylla","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/120044486?s=80&v=4"},"commit":{"message":"direct_failure_detector: increase ping timeout and make it tunable\n\nThe direct failure detector design is simplistic. It sends pings\nsequentially and times out listeners that reached the threshold (i.e.\ndidn't hear from a given endpoint for too long) in-between pings.\n\nGiven the sequential nature, the previous ping must finish so the next\nping can start. We timeout pings that take too long. The timeout was\nhardcoded and set to 300ms. This is too low for wide-area setups --\nlatencies across the Earth can indeed go up to 300ms. 3 subsequent timed\nout pings to a given node were sufficient for the Raft listener to \"mark\nserver as down\" (the listener used a threshold of 1s).\n\nIncrease the ping timeout to 600ms which should be enough even for\npinging the opposite side of Earth, and make it tunable.\n\nIncrease the Raft listener threshold from 1s to 2s. Without the\nincreased threshold, one timed out ping would be enough to mark the\nserver as down. Increasing it to 2s requires 3 timed out pings which\nmakes it more robust in presence of transient network hiccups.\n\nIn the future we'll most likely want to decrease the Raft listener\nthreshold again, if we use Raft for data path -- so leader elections\nstart quickly after leader failures. (Faster than 2s). To do that we'll\nhave to improve the design of the direct failure detector.\n\nRef: scylladb/scylladb#16410\nFixes: scylladb/scylladb#16607\n\n---\n\nI tested the change manually using `tc qdisc ... netem delay`, setting\nnetwork delay on local setup to ~300ms with jitter. Without the change,\nthe result is as observed in scylladb/scylladb#16410: interleaving\n```\nraft_group_registry - marking Raft server ... as dead for Raft groups\nraft_group_registry - marking Raft server ... as alive for Raft groups\n```\nhappening once every few seconds. The \"marking as dead\" happens whenever\nwe get 3 subsequent failed pings, which is happens with certain (high)\nprobability depending on the latency jitter. Then as soon as we get a\nsuccessful ping, we mark server back as alive.\n\nWith the change, the phenomenon no longer appears.\n\n(cherry picked from commit 8df6d10e883f0965b16f75bd0217e5676eaa4c04)","shortMessageHtmlLink":"direct_failure_detector: increase ping timeout and make it tunable"}},{"before":"180cb7a2b9dd6b7c9f50883419156350311889ea","after":"4dcae6638073036bd92baa7507e60f585df2763b","ref":"refs/heads/next","pushedAt":"2024-05-08T07:45:15.000Z","pushType":"push","commitsCount":4,"pusher":{"login":"kbr-scylla","name":"Kamil Braun","path":"/kbr-scylla","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/120044486?s=80&v=4"},"commit":{"message":"Merge 'test: {auth,topology}: use manager.rolling_restart' from Piotr Dulikowski\n\nInstead of performing a rolling restart by calling `restart` in a loop over every node in the cluster, use the dedicated\n`manager.rolling_restart` function. This method waits until all other nodes see the currently processed node as up or down before proceeding to the next step. Not doing so may lead to surprising behavior.\n\nIn particular, in scylladb/scylladb#18369, a test failed shortly after restarting three nodes. Because nodes were restarted one after another too fast, when the third node was restarted it didn't send a notification to the second node because it still didn't know that the second node was alive. This led the second node to notice that the third node restarted by observing that it incremented its generation in gossip (it restarted too fast to be marked as down by the failure detector). In turn, this caused the second node to send \"third node down\" and \"third node up\" notifications to the driver in a quick succession, causing it to drop and reestablish all connections to that node. However, this happened _after_ rolling upgrade finished and _after_ the test logic confirmed that all nodes were alive. When the notifications were sent to the driver, the test was executing some statements necessary for the test to pass - as they broke, the test failed.\n\nFixes: scylladb/scylladb#18369\n\nCloses scylladb/scylladb#18379\n\n* github.com:scylladb/scylladb:\n  test: get rid of server-side server_restart\n  test: util: get rid of the `restart` helper\n  test: {auth,topology}: use manager.rolling_restart","shortMessageHtmlLink":"Merge 'test: {auth,topology}: use manager.rolling_restart' from Piotr…"}},{"before":"03818c4aa96de33b40a336278dfb99bd797f6b53","after":"180cb7a2b9dd6b7c9f50883419156350311889ea","ref":"refs/heads/next","pushedAt":"2024-05-08T07:40:49.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"kbr-scylla","name":"Kamil Braun","path":"/kbr-scylla","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/120044486?s=80&v=4"},"commit":{"message":"storage_service: notify lifecycle subs only after token metadata update\n\nCurrently, in raft mode, when raft topology is reloaded from disk or a\nnotification is received from gossip about an endpoint change, token\nmetadata is updated accordingly. While updating token metadata we detect\nwhether some nodes are joining or are leaving and we notify endpoint\nlifecycle subscribers if such an event occurs. These notifications are\nfired _before_ we finish updating token metadata and before the updated\nversion is globally available.\n\nThis behavior, for \"node leaving\" notifications specifically, was not\npresent in legacy topology mode. Hinted handoff depends on token\nmetadata being updated before it is notified about a leaving node (we\nhad a similar issue before: scylladb/scylladb#5087, and we fixed it by\nenforcing this property). Because this is not true right now for raft\nmode, this causes the hint draining logic not to work properly - when a\nnode leaves the cluster, there should be an attempt to send out hints\nfor that node, but instead hints are not sent out and are kept on disk.\n\nIn order to fix the issue with hints, postpone notifying endpoint\nlifecycle subscribers about joined and left nodes only after the final\ntoken metadata is computed and replicated to all shards.\n\nFixes: scylladb/scylladb#17023\n\nCloses scylladb/scylladb#18377","shortMessageHtmlLink":"storage_service: notify lifecycle subs only after token metadata update"}},{"before":null,"after":"52a9aeeed6eb938de020f2d89ed6fd649e896d3d","ref":"refs/heads/mergify/copy/branch-5.2/pr-18443","pushedAt":"2024-05-08T01:56:11.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"mergify[bot]","name":null,"path":"/apps/mergify","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/10562?s=80&v=4"},"commit":{"message":"direct_failure_detector: increase ping timeout and make it tunable\n\nThe direct failure detector design is simplistic. It sends pings\nsequentially and times out listeners that reached the threshold (i.e.\ndidn't hear from a given endpoint for too long) in-between pings.\n\nGiven the sequential nature, the previous ping must finish so the next\nping can start. We timeout pings that take too long. The timeout was\nhardcoded and set to 300ms. This is too low for wide-area setups --\nlatencies across the Earth can indeed go up to 300ms. 3 subsequent timed\nout pings to a given node were sufficient for the Raft listener to \"mark\nserver as down\" (the listener used a threshold of 1s).\n\nIncrease the ping timeout to 600ms which should be enough even for\npinging the opposite side of Earth, and make it tunable.\n\nIncrease the Raft listener threshold from 1s to 2s. Without the\nincreased threshold, one timed out ping would be enough to mark the\nserver as down. Increasing it to 2s requires 3 timed out pings which\nmakes it more robust in presence of transient network hiccups.\n\nIn the future we'll most likely want to decrease the Raft listener\nthreshold again, if we use Raft for data path -- so leader elections\nstart quickly after leader failures. (Faster than 2s). To do that we'll\nhave to improve the design of the direct failure detector.\n\nRef: scylladb/scylladb#16410\nFixes: scylladb/scylladb#16607\n\n---\n\nI tested the change manually using `tc qdisc ... netem delay`, setting\nnetwork delay on local setup to ~300ms with jitter. Without the change,\nthe result is as observed in scylladb/scylladb#16410: interleaving\n```\nraft_group_registry - marking Raft server ... as dead for Raft groups\nraft_group_registry - marking Raft server ... as alive for Raft groups\n```\nhappening once every few seconds. The \"marking as dead\" happens whenever\nwe get 3 subsequent failed pings, which is happens with certain (high)\nprobability depending on the latency jitter. Then as soon as we get a\nsuccessful ping, we mark server back as alive.\n\nWith the change, the phenomenon no longer appears.\n\n(cherry picked from commit 8df6d10e883f0965b16f75bd0217e5676eaa4c04)\n\n# Conflicts:\n#\tdb/config.cc\n#\tservice/raft/raft_group_registry.cc\n#\ttest/lib/cql_test_env.cc\n#\ttest/topology_custom/test_raft_no_quorum.py","shortMessageHtmlLink":"direct_failure_detector: increase ping timeout and make it tunable"}},{"before":null,"after":"cfc6e07097f6ec262e3833b2b6ccca2dc7e29fd1","ref":"refs/heads/mergify/copy/branch-5.4/pr-18443","pushedAt":"2024-05-08T01:56:11.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"mergify[bot]","name":null,"path":"/apps/mergify","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/10562?s=80&v=4"},"commit":{"message":"direct_failure_detector: increase ping timeout and make it tunable\n\nThe direct failure detector design is simplistic. It sends pings\nsequentially and times out listeners that reached the threshold (i.e.\ndidn't hear from a given endpoint for too long) in-between pings.\n\nGiven the sequential nature, the previous ping must finish so the next\nping can start. We timeout pings that take too long. The timeout was\nhardcoded and set to 300ms. This is too low for wide-area setups --\nlatencies across the Earth can indeed go up to 300ms. 3 subsequent timed\nout pings to a given node were sufficient for the Raft listener to \"mark\nserver as down\" (the listener used a threshold of 1s).\n\nIncrease the ping timeout to 600ms which should be enough even for\npinging the opposite side of Earth, and make it tunable.\n\nIncrease the Raft listener threshold from 1s to 2s. Without the\nincreased threshold, one timed out ping would be enough to mark the\nserver as down. Increasing it to 2s requires 3 timed out pings which\nmakes it more robust in presence of transient network hiccups.\n\nIn the future we'll most likely want to decrease the Raft listener\nthreshold again, if we use Raft for data path -- so leader elections\nstart quickly after leader failures. (Faster than 2s). To do that we'll\nhave to improve the design of the direct failure detector.\n\nRef: scylladb/scylladb#16410\nFixes: scylladb/scylladb#16607\n\n---\n\nI tested the change manually using `tc qdisc ... netem delay`, setting\nnetwork delay on local setup to ~300ms with jitter. Without the change,\nthe result is as observed in scylladb/scylladb#16410: interleaving\n```\nraft_group_registry - marking Raft server ... as dead for Raft groups\nraft_group_registry - marking Raft server ... as alive for Raft groups\n```\nhappening once every few seconds. The \"marking as dead\" happens whenever\nwe get 3 subsequent failed pings, which is happens with certain (high)\nprobability depending on the latency jitter. Then as soon as we get a\nsuccessful ping, we mark server back as alive.\n\nWith the change, the phenomenon no longer appears.\n\n(cherry picked from commit 8df6d10e883f0965b16f75bd0217e5676eaa4c04)\n\n# Conflicts:\n#\tservice/raft/raft_group_registry.cc\n#\ttest/topology_custom/test_raft_no_quorum.py","shortMessageHtmlLink":"direct_failure_detector: increase ping timeout and make it tunable"}},{"before":"98367cb6a112272c7ec461bc97f04ad78ba65a93","after":"03818c4aa96de33b40a336278dfb99bd797f6b53","ref":"refs/heads/master","pushedAt":"2024-05-08T01:54:12.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"scylladb-promoter","name":null,"path":"/scylladb-promoter","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/36883350?s=80&v=4"},"commit":{"message":"direct_failure_detector: increase ping timeout and make it tunable\n\nThe direct failure detector design is simplistic. It sends pings\nsequentially and times out listeners that reached the threshold (i.e.\ndidn't hear from a given endpoint for too long) in-between pings.\n\nGiven the sequential nature, the previous ping must finish so the next\nping can start. We timeout pings that take too long. The timeout was\nhardcoded and set to 300ms. This is too low for wide-area setups --\nlatencies across the Earth can indeed go up to 300ms. 3 subsequent timed\nout pings to a given node were sufficient for the Raft listener to \"mark\nserver as down\" (the listener used a threshold of 1s).\n\nIncrease the ping timeout to 600ms which should be enough even for\npinging the opposite side of Earth, and make it tunable.\n\nIncrease the Raft listener threshold from 1s to 2s. Without the\nincreased threshold, one timed out ping would be enough to mark the\nserver as down. Increasing it to 2s requires 3 timed out pings which\nmakes it more robust in presence of transient network hiccups.\n\nIn the future we'll most likely want to decrease the Raft listener\nthreshold again, if we use Raft for data path -- so leader elections\nstart quickly after leader failures. (Faster than 2s). To do that we'll\nhave to improve the design of the direct failure detector.\n\nRef: scylladb/scylladb#16410\nFixes: scylladb/scylladb#16607\n\n---\n\nI tested the change manually using `tc qdisc ... netem delay`, setting\nnetwork delay on local setup to ~300ms with jitter. Without the change,\nthe result is as observed in scylladb/scylladb#16410: interleaving\n```\nraft_group_registry - marking Raft server ... as dead for Raft groups\nraft_group_registry - marking Raft server ... as alive for Raft groups\n```\nhappening once every few seconds. The \"marking as dead\" happens whenever\nwe get 3 subsequent failed pings, which is happens with certain (high)\nprobability depending on the latency jitter. Then as soon as we get a\nsuccessful ping, we mark server back as alive.\n\nWith the change, the phenomenon no longer appears.\n\nCloses scylladb/scylladb#18443","shortMessageHtmlLink":"direct_failure_detector: increase ping timeout and make it tunable"}},{"before":"98367cb6a112272c7ec461bc97f04ad78ba65a93","after":"03818c4aa96de33b40a336278dfb99bd797f6b53","ref":"refs/heads/next","pushedAt":"2024-05-07T21:40:31.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"tgrabiec","name":"Tomasz Grabiec","path":"/tgrabiec","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/283695?s=80&v=4"},"commit":{"message":"direct_failure_detector: increase ping timeout and make it tunable\n\nThe direct failure detector design is simplistic. It sends pings\nsequentially and times out listeners that reached the threshold (i.e.\ndidn't hear from a given endpoint for too long) in-between pings.\n\nGiven the sequential nature, the previous ping must finish so the next\nping can start. We timeout pings that take too long. The timeout was\nhardcoded and set to 300ms. This is too low for wide-area setups --\nlatencies across the Earth can indeed go up to 300ms. 3 subsequent timed\nout pings to a given node were sufficient for the Raft listener to \"mark\nserver as down\" (the listener used a threshold of 1s).\n\nIncrease the ping timeout to 600ms which should be enough even for\npinging the opposite side of Earth, and make it tunable.\n\nIncrease the Raft listener threshold from 1s to 2s. Without the\nincreased threshold, one timed out ping would be enough to mark the\nserver as down. Increasing it to 2s requires 3 timed out pings which\nmakes it more robust in presence of transient network hiccups.\n\nIn the future we'll most likely want to decrease the Raft listener\nthreshold again, if we use Raft for data path -- so leader elections\nstart quickly after leader failures. (Faster than 2s). To do that we'll\nhave to improve the design of the direct failure detector.\n\nRef: scylladb/scylladb#16410\nFixes: scylladb/scylladb#16607\n\n---\n\nI tested the change manually using `tc qdisc ... netem delay`, setting\nnetwork delay on local setup to ~300ms with jitter. Without the change,\nthe result is as observed in scylladb/scylladb#16410: interleaving\n```\nraft_group_registry - marking Raft server ... as dead for Raft groups\nraft_group_registry - marking Raft server ... as alive for Raft groups\n```\nhappening once every few seconds. The \"marking as dead\" happens whenever\nwe get 3 subsequent failed pings, which is happens with certain (high)\nprobability depending on the latency jitter. Then as soon as we get a\nsuccessful ping, we mark server back as alive.\n\nWith the change, the phenomenon no longer appears.\n\nCloses scylladb/scylladb#18443","shortMessageHtmlLink":"direct_failure_detector: increase ping timeout and make it tunable"}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAERQi5FAA","startCursor":null,"endCursor":null}},"title":"Activity · scylladb/scylladb"}