Improved location claim algorithm (#1008)

* Configuration and module alignment Add claim function and target_n_val configuration into cuttlefish. Move modules around to try and make it more obvious where functions used in membership reside. * Remove deprecated v1 claim/wants * Update framework for Claim Stops fake wants being required to prompt claim on a location change. Allow for a claim module to implement a sort_members_for_choose(Ring, Members, Owners) -> SortedMembers function, to pre-sort the members being passed into claim_rebalance. Add further specs. * Add choose_claim_v4 * Location claim improvements Location claim improved so that it will try to balance the spread of vnodes, if it reaches the end and is still unbalanced. Also uses a stronger meets_target_n to fallback to sequential_claim more reliably on incorrect spacing (of vnodes across nodes, but not yet across locations). * Refinements to claim_v4 Extended potential for test by determining what nodes are safe to both add and remove from loops, rather than simply relying on sequential order. * Correction following removal of log * Count remove list not Excess to determine loops * Better order of initial striping Resolves some test issues. Also try harder to do safe removals when looking back in the ring (as opposed to the removal other additions) * A new claim algorithm (#1003) * Support two transition changes Where the second transition is triggered by a change of location. Need to ensure that the location_changed status update is recognised in the ring * Unrelated fix to remove reference to gen_fsm_compat * unrelated fix to get rid of deprecation warning * Testing claim * The new claim algorithm as purely functional algorithm * add new entry for version 5 claiming * Refactor v5 into v4 * move impossible config test to place where we actually may enter recursion * Documentation The algorithm should be described in more detail in a markup document * Allow configurations with zero nodes in location for better placement update This works better when a location is emptied on nodes. Less transfers. * Keep order of nodes to avoid back translate issue --------- Co-authored-by: Martin Sumner <martin.sumner@adaptip.co.uk> * Claim API requires export of 2-arity choose function as well as 3-arity * Always return indices Otherwise, if all vnodes have become excluded there is no escape from this condition (unless other traffic can trigger the creation of vnodes). This is helpful in situations where transfers are performed on standby clusters with no other traffic. This commit also logs a timing of the claim each time it is called. * Calculate swaps only once * Remember v4 solutions via claimant To allow for the riak_core_ring_manager and riak_core_claimant to remember v4 solutions, they are shared via the state of the claimant * Long-running tests * Adding an extra test (#1004) * Add an extra test to show owners may stay the same if only location changes This is not always the case, but holds when there is a solution in the first place * Fix type error that dialyzer could not find * Introduce necessary conditions to fallback to version 2 * update tests * Check whether it is worth to use brute force * make historic values the norm * Introduce nvals map type * Take nr nodes into account when checking for brute force cond. * Property to evaluate skipping brute force strategy * QuickCheck property starts with choosing ring size. * Remove fallback for necessary conditions * Filter tests to get away with flakyness * In order to re-run tests suite, remove strict precondition * Check in test suite * Replace claim_suite.suite by larger claim.suite * Sometimes it is worth to brute_force to a zero node violation * better documentation binring algorithm * Run property with a sufficient condition * Revert "Long-running tests" This reverts commit 6527033. * Test adjustments * Test adjustments * Add support for configured target_location_n_val * Memoise fixes The cache of v4 solutions is required by the ring_manager and the claimant - so specifically update both of these processes each time. Otherwise cache will be missed when the ring_manager calls to riak_core_claimant:ring_changed/2. There is a fix to the last gasp check before writing the ring file. prune_write_notify_ring function does not care if the write of a ring errors - so error for this function rather than crashing the ring manager. This otherwise causes instability in location tests. * Example configurations saves in source format (#1005) * Remove pre-computed test suite * cleanup * Make claim_eqc tests not fail on weird configs by supplying a diverse list of options * Add full-rebalance for v4 The leave call on a failure of simple_transfer will call sequential_claim - which is part of the v2 claim family. Now we have v4, if this is configured it should call v4 as it does handle leaves. * Support leave in prop_claim * Update - to use correct claim_fun on leave Property temporarily changed to consider only failures with locations * Use application env to read target_n_val (#1007) * Use application env to read target_n_val * Re-introduce v2 in riak_core_claim_eqc * Move precondition to postcondition to also test less perfect cases * Fixed error in transfer_node usage * cleanup not using remove_from_cluster/5. * Add warning if simple_transfer produces unbalanced result In this case - full_rebalance should be enabled * only_swap/swap_only confusion Add recommendation to use full_rebalance_on_leave for locations * Update riak_core_claim_eqc.erl * Mas i1001 docupdate (#1009) * Change doc Change introduction to refer to vnodes and nodes. Removes the recommendation not to vary location_n_val and node_n_val. * Update comments on having different target n_vals * Further doc updates * Update docs/claim-version4.md * Update docs/claim-version4.md * Update docs/claim-version4.md Co-authored-by: Thomas Arts <thomas.arts@quviq.com> * Update docs/claim-version4.md Co-authored-by: Thomas Arts <thomas.arts@quviq.com> --------- Co-authored-by: Thomas Arts <thomas.arts@quviq.com> --------- Co-authored-by: Thomas Arts <thomas.arts@quviq.com>
basho · Jun 12, 2023 · 7b04e2c · 7b04e2c
1 parent b1696b3
commit 7b04e2c
Show file tree

Hide file tree

Showing 23 changed files with 4,109 additions and 807 deletions.
diff --git a/docs/claim-version4.md b/docs/claim-version4.md
diff --git a/eqc/hashtree_eqc.erl b/eqc/hashtree_eqc.erl
@@ -190,7 +190,7 @@ command(_S = #state{started = true, tree_id = TreeId,
 %% to make sure the iterator code is fully exercised.
 %%
 %% Store the hashtree records in the process dictionary under keys 't1' and 't2'.
-%% 
+%%
 start(Params, [TreeId | ExtraIds], Tree1OpenOrEmpty, Tree2OpenOrEmpty) ->
     {Segments, Width, MemLevels} = Params,
     %% Return now so we can store symbolic value in procdict in next_state call
@@ -271,7 +271,7 @@ update_snapshot(T, S) ->
     ok.
 
 
-%% 
+%%
 %% Wrap the hashtree:update_perform call and erase the snapshot hashtree state.
 %% Should only happen if a snapshot state exists.
 %%
@@ -490,7 +490,7 @@ next_state(S,_R,{call, _, local_compare1, []}) ->
 %%
 prop_correct() ->
     ?SETUP(fun() ->
-                   application:set_env(lager, handlers, [{lager_console_backend, info}]),
+                   application:set_env(lager, handlers, [{level, info}]),
                    application:ensure_started(syntax_tools),
                    application:ensure_started(compiler),
                    application:ensure_started(goldrush),
@@ -531,7 +531,7 @@ prop_correct() ->
                                                Res0
                                        end,
                                  %% Clean up after the test
-                                 case Res of 
+                                 case Res of
                                      ok -> %  if all went well, remove leveldb files
                                          catch cleanup_hashtree(get(t1)),
                                          catch cleanup_hashtree(get(t2));

diff --git a/eqc/new_cluster_membership_model_eqc.erl b/eqc/new_cluster_membership_model_eqc.erl
@@ -1608,14 +1608,14 @@ handle_down_nodes(CState, Next) ->
 
 claim_until_balanced(Ring, Node) ->
     %%{WMod, WFun} = app_helper:get_env(riak_core, wants_claim_fun),
-    {WMod, WFun} = {riak_core_claim, default_wants_claim},
+    {WMod, WFun} = {riak_core_membership_claim, default_wants_claim},
     NeedsIndexes = apply(WMod, WFun, [Ring, Node]),
     case NeedsIndexes of
         no ->
             Ring;
         {yes, _NumToClaim} ->
             %%{CMod, CFun} = app_helper:get_env(riak_core, choose_claim_fun),
-            {CMod, CFun} = {riak_core_claim, default_choose_claim},
+            {CMod, CFun} = {riak_core_membership_claim, default_choose_claim},
             NewRing = CMod:CFun(Ring, Node),
             claim_until_balanced(NewRing, Node)
     end.
@@ -1682,7 +1682,7 @@ remove_from_cluster(Ring, ExitingNode) ->
                              end,
                              Ring,
                              AllOwners),
-                riak_core_claim:claim_rebalance_n(TempRing, Other)
+                riak_core_membership_claim:claim_rebalance_n(TempRing, Other)
         end,
 
     ExitRing.

diff --git a/priv/riak_core.schema b/priv/riak_core.schema
@@ -227,6 +227,83 @@
     hidden
 ]}.
 
+%% @doc Choose claim function
+%% Claim function to be used when handling joins to the cluster.
+%% There are three supported functions:
+%% - choose_claim_v2 (the default) designed for environments without location
+%% awareness as a requirement
+%% - choose_claim_v3 (deprecated) a claim function which treats claim as an
+%% optimisation problem. It creates a number of possible claim plans and
+%% evaluates them for violations, balance and diversity, choosing the 'best'
+%% plan.  claim_v3 is not location aware
+%% - choose_claim_v4 a claim algorithm which refactors v2 to improve location
+%% awareness
+{mapping, "choose_claim_fun", "riak_core.choose_claim_fun", [
+    {commented, "choose_claim_v2"},
+    {datatype, {enum, [choose_claim_v2, choose_claim_v3, choose_claim_v4]}},
+    merge
+]}.
+
+%% @doc Target N Val for Cluster Administration
+%% Cluster change operations such as joins and leaves will use a target_n_val
+%% to control spacing of preflists across physical nodes.  The default value
+%% is 4, which is the default bucket propery for n_val + 1.  This means that
+%% the target for a cluster change operation is to make sure that all preflists
+%% of n_val 3 are on 3 deperate physical devices, even when a single failure
+%% has occurred.
+%% If the target_n_val is not met by a cluster chnage operation, the failure is
+%% not blocking -  a warning will be printed in the cluster plan, but the plan
+%% will not be prevented from being committed.
+%% In some cases, by reducing the target_n_val it may be possible to reduce the
+%% number of transfers necessary to complete a cluster change operation.
+%% In clusters with a large number of nodes, larger target_n_val values can be
+%% supported, and may result to a better spread of load across the cluster
+%% when node failure occurs.
+{mapping, "target_n_val", "riak_core.target_n_val", [
+  {datatype, integer},
+  {default, 4},
+  {validators, ["target_nval_max", "target_nval_min"]},
+  {commented, 4}
+]}.
+
+%% ring_size validators
+{validator, "target_nval_max",
+ "7 and larger are supported, but considered advanced config",
+ fun(Size) ->
+  Size =< 6
+ end}.
+
+{validator, "target_nval_min", "must be at least 1",
+ fun(Size) ->
+  Size >= 1
+ end}.
+
+%% @doc Target Location N Val for Cluster Administration
+%% Cluster change operations such as joins and leaves will use a
+%% target_location_n_val to control spacing of preflists across locations. This
+%% is to support clusters which have a concept of `location` failure as well as
+%% Node failure (e.g. rack awareness is required, or support for AWS placement
+%% groups).
+%% In this case, nodes are assigned to locations, and as well as supporting
+%% the splitting of data replicas across nodes, attempts will also be made
+%% during cluster chnage operations to split preflists across locations.
+%% If the target_location_n_val is not met by a cluster chnage operation, the failure is
+%% not blocking -  a warning will be printed in the cluster plan, but the plan
+%% will not be prevented from being committed.
+%% In some cases, by reducing the target_location_n_val it may be possible to
+%% reduce the number of transfers necessary to complete a cluster change
+%% operation.
+%% In clusters with a large number of nodes, larger target_location_n_val
+%% values can be supported.
+%% If the target_location_nval is greater than the target_nval, the target_nval
+%% will be used.
+{mapping, "target_location_n_val", "riak_core.target_location_n_val", [
+  {datatype, integer},
+  {default, 3},
+  {validators, ["target_nval_max", "target_nval_min"]},
+  {commented, 3}
+]}.
+
 %% @doc On cluster leave - force full rebalance partitions
 %% By default on a cluster leave there will first be an attempt to handoff
 %% vnodes to safe (in terms of target_n_val) locations.  In small clusters,
@@ -237,6 +314,9 @@
 %% all nodes.
 %% Please carefully consider any cluster plan created with this option before
 %% committing
+%% If cluster planning with locations enabled, then `full_rebalance_onleave`
+%% should also be enabled.  With claim_v4 this should result in a cluster
+%% plan which is correct, but also relatively efficient.
 {mapping, "full_rebalance_onleave", "riak_core.full_rebalance_onleave", [
     {datatype, flag},
     {default, off}

diff --git a/src/riak_core.app.src b/src/riak_core.app.src
@@ -42,8 +42,10 @@
          {target_n_val, 4},
 
          %% Default claims functions
-         {wants_claim_fun, {riak_core_claim, default_wants_claim}},
-         {choose_claim_fun, {riak_core_claim, default_choose_claim}},
+         {wants_claim_fun,
+            {riak_core_membership_claim, default_wants_claim}},
+         {choose_claim_fun,
+            {riak_core_membership_claim, default_choose_claim}},
 
          %% Vnode inactivity timeout (how often to check if fallback vnodes
          %% should return their data) in ms.