Skip to content

Cluster Metadata Internals

Doug Rohrer edited this page Sep 20, 2016 · 2 revisions

Cluster Metadata Internals

The following describes Cluster Metadata from an implementor's perspective. It is intended for those wishing to familiarize themselves with the implementation in order to make changes or to support the subsystem. This document is not intended for those wishing to use Cluster Metadata but may be helpful.

1. Structure

The Cluster Metadata subsystem is divided into several parts spread across multiple modules. These parts can be divided into three categories: representation, storage and broadcast. Unifying them is the API, which presents the subsystem as one cohesive unit to clients.

1.1 Representation

Cluster Metadata stores and represents values internally in a different form than how it is presented to clients. Most notably, clients are not exposed to the logical clocks stored alongside values while internally they are used heavily.

The riak_core_metadata_object module is responsible for defining the internal representation -- referred to as the "Metadata Object". In addition, it provides a set of operations that can be performed on Metadata Object(s). These operations include conflict resolution, reconciling differing versions of an object, and accessing the values, the clock and additional information about the object. Outside this module, the structure of the Metadata Object is opaque and all operations on objects should use it.

1.1.1 Logical Clocks

The logical clocks used by Cluster Metadata are Dotted Version Vectors. The implementation is contained within dvvset.erl.

1.2 Storage

Data is stored both in-memory, using ets, and on-disk, using dets. Each node locally manages its own storage via the riak_core_metadata_manager, or "Metadata Manager", process running on it.

The Metadata Manager is not responsible for replicating data, nor is it aware of other nodes. However, it is aware that writes are causally related by their logical clocks, and it uses this information to make local storage decisions, e.g. when an object is causally older than the value already stored. Data read out of the Metadata Manager is a Metadata Object, while data written is either a Metadata Object or a raw value and a view the logical clock (the view contains enough details to reconstruct the information in the original clock).

1.2.1 Metadata Hashtrees

In addition to storing Metadata Objects, the subsystem maintains Hash Trees that are the basis for determining differences between data stored on two nodes and can be used to determine when individual or groups of keys change.

The Hash Trees that are used are implemented in the hashtree_tree module. More details on its implementation can be found there.

Each node maintains its own hashtree_tree via the riak_core_metadata_hashtree, or Metadata Hashtree, process. This process is responsible for safely applying different operations on the tree, in addition to exposing information needed elsewhere in the subsystem and to clients. It also periodically updates the tree, when not in use by other parts of the subsystem, so that clients using it to detect changed keys can do so without forcing the update themselves.

Metadata Hashtrees are only maintained for the lifetime of a running node. They are rebuilt each time a node starts and are never rebuilt for as long as the node continues to run. The initial design of this part of the system included using memory backed Hash Trees. However, they are currently backed by LevelDB and cleared on graceful shutdown and each time the node is started. The entire tree is stored within one leveldb instance.

The Metadata Manager updates the hashes for keys using riak_core_metadata_hashtree in its write path. The upper levels of the tree will only reflect those changes when the tree is updated -- either periodically or when needed by another part of the subsystem.

1.3 Broadcast

The above details focus on Cluster Metadata from the point of view of a single node. It is the broadcast mechanism that puts the "Cluster" in "Cluster Metadata". This mechanism was built to be general and is in it of itself a separate subsystem. More details about how it works can be found in the documentation about riak_core_broadcast. The following describes how Cluster Metadata uses this subsystem.

The riak_core_broadcast_handler behaviour for Cluster Metadata is the Metadata Manager. In addition to the simple get/put interface, it implements all the required broadcast functions.

The majority of the riak_core_broadcast_handler functions are well-defined, however, exchange/1 allows a lot of leeway in how to implement the repair of data between the local node and a remote peer. To implement exchange/1, Cluster Metadata uses the Metadata Hashtrees on both nodes to determine differences between data stored on each and the existing riak_core_metadata_manager interface to repair those differences. The implementation of this logic can be found in riak_core_metadata_exchange_fsm.

1.4 API

The API, contained within the riak_core_metadata module, ties together the above parts and is the only interface that should be used by clients. It is not only responsible for hiding the logical clock and internal representation of the object but, also, performing writes via the Metadata Manager and submitting broadcasts of updates and resolved values.

2. Future Work & Known Issues

Known outstanding work and issues can be found under the riak_core repository's issue list. All issues pertaining to this subsystem have (and should be created with) the Cluster Metadata tag.

3. Tips for Debugging and Resolving Issues

The following is a list of debugging and testing tips as well as ways to resolve unforeseen issues. It is by no means a complete list but it is, hopefully, a growing one.

3.1 Viewing the Broadcast Tree

It may be useful to know what nodes are being communicated with when replicating updates across the cluster for a given host. riak_core_broadcast's debug_get_peers/2,3 and debug_get_tree/2 expose this information without needing to inspect the process state of riak_core_broadcast The documentation of those functions has more details.

3.2 Retrieving Objects not Values

The API hides the internal representation of objects for the client. However, when debugging and testing, it is sometimes useful to see what the Metadata Manager actually has stored, especially the logical clock. In this case riak_core_metadata_manager:get/1,2 can be used to retrieve objects locally or from the Metadata Manager on another node.

3.3 Forcing Conflicting Writes

Although not typically a good thing to do, for testing purposes its useful to know how to force writing of siblings. Using the Metadata Manager this can be done easily by writing values with no logical clock:

riak_core_metadata_manager:put({FullPrefix, Key}, undefined,
sibling1value),
riak_core_metadata_manager:put({FullPrefix, Key}, undefined,
sibling2value)

The two calls can be run on the same or different nodes. If run on different nodes the siblings won't appear until after an exchange has completed.

3.4 Managing Exchanges

Most of the following is general to the broadcast mechanism but is useful within the context of Cluster Metadata.

3.4.1 Querying Exchanges Started by a Node

The functions riak_core_broadcast:exchanges/0,1 can be used to determine which exchanges are running that were started on the current or given node, respectively. These functions return a list of 4-tuples. The 4-tuple contains the name of the module used as the riak_core_broadcast_handler for the exchange (always riak_core_metadata_manager for Cluster Metadata), the node that is the peer of the exchange, a unique reference representing the exchange, and the process id of the exchange fsm doing the work.

3.4.2 Canceling Exchanges

Running exchanges may be aborted using riak_core_broadcast:cancel_exchanges/1. cancel_exchanges/1's argument takes several forms.

To cancel all Cluster Metadata exchanges started by the node where this command is run:

riak_core_broadcast:cancel_exchanges({mod, riak_core_metadata_manager}).

To cancel all exchanges with a given peer that were started by the node where this command is run:

riak_core_broadcast:cancel_exchanges({peer, Peer}).

where Peer is the Erlang node name of the remote node.

To cancel a specific exchange started by the node where this command is run:

riak_core_broadcast:cancel_exchanges(PidOrRef).

where PidOrRef is a process id or unique reference found in a 4-tuple returned from riak_core_broadcast:exchanges/0.

3.4.3 Manually Triggering Exchanges

If it is necessary to ensure that the data stored in Cluster Metadata is the same between two nodes, an exchange can be triggered manually. These exchanges are triggered by the broadcast mechanism so, typically, this should not be necessary. However, it can be done by running the following from an Erlang node running Cluster Metadata:

riak_core_metadata_manager:exchange(Peer)

Where Peer is the Erlang node name for another node in the cluster. This will trigger an exchange between the node the command was run on and Peer. This function does not block until the completion of the exchange. It only starts it. If the exchange was started successfully, {ok, Pid} will be returned, where Pid is the process id of the exchange fsm doing the work in the background. If the exchange could not be started {error, Reason} will be returned.

3.5 Repairing DETS Files

DETS files may need to be repaired if a node exits abnormally and the Metadata Manager was not allowed to properly close its files. When the Metadata Manager restarts it will repair the files when they are opened (this is the default DETS behavior).

More details on DETS repair can be found in the docs

3.6 Removing Metadata Hashtree Files

Metadata Hashtrees are meant to be ephermeral so they may be deleted without affecting a cluster. The node must first be shutdown and then restarted. This ensures that the Metadata Hashtree process does not crash when the files are deleted and that the tree is rebuilt when the node starts back up.