/
dynamo.html
1281 lines (1266 loc) · 106 KB
/
dynamo.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<title>Dynamo</title>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<link rel="stylesheet" media="all" href="docco.css" />
</head>
<body>
<div id="container">
<div id="background"></div>
<table cellpadding="0" cellspacing="0">
<thead>
<tr>
<th class="docs">
<h1>dynamo.js</h1>
</th>
<th class="code"></th>
</tr>
</thead>
<tbody>
<tr id="section-0"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-0">¶</a></div><p><strong>Hear Ye! Hear Ye! This is not done. It's a work in progress.</strong>
This is a modified version of Amazon's Dynamo Paper, annotated with information about Riak and how its design compares to Dynamo.
Basho Technologies built Riak based on many (but not all) of the ideas and design decisions set forth in this paper. We
often get questions about how closely we adhered to the principles and design decisions
put forth in the paper. I thought it would be worthwhile to annotate it with
Riak specifics.</p></td><td class="code"><p></p></td><tr id="section-1"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-1">¶</a></div><p>In the right column, you'll find the paper reprinted in its entirety, images and all.
In this, the left column, you have Riak specifics that relate to a given section of the paper;
anything from links to the Riak wiki, to code references, to explanations of why and how
we did what we did when we did it. There is also some work to do to make Riak more like Dynamo in some
ways. This is noted, too.</p></td><td class="code"><p></p></td><tr id="section-2"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-2">¶</a></div><p>The goal for this resource is to simplify the Dynamo paper in the context of Riak and better introduce Riak's
design principles to developers and technologists. I hope you enjoy it and find it useful. If there's something
you believe needs changing, drop me a note or <a href="#">submit a pull request</a>.</p></td><td class="code"><p></p></td><tr id="section-3"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-3">¶</a></div><p><a href="https://twitter.com/pharkmillups">Mark</a></p></td><td class="code"><p></p></td><tr id="section-4"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-4">¶</a></div><p><hr></p></td><td class="code"><p>
Dynamo: Amazon’s Highly Available Key-value Store
</p><p></p></td><tr id="section-5"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-5">¶</a></div><p>
This paper was first release in ... and was popularized on the blog of Werner Vogels. Since then there
has been a large amount of databases that were insprired (either entirely or partially) by this paper.
In addition to Riak, Cassandra and Voldemort come to mind. Some of you may also remember
Dynomite (which predates all of these). I'm sure there are more.
</p></td><td class="code"><p>
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,
Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels
</p><p>
Amazon.com
</p><p>
Abstract
</p><p>
Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the
largest e-commerce operations in the world; even the slightest outage has significant financial
consequences and impacts customer trust. The Amazon.com platform, which provides services for
many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of
servers and network components located in many datacenters around the world. At this scale,
small and large components fail continuously and the way persistent state is managed in the
face of these failures drives the reliability and scalability of the software systems.
</p><p>
This paper presents the design and implementation of Dynamo, a highly available key-value
storage system that some of Amazon’s core services use to provide an “always-on” experience.
To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios.
It makes extensive use of object versioning and application-assisted conflict resolution in a manner
that provides a novel interface for developers to use.
</p><p>
Categories and Subject Descriptors
D.4.2 [Operating Systems]: Storage Management; D.4.5 [Operating Systems]: Reliability; D.4.2
[Operating Systems]: Performance;
</p><p>
General Terms
Algorithms, Management, Measurement, Performance, Design, Reliability.
</p><p>
1. Introduction
</p><p>
Amazon runs a world-wide e-commerce platform that serves tens of millions customers at peak times using
tens of thousands of servers located in many data centers around the world. There are strict operational
requirements on Amazon’s platform in terms of performance, reliability and efficiency, and to support
continuous growth the platform needs to be highly scalable. Reliability is one of the most important
requirements because even the slightest outage has significant financial consequences and impacts customer
trust. In addition, to support continuous growth, the platform needs to be highly scalable.
</p><p>
One of the lessons our organization has learned from operating Amazon’s platform is that the reliability
and scalability of a system is dependent on how its application state is managed. Amazon uses a highly
decentralized, loosely coupled, service oriented architecture consisting of hundreds of services. In
this environment there is a particular need for storage technologies that are always available. For
example, customers should be able to view and add items to their shopping cart even if disks are failing,
network routes are flapping, or data centers are being destroyed by tornados. Therefore, the service
responsible for managing shopping carts requires that it can always write to and read from its data
store, and that its data needs to be available across multiple data centers.
</p><p>
Dealing with failures in an infrastructure comprised of millions of components is our standard
mode of operation; there are always a small but significant number of server and network components
that are failing at any given time. As such Amazon’s software systems need to be constructed in
a manner that treats failure handling as the normal case without impacting availability or
performance.
</p><p>
To meet the reliability and scaling needs, Amazon has developed a number of storage
technologies, of which the Amazon Simple Storage Service (also available outside of Amazon and
known as Amazon S3), is probably the best known. This paper presents the design and implementation of
Dynamo, another highly available and scalable distributed data store built for Amazon’s platform.
Dynamo is used to manage the state of services that have very high reliability requirements and need
tight control over the tradeoffs between availability, consistency, cost-effectiveness and performance.
Amazon’s platform has a very diverse set of applications with different storage requirements. A select
set of applications requires a storage technology that is flexible enough to let application designers
configure their data store appropriately based on these tradeoffs to achieve high availability and
guaranteed performance in the most cost effective manner.
</p><p>
There are many services on Amazon’s platform that only need primary-key access to a data store.
For many services, such as those that provide best seller lists, shopping carts, customer
preferences, session management, sales rank, and product catalog, the common pattern of
using a relational database would lead to inefficiencies and limit scale and availability.
Dynamo provides a simple primary-key only interface to meet the requirements of these applications.
</p><p></p></td><tr id="section-6"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-6">¶</a></div><p>
Like Dynamo, Riak employs consistent hashing to partition and replicate date around the ring. For the consistent hashing that takes place in riak_core, Basho chose the SHA1 hash.
<ul>
<li><a href="http://wiki.basho.com/display/RIAK/Riak+Glossary#RiakGlossary-ConsistentHashing">Consistent Hashing on the Riak wiki</a></li>
</ul>
Riak uses vector clocks for object versioning. Scroll down to section 4.4 to read up on this in depth.
Riak makes use of gossiping in the same way that Dynamo does: to communicate ring state and node membership.
<ul>
<li><a href="https://wiki.basho.com/display/RIAK/Riak+Glossary#RiakGlossary-Gossiping">Gossip Protocol on the Riak Wiki</a></li>
</ul>
And, nodes can added and removed from your Riak cluster as needed.</p></td><td class="code"><p>Dynamo uses a synthesis of well known techniques to achieve scalability and availability:
Data is partitioned and replicated using consistent hashing [10], and consistency is facilitated
by object versioning [12]. The consistency among replicas during updates is maintained by a
quorum-like technique and a decentralized replica synchronization protocol. Dynamo employs a gossip
based distributed failure detection and membership protocol. Dynamo is a completely decentralized
system with minimal need for manual administration. Storage nodes can be added and removed from
Dynamo without requiring any manual partitioning or redistribution.
</p><p>
In the past year, Dynamo has been the underlying storage technology for a number of the
core services in Amazon’s e-commerce platform. It was able to scale to extreme peak loads
efficiently without any downtime during the busy holiday shopping season. For example, the service
that maintains shopping cart (Shopping Cart Service) served tens of millions requests that resulted
in well over 3 million checkouts in a single day and the service that manages session state handled
hundreds of thousands of concurrently active sessions.
The main contribution of this work for the research community is the evaluation of how different
techniques can be combined to provide a single highly-available system. It demonstrates that an
eventually-consistent storage system can be used in production with demanding applications. It
also provides insight into the tuning of these techniques to meet the requirements of production
systems with very strict performance demands.
The paper is structured as follows. Section 2 presents the background and Section 3 presents
the related work. Section 4 presents the system design and Section 5 describes the implementation.
Section 6 details the experiences and insights gained by running Dynamo in production and Section 7
concludes the paper. There are a number of places in this paper where additional information may
have been appropriate but where protecting Amazon’s business interests require us to reduce some
level of detail. For this reason, the intra- and inter-datacenter latencies in section 6, the
absolute request rates in section 6.2 and outage lengths and workloads in section 6.3 are
provided through aggregate measures instead of absolute details.
</p><p></p></td><tr id="section-7"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-7">¶</a></div><p>
<strong>Brief Background on Riak:</strong>
<hr></p></td><td class="code"><p>
2. Background
</p><p></p></td><tr id="section-8"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-8">¶</a></div><p>Basho Technologies started to develop Riak back in 2007 to solve an internal problem. We were,
at the time, builing a web application that would require a database layer that afforded higher
availability and scale out properties than any technology we knew of. So, we (primary Justin Sheehy, Andy
Andy Gross, and Bryan Fink at the time) rolled our own.
After using Riak in production for several successful applications that generated revenue, we decided
to open source it and share our creation with the world.
</p></td><td class="code"><p>
Amazon’s e-commerce platform is composed of hundreds of services that work in concert
to deliver functionality ranging from recommendations to order fulfillment to fraud detection.
Each service is exposed through a well defined interface and is accessible over the network.
These services are hosted in an infrastructure that consists of tens of thousands of servers
located across many data centers world-wide. Some of these services are stateless (i.e.,
services which aggregate responses from other services) and some are stateful (i.e., a
service that generates its response by executing business logic on its state stored in
persistent store).
</p><p>
Traditionally production systems store their state in relational databases.
For many of the more common usage patterns of state persistence, however, a
relational database is a solution that is far from ideal. Most of these services only store and
retrieve data by primary key and do not require the complex querying and management functionality
offered by an RDBMS. This excess functionality requires expensive hardware and highly skilled
personnel for its operation, making it a very inefficient solution. In addition, the available replication
technologies are limited and typically choose consistency over availability. Although many advances have
been made in the recent years, it is still not easy to scale-out databases or use smart partitioning
schemes for load balancing.
</p><p></p></td><tr id="section-9"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-9">¶</a></div><p>Riak is a highly available, scalable, open source key/value database. Theses notes
describe where Riak's design decisions emulated and diverged from Dynamo's (as described in this paper).
Riak has offers several query methods in addition to the standard key/value interface,
is made to be highly-avaible, is efficient in its resource uses, and has a simple scale out
story accompany data and traffic growth.</p></td><td class="code"><p>
This paper describes Dynamo, a highly available data storage technology
that addresses the needs of these important classes of services. Dynamo has a simple key/value
interface, is highly available with a clearly defined consistency window, is efficient in its
resource usage, and has a simple scale out scheme to address growth in data set size or request
rates. Each service that uses Dynamo runs its own Dynamo instances.</p></td><tr id="section-10"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-10">¶</a></div><p></p></td><td class="code"><p>2.1 System Assumptions and Requirements
</p><p>
The storage system for this class of services has the following requirements:
</p><p></p></td><tr id="section-11"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-11">¶</a></div><p><strong>Riak's Query Model</strong>
We've extended Dynamo's proposed query model in severl ways. Currently Riak offers:
<ol>
<li>Standard key/value access (GET, PUT, DELETE)</li>
<li>MapReduce querying </li>
<li>Links and Link Walking (which is analogous to relationships in GraphDBs and is a form
or MapReduce under the hood)
<li>Secondary Indexing</li>
<li>Full-text Search</li>
</ol>
Riak's realistic object size limit is around 5MB. Basho does offer a commercial extenstion to Riak called
<a href="#">Riak CS</a> that can store objects well into the 10s of GB range at the time of this writing.</p></td><td class="code"><p>
</p><p>
Query Model: simple read and write operations to a data item that is uniquely identified by a key.
State is stored as binary objects (i.e., blobs) identified by unique keys. No operations span multiple
data items and there is no need for relational schema. This requirement is based on the observation that
a significant portion of Amazon’s services can work with this simple query model and do not need any
relational schema. Dynamo targets applications that need to store objects that are relatively small
(usually less than 1 MB).
</p><p></p></td><tr id="section-12"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-12">¶</a></div><p><strong>ACID Properties:</strong>
Riak offers no traditional "ACID" semantics around transactions. Instead, it's built to
be "evenutually consistent." We did this because we were of the opinion (and our users proved this)
that most applications don't require heavy transactions. (Even ATMs are eventually consistent.)
<br/>
<br/>
Riak stresses availability and puts the power in the hands of developers and ops professionals to
tweak and tune the consistency requirements.
</p></td><td class="code"><p>
ACID Properties: ACID (Atomicity, Consistency, Isolation, Durability) is a
set of properties that guarantee that database transactions are processed reliably. In the context of
databases, a single logical operation on the data is called a transaction. Experience at Amazon has
shown that data stores that provide ACID guarantees tend to have poor availability. This has been widely
acknowledged by both the industry and academia [5]. Dynamo targets applications that operate with weaker
consistency (the “C” in ACID) if this results in high availability. Dynamo does not provide any isolation
guarantees and permits only single key updates.
</p><p></p></td><tr id="section-13"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-13">¶</a></div><p><strong>Efficiency</strong>
Agreed. Riak is made to (and will!) scale linearly on commodity hardware (often called "pizza boxes").
<li><a href="http://wiki.basho.com/display/RIAK/Hosting+and+Server+Configuration#HostingandServerConfiguration-Hardware">More on Hardware</a></li>
</p></td><td class="code"><p>
Efficiency: The system needs to function on a commodity hardware infrastructure. In Amazon’s platform, services have stringent latency requirements which are
in general measured at the 99.9th percentile of the distribution. Given that state access plays a crucial
role in service operation the storage system must be capable of meeting such stringent SLAs (see Section
2.2 below). Services must be able to configure Dynamo such that they consistently achieve their latency and
throughput requirements. The tradeoffs are in performance, cost efficiency, availability, and durability
guarantees.
</p><p>
Other Assumptions: Dynamo is used only by Amazon’s internal services. Its operation environment is assumed
to be non-hostile and there are no security related requirements such as authentication and authorization.
Moreover, since each service uses its distinct instance of Dynamo, its initial design targets a scale of
up to hundreds of storage hosts. We will discuss the scalability limitations of Dynamo and possible
scalability related extensions in later sections.
</p><p></p></td><tr id="section-14"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-14">¶</a></div><p><strong>Riak Loves SLAs</strong></p></td><td class="code"><p>
2.2 Service Level Agreements (SLA)
</p><p></p></td><tr id="section-15"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-15">¶</a></div><p>
Much like Amazon built Dynamo to guarantee their applications were always available to retail shoppers,
the design decisions in Riak were taken to ensure that developers could sleep well knowing that their
database would always be available to serve requests.
Many of our clients and open source users have explicit uptime agreements related to their applications and
services built on Riak. This was not an accident.
</p></td><td class="code"><p>
To guarantee that the application can deliver its functionality in a bounded time, each and every
dependency in the platform needs to deliver its functionality with even tighter bounds. Clients and
services engage in a Service Level Agreement (SLA), a formally negotiated contract where a client and
a service agree on several system-related characteristics, which most prominently include the client’s
expected request rate distribution for a particular API and the expected service latency under those
conditions. An example of a simple SLA is a service guaranteeing that it will provide a response within
300ms for 99.9% of its requests for a peak client load of 500 requests per second.
</p><p>
In Amazon’s decentralized service oriented infrastructure, SLAs play an important role. For example
a page request to one of the e-commerce sites typically requires the rendering engine to construct
its response by sending requests to over 150 services. These services often have multiple dependencies,
which frequently are other services, and as such it is not uncommon for the call graph of an application
to have more than one level. To ensure that the page rendering engine can maintain a clear bound on page
delivery each service within the call chain must obey its performance contract.
</p><p>
Figure 1 shows an abstract view of the architecture of Amazon’s platform, where dynamic web content
is generated by page rendering components which in turn query many other services. A service can use
different data stores to manage its state and these data stores are only accessible within its service
boundaries. Some services act as aggregators by using several other services to produce a composite
response. Typically, the aggregator services are stateless, although they use extensive caching.
</p><p>
Figure 1: Service-oriented architecture of Amazon’s platform.
</p><p>
A common approach in the industry for forming a performance oriented SLA is to describe it using average,
median and expected variance. At Amazon we have found that these metrics are not good enough if the goal
is to build a system where all customers have a good experience, rather than just the majority. For
example if extensive personalization techniques are used then customers with longer histories require
more processing which impacts performance at the high-end of the distribution. An SLA stated in terms
of mean or median response times will not address the performance of this important customer segment.
To address this issue, at Amazon, SLAs are expressed and measured at the 99.9th percentile of the
distribution. The choice for 99.9% over an even higher percentile has been made based on a cost-benefit
analysis which demonstrated a significant increase in cost to improve performance that much. Experiences
with Amazon’s production systems have shown that this approach provides a better overall experience
compared to those systems that meet SLAs defined based on the mean or median.
</p><p>
In this paper there are many references to this 99.9th percentile of distributions, which reflects
Amazon engineers’ relentless focus on performance from the perspective of the customers’ experience.
Many papers report on averages, so these are included where it makes sense for comparison purposes.
Nevertheless, Amazon’s engineering and optimization efforts are not focused on averages. Several
techniques, such as the load balanced selection of write coordinators, are purely targeted at
controlling performance at the 99.9th percentile.
</p><p>
Storage systems often play an important role in establishing a service’s SLA, especially if the
business logic is relatively lightweight, as is the case for many Amazon services. State management
then becomes the main component of a service’s SLA. One of the main design considerations for Dynamo
is to give services control over their system properties, such as durability and consistency,
and to let services make their own tradeoffs between functionality, performance and cost-effectiveness.
</p><p></p></td><tr id="section-16"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-16">¶</a></div><p><strong>Riak's Design Considerations</strong></p></td><td class="code"><p>
2.3 Design Considerations
</p><p></p></td><tr id="section-17"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-17">¶</a></div><p>Availability under any circumstances was something we stressed when
designing Riak, too. Most database didn't enable developers to do this in a simple way
so we set out to change this.</p></td><td class="code"><p>
Data replication algorithms used in commercial systems traditionally perform synchronous replica
coordination in order to provide a strongly consistent data access interface. To achieve this level of
consistency, these algorithms are forced to tradeoff the availability of the data under certain failure
scenarios. For instance, rather than dealing with the uncertainty of the correctness of an answer, the
data is made unavailable until it is absolutely certain that it is correct. From the very early replicated
database works, it is well known that when dealing with the possibility of network failures, strong
consistency and high data availability cannot be achieved simultaneously [2, 11]. As such systems and
applications need to be aware which properties can be achieved under which conditions.
</p><p></p></td><tr id="section-18"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-18">¶</a></div><p>Remember Eventual Consistency? We followed Dynamo's lead here and
made sure that Riak could withstand network, server and other failures
by sacrificing absolute consistency and building in mechanisms to rectify
object conflicts.</p></td><td class="code"><p>
For systems prone to server and network failures, availability can be increased by using optimistic
replication techniques, where changes are allowed to propagate to replicas in the background, and
concurrent, disconnected work is tolerated. The challenge with this approach is that it can lead to
conflicting changes which must be detected and resolved. This process of conflict resolution introduces
two problems: when to resolve them and who resolves them. Dynamo is designed to be an eventually
consistent data store; that is all updates reach all replicas eventually.
</p><p></p></td><tr id="section-19"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-19">¶</a></div><p>Ditto</p></td><td class="code"><p>
An important design consideration is to decide when to perform the process of resolving update
conflicts, i.e., whether conflicts should be resolved during reads or writes. Many traditional
data stores execute conflict resolution during writes and keep the read complexity simple [7].
In such systems, writes may be rejected if the data store cannot reach all (or a majority of)
the replicas at a given time. On the other hand, Dynamo targets the design space of an “always
writeable” data store (i.e., a data store that is highly available for writes). For a number of
Amazon services, rejecting customer updates could result in a poor customer experience. For
instance, the shopping cart service must allow customers to add and remove items from their
shopping cart even amidst network and server failures. This requirement forces us to push the
complexity of conflict resolution to the reads in order to ensure that writes are never rejected.
</p><p></p></td><tr id="section-20"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-20">¶</a></div><p>No conflict here (pun intended). Riak follows this approach to conflict resolution.</p></td><td class="code"><p>
The next design choice is who performs the process of conflict resolution. This can be done by
the data store or the application. If conflict resolution is done by the data store, its choices
are rather limited. In such cases, the data store can only use simple policies, such as “last
write wins” [22], to resolve conflicting updates. On the other hand, since the application is
aware of the data schema it can decide on the conflict resolution method that is best suited for
its client’s experience. For instance, the application that maintains customer shopping carts
can choose to “merge” the conflicting versions and return a single unified shopping cart. Despite
this flexibility, some application developers may not want to write their own conflict resolution
mechanisms and choose to push it down to the data store, which in turn chooses a simple policy
such as “last write wins”.
</p><p>
Other key principles embraced in the design are:
</p><p></p></td><tr id="section-21"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-21">¶</a></div><p>We refer to hosts as "nodes", too. Riak provides a simple set of commands to start
and join nodes to a running cluster. With proper capacity planning, this process should
be painless for the ops team and devs, and imperceivable by the client.</p></td><td class="code"><p>
Incremental scalability: Dynamo should be able to scale out one storage host (henceforth, referred
to as “node”) at a time, with minimal impact on both operators of the system and the system itself.
</p><p></p></td><tr id="section-22"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-22">¶</a></div><p>Again, we agree. Each storage nope is the same at its neighbor. Any node can coordinate
a request and, in the event that a node goes does, its neighbors can cover for it until
it's restarted or decommissioned.</p></td><td class="code"><p>
Symmetry: Every node in Dynamo should have the same set of responsibilities as its peers;
there should be no distinguished node or nodes that take special roles or extra set of responsibilities.
In our experience, symmetry simplifies the process of system provisioning and maintenance.
</p><p></p></td><tr id="section-23"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-23">¶</a></div><p>A Riak cluster is completely decentralized. No single node is special and this leads to
no single points of failure.</p></td><td class="code"><p>
Decentralization: An extension of symmetry, the design should favor decentralized peer-to-peer
techniques over centralized control. In the past, centralized control has resulted in outages and
the goal is to avoid it as much as possible. This leads to a simpler, more scalable, and more
available system.
</p><p></p></td><tr id="section-24"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-24">¶</a></div><p>Riak agrees.</p></td><td class="code"><p>
Heterogeneity: The system needs to be able to exploit heterogeneity in the infrastructure it
runs on. e.g. the work distribution must be proportional to the capabilities of the individual
servers. This is essential in adding new nodes with higher capacity without having to upgrade all
hosts at once.
</p><p></p></td><tr id="section-25"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-25">¶</a></div><p>Blurb about how this section is a wonderful study in some technology and how it should be read when
you have a moment.</p></td><td class="code"><p>
3. Related Work
</p><p>
3.1 Peer to Peer Systems
</p><p>
There are several peer-to-peer (P2P) systems that have looked at the problem of data storage and distribution.
The first generation of P2P systems, such as Freenet and Gnutella, were predominantly used as file sharing
systems. These were examples of unstructured P2P networks where the overlay links between peers were
established arbitrarily. In these networks, a search query is usually flooded through the network to
find as many peers as possible that share the data. P2P systems evolved to the next generation into
what is widely known as structured P2P networks. These networks employ a globally consistent protocol
to ensure that any node can efficiently route a search query to some peer that has the desired data.
Systems like Pastry [16] and Chord [20] use routing mechanisms to ensure that queries can be answered
within a bounded number of hops. To reduce the additional latency introduced by multi-hop routing, some
P2P systems (e.g., [14]) employ O(1) routing where each peer maintains enough routing information locally
so that it can route requests (to access a data item) to the appropriate peer within a constant number of hops.
</p><p>
Various storage systems, such as Oceanstore [9] and PAST [17] were built on top of these routing
overlays. Oceanstore provides a global, transactional, persistent storage service that supports
serialized updates on widely replicated data. To allow for concurrent updates while avoiding many
of the problems inherent with wide-area locking, it uses an update model based on conflict resolution.
Conflict resolution was introduced in [21] to reduce the number of transaction aborts. Oceanstore resolves
conflicts by processing a series of updates, choosing a total order among them, and then applying them
atomically in that order. It is built for an environment where the data is replicated on an untrusted
infrastructure. By comparison, PAST provides a simple abstraction layer on top of Pastry for persistent
and immutable objects. It assumes that the application can build the necessary storage semantics (such as
mutable files) on top of it.
</p><p>
3.2 Distributed File Systems and Databases
</p><p>
Distributing data for performance, availability and durability has been widely studied in
the file system and database systems community. Compared to P2P storage systems that only
support flat namespaces, distributed file systems typically support hierarchical namespaces.
Systems like Ficus [15] and Coda [19] replicate files for high availability at the expense of
consistency. Update conflicts are typically managed using specialized conflict resolution procedures.
The Farsite system [1] is a distributed file system that does not use any centralized server like NFS.
Farsite achieves high availability and scalability using replication. The Google File System [6] is
another distributed file system built for hosting the state of Google’s internal applications. GFS
uses a simple design with a single master server for hosting the entire metadata and where the data
is split into chunks and stored in chunkservers. Bayou is a distributed relational database system
that allows disconnected operations and provides eventual data consistency [21].
</p><p>
Among these systems, Bayou, Coda and Ficus allow disconnected operations and are resilient to
issues such as network partitions and outages. These systems differ on their conflict resolution
procedures. For instance, Coda and Ficus perform system level conflict resolution and Bayou allows
application level resolution. All of them, however, guarantee eventual consistency. Similar to these
systems, Dynamo allows read and write operations to continue even during network partitions and
resolves updated conflicts using different conflict resolution mechanisms. Distributed block storage
systems like FAB [18] split large size objects into smaller blocks and stores each block in a highly
available manner. In comparison to these systems, a key-value store is more suitable in this case
because: (a) it is intended to store relatively small objects (size < 1M) and (b) key-value stores
are easier to configure on a per-application basis. Antiquity is a wide-area distributed storage
system designed to handle multiple server failures [23]. It uses a secure log to preserve data
integrity, replicates each log on multiple servers for durability, and uses Byzantine fault tolerance
protocols to ensure data consistency. In contrast to Antiquity, Dynamo does not focus on the problem
of data integrity and security and is built for a trusted environment. Bigtable is a distributed
storage system for managing structured data. It maintains a sparse, multi-dimensional sorted map
and allows applications to access their data using multiple attributes [2]. Compared to Bigtable,
Dynamo targets applications that require only key/value access with primary focus on high availability
where updates are not rejected even in the wake of network partitions or server failures.
</p><p>
Traditional replicated relational database systems focus on the problem of guaranteeing strong
consistency to replicated data. Although strong consistency provides the application writer a
convenient programming model, these systems are limited in scalability and availability [7]. These
systems are not capable of handling network partitions because they typically provide strong
consistency guarantees.
</p><p>
3.3 Discussion
</p><p>
Dynamo differs from the aforementioned decentralized storage systems in terms of its target requirements.
First, Dynamo is targeted mainly at applications that need an “always writeable” data store where no
updates are rejected due to failures or concurrent writes. This is a crucial requirement for many Amazon
applications. Second, as noted earlier, Dynamo is built for an infrastructure within a single administrative
domain where all nodes are assumed to be trusted. Third, applications that use Dynamo do not require
support for hierarchical namespaces (a norm in many file systems) or complex relational schema (supported
by traditional databases). Fourth, Dynamo is built for latency sensitive applications that require at
least 99.9% of read and write operations to be performed within a few hundred milliseconds. To meet these
stringent latency requirements, it was imperative for us to avoid routing requests through multiple nodes
(which is the typical design adopted by several distributed hash table systems such as Chord and Pastry).
This is because multi-hop routing increases variability in response times, thereby increasing the latency
at higher percentiles. Dynamo can be characterized as a zero-hop DHT, where each node maintains enough
routing information locally to route a request to the appropriate node directly.
</p><p></p></td><tr id="section-26"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-26">¶</a></div><p><strong>System Architecture</strong></p></td><td class="code"><p>
4.System Architecture
</p><p></p></td><tr id="section-27"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-27">¶</a></div><p>This is truly the meat of the Dynamo paper. Stick around. It gets good. Trust me.</p></td><td class="code"><p>
The architecture of a storage system that needs to operate in a production setting is complex. In addition
to the actual data persistence component, the system needs to have scalable and robust solutions for load
balancing, membership and failure detection, failure recovery, replica synchronization, overload handling,
state transfer, concurrency and job scheduling, request marshalling, request routing, system monitoring and
alarming, and configuration management. Describing the details of each of the solutions is not possible,
so this paper focuses on the core distributed systems techniques used in Dynamo: partitioning, replication,
versioning, membership, failure handling and scaling. Table 1 presents a summary of the list of techniques
Dynamo uses and their respective advantages.
</p><p>
Table 1: Summary of techniques used in Dynamo and their advantages.
</p><p>
</p><p>
4.1 System Interface
</p><p></p></td><tr id="section-28"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-28">¶</a></div><p>
Whereas Dynamo only has the concept of keys, we added a higher level of organization called a "bucket."
Keys are stored in buckets and buckets are the level at which several Riak properties can be configured
(primarily the "N" value, or the replication value.) In addition to the bucket+key identifier and value, Riak
will also return the associated metadata for a given object with each get or put.
Riak has two APIs:
<ul>
<li><a href="#">Riak's HTTP API</a></li>
<li><a href="#">Riak's Protocol Buffers API</a></li>
</ul>
</p></td><td class="code"><p>
Dynamo stores objects associated with a key through a simple interface; it exposes two operations: get()
and put(). The get(key) operation locates the object replicas associated with the key in the storage
system and returns a single object or a list of objects with conflicting versions along with a context.
The put(key, context, object) operation determines where the replicas of the object should be placed based
on the associated key, and writes the replicas to disk. The context encodes system metadata about the
object that is opaque to the caller and includes information such as the version of the object. The context
information is stored along with the object so that the system can verify the validity of the context object
supplied in the put request.
</p><p></p></td><tr id="section-29"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-29">¶</a></div><p>Riak concatenates the bucket with the key and runs it through the SHA1 hash to generate a 160 bit identifier
which is then used to determine where in the database each datum is stored. Riak treats data as an opaque
binary, thus enabling users to store virtually anything.</p></td><td class="code"><p>
Dynamo treats both the key and the object supplied by the caller as an opaque array of bytes. It applies a MD5
hash on the key to generate a 128-bit identifier, which is used to determine the storage nodes that are
responsible for serving the key.
</p><p></p></td><tr id="section-30"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-30">¶</a></div><p><strong>Partitioning in Riak</strong></p></td><td class="code"><p>
4.2 Partitioning Algorithm
</p><p></p></td><tr id="section-31"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-31">¶</a></div><p>As mentioned above, Riak uses consistent hashing to distribute data around ring to partitions responsible
for storing data. The ring has a maximum key space of 2^160. Each bucket+key (and its associated value)
is hashed to a location on the ring.
Riak also breaks the ring into a set number of partitions. This number is configured when a cluster is first built.
Each node will be repsonsible for storing the data hashed to a set number of partitions.
Each storage node will optimisitically handle an equal number of partitions.
</p></td><td class="code"><p>
One of the key design requirements for Dynamo is that it must scale incrementally. This requires a mechanism
to dynamically partition the data over the set of nodes (i.e., storage hosts) in the system. Dynamo’s partitioning
scheme relies on consistent hashing to distribute the load across multiple storage hosts. In consistent hashing [10],
the output range of a hash function is treated as a fixed circular space or “ring” (i.e. the largest hash value
wraps around to the smallest hash value). Each node in the system is assigned a random value within this space which
represents its “position” on the ring. Each data item identified by a key is assigned to a node by hashing the
data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with
a position larger than the item’s position. Thus, each node becomes responsible for the region in the ring between
it and its predecessor node on the ring. The principle advantage of consistent hashing is that departure or
arrival of a node only affects its immediate neighbors and other nodes remain unaffected.
</p><p></p></td><tr id="section-32"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-32">¶</a></div><p>
Riak also has the concept of virtual nodes and they are used to the same end as they are in Dynamo.
Phycical storage nodes are responsible for partitions, and each partition a vnode.
</p></td><td class="code"><p>
The basic consistent hashing algorithm presents some challenges. First, the random position assignment of each
node on the ring leads to non-uniform data and load distribution. Second, the basic algorithm is oblivious to the
heterogeneity in the performance of nodes. To address these issues, Dynamo uses a variant of consistent hashing
(similar to the one used in [10, 20]): instead of mapping a node to a single point in the circle, each node gets
assigned to multiple points in the ring. To this end, Dynamo uses the concept of “virtual nodes”. A virtual node
looks like a single node in the system, but each node can be responsible for more than one virtual node.
Effectively, when a new node is added to the system, it is assigned multiple positions (henceforth, “tokens”)
in the ring. The process of fine-tuning Dynamo’s partitioning scheme is discussed in Section 6.
</p><p></p></td><tr id="section-33"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-33">¶</a></div><p>All of these properties for vnodes in Dynamo hold true for Riak, too.</p></td><td class="code"><p>
Using virtual nodes has the following advantages:
</p><p>
If a node becomes unavailable (due to failures or routine maintenance), the load handled by this node is
evenly dispersed across the remaining available nodes.
</p><p>
When a node becomes available again, or a new node is added to the system, the newly available node accepts
a roughly equivalent amount of load from each of the other available nodes.
</p><p></p></td><tr id="section-34"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-34">¶</a></div><p>
<u>Further Reading on Partitioning in Riak</u>
<ul>
<li><a href ="#">All about the Riak Ring</a></li>
</ul>
</p></td><td class="code"><p>
The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for
heterogeneity in the physical infrastructure.
</p><p></p></td><tr id="section-35"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-35">¶</a></div><p><strong>Replication</strong></p></td><td class="code"><p>
4.3 Replication
</p><p></p></td><tr id="section-36"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-36">¶</a></div><p>Replication in Riak, like in Dynamo, is fundamental and automatic. Remember above I introduced the
concept of a bucket? In Riak, the replication parameter, "N" (also called "n_val"),
is configurable at the bucket level. The default n_val in Riak is 3, meaning that
out of the box Riak will store three replicas of your data on three different partitions on the ring.</p></td><td class="code"><p>
</p><p>
To achieve high availability and durability, Dynamo replicates its data on multiple hosts. Each data item is
replicated at N hosts, where N is a parameter configured “per-instance”. Each key, k, is assigned to a coordinator
node (described in the previous section). The coordinator is in charge of the replication of the data items that
fall within its range. In addition to locally storing each key within its range, the coordinator replicates
these keys at the N-1 clockwise successor nodes in the ring. This results in a system where each node is
responsible for the region of the ring between it and its Nth predecessor. In Figure 2, node B replicates
the key k at nodes C and D in addition to storing it locally. Node D will store the keys that fall in the
ranges (A, B], (B, C], and (C, D].
</p><p>
</p><p>
Figure 2: Partitioning and replication of keys in Dynamo ring.
</p><p></p></td><tr id="section-37"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-37">¶</a></div><p>This diagram is applicable to Riak and the manner in which it replicates data.
The preference list is present in Riak, too, and is the reason why any node in the ring
can coordinate a request. The node receives a request, consults the preference list,
and routes the request accordingly.</p></td><td class="code"><p>
The list of nodes that is responsible for storing a particular key is called the preference list. The system
is designed, as will be explained in Section 4.8, so that every node in the system can determine which nodes
should be in this list for any particular key. To account for node failures, preference list contains more
than N nodes. Note that with the use of virtual nodes, it is possible that the first N successor positions
for a particular key may be owned by less than N distinct physical nodes (i.e. a node may hold more than one
of the first N positions). To address this, the preference list for a key is constructed by skipping
positions in the ring to ensure that the list contains only distinct physical nodes.
</p><p></p></td><tr id="section-38"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-38">¶</a></div><p>Data Versioning</p></td><td class="code"><p>
4.4 Data Versioning
</p><p></p></td><tr id="section-39"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-39">¶</a></div><p>Riak is an "eventually consistent" database. All replication is done asynchronously,
as you can expect, could result in a datum being returned to the client that is out of
date. But don't worry. We built in some mechanisms to address this.</p></td><td class="code"><p>
Dynamo provides eventual consistency, which allows for updates to be propagated to all replicas
asynchronously. A put() call may return to its caller before the update has been applied at all the
replicas, which can result in scenarios where a subsequent get() operation may return an object that
does not have the latest updates.. If there are no failures then there is a bound on the update
propagation times. However, under certain failure scenarios (e.g., server outages or network partitions),
updates may not arrive at all replicas for an extended period of time.
</p><p></p></td><tr id="section-40"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-40">¶</a></div><p>Much like Dynamo was suited to the design of the shopping cart, Riak, and its tradeoffs,
are appropriate for a certain set of use cases. We happen to feel that _most_ use cases
can tolerate some level of eventual consistency.</p></td><td class="code"><p>
There is a category of applications in Amazon’s platform that can tolerate such inconsistencies and
can be constructed to operate under these conditions. For example, the shopping cart application requires
that an “Add to Cart” operation can never be forgotten or rejected. If the most recent state of the cart
is unavailable, and a user makes changes to an older version of the cart, that change is still meaningful
and should be preserved. But at the same time it shouldn’t supersede the currently unavailable state of
the cart, which itself may contain changes that should be preserved. Note that both “add to cart” and
“delete item from cart” operations are translated into put requests to Dynamo. When a customer wants to
add an item to (or remove from) a shopping cart and the latest version is not available, the item is added
to (or removed from) the older version and the divergent versions are reconciled later.
</p><p></p></td><tr id="section-41"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-41">¶</a></div><p>The same holds true for Riak. If, by way of some failure and concurrent update
(rare but quite possible), there come to exist multiple versions of the same object,
Riak will push this decision down to the client (who are we to tell you which is the
authoritative object?). All that said, if your application doesn't need this level of
version control, we enable you to turn the usage of vector clocks on and off at the bucket
level.</p></td><td class="code"><p>
In order to provide this kind of guarantee, Dynamo treats the result of each modification as a new and
immutable version of the data. It allows for multiple versions of an object to be present in the system at
the same time. Most of the time, new versions subsume the previous version(s), and the system itself can
determine the authoritative version (syntactic reconciliation). However, version branching may happen, in
the presence of failures combined with concurrent updates, resulting in conflicting versions of an object.
In these cases, the system cannot reconcile the multiple versions of the same object and the client must
perform the reconciliation in order to collapse multiple branches of data evolution back into one (semantic
reconciliation). A typical example of a collapse operation is “merging” different versions of a customer’s
shopping cart. Using this reconciliation mechanism, an “add to cart” operation is never lost. However,
deleted items can resurface.
</p><p></p></td><tr id="section-42"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-42">¶</a></div><p>Ditto</p></td><td class="code"><p>
It is important to understand that certain failure modes can potentially result in the system having not
just two but several versions of the same data. Updates in the presence of network partitions and node
failures can potentially result in an object having distinct version sub-histories, which the system will
need to reconcile in the future. This requires us to design applications that explicitly acknowledge the
possibility of multiple versions of the same data (in order to never lose any updates).
</p><p></p></td><tr id="section-43"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-43">¶</a></div><p>As you've probably figured out already, Riak uses vector clocks for object versioning, too.
Here are a whole host of resources to keep you busy for a while:
<ul>
<li><a href ="http://wiki.basho.com/display/RIAK/Riak+Glossary#RiakGlossary-VectorClock">Vector Clocks on the Riak Wiki</a></li>
<li><a href="http://blog.basho.com/2010/01/29/why-vector-clocks-are-easy/">Why Vector Clocks are Easy</a></li>
<li><a href="http://blog.basho.com/2010/04/05/why-vector-clocks-are-hard/">Why Vector Clocks are Hard</a></li>
<li><a href="http://en.wikipedia.org/wiki/Vector_clock">Vector Clocks on Wikipedia</a></li>
</ul>
</p></td><td class="code"><p>
Dynamo uses vector clocks [12] in order to capture causality between different versions of the same object.
A vector clock is effectively a list of (node, counter) pairs. One vector clock is associated with every
version of every object. One can determine whether two versions of an object are on parallel branches
or have a causal ordering, by examine their vector clocks. If the counters on the first object’s clock
are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the
second and can be forgotten. Otherwise, the two changes are considered to be in conflict and require
reconciliation.
</p><p>
In Dynamo, when a client wishes to update an object, it must specify which version it is updating.
This is done by passing the context it obtained from an earlier read operation, which contains the
vector clock information. Upon processing a read request, if Dynamo has access to multiple branches
that cannot be syntactically reconciled, it will return all the objects at the leaves, with the
corresponding version information in the context. An update using this context is considered to have
reconciled the divergent versions and the branches are collapsed into a single new version.
</p><p>
Figure 3: Version evolution of an object over time.
</p><p>
To illustrate the use of vector clocks, let us consider the example shown in Figure 3. A client writes a
new object. The node (say Sx) that handles the write for this key increases its sequence number and uses
it to create the data's vector clock. The system now has the object D1 and its associated clock [(Sx, 1)].
The client updates the object. Assume the same node handles this request as well. The system now also has
object D2 and its associated clock [(Sx, 2)]. D2 descends from D1 and therefore over-writes D1, however
there may be replicas of D1 lingering at nodes that have not yet seen D2. Let us assume that the same
client updates the object again and a different server (say Sy) handles the request. The system now has
data D3 and its associated clock [(Sx, 2), (Sy, 1)].
</p><p>
Next assume a different client reads D2 and then tries to update it, and another node (say Sz) does the
write. The system now has D4 (descendant of D2) whose version clock is [(Sx, 2), (Sz, 1)]. A node that is
aware of D1 or D2 could determine, upon receiving D4 and its clock, that D1 and D2 are overwritten by the
new data and can be garbage collected. A node that is aware of D3 and receives D4 will find that there
is no causal relation between them. In other words, there are changes in D3 and D4 that are not reflected
in each other. Both versions of the data must be kept and presented to a client (upon a read) for
semantic reconciliation.
</p><p>
Now assume some client reads both D3 and D4 (the context will reflect that
both values were found by the read). The read's context is a summary of the clocks of D3 and D4, namely
[(Sx, 2), (Sy, 1), (Sz, 1)]. If the client performs the reconciliation and node Sx
coordinates the write, Sx will update its sequence number in the clock. The new
data D5 will have the following clock: [(Sx, 3), (Sy, 1), (Sz, 1)].
</p><p>
</p><p></p></td><tr id="section-44"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-44">¶</a></div><p>Riak does a certain amount of vector clock pruning to ensure their growth is kept under control.</p></td><td class="code"><p>
A possible issue with vector clocks is that the size of vector clocks may grow if many servers
coordinate the writes to an object. In practice, this is not likely because the writes are usually
handled by one of the top N nodes in the preference list. In case of network partitions or multiple
server failures, write requests may be handled by nodes that are not in the top N nodes in the
preference list causing the size of vector clock to grow. In these scenarios, it is desirable to limit
the size of vector clock. To this end, Dynamo employs the following clock truncation scheme: Along with
each (node, counter) pair, Dynamo stores a timestamp that indicates the last time the node updated the
data item. When the number of (node, counter) pairs in the vector clock reaches a threshold (say 10),
the oldest pair is removed from the clock. Clearly, this truncation scheme can lead to inefficiencies
in reconciliation as the descendant relationships cannot be derived accurately. However, this problem
has not surfaced in production and therefore this issue has not been thoroughly investigated.
</p><p></p></td><tr id="section-45"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-45">¶</a></div><p><strong>Execution of get () and put () operations</strong></p></td><td class="code"><p>
4.5 Execution of get () and put () operations
</p><p></p></td><tr id="section-46"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-46">¶</a></div><p>Any node in the Riak ring can coordinate a request. The Riak information in this section
applies to a failure-free environment.</p></td><td class="code"><p>
Any storage node in Dynamo is eligible to receive client get and put operations for any key. In this
section, for sake of simplicity, we describe how these operations are performed in a failure-free environment
and in the subsequent section we describe how read and write operations are executed during failures.
</p><p>
Both get and put operations are invoked using Amazon’s infrastructure-specific request processing framework
over HTTP. There are two strategies that a client can use to select a node: (1) route its request through a
generic load balancer that will select a node based on load information, or (2) use a partition-aware client
library that routes requests directly to the appropriate coordinator nodes. The advantage of the first
approach is that the client does not have to link any code specific to Dynamo in its application, whereas
the second strategy can achieve lower latency because it skips a potential forwarding step.
</p><p>
A node handling a read or write operation is known as the coordinator. Typically, this is the first
among the top N nodes in the preference list. If the requests are received through a load balancer,
requests to access a key may be routed to any random node in the ring. In this scenario, the node that
receives the request will not coordinate it if the node is not in the top N of the requested key’s
preference list. Instead, that node will forward the request to the first among the top N nodes in
the preference list.
</p><p>
Read and write operations involve the first N healthy nodes in the preference list, skipping over
those that are down or inaccessible. When all nodes are healthy, the top N nodes in a key’s preference
list are accessed. When there are node failures or network partitions, nodes that are lower ranked in
the preference list are accessed.
</p><p></p></td><tr id="section-47"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-47">¶</a></div><p>Riak makes use of the same values. But, thanks to our concept of buckets, we made it a bit more
customizable. The default R and W values are set at the bucket level but can be configured at
the request level if the developer deems it necessary for certain data.
"Quorum" as described in Dynamo is the default setting in Riak.
Some more resources on R and W:
<ul>
<li><a href ="http://wiki.basho.com/display/RIAK/REST+API">Riak's REST API</a></li>
<li><a href="http://wiki.basho.com/display/RIAK/An+Introduction+to+Riak#AnIntroductiontoRiak-ReadingData"> Reading and Writing Data</a></li>
</ul></p></td><td class="code"><p>To maintain consistency among its replicas, Dynamo uses a consistency protocol similar to those used in
quorum systems. This protocol has two key configurable values: R and W. R is the minimum number of nodes
that must participate in a successful read operation. W is the minimum number of nodes that must
participate in a successful write operation. Setting R and W such that R + W > N yields a quorum-like
system. In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W)
replicas. For this reason, R and W are usually configured to be less than N, to provide better latency.
</p><p></p></td><tr id="section-48"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-48">¶</a></div><p>Check on this</p></td><td class="code"><p>Upon receiving a put() request for a key, the coordinator generates the vector clock for the new version and
writes the new version locally. The coordinator then sends the new version (along with the new vector clock)
to the N highest-ranked reachable nodes. If at least W-1 nodes respond then the write is considered successful.
</p><p></p></td><tr id="section-49"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-49">¶</a></div><p>Check on this</p></td><td class="code"><p>Similarly, for a get() request, the coordinator requests all existing versions of data for that key from the
N highest-ranked reachable nodes in the preference list for that key, and then waits for R responses
before returning the result to the client. If the coordinator ends up gathering multiple versions of
the data, it returns all the versions it deems to be causally unrelated. The divergent versions are
then reconciled and the reconciled version superseding the current versions is written back.
</p><p></p></td><tr id="section-50"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-50">¶</a></div><p>Annotate this</p></td><td class="code"><p>
4.6 Handling Failures: Hinted Handoff
</p><p>
If Dynamo used a traditional quorum approach it would be unavailable during server failures and network
partitions, and would have reduced durability even under the simplest of failure conditions. To remedy
this it does not enforce strict quorum membership and instead it uses a “sloppy quorum”; all read and write
operations are performed on the first N healthy nodes from the preference list, which may not always be
the first N nodes encountered while walking the consistent hashing ring.
</p><p>
Consider the example of Dynamo configuration given in Figure 2 with N=3. In this example, if node
A is temporarily down or unreachable during a write operation then a replica that would normally
have lived on A will now be sent to node D. This is done to maintain the desired availability and
durability guarantees. The replica sent to D will have a hint in its metadata that suggests which
node was the intended recipient of the replica (in this case A). Nodes that receive hinted replicas
will keep them in a separate local database that is scanned periodically. Upon detecting that A has
recovered, D will attempt to deliver the replica to A. Once the transfer succeeds, D may delete the
object from its local store without decreasing the total number of replicas in the system.
</p><p>
Using hinted handoff, Dynamo ensures that the read and write operations are not failed due to temporary
node or network failures. Applications that need the highest level of availability can set W to 1,
which ensures that a write is accepted as long as a single node in the system has durably written
the key it to its local store. Thus, the write request is only rejected if all nodes in the system
are unavailable. However, in practice, most Amazon services in production set a higher W to meet
the desired level of durability. A more detailed discussion of configuring N, R and W follows in section 6.
</p><p>
It is imperative that a highly available storage system be capable of handling the failure of an
entire data center(s). Data center failures happen due to power outages, cooling failures, network
failures, and natural disasters. Dynamo is configured such that each object is replicated across
multiple data centers. In essence, the preference list of a key is constructed such that the storage
nodes are spread across multiple data centers. These datacenters are connected through high speed
network links. This scheme of replicating across multiple datacenters allows us to handle entire data
center failures without a data outage.
</p><p></p></td><tr id="section-51"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-51">¶</a></div><p>All of this</p></td><td class="code"><p>
4.7 Handling permanent failures: Replica synchronization
</p><p>
Hinted handoff works best if the system membership churn is low and node failures are transient.
There are scenarios under which hinted replicas become unavailable before they can be returned to
the original replica node. To handle this and other threats to durability, Dynamo implements an
anti-entropy (replica synchronization) protocol to keep the replicas synchronized.
</p><p>
To detect the inconsistencies between replicas faster and to minimize the amount of transferred data,
Dynamo uses Merkle trees [13]. A Merkle tree is a hash tree where leaves are hashes of the values of
individual keys. Parent nodes higher in the tree are hashes of their respective children. The principal
advantage of Merkle tree is that each branch of the tree can be checked independently without requiring
nodes to download the entire tree or the entire data set. Moreover, Merkle trees help in reducing the
amount of data that needs to be transferred while checking for inconsistencies among replicas. For
instance, if the hash values of the root of two trees are equal, then the values of the leaf nodes
in the tree are equal and the nodes require no synchronization. If not, it implies that the values
of some replicas are different. In such cases, the nodes may exchange the hash values of children and
the process continues until it reaches the leaves of the trees, at which point the hosts can identify
the keys that are “out of sync”. Merkle trees minimize the amount of data that needs to be transferred
for synchronization and reduce the number of disk reads performed during the anti-entropy process.
</p><p>
Dynamo uses Merkle trees for anti-entropy as follows: Each node maintains a separate Merkle tree
for each key range (the set of keys covered by a virtual node) it hosts. This allows nodes to compare
whether the keys within a key range are up-to-date. In this scheme, two nodes exchange the root of
the Merkle tree corresponding to the key ranges that they host in common. Subsequently, using the
tree traversal scheme described above the nodes determine if they have any differences and perform
the appropriate synchronization action. The disadvantage with this scheme is that many key ranges
change when a node joins or leaves the system thereby requiring the tree(s) to be recalculated.
This issue is addressed, however, by the refined partitioning scheme described in Section 6.2.
</p><p></p></td><tr id="section-52"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-52">¶</a></div><p>All of 4.8</p></td><td class="code"><p>
4.8 Membership and Failure Detection
</p><p>
4.8.1 Ring Membership
</p><p>
In Amazon’s environment node outages (due to failures and maintenance tasks) are often transient but may
last for extended intervals. A node outage rarely signifies a permanent departure and therefore should
not result in rebalancing of the partition assignment or repair of the unreachable replicas. Similarly,
manual error could result in the unintentional startup of new Dynamo nodes. For these reasons, it was
deemed appropriate to use an explicit mechanism to initiate the addition and removal of nodes from a
Dynamo ring. An administrator uses a command line tool or a browser to connect to a Dynamo node and
issue a membership change to join a node to a ring or remove a node from a ring. The node that serves
the request writes the membership change and its time of issue to persistent store. The membership
changes form a history because nodes can be removed and added back multiple times. A gossip-based protocol
propagates membership changes and maintains an eventually consistent view of membership. Each node
contacts a peer chosen at random every second and the two nodes efficiently reconcile their persisted
membership change histories.
</p><p>
When a node starts for the first time, it chooses its set of tokens (virtual nodes in the consistent
hash space) and maps nodes to their respective token sets. The mapping is persisted on disk and initially
contains only the local node and token set. The mappings stored at different Dynamo nodes are reconciled
during the same communication exchange that reconciles the membership change histories. Therefore,
partitioning and placement information also propagates via the gossip-based protocol and each storage
node is aware of the token ranges handled by its peers. This allows each node to forward a key’s read/write
operations to the right set of nodes directly.
</p><p>
4.8.2 External Discovery
</p><p>
The mechanism described above could temporarily result in a logically partitioned Dynamo ring. For
example, the administrator could contact node A to join A to the ring, then contact node B to join
B to the ring. In this scenario, nodes A and B would each consider itself a member of the ring, yet
neither would be immediately aware of the other. To prevent logical partitions, some Dynamo nodes play
the role of seeds. Seeds are nodes that are discovered via an external mechanism and are known to all
nodes. Because all nodes eventually reconcile their membership with a seed, logical partitions are
highly unlikely. Seeds can be obtained either from static configuration or from a configuration
service. Typically seeds are fully functional nodes in the Dynamo ring.
</p><p>
4.8.3 Failure Detection
</p><p>
Failure detection in Dynamo is used to avoid attempts to communicate with unreachable peers during
get() and put() operations and when transferring partitions and hinted replicas. For the purpose of
avoiding failed attempts at communication, a purely local notion of failure detection is entirely
sufficient: node A may consider node B failed if node B does not respond to node A’s messages (even
if B is responsive to node C*s messages). In the presence of a steady rate of client requests generating
inter-node communication in the Dynamo ring, a node A quickly discovers that a node B is unresponsive
when B fails to respond to a message; Node A then uses alternate nodes to service requests that map to
B's partitions; A periodically retries B to check for the latter's recovery. In the absence of client
requests to drive traffic between two nodes, neither node really needs to know whether the other is
reachable and responsive.
</p><p>
Decentralized failure detection protocols use a simple gossip-style protocol that enable each node
in the system to learn about the arrival (or departure) of other nodes. For detailed information on
decentralized failure detectors and the parameters affecting their accuracy, the interested reader
is referred to [8]. Early designs of Dynamo used a decentralized failure detector to maintain a globally
consistent view of failure state. Later it was determined that the explicit node join and leave methods
obviates the need for a global view of failure state. This is because nodes are notified of permanent
node additions and removals by the explicit node join and leave methods and temporary node failures are
detected by the individual nodes when they fail to communicate with others (while forwarding requests).
</p><p></p></td><tr id="section-53"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-53">¶</a></div><p>Annoate all of 4.9</p></td><td class="code"><p>
4.9 Adding/Removing Storage Nodes
</p><p>
When a new node (say X) is added into the system, it gets assigned a number of tokens that are randomly
scattered on the ring. For every key range that is assigned to node X, there may be a number of nodes
(less than or equal to N) that are currently in charge of handling keys that fall within its token range.
Due to the allocation of key ranges to X, some existing nodes no longer have to some of their keys and
these nodes transfer those keys to X. Let us consider a simple bootstrapping scenario where node X is
added to the ring shown in Figure 2 between A and B. When X is added to the system, it is in charge of
storing keys in the ranges (F, G], (G, A] and (A, X]. As a consequence, nodes B, C and D no longer have
to store the keys in these respective ranges. Therefore, nodes B, C, and D will offer to and upon confirmation
from X transfer the appropriate set of keys. When a node is removed from the system, the reallocation of
keys happens in a reverse process.
</p><p>
Operational experience has shown that this approach distributes the load of key distribution uniformly
across the storage nodes, which is important to meet the latency requirements and to ensure fast
bootstrapping. Finally, by adding a confirmation round between the source and the destination, it
is made sure that the destination node does not receive any duplicate transfers for a given key range.
</p><p>
5.Implementation
</p><p>
In Dynamo, each storage node has three main software components: request coordination, membership and
failure detection, and a local persistence engine. All these components are implemented in Java.
</p><p></p></td><tr id="section-54"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-54">¶</a></div><p>We use Bitcask as default and also offer x, y and z.
Why do we use these? pros, cons</p></td><td class="code"><p>Dynamo’s local persistence component allows for different storage engines to be plugged in.
Engines that are in use are Berkeley Database (BDB) Transactional Data Store, BDB Java Edition,
MySQL, and an in-memory buffer with persistent backing store. The main reason for designing a
pluggable persistence component is to choose the storage engine best suited for an application’s
access patterns. For instance, BDB can handle objects typically in the order of tens of kilobytes
whereas MySQL can handle objects of larger sizes. Applications choose Dynamo’s local persistence
engine based on their object size distribution. The majority of Dynamo’s production instances use
BDB Transactional Data Store.
</p><p>
The request coordination component is built on top of an event-driven messaging substrate where the
message processing pipeline is split into multiple stages similar to the SEDA architecture [24]. All
communications are implemented using Java NIO channels. The coordinator executes the read and write
requests on behalf of clients by collecting data from one or more nodes (in the case of reads) or
storing data at one or more nodes (for writes). Each client request results in the creation of a
state machine on the node that received the client request. The state machine contains all the
logic for identifying the nodes responsible for a key, sending the requests, waiting for responses,
potentially doing retries, processing the replies and packaging the response to the client. Each
state machine instance handles exactly one client request. For instance, a read operation implements
the following state machine: (i) send read requests to the nodes, (ii) wait for minimum number of
required responses, (iii) if too few replies were received within a given time bound, fail the
request, (iv) otherwise gather all the data versions and determine the ones to be returned and
(v) if versioning is enabled, perform syntactic reconciliation and generate an opaque write context
that contains the vector clock that subsumes all the remaining versions. For the sake of brevity the
failure handling and retry states are left out.
</p><p>
After the read response has been returned to the caller the state machine waits for a small period of
time to receive any outstanding responses. If stale versions were returned in any of the responses,
the coordinator updates those nodes with the latest version. This process is called read repair because
it repairs replicas that have missed a recent update at an opportunistic time and relieves the anti-entropy
protocol from having to do it.
</p><p>
As noted earlier, write requests are coordinated by one of the top N nodes in the preference list.
Although it is desirable always to have the first node among the top N to coordinate the writes thereby
serializing all writes at a single location, this approach has led to uneven load distribution resulting
in SLA violations. This is because the request load is not uniformly distributed across objects. To counter
this, any of the top N nodes in the preference list is allowed to coordinate the writes. In particular,
since each write usually follows a read operation, the coordinator for a write is chosen to be the node
that replied fastest to the previous read operation which is stored in the context information of the
request. This optimization enables us to pick the node that has the data that was read by the preceding
read operation thereby increasing the chances of getting “read-your-writes” consistency. It also reduces
variability in the performance of the request handling which improves the performance at the 99.9 percentile.
</p><p>
6. Experiences & Lessons Learned
</p><p>
Dynamo is used by several services with different configurations. These instances differ by their
version reconciliation logic, and read/write quorum characteristics. The following are the main
patterns in which Dynamo is used:
</p><p></p></td><tr id="section-55"><td class="docs"><div class="pilwrap"><a class="pilcrow" href="#section-55">¶</a></div><p>This paragraph</p></td><td class="code"><p>
Business logic specific reconciliation: This is a popular use case for Dynamo. Each data object is
replicated across multiple nodes. In case of divergent versions, the client application performs its
own reconciliation logic. The shopping cart service discussed earlier is a prime example of this
category. Its business logic reconciles objects by merging different versions of a customer’s
shopping cart.
</p><p>
Timestamp based reconciliation: This case differs from the previous one only in the reconciliation
mechanism. In case of divergent versions, Dynamo performs simple timestamp based reconciliation
logic of “last write wins”; i.e., the object with the largest physical timestamp value is chosen
as the correct version. The service that maintains customer’s session information is a good example
of a service that uses this mode.
</p><p>
High performance read engine: While Dynamo is built to be an “always writeable” data store, a few
services are tuning its quorum characteristics and using it as a high performance read engine.
Typically, these services have a high read request rate and only a small number of updates.
In this configuration, typically R is set to be 1 and W to be N. For these services, Dynamo
provides the ability to partition and replicate their data across multiple nodes thereby offering
incremental scalability. Some of these instances function as the authoritative persistence cache
for data stored in more heavy weight backing stores. Services that maintain product catalog and
promotional items fit in this category.
</p><p>
The main advantage of Dynamo is that its client applications can tune the values of N, R and W to
achieve their desired levels of performance, availability and durability. For instance, the value
of N determines the durability of each object. A typical value of N used by Dynamo’s users is 3.
</p><p>
The values of W and R impact object availability, durability and consistency. For instance, if W is set
to 1, then the system will never reject a write request as long as there is at least one node in the
system that can successfully process a write request. However, low values of W and R can increase the
risk of inconsistency as write requests are deemed successful and returned to the clients even if they
are not processed by a majority of the replicas. This also introduces a vulnerability window for
durability when a write request is successfully returned to the client even though it has been persisted
at only a small number of nodes.
</p><p>
Traditional wisdom holds that durability and availability go hand-in-hand. However, this is not
necessarily true here. For instance, the vulnerability window for durability can be decreased by
increasing W. This may increase the probability of rejecting requests (thereby decreasing
availability) because more storage hosts need to be alive to process a write request.
</p><p>
The common (N,R,W) configuration used by several instances of Dynamo is (3,2,2). These values are
chosen to meet the necessary levels of performance, durability, consistency, and availability SLAs.
</p><p>
All the measurements presented in this section were taken on a live system operating with a configuration
of (3,2,2) and running a couple hundred nodes with homogenous hardware configurations. As mentioned earlier,
each instance of Dynamo contains nodes that are located in multiple datacenters. These datacenters are
typically connected through high speed network links. Recall that to generate a successful get (or put)
response R (or W) nodes need to respond to the coordinator. Clearly, the network latencies between
datacenters affect the response time and the nodes (and their datacenter locations) are chosen such
that the applications target SLAs are met.
</p><p>
6.1 Balancing Performance and Durability
</p><p>
While Dynamo’s principle design goal is to build a highly available data store, performance is an
equally important criterion in Amazon’s platform. As noted earlier, to provide a consistent customer
experience, Amazon’s services set their performance targets at higher percentiles (such as the 99.9th
or 99.99th percentiles). A typical SLA required of services that use Dynamo is that 99.9% of the
read and write requests execute within 300ms.
</p><p>
Since Dynamo is run on standard commodity hardware components that have far less I/O throughput
than high-end enterprise servers, providing consistently high performance for read and write operations
is a non-trivial task. The involvement of multiple storage nodes in read and write operations makes
it even more challenging, since the performance of these operations is limited by the slowest of the
R or W replicas. Figure 4 shows the average and 99.9th percentile latencies of Dynamo’s read and write
operations during a period of 30 days. As seen in the figure, the latencies exhibit a clear diurnal
pattern which is a result of the diurnal pattern in the incoming request rate (i.e., there is a
significant difference in request rate between the daytime and night). Moreover, the write latencies
are higher than read latencies obviously because write operations always results in disk access.
Also, the 99.9th percentile latencies are around 200 ms and are an order of magnitude higher than
the averages. This is because the 99.9th percentile latencies are affected by several factors such
as variability in request load, object sizes, and locality patterns.
</p><p>
Figure 4: Average and 99.9 percentiles of latencies for read and write requests during our peak
request season of December 2006. The intervals between consecutive ticks in the x-axis correspond
to 12 hours. Latencies follow a diurnal pattern similar to the request rate and 99.9 percentile
latencies are an order of magnitude higher than averages.
</p><p>
While this level of performance is acceptable for a number of services, a few customer-facing
services required higher levels of performance. For these services, Dynamo provides the ability
to trade-off durability guarantees for performance. In the optimization each storage node maintains
an object buffer in its main memory. Each write operation is stored in the buffer and gets periodically