/
transactions.tex
2074 lines (1910 loc) · 108 KB
/
transactions.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%% This file is a portion of the source for Revised Edition 1.1 of
%% Operating Systems and Middleware: Supporting Controlled
%% Interaction, Copyright 2011 by Max Hailperin. This work is
%% licensed under the Creative Commons Attribution-ShareAlike 3.0
%% Unported License. To view a copy of this license, visit
%% http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to
%% Creative Commons, 171 Second Street, Suite 300, San Francisco,
%% California, 94105, USA.
\chapter{Atomic Transactions}
\label{transactions-chapter}
\section{Introduction}
In Chapter~\ref{synchronization-chapter}, I described mutual exclusion as a mechanism
for ensuring that an object undergoes a sequence of
invariant-preserving transformations and hence is left in a state
where the invariant holds. (Such states are called
\vocab{consistent} states.) In particular, this was the idea behind
monitors. Any monitor object is constructed in a consistent state.
Any public operation on the monitor object will work correctly when
invoked in a consistent state and will reestablish the invariant
before returning. No interleaving of actions from different monitor
operations is allowed, so the monitor's state advances from one
consistent state to the next.
In this chapter, I will continue on the same theme of
invariant-preserving state transformations. This time through,
though, I will address two issues I ignored in
Chapter~\ref{synchronization-chapter}:
\begin{enumerate}
\item
Some invariants span multiple objects; rather than transforming a
single object from a consistent state to another consistent state, you may
need to transform a whole system of objects from one consistent state
to the next. For example, suppose you use objects to form a rooted
tree, with each object knowing its parent and its children, as shown
in Figure~\ref{scan-5-1}.
\begin{figure}
\centerline{\includegraphics{hail_f0501}}
%\centerline{\epsfbox{scan-5-1.eps}}
\caption{Rooted trees with pointers to children and parents: (a)
example satisfying the invariant; (b) invariant violated because
E's parent is now C, but E is still a child of D and not of C;
(c) invariant restored because the only child pointer leading to E
again agrees with E's parent pointer. The complete transformation
from Part (a) to Part (c) requires modifications to nodes C, D, and E.}
\label{scan-5-1}
\end{figure}
An
invariant is that X has Y as a child if and only if Y has X as its
parent. An operation to move a node to a new position in the tree
would need to change three objects (the node, the old parent, and the
new parent) in order to preserve the invariant.
\item
Under exceptional circumstances an operation may \vocab{fail}, that is,
be forced to give up after doing only part of its invariant-preserving
transformation. For example, some necessary resource may be
unavailable, the user may press a Cancel button, the input may fail a
validity check, or a hardware failure may occur. Nonetheless, the
system should be left in a consistent state.
\end{enumerate}
An \foldvocab{atomic}{transaction} is an operation that takes a system
from an observable initial state to an observable final state, without
any intermediate states being observable or perturbable by other
atomic transactions. If a system starts with a consistent initial
state and modifies that state using only invariant-preserving
atomic transactions, the state will remain consistent.
Atomicity\index{atomic} must be preserved in the face of both
concurrency and failures. That is, no transaction may interact with a
concurrently running transaction nor may any transaction see an
intermediate state left behind by a failed transaction. The former
requirement is known as \vocab{isolation}. The latter requirement
lacks a generally agreed-upon name; I will call it \vocab{failure atomicity}.
Often, atomic transactions are simply called \vocabs{transaction}. In
fact, according to many authors, atomicity is part of the definition
of a transaction. Unfortunately, there are other authors for whom
transactions need not be atomic. Because of this lack of agreement on
the nomenclature, I have introduced this chapter with the full phrase
``atomic transactions'' to make my focus clear. Henceforth, I will
skip the modifier ``atomic'' and use only ``transactions,'' with the
understanding that they are atomic unless otherwise specified.
Many transaction systems require not only atomicity, but also
\vocabindex{durability}{durable}. A transaction is durable if the
state of a successfully completed transaction remains intact, even if
the system crashes afterward and has to be rebooted. Each successful
transaction ends with an explicit \vocab{commit} action, which
signifies that the consistent final state has been established and
should be made visible to other transactions. With durable
transactions, if the system crashes after the commit action, the final
transformed state will be intact after system restart. If the crash
occurs before the commit action, the system will be back in the
initial, unchanged state after restart.
Note that failure atomicity is slightly simpler for nondurable
transactions. Atomicity across system crashes and restarts is easy to
arrange: by clearing all memory on restart, you can guarantee that no
partially updated state is visible after the restart---no updates at
all, partial or otherwise, will remain. This clearing of memory will
happen automatically if the computer's main semiconductor DRAM memory
is used, because that memory is \vocab{volatile}, that is, it does not
survive reboots. (Strictly speaking, volatility means the memory does
not survive a loss of power; reboots with the power left on generally
clear volatile memory as well, however.)
Even nondurable transactions must ensure failure atomicity
for less dramatic failures in which the system is not rebooted. For
example, a transaction might do some updates, then discover invalid
input and respond by bailing out.
To take another example, recovering from a
detected deadlock might entail aborting one of the deadlocked
transactions. Both situations can be handled using an
explicit \vocab{abort} action, which indicates the transaction should
be terminated with no visible change made to the state. Any changes
already made must be concealed, by undoing them.
In 1983, \index{Harder, Theo@H{\"a}rder, Theo}H{\"a}rder and
\index{Reuter, Andreas}Reuter coined a catchy phrase by saying that
whether a system supports transactions is ``the \index{ACID}ACID test of the
system's quality.'' The ACID acronym indicates that transactions are
\vocab{atomic}, \vocab{consistent}, \vocab{isolated}, and
\vocab{durable}. This acronym is quite popular, but somewhat
redundant. As you have seen, a transaction system really only provides
two properties: atomicity and durability. Consistency is a property
of system states---a state is consistent if the invariants hold.
Transactions that are written correctly (so each preserves invariants)
will leave the state consistent if they execute atomically. Isolation
simply is another name for atomicity in the face of concurrency:
concurrent transactions must not interact.
The properties of atomicity and durability refer to transactions,
independent of the objects on which the transactions operate.
Returning to the earlier rooted tree example of moving a node to a new
position, a transaction might modify the node, the old parent, and the
new parent, all within one atomic unit. This stands in contrast
to monitors, each of which controls a single
object.
To obtain the requisite atomicity with monitors, the whole tree could
be a single monitor object, instead of having one monitor per node. The tree
monitor would
have an operation to move one of its nodes. In general, this approach
is difficult to reconcile with modularity. Moreover, lumping lots of
data into one monitor creates a performance problem. Making the
whole system (or a large chunk of it) into one monitor would prevent
any concurrency. Yet it ought to be possible to concurrently move two
nodes in different parts of a tree. Atomic transactions allow
concurrency of this sort while still protecting the entire
transformation of the system's state.
This point is worth emphasizing. Although the system's state remains
consistent \emph{as though} only one transaction were executed at a
time, transactions in fact execute concurrently, for performance
reasons. The transaction system is responsible for maintaining
atomicity in the face of concurrency. That is, it must ensure that
transactions don't interact with one another, even when running
concurrently. Often the system will achieve this isolation by
ensuring that no transaction reads from any data object being
modified by another transaction. Enforcing this restriction entails introducing
synchronization that limits, but does not completely eliminate, the
concurrency.
In Section~\ref{trasactions-examples-section}, I will sketch several examples of the ways in
which transactions are used by middleware and operating systems to
support application programs. Thereafter, I present techniques used
to make transactions work, divided into three sections. First, Section~\ref{atomicity-mechanisms-section}
explains basic techniques for ensuring the atomicity of transactions,
without addressing durability. Second, Section~\ref{wal} explains how
the mechanism used to ensure failure atomicity can be extended to also
support durability. Third,
Section~\ref{additional-transaction-mechanisms-section} explains a few additional mechanisms
to provide increased concurrency and
coordinate multiple participants cooperating on a single transaction.
Finally, Section~\ref{security-and-transactions-section} is devoted to security issues.
The chapter concludes with exercises, exploration and programming projects, and notes.
\section{Example Applications of Transactions}\label{trasactions-examples-section}
The transaction concept is much more pervasive in middleware than in
operating systems. Therefore, of the three examples presented in the
following subsections, the first two are from middleware systems.
Sections \ref{db-transaction-section} and
\ref{transactions-message-queuing-systems-section} explain the two
most long-standing middleware applications, namely database systems
and message-queuing systems.
Moving into the operating systems arena, Section~\ref{jfs-transactions-section}
explains the role that transactions play in
journaled file systems, which are the current dominant form of file
system.
\subsection{Database Systems}\label{db-transaction-section}
The transaction concept is most strongly rooted in \vocabs{database
system}; for decades, every serious database system has provided
transactions as a service to application programmers. Database
systems are an extremely important form of middleware, used in almost
every enterprise information system. Like all middleware,
database systems are built on top of operating system services, rather
than raw hardware, while providing general-purpose services
to application software. Some of those services are synchronization
services: just as an operating system provides mutexes, a database
system provides transactions.
On the other hand, transaction services are not the central, defining
mission of a database system. Instead, database systems are primarily
concerned with providing persistent data storage and convenient means
for accessing the stored data. Nonetheless, my goal in this chapter is to show how transactions fit
into relational database systems. I will cover just enough of the SQL
language used by such systems to enable you to try out the example on
a real system. In particular, I show the example using the Oracle
database system.
Relational database systems manipulate tables of data. In Chapter~\ref{synchronization-chapter}'s
discussion of deadlock detection, I showed a
simple example from the Oracle database system involving two
accounts with account numbers 1 and 2. The scenario
(as shown in Figure~\ref{oracle-deadlock} on
page~\pageref{oracle-deadlock}) involved transferring money from each
account to the other, by updating the balance of each account. Thus,
that example involved a table called \verb|accounts| with two columns,
\verb|account_number| and \verb|balance|. That table can be created
with the SQL command shown here:
\begin{verbatim}
create table accounts (
account_number int primary key,
balance int);
\end{verbatim}
Similarly, you can initialize account~1 to \$750 and account~2 to
\$2250 by using the following commands:
\begin{verbatim}
insert into accounts values (1, 750);
insert into accounts values (2, 2250);
\end{verbatim}
At this point, you can look at the table with the \verb|select| command:
\begin{verbatim}
select * from accounts;
\end{verbatim}
and get the following reply:
\begin{verbatim}
ACCOUNT_NUMBER BALANCE
-------------- ----------
1 750
2 2250
\end{verbatim}
(If you are using a relational database other than Oracle, the format
of the table may be slightly different. Of course, other aspects of
the example may differ as well, particularly the deadlock detection
response.)
At this point, to replicate the deadlock detection example from
Figure~\ref{oracle-deadlock}, you will need to open up two different
sessions connected to the database, each in its own window. In the
first session, you can debit \$100 from account~1, and in the second
session you can debit \$250 from account~2. (See
page~\pageref{oracle-deadlock} for the specific SQL commands.) Now
in session one, try to credit the \$100 into account~2; this is blocked,
because the other session has locked account~2. Similarly, session two
is blocked trying to credit its \$250 into account~1, creating a
deadlock, as illustrated in Figure~\ref{scan-5-2}.
\begin{figure}
\centerline{\includegraphics{hail_f0502}}
%\centerline{\epsfysize=5in\epsfbox{scan-5-2.eps}}
\caption{Two transfer transactions deadlock when each waits for
exclusive access to the account for which the other already has
obtained exclusive access. In this diagram, the vertical dimension
represents the passage of time.}
\label{scan-5-2}
\end{figure}
As you saw, Oracle detects the deadlock and chooses to cause
session one's update request to fail.
Having made it through all this prerequisite setup, you are in a
position to see the role that transactions play in situations such as
this. Each of the two sessions is processing its own transaction.
Recall that session one has already debited \$100 from account~1 but
finds itself unable to credit the \$100 into account~2. The
transaction cannot make forward progress, but on the other hand, you
don't want it to just stop dead in its tracks either. Stopping would block the
progress of session two's transaction. Session one also cannot just bail
out without any cleanup: it has already debited \$100 from account~1.
Debiting the source account without crediting the destination account
would violate atomicity and make customers angry besides.
Therefore, session one needs to abort its transaction, using the
\verb|rollback| command. Aborting will back out of the transaction's earlier debiting
of \$100 from account~1 and release its lock on that account. As a
result, session two's attempt to credit \$250 into account~1 can
finally stop hanging and complete. Continuing my earlier tradition
of showing session one at the left margin and session two indented
four spaces, the interaction would look like:
\begin{verbatim}
SQL> rollback;
Rollback complete.
1 row updated.
\end{verbatim}
Of course, whoever was trying to transfer \$100 from account~1 to
account~2 still wants to do so. Therefore, after aborting that
transaction, you should retry it:
\begin{verbatim}
SQL> update accounts set balance = balance - 100
where account_number = 1;
\end{verbatim}
This command will hang, because session two's transaction now has both
accounts locked. However, that transaction has nothing more it needs
to do, so it can commit, allowing session one to continue with its retry:
\begin{verbatim}
SQL> commit;
Commit complete.
1 row updated.
SQL> update accounts set balance = balance + 100
where account_number = 2;
1 row updated.
SQL> commit;
Commit complete.
SQL> select * from accounts;
ACCOUNT_NUMBER BALANCE
-------------- ----------
1 900
2 2100
\end{verbatim}
Notice that at the end, the two accounts have been updated correctly.
For example, account~1 does not look as though \$100 was debited from
it twice---the debiting done in the aborted transaction was wiped
away. Figure~\ref{scan-5-3} illustrates how the transactions recover
from the deadlock.
\begin{figure}
\centerline{\includegraphics{hail_f0503}}
%\centerline{\epsfysize=7in\epsfbox{scan-5-3.eps}}
\caption{Transactions recover from their deadlock when one rolls back,
releasing the lock it holds. As in the prior figure, the vertical
dimension represents the passage of time.}
\label{scan-5-3}
\end{figure}
In a large system with many accounts, there may be many concurrent
transfer transactions on different pairs of accounts. Only
rarely will a deadlock situation such as the preceding example arise.
However, it is nice to know that database systems have a clean way of
dealing with them. Any transaction can be aborted, due to deadlock
detection or any other reason, and retried later. Moreover, concurrent
transactions will never create incorrect results due to races; that was why
the database system locked the accounts, causing the temporary hanging
(and in one case, the deadlock) that you observed.
\subsection{Message-Queuing Systems}\label{transactions-message-queuing-systems-section}
\vocabindex{Message-queuing systems}{message-queuing system} form
another important class of middleware, and like database systems, they
support the transaction concept.
Developers of large-scale enterprise information systems normally use both forms of
middleware, although message-queuing systems are more avoidable than
database systems.
As with database systems, the
primary mission of message queuing is not the support of
transactions. Instead, message-queuing systems specialize in the
provision of communication services. As such, I will discuss them further
in Chapter~\ref{distmid-chapter}, as part of a discussion
of the broader family of middleware to which they belong:
\vocabs{messaging system} or
\foldvocab{message-oriented}{middleware} (\vocab{MOM}).
A straightforward application of messaging consists of a server
accessed through a request queue and a response queue.
As shown in Figure~\ref{scan-5-4}, the server
\begin{figure}
\centerline{\includegraphics{hail_f0504}}
%\centerline{\epsfbox{scan-5-4.eps}}
\caption{An analogy: (a) a server dequeues a message from its request
queue, processes the request, and enqueues a message into the response queue; (b) an
office worker takes paper from the In basket, processes the
paperwork, and puts it into the Out basket.}
\label{scan-5-4}
\end{figure}
dequeues a request message from the request queue, carries out the
required processing, and enqueues a response message into the response
queue. (Think about an office worker whose desk has two baskets,
labeled ``in'' and ``out,'' and who takes paper from one, processes it, and
puts it in the other.)
These three steps (dequeue, process, enqueue) are grouped together as
an atomic transaction. If any of the three steps fail, the request
message is left in the input queue, ready to be retried. No request
will be lost, nor will there ever be visible signs of repeated
processing, such as duplicated response messages. (Of course, some
causes of failure will affect retries as well. For that reason,
realistic systems generally keep count of retries and after a while
divert the request message, for example, into a human troubleshooter's
request queue.)
Message-queuing systems also provide durability, so that even if the system crashes and restarts,
each request will generate exactly one response.
In most systems, applications can opt out of durability in order to reduce persistent storage traffic and thereby
obtain higher performance.
To provide greater concurrency, a system may have several servers
dequeuing from the same request queue, as shown in
Figure~\ref{scan-5-5}.
\begin{figure}
\centerline{\includegraphics{hail_f0505}}
%\centerline{\epsfbox{scan-5-5.eps}}
\caption{Several message-driven servers in parallel can dequeue from a
common request queue and enqueue into a common response queue. To
allow concurrent operation, messages need not be provided in strict
first-in, first-out order.}
\label{scan-5-5}
\end{figure}
This configuration has an interesting
interaction with atomicity. If the dequeue action is interpreted
strictly as taking the message at the head of the queue, then you have
to wait for the first transaction to commit or abort before you can
know which message the second transaction should dequeue. (If the
first transaction aborts, the message it tried to dequeue is still at
the head of the queue and should be taken by the second
transaction.) This would prevent any concurrency. Therefore,
message-queuing systems generally relax queue ordering a little,
allowing the second message to be dequeued even before the fate of the
first message is known. In effect, the first message is provisionally
removed from the queue and so is out of the way of the second
message. If the transaction handling the first
message aborts, the first message is returned to the head of the queue, even though the
second message was already dequeued.
More advanced \vocab{workflow} systems may include
several processing steps, with each processing step connected to the
next by an intermediate message queue. In these systems, each
processing stage is treated as a separate transaction. If the
transaction commits, that stage's input is gone from its inbound
queue, and its output is in the outbound queue. Seen as a whole, the
workflow may not exhibit atomicity. For example, failure in a later
processing stage will not roll back an earlier stage.
Consider a sale of merchandise as an example workflow, as shown in
Figure~\ref{scan-5-6}.
\begin{figure}
\centerline{\includegraphics{hail_f0506}}
%\centerline{\epsfbox{scan-5-6.eps}}
\caption{In this simplified workflow for selling merchandise,
processing a single order produces three different responses. The
response queues from the order-processing step are request queues
for subsequent steps.}
\label{scan-5-6}
\end{figure}
One transaction might take an incoming order, check it for validity,
and generate three output messages, each into its own outbound queue:
an order confirmation (back to the customer), a billing record (to the
accounts receivable system), and a shipping request (to the shipping
system). Another transaction, operating in the shipping
system, might dequeue the shipping request and fulfill it. If
failure is detected in the shipping transaction, the system can no longer
abort the overall workflow; the order confirmation and
billing have already been sent. Instead, the shipping transaction has
no alternative but to drive the overall workflow forward, even if in
a somewhat different direction than hoped for. For example, the
shipping transaction could queue messages apologizing to the customer
and crediting the purchase price back to the customer's account.
Figure~\ref{scan-5-7} shows the workflow with these extra steps.
\begin{figure}
\centerline{\includegraphics{hail_f0507}}
%\centerline{\epsfbox{scan-5-7.eps}}
\caption{In this workflow, a failure in shipping must produce
compensating responses, as it cannot abort the
overall workflow. The compensating responses credit the
customer's account for the previously debited amount and send an
apology to the customer indicating that the previously confirmed
order will not be filled after all.}
\label{scan-5-7}
\end{figure}
Even in a system in which one transaction may bill the
customer only to have a later \vocab{compensating transaction} refund
the billed amount, using atomic transactions simplifies application
programming. Imagine how complex it would be to reason about a
large workflow if each individual processing stage could fail midway through or
could interact with other concurrently executing stages. By treating
each workflow stage as an atomic transaction, a messaging system
considerably reduces the application designer's cognitive burden. A
diagram, such as Figure~\ref{scan-5-7}, can provide an accurate abstraction of the system's observable
behaviors by showing the system as processing stages linked by message
queues.
Finally, consider how the sales workflow keeps track of available
merchandise, customer account balances, and other information. You
should be able to see that individual processing stages of a workflow
will frequently have to use a database system. As such, transactions
will involve both message queues and databases. Atomicity needs to
cover both; if a transaction aborts, you want the database left
unchanged \emph{and} the request message left queued. In
Section~\ref{two-phase-commit-section}, I will explain how this
comprehensive atomicity can be achieved by coordinating the systems
participating in a transaction.
\subsection{Journaled File Systems}\label{jfs-transactions-section}
The transaction concept has been employed in middleware both longer
and more extensively than in operating systems. However, one
application in operating systems has become quite important. Most
contemporary operating systems provide file systems that employ atomic
transactions to at least maintain the structural consistency of the
file system itself, if not the consistency of the data stored in
files. These file systems are known as
\foldvocabs{journaled}{file system} (or
\foldvocabs{journaling}{file system}) in reference to their use of an
underlying mechanism known as a \vocab{journal}. I will discuss
journals in Sections \ref{failure-atomicity-subsection} and \ref{wal} under their alternative name,
\vocabs{log}. Examples of journaled file systems include NTFS, used
by Microsoft Windows; HFS Plus, used by Mac OS~X; and ext3fs, reiserfs, JFS, and XFS, used by Linux. (The latter two
originated in proprietary UNIX systems: JFS was developed by IBM for
AIX, and XFS was developed by SGI for IRIX.) File systems that are
not journaled need to use other techniques, which I describe in
Section~\ref{metadata-integrity-section}, to maintain the consistency of
of their data structures.
File systems provide a more primitive form of data storage and access
than database systems. As you will see in Chapter~\ref{persistence-chapter}, contemporary
operating systems generally treat a file as an arbitrarily large,
potentially extensible sequence of bytes, accessed by way of a
textual name. The names are organized hierarchically into nested
directories or folders. Typical operations on files include create,
read, write, rename, and delete.
Underlying this simple abstraction are some largely invisible
data structures, known as \vocab{metadata}, that help locate and
organize the data. For example, because each file can grow in size,
the file system must be free to store different parts of a file in
different locations. As such, the file system must store
metadata for each file indicating where each portion of the
file is located. Moreover, the file system must store information concerning
what parts of the storage are in use, so that it can allocate unused
space for a file that is growing.
The existence of this metadata means that even simple file operations
can involve several updates to the information in persistent storage.
Extending a file, for example, must update both the information about
free space and the information about space allocated to that file.
These structures need to be kept consistent; it would be
disastrous if a portion of the storage were both used for storing a file
and made available for allocation to a second file. Thus, the updates
should be done as part of an atomic transaction.
Some atomic transactions may even be visible to the user. Consider
the renaming of a file. A new directory entry needs to be created and an old
entry removed. The user wants these two changes done atomically,
without the possibility of the file having both names, or neither.
Some journaled file systems treat each operation requested by an
application program as an atomic and durable transaction. On such a system, if a
program asks the system to rename a file, and the rename operation
returns with an indication of success, the application program can be
sure that renaming has taken place. If the system crashes immediately
afterward and is rebooted, the file will have its new name. Said
another way, the rename operation includes commitment of the
transaction. The application program can tell that the transaction
committed and hence is guaranteed to be durable.
Other journaled file systems achieve higher performance by delaying
transaction commit. At the time the rename operation returns, the
transaction may not have committed yet. Every minute or so, the file
system will commit all transactions completed during that interval.
As such, when the system comes back from a crash, the file system will
be in some consistent state, but maybe not a completely up-to-date one.
A minute's worth of operations that appeared to complete
successfully may have vanished. In exchange for this risk, the system
has gained the ability to do fewer writes to persistent storage, which improves
performance. Notice that even in this version, transactions are
providing some value. The state found after reboot will be the result
of some sequence of operations (even if possibly a truncated
sequence), rather than being a hodgepodge of partial results from
incomplete and unordered operations.
Often, journaled file systems protect only metadata; the application
data stored in files may be left in an inconsistent state after a
crash.
In particular, some writes into the files may not have taken effect, and the writes that are lost in this way are not necessarily the ones performed most recently.
Even many journaled file system that do better than
this offer only a guarantee that all write operations
that completed before a crash will be reflected in the state after the
crash. With this limited guarantee, if a program wants to do multiple writes in an atomic fashion
(so that all writes take place or none do), the file system will not
provide any assistance. However, a file system can also be designed to
fully support transactions, including allowing the programmer to
group multiple updates into a transaction. One example of such a fully
transactional file system is Transactional NTFS (TxF), which was added
to Microsoft Windows in the Vista version.
\section{Mechanisms to Ensure Atomicity}\label{atomicity-mechanisms-section}
Having seen how valuable atomic transactions are for middleware and
operating systems, you should be ready to consider how this value is
actually provided. In particular, how is the atomicity of each
transaction ensured? Atomicity has two aspects: the isolation of
concurrent transactions from one another and the assurance that
failed transactions have no visible effect. In
Section~\ref{two-phase-section}, you will see how isolation is
formalized as serializability and how a particular locking discipline,
two-phase locking, is used to ensure serializability. In
Section~\ref{failure-atomicity-subsection}, you will see how failure
atomicity is assured through the use of an undo log.
\subsection{Serializability: Two-Phase Locking}\label{two-phase-section}
Transactions may execute concurrently with one another, so long as
they don't interact in any way that makes the concurrency apparent.
That is, the execution must be equivalent to a \vocab{serial}
execution, in which one transaction runs at a time, committing or
aborting before the next transaction starts. Any execution
equivalent to a serial execution is called a \vocab{serializable} execution. In
this section, I will more carefully define what is means for two
executions to be equivalent and hence what it means for an execution
to be serializable. In addition, I will show some simple rules for using
readers/writers locks that guarantee serializability. These rules,
used in many transaction systems, are known as
\foldvocab{two-phase}{locking}.
Equivalence, and hence serializability, can be defined in several
somewhat different ways. The definitions I give are the simplest I
could find and suffice to justify two-phase locking, which is the
mechanism normally used to achieve serializability in practical
systems. However, you should be aware that more general definitions
are needed in order to accommodate more advanced concurrency control
mechanisms. The notes at the end of the chapter provide pointers to
some of these more sophisticated alternatives.
Each transaction executes a sequence of actions. I will focus on
those actions that read or write some stored entity (which might be a
row in a database table, for example) and those actions that lock or
unlock a readers/writers lock. Assume that each stored entity
has its own lock associated with it. I will use the following
notation:
\begin{itemize}
\item
$r_j(x)$ means a read of entity $x$ by transaction $T_j$; when I want
to show the value that was read, I use $r_j(x, v)$, with $v$ as the value.
\item
$w_j(x)$ means a write of entity $x$ by transaction $T_j$; when I want
to show the value being written, I use $w_j(x, v)$, with $v$ as the value.
\item
$s_j(x)$ means an acquisition of a shared (that is, reader) lock on entity $x$ by transaction $T_j$.
\item
$e_j(x)$ means an acquisition of an exclusive (that is, writer) lock on
entity $x$ by transaction $T_j$.
\item
$\overline{s}_j(x)$ means an unlocking of a shared lock on entity $x$ by transaction $T_j$.
\item
$\overline{e}_j(x)$ means an unlocking of an exclusive lock on entity
$x$ by transaction $T_j$.
\item
$u_j(x)$ means an upgrade by transaction $T_j$ of its hold on entity
$x$'s lock from shared status to exclusive status.
\end{itemize}
Each read returns the most recently written value. Later, in
Section~\ref{reduced-isolation}, I will revisit this assumption,
considering the possibility that writes might store each successive
value for an entity in a new location so that reads can choose
among the old values.
The sequence of actions executed by a transaction is called its
\vocab{history}.
Because the transactions execute concurrently, if you were to write
all their actions in the order they happen, the transactions'
histories would be interleaved.
This time-ordered interleaving of all the
transactions' histories is called the system's history. All locking actions
are shown at the time when the lock is granted, not at the possibly
earlier time when the lock is requested. Assume that the histories
include all the relevant actions. In particular, if a transaction
aborts and does some extra writes at that time to undo the effect of
earlier writes (as you will see in Section~\ref{failure-atomicity-subsection}),
those undo writes must be explicitly listed in the history. Note
also that I am implicitly assuming the transactions have no effects
other than on storage; in particular, they don't do any I/O.
Let's look at some examples. Suppose that $x$ and $y$ are two
variables that are initially both equal to 5.
Suppose that transaction $T_1$ adds 3 to each of the two
variables, and transaction $T_2$ doubles each of the two variables.
Each of these transactions preserves the invariant that
$x=y$.
One serial history would be as follows:
\begin{eqnarray*}
&&e_1(x), r_1(x, 5), w_1(x, 8), \overline{e}_1(x),
e_1(y), r_1(y, 5), w_1(y, 8), \overline{e}_1(y),\\
&&e_2(x), r_2(x, 8), w_2(x, 16), \overline{e}_2(x),
e_2(y), r_2(y, 8), w_2(y, 16), \overline{e}_2(y)
\end{eqnarray*}
Before you go any further, make sure you understand this notation; as directed in
Exercise~\ref{t2-first-ex}, write out another serial history in which transaction
$T_2$ happens before transaction $T_1$. (The sequence of steps within each
transaction should remain the same.)
In the serial history I showed, $x$ and $y$ both end up with the value
16. When you wrote out the other serial history for these two transactions,
you should have obtained a different final value for these variables. Although
the invariant $x=y$ again holds, the common numerical value of $x$ and
$y$ is not 16 if transaction $T_2$ goes first. This makes an important
point: transaction system designers do not insist on \vocab{deterministic} execution,
in which the scheduling cannot affect the result. Serializability is a
weaker condition.
Continuing with the scenario in which $T_1$ adds 3 to each variable
and $T_2$ doubles each variable, one serializable---but not serial---history follows:
\begin{eqnarray*}
&&e_1(x), r_1(x, 5), w_1(x, 8), \overline{e}_1(x),
e_2(x), r_2(x, 8), w_2(x, 16), \overline{e}_2(x),\\
&&e_1(y), r_1(y, 5), w_1(y, 8), \overline{e}_1(y),
e_2(y), r_2(y, 8), w_2(y, 16), \overline{e}_2(y)
\end{eqnarray*}
To convince others that this history is serializable, you could
persuade them that it is equivalent to the serial history shown
previously. Although transaction $T_2$ starts before transaction $T_1$ is
finished, each variable still is updated the same way as in the
serial history.
Because the example transactions unlock $x$ before locking $y$, they
can also be interleaved in a nonserializable fashion:
\begin{eqnarray*}
&&e_1(x), r_1(x, 5), w_1(x, 8), \overline{e}_1(x),
e_2(x), r_2(x, 8), w_2(x, 16), \overline{e}_2(x),\\
&&e_2(y), r_2(y, 5), w_2(y, 10), \overline{e}_2(y),
e_1(y), r_1(y, 10), w_1(y, 13), \overline{e}_1(y)
\end{eqnarray*}
Here, the invariant $x=y$ is broken: at the end, $x$ is equal to 16,
but $y$ is equal to 13. Thus, this history is not equivalent to
either of the two serial histories.
My primary goal in this section is to show how locks can be used in a
disciplined fashion that rules out nonserializable histories. (In
particular, you will learn that in the previous example, $x$ should not be
unlocked until after $y$ is locked.) First, though, I need to
formalize what it means for two histories to be equivalent, so that
the definition of serializability is rigorous.
I will make two assumptions about locks:
\begin{enumerate}
\item
Each transaction correctly pairs up lock and unlock operations. That
is, no transaction ever locks a lock it already holds (except
upgrading from shared to exclusive status), unlocks a lock it doesn't
hold, or leaves a lock locked at the end.
\item
The locks function correctly. No transaction will ever be granted a
lock in shared mode while it is held by another transaction in
exclusive mode, and no transaction will ever be granted a lock in
exclusive mode while it is held by another transaction in either
mode.
\end{enumerate}
Neither of these assumptions should be controversial.
Two system histories are equivalent if the first history can be turned
into the second by performing a succession of equivalence-preserving
swap steps. An equivalence-preserving swap reverses the order of
two adjacent actions, subject to the following constraints:
\begin{itemize}
\item
The two actions must be from different transactions. (Any
transaction's actions should be kept in their given order.)
\item
The two actions must not be any of the following seven \vocabindex{conflicting}{conflict} pairs:
\begin{enumerate}
\item
$\overline{e}_j(x), s_k(x)$
\item
$\overline{e}_j(x), e_k(x)$
\item
$\overline{s}_j(x), e_k(x)$
\item
$\overline{s}_j(x), u_k(x)$
\item
$w_j(x),r_k(x)$
\item
$r_j(x),w_k(x)$
\item
$w_j(x),w_k(x)$
\end{enumerate}
Forbidding swaps of the first four pairs ensures locks continue properly
functioning: $T_k$ may not lock $x$'s lock until after $T_j$ has
unlocked it. The next two conflicts ensure the read actions return the correct
values: swapping a read and a write would change which value the read action returns. The final conflict ensures that
$x$ is left storing the correct value.
\end{itemize}
Figure~\ref{scan-5-8} illustrates some of the constraints on
equivalence-preserving swaps.
\begin{figure}
\centerline{\includegraphics{hail_f0508}}
%\centerline{\epsfbox{scan-5-8.eps}}
\caption{Illegal and legal swaps: (a) illegal to swap steps from one
transaction; (b) illegal to swap two conflicting operations on the
same entity; (c) legal to swap operations on different entities by
different transactions; (d) legal to swap nonconflicting operations by
different transactions.}
\label{scan-5-8}
\end{figure}
Note that in all the conflicts, the two actions operate on the same
stored entity (shown as $x$); any two operations on different entities
by different transactions
can be reversed without harm. In Exercise~\ref{serializing-ex},
show that this suffices
to prove that the earlier example of a serializable history is indeed
equivalent to the example serial history.
Even if two actions by different transactions involve
the same entity, they may be reversed without harm if they are both reads.
Exercise~\ref{reads-commute-ex} includes a serializable
history where reads of an entity need to be reversed in order to
arrive at an equivalent serial history.
I am now ready to state the two-phase locking rules, which suffice
to ensure serializability. For now, concentrate on understanding what
the rules say; afterward I will show that they suffice. A
transaction obeys two-phase locking if:
\begin{itemize}
\item
For any entity that it operates on, the transaction locks the
corresponding lock exactly once, sometime before it reads or writes
the entity the first time, and unlocks it exactly once, sometime after
it reads or writes the entity the last time.
\item
For any entity the transaction writes into, either the
transaction initially obtains the corresponding lock in exclusive mode,
or it upgrades the lock to exclusive mode sometime before writing.
\item
The transaction performs all its lock and upgrade actions before performing any of
its unlock actions.
\end{itemize}
Notice that the two-phase locking rules leave a modest amount of
flexibility regarding the use of locks. Consider the
example transactions that read and write $x$ and then read and write
$y$. Any of the following transaction histories for $T_1$ would obey
two-phase locking:
\begin{itemize}
\item \label{two-phase-examples}$e_1(x), r_1(x), w_1(x), e_1(y), \overline{e}_1(x), r_1(y), w_1(y),
\overline{e}_1(y)$
\item $e_1(x), e_1(y), r_1(x), w_1(x), r_1(y), w_1(y),
\overline{e}_1(y), \overline{e}_1(x)$
\item $s_1(x), r_1(x), u_1(x), w_1(x), s_1(y), r_1(y), u_1(y),
w_1(y), \overline{e}_1(x), \overline{e}_1(y)$
\end{itemize}
In Exercise~\ref{two-phase-orders-ex}, you can come up with several additional two-phase
possibilities for this transaction.
If the programmer who writes a transaction explicitly includes the
lock and unlock actions, any of these possibilities would be valid.
More commonly, however, the programmer includes only the
reads and writes, without any explicit lock or unlock actions. An
underlying transaction processing system automatically inserts the
lock and unlock actions to make the programming simpler and
less error-prone. In this case, the system is likely to use
three very simple rules\label{simple-two-phase}:
\begin{enumerate}
\item
Immediately before any read action, acquire the corresponding lock in
shared mode if the transaction doesn't already hold it.
\item
Immediately before any write action, acquire the corresponding lock in
exclusive mode if the transaction doesn't already hold it. (If the
transaction holds the lock in shared mode, upgrade it.)
\item
At the very end of the transaction, unlock all the locks the
transaction has locked.
\end{enumerate}
You should be able to convince yourself that these rules are a special
case of two-phase locking. By holding all the locks until the end of
the transaction, the system need not predict the transaction's future
read or write actions.
I still need to prove that two-phase locking suffices to ensure
serializability. Recall that a history is serializable if it is
equivalent to a serial history. Thus, I need to show that so long as
two-phase locking is followed, you can find a sequence of
equivalence-preserving swaps that will transform the system history
into a serial one. Please understand that this transformation of the
history into a serial one is just a proof technique I am using to
help understand the system, not something that actually occurs during
the system's operation. Transaction systems are not in the business
of forcing transactions to execute serially; concurrency is good for
performance. If anything, the running transaction system is doing the
reverse transformation: the programmer may have thought in terms of
serial transactions, but the system's execution interleaves them. I
am showing that this interleaving is equivalence-preserving by
showing that you can back out of it.
To simplify the proof, I will use the following
vocabulary:
\begin{itemize}
\item
The portion of the system history starting with $T_j$'s first action
and continuing up to, but not including, $T_j$'s first unlock action
is \label{phase-one-definition}\vocab{phase one} of $T_j$.
\item
The portion of the system history starting with $T_j$'s first unlock
action and continuing up through $T_j$'s last action is \vocab{phase
two} of $T_j$.
\item
Any action performed by $T_k$ during $T_j$'s phase one (with $j \neq
k$) is a \foldvocab{phase one}{impurity} of $T_j$. Similarly, any
action performed by $T_k$ during $T_j$'s phase two (with $j \neq k$)
is a \foldvocab{phase two}{impurity} of $T_j$.
\item
If a transaction has no impurities of either kind, it is
\vocab{pure}. If all transactions are pure, then the system history
is serial.
\end{itemize}
My game plan for the proof is this. First, I will show how to use
equivalence-preserving swaps to purify any one transaction, say, $T_j$.
Second, I will show that if $T_k$ is already pure, purifying $T_j$
does not introduce any impurities into $T_k$. Thus, you can
purify the transactions one at a time, without having to worry about
wrecking the transactions purified earlier.
If $T_j$ is impure, you can purify it by first removing any phase one
impurities and then any phase two impurities. To remove the phase
one impurities, you can remove the leftmost one, and then repeat with
the new leftmost one, until all are gone. The leftmost phase one
impurity of $T_j$ must be preceded by an action of $T_j$. I will
show that those two actions can be reversed by an
equivalence-preserving swap. That moves the leftmost impurity
further to the left. If this swapping is done repeatedly, the
impurity will percolate its way further and further to the left until
it passes the first operation of $T_j$, at which point it will cease
to be an impurity of $T_j$. Phase two impurities can be removed
similarly, starting with the rightmost one, by percolating them to the
right until they pass the last operation of $T_j$.
I need to show that the leftmost phase one impurity of $T_j$ can be
swapped with its left-hand neighbor, and that the rightmost phase two
impurity can be swapped with it right-hand neighbor. Recall that to
legally swap two actions, they must be from different transactions,
and they must not be one of the seven forbidden conflicting pairs. In
order to be the leftmost impurity of $T_j$, an action must be
performed by some other transaction, $T_k$, and have an action from
$T_j$ as its left-hand neighbor. (A similar argument applies for the rightmost
impurity and its right-hand neighbor.) Thus, the actions are
definitely from different transactions, and the only remaining concern is
the seven conflicts.
For the leftmost phase one impurity and its left-hand neighbor, you
cannot have any of these conflicts:
\begin{enumerate}
\item $\overline{e}_j(x), s_k(x)$
\item $\overline{e}_j(x), e_k(x)$
\item $\overline{s}_j(x), e_k(x)$
\item $\overline{s}_j(x), u_k(x)$
\end{enumerate}
because transaction $T_j$ does not do any unlock actions in phase one.
(Recall the definition of phase one.) Nor can you have any of the
other three conflicts:
\begin{enumerate}[resume]
\item $w_j(x), r_k(x)$
\item $r_j(x), w_k(x)$
\item $w_j(x), w_k(x)$
\end{enumerate}
because the two-phase locking rules ensure that each read or write action is performed only with the
appropriate lock held. There is no way transactions $T_j$ and $T_k$
can both hold the lock on $x$, with at least one of them being in
exclusive mode. Similar arguments rule out any conflict between the
rightmost phase two impurity and its right-hand neighbor; in
Exercise~\ref{phase-two-purification-ex}, you can fill in the details.
You have now seen that equivalence-preserving swap steps suffice to
purify $T_j$ by percolating each of its phase one impurities out to
the left and each of its phase two impurities out to the right. The
goal is to serialize an arbitrary system history that complies with
the two-phase locking rules. I would like to pick one of its
transactions that is impure and purify it, then repeat with another
and keep going until all the transactions are pure, that is, until the system
history has become serial. For this plan to work, I need to be sure
that purifying one transaction doesn't wreck the purity of any already
pure transaction.
Purifying $T_j$ doesn't touch any actions that don't lie between
$T_j$'s first action and its last action. Thus, the only way
purifying $T_j$ could endanger the existing purity of $T_k$ is if
$T_k$ lies at least partly within $T_j$'s span. However, because $T_k$
is pure, either all of it lies within $T_j$'s span or none of it does,
so you need only consider the case that all of $T_k$ lies within
$T_j$'s span. In fact, you should be able to convince yourself of
something stronger: if any action of a pure transaction $T_k$ lies within $T_j$'s span,
then all of $T_k$ lies within a single one of $T_j$'s phases (either
all within phase one, or all within phase two).
If $T_k$'s actions occupy consecutive positions within phase one,
purifying $T_j$ will percolate all of $T_k$'s actions to the left and
leave them in consecutive positions preceding the start of $T_j$.
Similarly, if $T_k$ is within phase two, all its actions will