Switch branches/tags
Nothing to show
Find file
Fetching contributors…
Cannot retrieve contributors at this time
6573 lines (5991 sloc) 255 KB
Come on now!!!
Parallel programming has been known to be exceedingly
hard for many decades.
You seem to be hinting that it is not so hard.
What sort of game are you playing?
If you really believe that parallel programming is exceedingly
hard, then you should have a ready answer to the question
``Why is parallel programming hard?''
One could list any number of reasons, ranging from deadlocks to
race conditions to testing coverage, but the real answer is that
{\em it is not really all that hard}.
After all, if parallel programming was really so horribly difficult,
how could a large number of open-source projects, ranging from Apache
to MySQL to the Linux kernel, have managed to master it?
A better question might be: ''Why is parallel programming {\em
perceived} to be so difficult?''
To see the answer, let's go back to the year 1991.
Paul McKenney was walking across the parking lot to Sequent's
benchmarking center carrying six dual-80486 Sequent Symmetry CPU
boards, when he suddenly realized that he was carrying several
times the price of the house he had just purchased.\footnote{
Yes, this sudden realization {\em did} cause him to walk quite
a bit more carefully.
Why do you ask?}
This high cost of parallel systems meant that
parallel programming was restricted to a privileged few who
worked for an employer who either manufactured or could afford to
purchase machines costing upwards of \$100,000 --- in 1991 dollars US.
In contrast, in 2006, Paul finds himself typing these words on a
dual-core x86 laptop.
Unlike the dual-80486 CPU boards, this laptop also contains
2GB of main memory, a 60GB disk drive, a display, Ethernet,
USB ports, wireless, and Bluetooth.
And the laptop is more than an order of magnitude cheaper than
even one of those dual-80486 CPU boards, even before taking inflation
into account.
Parallel systems have truly arrived.
They are no longer the sole domain of a privileged few, but something
available to almost everyone.
The earlier restricted availability of parallel hardware is
the \emph{real} reason that parallel programming is considered
so difficult.
After all, it is quite difficult to learn to program even the simplest
machine if you have no access to it.
Since the age of rare and expensive parallel machines is for the most
part behind us, the age during which
parallel programming is perceived to be mind-crushingly difficult is
coming to a close.\footnote{
Parallel programming is in some ways more difficult than
sequential programming, for example, parallel validation
is more difficult.
But no longer mind-crushingly difficult.}
How could parallel programming \emph{ever} be as easy
as sequential programming?
It depends on the programming environment.
SQL~\cite{DIS9075SQL92} is an underappreciated success
story, as it permits programmers who know nothing about parallelism
to keep a large parallel system productively busy.
We can expect more variations on this theme as parallel
computers continue to become cheaper and more readily available.
For example, one possible contender in the scientific and
technical computing arena is MATLAB*P,
which is an attempt to automatically parallelize common
matrix operations.
Finally, on Linux and UNIX systems, consider the following
shell command:
{\tt get\_input | grep "interesting" | sort}
This shell pipeline runs the \co{get_input}, \co{grep},
and \co{sort} processes in parallel.
There, that wasn't so hard, now was it?
Oh, really???
What about correctness, maintainability, robustness, and so on?
These are important goals, but they are just as important for
sequential programs as they are for parallel programs.
Therefore, important though they are, they do not belong on
a list specific to parallel programming.
And if correctness, maintainability, and robustness don't
make the list, why do productivity and generality?
Given that parallel programming is perceived to be much harder
than is sequential programming, productivity is tantamount and
therefore must not be omitted.
Furthermore, high-productivity parallel-programming environments
such as SQL have been special purpose, hence generality must
also be added to the list.
Given that parallel programs are much harder to prove
correct than are sequential programs, again, shouldn't
correctness \emph{really} be on the list?
From an engineering standpoint, the difficulty in proving
correctness, either formally or informally, would be important
insofar as it impacts the primary goal of productivity.
So, in cases where correctness proofs are important, they
are subsumed under the ``productivity'' rubric.
What about just having fun?
Having fun is important as well, but, unless you are a hobbyist,
would not normally be a \emph{primary} goal.
On the other hand, if you \emph{are} a hobbyist, go wild!
Are there no cases where parallel programming is about something
other than performance?
There are certainly cases where the problem to be solved is
inherently parallel, for example, Monte Carlo methods and
some numerical computations.
Even in these cases, however, there will be some amount of
extra work managing the parallelism.
Why all this prattling on about non-technical issues???
And not just \emph{any} non-technical issue, but \emph{productivity}
of all things?
Who cares?
If you are a pure hobbyist, perhaps you don't need to care.
But even pure hobbyists will often care about how much they
can get done, and how quickly.
After all, the most popular hobbyist tools are usually those
that are the best suited for the job, and an important part of
the definition of ``best suited'' involves productivity.
And if someone is paying you to write parallel code, they will
very likely care deeply about your productivity.
And if the person paying you cares about something, you would
be most wise to pay at least some attention to it!
Besides, if you \emph{really} didn't care about productivity,
you would be doing it by hand rather than using a computer!
Given how cheap parallel hardware has become, how can anyone
afford to pay people to program it?
There are a number of answers to this question:
\item Given a large computational cluster of parallel machines,
the aggregate cost of the cluster can easily justify
substantial developer effort, because the development
cost can be spread over the large number of machines.
\item Popular software that is run by tens of millions of users
can easily justify substantial developer effort,
as the cost of this development can be spread over the tens
of millions of users.
Note that this includes things like kernels and system
\item If the low-cost parallel machine is controlling the operation
of a valuable piece of equipment, then the cost of this
piece of equipment might easily justify substantial
developer effort.
\item If the software for the low-cost parallel produces an
extremely valuable result (e.g., mineral exploration),
then the valuable result might again justify substantial
developer cost.
\item Safety-critical systems protect lives, which can clearly
justify very large developer effort.
\item Hobbyists and researchers might seek knowledge, experience,
fun, or glory rather than mere money.
So it is not the case that the decreasing cost of hardware renders
software worthless, but rather that it is no longer possible to
``hide'' the cost of software development within the cost of
the hardware, at least not unless there are extremely large
quantities of hardware.
This is a ridiculously unachievable ideal!
Why not focus on something that is achievable in practice?
This is eminently achievable.
The cellphone is a computer that can be used to make phone
calls and to send and receive text messages with little or
no programming or configuration on the part of the end user.
This might seem to be a trivial example at first glance,
but if you consider it carefully you will see that it is
both simple and profound.
When we are willing to sacrifice generality, we can achieve
truly astounding increases in productivity.
Those who cling to generality will therefore fail to set
the productivity bar high enough to succeed in production
What other bottlenecks might prevent additional CPUs from
providing additional performance?
There are any number of potential bottlenecks:
\item Main memory. If a single thread consumes all available
memory, additional threads will simply page themselves
\item Cache. If a single thread's cache footprint completely
fills any shared CPU cache(s), then adding more threads
will simply thrash the affected caches.
\item Memory bandwidth. If a single thread consumes all available
memory bandwidth, additional threads will simply
result in additional queuing on the system interconnect.
\item I/O bandwidth. If a single thread is I/O bound,
adding more threads will simply result in them all
waiting in line for the affected I/O resource.
Specific hardware systems may have any number of additional
What besides CPU cache capacity might require limiting the
number of concurrent threads?
There are any number of potential limits on the number of
\item Main memory. Each thread consumes some memory
(for its stack if nothing else), so that excessive
numbers of threads can exhaust memory, resulting
in excessive paging or memory-allocation failures.
\item I/O bandwidth. If each thread initiates a given
amount of mass-storage I/O or networking traffic,
excessive numbers of threads can result in excessive
I/O queuing delays, again degrading performance.
Some networking protocols may be subject to timeouts
or other failures if there are so many threads that
networking events cannot be responded to in a timely
\item Synchronization overhead.
For many synchronization protocols, excessive numbers
of threads can result in excessive spinning, blocking,
or rollbacks, thus degrading performance.
Specific applications and platforms may have any number of additional
limiting factors.
Are there any other obstacles to parallel programming?
There are a great many other potential obstacles to parallel
Here are a few of them:
\item The only known algorithms for a given project might
be inherently sequential in nature.
In this case, either avoid parallel programming
(there being no law saying that your project \emph{has}
to run in parallel) or invent a new parallel algorithm.
\item The project allows binary-only plugins that share the same
address space, such that no one developer has access to
all of the source code for the project.
Because many parallel bugs, including deadlocks, are
global in nature, such binary-only plugins pose a severe
challenge to current software development methodologies.
This might well change, but for the time being, all
developers of parallel code sharing a given address space
need to be able to see \emph{all} of the code running in
that address space.
\item The project contains heavily used APIs that were designed
without regard to parallelism.
Some of the more ornate features of the System V
message-queue API form a case in point.
Of course, if your project has been around for a few
decades, and if its developers did not have access to
parallel hardware, your project undoubtedly has at least
its share of such APIs.
\item The project was implemented without regard to parallelism.
Given that there are a great many techniques that work
extremely well in a sequential environment, but that
fail miserably in parallel environments, if your project
ran only on sequential hardware for most of its lifetime,
then your project undoubtably has at least its share of
parallel-unfriendly code.
\item The project was implemented without regard to good
software-development practice.
The cruel truth is that shared-memory parallel
environments are often much less forgiving of sloppy
development practices than are sequential environments.
You may be well-served to clean up the existing design
and code prior to attempting parallelization.
\item The people who originally did the development on your
project have since moved on, and the people remaining,
while well able to maintain it or add small features,
are unable to make ``big animal'' changes.
In this case, unless you can work out a very simple
way to parallelize your project, you will probably
be best off leaving it sequential.
That said, there are a number of simple approaches that
you might use
to parallelize your project, including running multiple
instances of it, using a parallel implementation of
some heavily used library function, or making use of
some other parallel project, such as a database.
One can argue that many of these obstacles are non-technical
in nature, but that does not make them any less real.
In short, parallelization can be a large and complex effort.
As with any large and complex effort, it makes sense to
do your homework beforehand.
Where are the answers to the Quick Quizzes found?
In Appendix~\ref{chp:Answers to Quick Quizzes} starting on
page~\pageref{chp:Answers to Quick Quizzes}.
Hey, I thought I owed you an easy one!
Some of the Quick Quiz questions seem to be from the viewpoint
of the reader rather than the author.
Is that really the intent?
Indeed it is!
Many are modeled after Paul---just ask anyone who has had the
misfortune of being assigned to teach him.
Others are quite similar to actual questions that have been asked
during conference presentations and lectures covering the
material in this book.
Still others are from the viewpoint of the author.
These Quick Quizzes just are not my cup of tea.
What do you recommend?
There are a number of alternatives available to you:
\item Just ignore the Quick Quizzes and read the rest of
the book.
You might miss out on the interesting material in
some of the Quick Quizzes, but the rest of the book
has lots of good material as well.
\item If you prefer a more academic and rigorous treatment of
parallel programming,
you might like Herlihy's and Shavit's
This book starts with an interesting combination
of low-level primitives at high levels of abstraction
from the hardware, and works its way through locking
and simple data structures including lists, queues,
hash tables, and counters, culminating with transactional
\item If you would like an academic treatment of parallel
programming from a programming-language-pragmatics viewpoint,
you might be interested in the concurrency chapter from Scott's
on programming languages.
\item If you are interested in an object-oriented patternist
treatment of parallel programming focussing on C++,
you might try Volumes~2 and 4 of Schmidt's POSA
Volume 4 in particular has some interesting chapters
applying this work to a warehouse application.
The realism of this example is attested to by
the section entitled ``Partitioning the Big Ball of Mud'',
wherein the problems inherent in parallelism often
take a back seat to the problems inherent in getting
one's head around a real-world application.
\item If your primary focus is scientific and technical computing,
and you prefer a patternist approach,
you might try Mattson et al.'s
It covers Java, C/C++, OpenMP, and MPI.
Its patterns are admirably focused first on design,
then on implementation.
\item If you are interested in POSIX Threads, you might take
a look at David R. Butenhof's book~\cite{Butenhof1997pthreads}.
\item If you are interested in C++, but in a Windows environment,
you might try Herb Sutter's ``Effective Concurrency''
series in
Dr. Dobbs Journal~\cite{HerbSutter2008EffectiveConcurrency}.
This series does a reasonable job of presenting a
commonsense approach to parallelism.
\item If you want to try out Intel Threading Building Blocks,
then perhaps James Reinders's book~\cite{Reinders2007Textbook}
is what you are looking for.
\item Finally, those preferring to work in Java might be
well-served by Doug Lea's
In contrast, this book meshes real-world machines with real-world
If your sole goal is to find an optimal parallel queue, you might
be better served by one of the above books.
However, if you are interested in principles of parallel design
that allow multiple such queues to operate in parallel, read on!
\QuickQAC{chp:Hardware and its Habits}{Hardware and its Habits}
Why should parallel programmers bother learning low-level
properties of the hardware?
Wouldn't it be easier, better, and more general to remain at
a higher level of abstraction?
It might well be easier to ignore the detailed properties of
the hardware, but in most cases it would be quite foolish
to do so.
If you accept that the only purpose of parallelism is to
increase performance, and if you further accept that
performance depends on detailed properties of the hardware,
then it logically follows that parallel programmers are going
to need to know at least a few hardware properties.
This is the case in most engineering disciplines.
Would \emph{you} want to use a bridge designed by an
engineer who did not understand the properties of
the concrete and steel making up that bridge?
If not, why would you expect a parallel programmer to be
able to develop competent parallel software without at least
\emph{some} understanding of the underlying hardware?
What types of machines would allow atomic operations on
multiple data elements?
One answer to this question is that it is often possible to
pack multiple elements of data into a single machine word,
which can then be manipulated atomically.
A more trendy answer would be machines supporting transactional
However, such machines are still (as of 2008) research
The jury is still out on the applicability of transactional
This is a \emph{simplified} sequence of events?
How could it \emph{possibly} be any more complex?
This sequence ignored a number of possible complications,
\item Other CPUs might be concurrently attempting to perform
CAS operations involving this same cacheline.
\item The cacheline might have been replicated read-only in
several CPUs' caches, in which case, it would need to
be flushed from their caches.
\item CPU~7 might have been operating on the cache line when
the request for it arrived, in which case CPU~7 would
need to hold of the request until its own operation
\item CPU~7 might have ejected the cacheline from its cache
(for example, in order to make room for other data),
so that by the time that the request arrived, the
cacheline was on its way to memory.
\item A correctable error might have occurred in the cacheline,
which would then need to be corrected at some point before
the data was used.
Production-quality cache-coherence mechanisms are extremely
complicated due to these sorts of considerations.
Why is it necessary to flush the cacheline from CPU~7's cache?
If the cacheline was not flushed from CPU~7's cache, then
CPUs~0 and 7 might have different values for the same set
of variables in the cacheline.
This sort of incoherence would greatly complicate parallel
software, and so hardware architects have been convinced to
avoid it.
Surely the hardware designers could be persuaded to improve
this situation!
Why have they been content with such abysmal performance
for these single-instruction operations?
The hardware designers \emph{have} been working on this
problem, and have consulted with no less a luminary than
the physicist Stephen Hawking.
Hawking's observation was that the hardware designers have
two basic problems~\cite{BryanGardiner2007}:
\item the finite speed of light, and
\item the atomic nature of matter.
Operation & Cost (ns) & Ratio \\
Clock period & 0.4 & 1.0 \\
``Best-case'' CAS & 12.2 & 33.8 \\
Best-case lock & 25.6 & 71.2 \\
Single cache miss & 12.9 & 35.8 \\
CAS cache miss & 7.0 & 19.4 \\
Off-Core & & \\
Single cache miss & 31.2 & 86.6 \\
CAS cache miss & 31.2 & 86.5 \\
Off-Socket & & \\
Single cache miss & 92.4 & 256.7 \\
CAS cache miss & 95.9 & 266.4 \\
Comms Fabric & 4,500 & 7,500 \\
Global Comms & 195,000,000 & 324,000,000 \\
\caption{Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
\label{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
The first problem limits raw speed, and the second limits
miniaturization, which in turn limits frequency.
And even this sidesteps the power-consumption issue that
is currently holding production frequencies to well below
10 GHz.
Nevertheless, some progress is being made, as may be seen
by comparing
Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}
page~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}.
Integration of hardware threads in a single core and multiple
cores on a die have improved latencies greatly, at least within the
confines of a single core or single die.
There has been some improvement in overall system latency,
but only by about a factor of two.
Unfortunately, neither the speed of light nor the atomic nature
of matter has changed much in the past few years.
Section~\ref{sec:cpu:Hardware Free Lunch?}
looks at what else hardware designers might be
able to do to ease the plight of parallel programmers.
These numbers are insanely large!
How can I possibly get my head around them?
Get a roll of toilet paper.
In the USA, each roll will normally have somewhere around 350-500
Tear off one sheet to represent a single clock cycle, setting it aside.
Now unroll the rest of the roll.
The resulting pile of toilet paper will likely represent a single
CAS cache miss.
For the more-expensive inter-system communications latencies,
use several rolls (or multiple cases) of toilet paper to represent
the communications latency.
Important safety tip: make sure to account for the needs of
those you live with when appropriating toilet paper!
Given that distributed-systems communication is so horribly
expensive, why does anyone bother with them?
There are a number of reasons:
\item Shared-memory multiprocessor systems have strict size limits.
If you need more than a few thousand CPUs, you have no
choice but to use a distributed system.
\item Extremely large shared-memory systems tend to be
quite expensive and to have even longer cache-miss
latencies than does the small four-CPU system
shown in
Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}.
\item The distributed-systems communications latencies do
not necessarily consume the CPU, which can often allow
computation to proceed in parallel with message transfer.
\item Many important problems are ``embarrassingly parallel'',
so that extremely large quantities of processing may
be enabled by a very small number of messages.
is but one example of such an application.
These sorts of applications can make good use of networks
of computers despite extremely long communications
It is likely that continued work on parallel applications will
increase the number of embarrassingly parallel applications that
can run well on machines and/or clusters having long communications
That said, greatly reduced hardware latencies would be an
extremely welcome development.
\QuickQAC{chp:Tools of the Trade}{Tools of the Trade}
But this silly shell script isn't a \emph{real} parallel program!
Why bother with such trivia???
Because you should \emph{never} forget the simple stuff!
Please keep in mind that the title of this book is
``Is Parallel Programming Hard, And, If So, What Can You Do About It?''.
One of the most effective things you can do about it is to
avoid forgetting the simple stuff!
After all, if you choose to do parallel programming the hard
way, you have no one but yourself to blame for it being hard.
Is there a simpler way to create a parallel shell script?
If so, how? If not, why not?
One straightforward approach is the shell pipeline:
grep $pattern1 | sed -e 's/a/b/' | sort
For a sufficiently large input file,
\co{grep} will pattern-match in parallel with \co{sed}
editing and with the input processing of \co{sort}.
See the file \co{} for a demonstration of
shell-script parallelism and pipelining.
But if script-based parallel programming is so easy, why
bother with anything else?
In fact, it is quite likely that a very large fraction of
parallel programs in use today are script-based.
However, script-based parallelism does have its limitations:
\item Creation of new processes is usually quite heavyweight,
involving the expensive \co{fork()} and \co{exec()}
system calls.
\item Sharing of data, including pipelining, typically involves
expensive file I/O.
\item The reliable synchronization primitives available to
scripts also typically involve expensive file I/O.
These limitations require that script-based parallelism use
coarse-grained parallelism, with each unit of work having
execution time of at least tens of milliseconds, and preferably
much longer.
Those requiring finer-grained parallelism are well advised to
think hard about their problem to see if it can be expressed
in a coarse-grained form.
If not, they should consider using other parallel-programming
environments, such as those discussed in
Section~\ref{sec:toolsoftrade:POSIX Multiprocessing}.
Why does this \co{wait()} primitive need to be so complicated?
Why not just make it work like the shell-script \co{wait} does?
Some parallel applications need to take special action when
specific children exit, and therefore need to wait for each
child individually.
In addition, some parallel applications need to detect the
reason that the child died.
As we saw in Figure~\ref{fig:toolsoftrade:Using the wait() Primitive},
it is not hard to build a \co{waitall()} function out of
the \co{wait()} function, but it would be impossible to
do the reverse.
Once the information about a specific child is lost, it is lost.
Isn't there a lot more to \co{fork()} and \co{wait()}
than discussed here?
Indeed there is, and
it is quite possible that this section will be expanded in
future versions to include messaging features (such as UNIX
pipes, TCP/IP, and shared file I/O) and memory mapping
(such as \co{mmap()} and \co{shmget()}).
In the meantime, there are any number of textbooks that cover
these primitives in great detail,
and the truly motivated can read manpages, existing parallel
applications using these primitives, as well as the
source code of the Linux-kernel implementations themselves.
If the \co{mythread()} function in
Figure~\ref{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory}
can simply return, why bother with \co{pthread_exit()}?
In this simple example, there is no reason whatsoever.
However, imagine a more complex example, where \co{mythread()}
invokes other functions, possibly separately compiled.
In such a case, \co{pthread_exit()} allows these other functions
to end the thread's execution without having to pass some sort
of error return all the way back up to \co{mythread()}.
If the C language makes no guarantees in presence of a data
race, then why does the Linux kernel have so many data races?
Are you trying to tell me that the Linux kernel is completely
Ah, but the Linux kernel is written in a carefully selected
superset of the C language that includes special gcc
extensions, such as asms, that permit safe execution even
in presence of data races.
In addition, the Linux kernel does not run on a number of
platforms where data races would be especially problematic.
For an example, consider embedded systems with 32-bit pointers
and 16-bit busses.
On such a system, a data race involving a store to and a load
from a given pointer might well result in the load returning the
low-order 16 bits of the old value of the pointer concatenated
with the high-order 16 bits of the new value of the pointer.
What if I want several threads to hold the same lock at the
same time?
The first thing you should do is to ask yourself why you would
want to do such a thing.
If the answer is ``because I have a lot of data that is read
by many threads, and only occasionally updated'', then
POSIX reader-writer locks might be what you are looking for.
These are introduced in
Section~\ref{sec:toolsoftrade:POSIX Reader-Writer Locking}.
Another way to get the effect of multiple threads holding
the same lock is for one thread to acquire the lock, and
then use \co{pthread_create()} to create the other threads.
The question of why this would ever be a good idea is left
to the reader.
Why not simply make the argument to \co{lock_reader()}
on line~5 of
Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}
be a pointer to a \co{pthread_mutex_t}?
Because we will need to pass \co{lock_reader()} to
Although we could cast the function when passing it to
\co{pthread_create()}, function casts are quite a bit
uglier and harder to get right than are simple pointer casts.
Writing four lines of code for each acquisition and release
of a \co{pthread_mutex_t} sure seems painful!
Isn't there a better way?
And for that reason, the \co{pthread_mutex_lock()} and
\co{pthread_mutex_unlock()} primitives are normally wrapped
in functions that do this error checking.
Later on, we will wrapper them with the Linux kernel
\co{spin_lock()} and \co{spin_unlock()} APIs.
Is ``x = 0'' the only possible output from the code fragment
shown in
Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}?
If so, why?
If not, what other output could appear, and why?
The reason that ``x = 0'' was output was that \co{lock_reader()}
acquired the lock first.
Had \co{lock_writer()} instead acquired the lock first, then
the output would have been ``x = 3''.
However, because the code fragment started \co{lock_reader()} first
and because this run was performed on a multiprocessor,
one would normally expect \co{lock_reader()} to acquire the
lock first.
However, there are no guarantees, especially on a busy system.
Using different locks could cause quite a bit of confusion,
what with threads seeing each others' intermediate states.
So should well-written parallel programs restrict themselves
to using a single lock in order to avoid this kind of confusion?
Although it is sometimes possible to write a program using a
single global lock that both performs and scales well, such
programs are exceptions to the rule.
You will normally need to use multiple locks to attain good
performance and scalability.
One possible exception to this rule is ``transactional memory'',
which is currently a research topic.
Transactional-memory semantics can be thought of as those
of a single global lock with optimizations permitted and
with the addition of rollback~\cite{HansJBoehm2009HOTPAR}.
In the code shown in
Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks},
is \co{lock_reader()} guaranteed to see all the values produced
by \co{lock_writer()}?
Why or why not?
On a busy system, \co{lock_reader()} might be preempted
for the entire duration of \co{lock_writer()}'s execution,
in which case it would not see \emph{any} of \co{lock_writer()}'s
intermediate states for \co{x}.
Wait a minute here!!!
Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}
didn't initialize shared variable \co{x},
so why does it need to be initialized in
Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}?
See line~3 of
Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}.
Because the code in
Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}
ran first, it could rely on the compile-time initialization of
The code in
Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}
ran next, so it had to re-initialize \co{x}.
Isn't comparing against single-CPU throughput a bit harsh?
Not at all.
In fact, this comparison was, if anything, overly lenient.
A more balanced comparison would be against single-CPU
throughput with the locking primitives commented out.
But 1,000 instructions is not a particularly small size for
a critical section.
What do I do if I need a much smaller critical section, for
example, one containing only a few tens of instructions?
If the data being read \emph{never} changes, then you do not
need to hold any locks while accessing it.
If the data changes sufficiently infrequently, you might be
able to checkpoint execution, terminate all threads, change
the data, then restart at the checkpoint.
Another approach is to keep a single exclusive lock per
thread, so that a thread read-acquires the larger aggregate
reader-writer lock by acquiring its own lock, and write-acquires
by acquiring all the per-thread locks~\cite{WilsonCHsieh92a}.
This can work quite well for readers, but causes writers
to incur increasingly large overheads as the number of threads
Some other ways of handling very small critical sections are
described in Section~\ref{sec:defer:Read-Copy Update (RCU)}.
Figure~\ref{fig:intro:Reader-Writer Lock Scalability},
all of the traces other than the 100M trace deviate gently
from the ideal line.
In contrast, the 100M trace breaks sharply from the ideal
line at 64 CPUs.
In addition, the spacing between the 100M trace and the 10M
trace is much smaller than that between the 10M trace and the
1M trace.
Why does the 100M trace behave so much differently than the
other traces?
Your first clue is that 64 CPUs is exactly half of the 128
CPUs on the machine.
The difference is an artifact of hardware threading.
This system has 64 cores with two hardware threads per core.
As long as fewer than 64 threads are running, each can run
in its own core.
But as soon as there are more than 64 threads, some of the threads
must share cores.
Because the pair of threads in any given core share some hardware
resources, the throughput of two threads sharing a core is not
quite as high as that of two threads each in their own core.
So the performance of the 100M trace is limited not by the
reader-writer lock, but rather by the sharing of hardware resources
between hardware threads in a single core.
This can also be seen in the 10M trace, which deviates gently from
the ideal line up to 64 threads, then breaks sharply down, parallel
to the 100M trace.
Up to 64 threads, the 10M trace is limited primarily by reader-writer
lock scalability, and beyond that, also by sharing of hardware
resources between hardware threads in a single core.
Power-5 is several years old, and new hardware should
be faster.
So why should anyone worry about reader-writer locks being slow?
In general, newer hardware is improving.
However, it will need to improve more than two orders of magnitude
to permit reader-writer lock to achieve idea performance on
128 CPUs.
Worse yet, the greater the number of CPUs, the larger the
required performance improvement.
The performance problems of reader-writer locking are therefore
very likely to be with us for quite some time to come.
Is it really necessary to have both sets of primitives?
Strictly speaking, no.
One could implement any member of the second set using the
corresponding member of the first set.
For example, one could implement \co{__sync_nand_and_fetch()}
in terms of \co{__sync_fetch_and_nand()} as follows:
tmp = v;
ret = __sync_fetch_and_nand(p, tmp);
ret = ~ret & tmp;
It is similarly possible to implement \co{__sync_fetch_and_add()},
\co{__sync_fetch_and_sub()}, and \co{__sync_fetch_and_xor()}
in terms of their post-value counterparts.
However, the alternative forms can be quite convenient, both
for the programmer and for the compiler/library implementor.
Given that these atomic operations will often be able to
generate single atomic instructions that are directly
supported by the underlying instruction set, shouldn't
they be the fastest possible way to get things done?
Unfortunately, no.
See Chapter~\ref{chp:Counting} for some stark counterexamples.
What happened to the Linux-kernel equivalents to \co{fork()}
and \co{join()}?
They don't really exist.
All tasks executing within the Linux kernel share memory,
at least unless you want to do a huge amount of memory-mapping
work by hand.
Why on earth should efficient and scalable counting be hard?
After all, computers have special hardware for the sole purpose
of doing counting,
addition, subtraction, and lots more besides, don't they???
Because the straightforward counting algorithms, for example,
atomic operations on a shared counter, are slow and scale
badly, as will be seen in
Section~\ref{sec:count:Why Isn't Concurrent Counting Trivial?}.
{ \bfseries Network-packet counting problem. }
Suppose that you need to collect statistics on the number
of networking packets (or total number of bytes) transmitted
and/or received.
Packets might be transmitted or received by any CPU on
the system.
Suppose further that this large machine is capable of
handling a million packets per second, and that there
is a systems-monitoring package that reads out the count
every five seconds.
How would you implement this statistical counter?
Hint: the act of updating the counter must be blazingly
fast, but because the counter is read out only about once
in five million updates, the act of reading out the counter can be
quite slow.
In addition, the value read out normally need not be all that
accurate---after all, since the counter is updated a thousand
times per millisecond, we should be able to work with a value
that is within a few thousand counts of the ``true value'',
whatever ``true value'' might mean in this context.
However, the value read out should maintain roughly the same
absolute error over time.
For example, a 1\% error might be just fine when the count
is on the order of a million or so, but might be absolutely
unacceptable once the count reaches a trillion.
See Section~\ref{sec:count:Statistical Counters}.
{ \bfseries Approximate structure-allocation limit problem. }
Suppose that you need to maintain a count of the number of
structures allocated in order to fail any allocations
once the number of structures in use exceeds a limit
(say, 10,000).
Suppose further that these structures are short-lived,
that the limit is rarely exceeded, and that a ``sloppy''
approximate limit is acceptable.
Hint: the act of updating the counter must be blazingly
fast, but the counter is read out each time that the
counter is increased.
However, the value read out need not be accurate
\emph{except} that it absolutely must distinguish perfectly
between values below the limit and values greater than or
equal to the limit.
See Section~\ref{sec:count:Approximate Limit Counters}.
{ \bfseries Exact structure-allocation limit problem. }
Suppose that you need to maintain a count of the number of
structures allocated in order to fail any allocations
once the number of structures in use exceeds an exact limit
(say, 10,000).
Suppose further that these structures are short-lived,
and that the limit is rarely exceeded, that there is almost
always at least one structure in use, and suppose further
still that it is necessary to know exactly when this counter reaches
zero, for example, in order to free up some memory
that is not required unless there is at least one structure
in use.
Hint: the act of updating the counter must be blazingly
fast, but the counter is read out each time that the
counter is increased.
However, the value read out need not be accurate
\emph{except} that it absolutely must distinguish perfectly
between values between the limit and zero on the one hand,
and values that either are less than or equal to zero or
are greater than or equal to the limit on the other hand.
See Section~\ref{sec:count:Exact Limit Counters}.
{ \bfseries Removable I/O device access-count problem. }
Suppose that you need to maintain a reference count on a
heavily used removable mass-storage device, so that you
can tell the user when it is safe to removed the device.
This device follows the usual removal procedure where
the user indicates a desire to remove the device, and
the system tells the user when it is safe to do so.
Hint: the act of updating the counter must be blazingly
fast and scalable in order to avoid slowing down I/O operations,
but because the counter is read out only when the
user wishes to remove the device, the counter read-out
operation can be extremely slow.
Furthermore, there is no need to be able to read out
the counter at all unless the user has already indicated
a desire to remove the device.
In addition, the value read out need not be accurate
\emph{except} that it absolutely must distinguish perfectly
between non-zero and zero values.
However, once it has read out a zero value, it must act
to keep the value at zero until it has taken some action
to prevent subsequent threads from gaining access to the
device being removed.
See Section~\ref{sec:count:Applying Specialized Parallel Counters}.
But doesn't the \co{++} operator produce an x86 add-to-memory
And won't the CPU cache cause this to be atomic?
Although the \co{++} operator \emph{could} be atomic, there
is no requirement that it be so.
Furthermore, the \co{ACCESS_ONCE()} primitive forces most
version of {\sf gcc} to load the value to a register, increment
the register, then store the value to memory, which is
decidedly non-atomic.
The 8-figure accuracy on the number of failures indicates
that you really did test this.
Why would it be necessary to test such a trivial program,
especially when the bug is easily seen by inspection?
There are no trivial parallel programs, and most days I am
not so sure that there are trivial sequential programs, either.
No matter how small or simple the program, if you haven't tested
it, it does not work.
And even if you have tested it, Murphy says there are at least a
few bugs still lurking.
Furthermore, while proofs of correctness certainly do have their
place, they never will replace testing, including the
\url{counttorture.h} test setup used here.
After all, proofs can have bugs just as easily as programs can!
Why doesn't the dashed line on the x~axis meet the
diagonal line at $y=1$?
Because of the overhead of the atomic operation.
The dashed line on the x~axis represents the overhead of
a single \emph{non-atomic} increment.
After all, an \emph{ideal} algorithm would not only scale
linearly, it would also incur no performance penalty compared
to single-threaded code.
This level of idealism may seem severe, but if it is good
enough for Linus Torvalds, it is good enough for you.
But atomic increment is still pretty fast.
And incrementing a single variable in a tight loop sounds
pretty unrealistic to me, after all, most of the program's
execution should be devoted to actually doing work, not accounting
for the work it has done!
Why should I care about making this go faster?
In many cases, atomic increment will in fact be fast enough
for you.
In those cases, you should by all means use atomic increment.
That said, there are many real-world situations where
more elaborate counting algorithms are required.
The canonical example of such a situation is counting packets
and bytes in highly optimized networking stacks, where it is
all too easy to find much of the execution time going into
these sorts of accounting tasks, especially on large
In addition, counting provides an excellent view of the
issues encountered in shared-memory parallel programs.
But why can't CPU designers simply ship the operation to the
data, avoiding the need to circulate the cache line containing
the global variable being incremented?
It might well be possible to do this in some cases.
However, there are a few complications:
\item If the value of the variable is required, then the
thread will be forced to wait for the operation
to be shipped to the data, and then for the result
to be shipped back.
\item If the atomic increment must be ordered with respect
to prior and/or subsequent operations, then the thread
will be forced to wait for the operation to be shipped
to the data, and for an indication that the operation
completed to be shipped back.
\item Shipping operations among CPUs will likely require
more signals, which will consume more die area and
more electrical power.
But what if neither of the first two conditions holds?
Then you should think carefully about the algorithms discussed
in Section~\ref{sec:count:Statistical Counters}, which achieve
near-ideal performance on commodity hardware.
\caption{Data Flow For Global Combining-Tree Atomic Increment}
\label{fig:count:Data Flow For Global Combining-Tree Atomic Increment}
If either or both of the first two conditions hold, there is
\emph{some} hope for improvement.
One could imagine the hardware implementing a combining tree,
so that the increment requests from multiple CPUs are combined
by the hardware into a single addition when the combined request
reaches the hardware.
The hardware could also apply an order to the requests, thus
returning to each CPU the return value corresponding to its
particular atomic increment.
This results in instruction latency that varies as $O(log N)$,
where $N$ is the number of CPUs, as shown in
Figure~\ref{fig:count:Data Flow For Global Combining-Tree Atomic Increment}.
This is a great improvement over the $O(N)$ performance
of current hardware shown in
Figure~\ref{fig:count:Data Flow For Global Atomic Increment},
and it is possible that hardware latencies might decrease
somewhat if innovations such as three-D fabrication prove
Nevertheless, we will see that in some important special cases,
software can do \emph{much} better.
But doesn't the fact that C's ``integers'' are limited in size
complicate things?
No, because modulo addition is still commutative and associative.
At least as long as you use unsigned integer.
Recall that in the C standard, overflow of signed integers results
in undefined behavior (never mind the fact that machines that
do anything other than wrap on overflow are quite rare these days.
That said, one potential source of additional complexity arises
when attempting to gather (say) a 64-bit sum from 32-bit
per-thread counters.
For the moment, dealing with this added complexity is left as
an exercise for the reader.
An array???
But doesn't that limit the number of threads?
It can, and in this toy implementation, it does.
But it is not that hard to come up with an alternative
implementation that permits an arbitrary number of threads.
However, this is left as an exercise for the reader.
What other choice does gcc have, anyway???
According to the C standard, the effects of fetching a variable
that might be concurrently modified by some other thread are
It turns out that the C standard really has no other choice,
given that C must support (for example) eight-bit architectures
which are incapable of atomically loading a \co{long}.
An upcoming version of the C standard aims to fill this gap,
but until then, we depend on the kindness of the gcc developers.
How does the per-thread \co{counter} variable in
Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
get initialized?
The C standard specifies that the initial value of
global variables is zero, unless they are explicitly initialized.
So the initial value of all the instances of \co{counter}
will be zero.
That said, one often takes differences of consecutive reads
from statistical counters, in which case the initial value
is irrelevant.
How is the code in
Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
supposed to permit more than one counter?
Indeed, this toy example does not support more than one counter.
Modifying it so that it can provide multiple counters is left
as an exercise to the reader.
Why does \co{inc_count()} in
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
need to use atomic instructions?
If non-atomic instructions were used, counts could be lost.
Won't the single global thread in the function \co{eventual()} of
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
be just as severe a bottleneck as a global lock would be?
In this case, no.
What will happen instead is that the estimate of the counter
value returned by \co{read_count()} will become more inaccurate.
Won't the estimate returned by \co{read_count()} in
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
become increasingly
inaccurate as the number of threads rises?
If this proves problematic, one fix is to provide multiple
\co{eventual()} threads, each covering its own subset of
the other threads.
In even more extreme cases, a tree-like hierarchy of
\co{eventual()} threads might be required.
Why do we need an explicit array to find the other threads'
Why doesn't gcc provide a \co{per_thread()} interface, similar
to the Linux kernel's \co{per_cpu()} primitive, to allow
threads to more easily access each others' per-thread variables?
Why indeed?
To be fair, gcc faces some challenges that the Linux kernel
gets to ignore.
When a user-level thread exits, its per-thread variables all
disappear, which complicates the problem of per-thread-variable
access, particularly before the advent of user-level RCU.
In contrast, in the Linux kernel, when a CPU goes offline,
that CPU's per-CPU variables remain mapped and accessible.
Similarly, when a new user-level thread is created, its
per-thread variables suddenly come into existence.
In contrast, in the Linux kernel, all per-CPU variables are
mapped and initialized at boot time, regardless of whether
the corresponding CPU exists yet, or indeed, whether the
corresponding CPU will ever exist.
A key limitation that the Linux kernel imposes is a compile-time
maximum limit on the number of CPUs, namely,
In contrast, in user space, there is no hard-coded upper limit
on the number of threads.
Of course, both environments must deal with dynamically loaded
code (dynamic libraries in user space, kernel modules in the
Linux kernel), which increases the complexity of per-thread
variables in both environments.
These complications make it significantly harder for user-space
environments to provide access to other threads' per-thread
Nevertheless, such access is highly useful, and it is hoped
that it will someday appear.
Why on earth do we need something as heavyweight as a \emph{lock}
guarding the summation in the function \co{read_count()} in
Figure~\ref{fig:count:Per-Thread Statistical Counters}?
Remember, when a thread exits, its per-thread variables disappear.
Therefore, if we attempt to access a given thread's per-thread
variables after that thread exits, we will get a segmentation
The lock coordinates summation and thread exit, preventing this
Of course, we could instead read-acquire a reader-writer lock,
but Chapter~\ref{chp:Deferred Processing} will introduce even
lighter-weight mechanisms for implementing the required coordination.
Why on earth do we need to acquire the lock in
\co{count_register_thread()} in
Figure~\ref{fig:count:Per-Thread Statistical Counters}?
It is a single properly aligned machine-word store to a location
that no other thread is modifying, so it should be atomic anyway,
This lock could in fact be omitted, but better safe than
sorry, especially given that this function is executed only at
thread startup, and is therefore not on any critical path.
Now, if we were testing on machines with thousands of CPUs,
we might need to omit the lock, but on machines with ``only''
a hundred or so CPUs, no need to get fancy.
Fine, but the Linux kernel doesn't have to acquire a lock
when reading out the aggregate value of per-CPU counters.
So why should user-space code need to do this???
Remember, the Linux kernel's per-CPU variables are always
accessible, even if the corresponding CPU is offline --- even
if the corresponding CPU never existed and never will exist.
{ \scriptsize
1 long __thread counter = 0;
2 long *counterp[NR_THREADS] = { NULL };
3 int finalthreadcount = 0;
4 DEFINE_SPINLOCK(final_mutex);
6 void inc_count(void)
7 {
8 counter++;
9 }
11 long read_count(void)
12 {
13 int t;
14 long sum = 0;
16 for_each_thread(t)
17 if (counterp[t] != NULL)
18 sum += *counterp[t];
19 return sum;
20 }
22 void count_init(void)
23 {
24 }
26 void count_register_thread(void)
27 {
28 counterp[smp_thread_id()] = &counter;
29 }
31 void count_unregister_thread(int nthreadsexpected)
32 {
33 spin_lock(&final_mutex);
34 finalthreadcount++;
35 spin_unlock(&final_mutex);
36 while (finalthreadcount < nthreadsexpected)
37 poll(NULL, 0, 1);
38 }
\caption{Per-Thread Statistical Counters With Lockless Summation}
\label{fig:count:Per-Thread Statistical Counters With Lockless Summation}
One workaround is to ensure that each thread sticks around
until all threads are finished, as shown in
Figure~\ref{fig:count:Per-Thread Statistical Counters With Lockless Summation}.
Analysis of this code is left as an exercise to the reader,
however, please note that it does not fit well into the
\url{counttorture.h} counter-evaluation scheme.
(Why not?)
Chapter~\ref{chp:Deferred Processing} will introduce
synchronization mechanisms that handle this situation in a much
more graceful manner.
What fundamental difference is there between counting packets
and counting the total number of bytes in the packets, given
that the packets vary in size?
When counting packets, the counter is only incremented by the
value one.
On the other hand, when counting bytes, the counter might
be incremented by largish numbers.
Why does this matter?
Because in the increment-by-one case, the value returned will
be exact in the sense that the counter must necessarily have
taken on that value at some point in time, even if it is impossible
to say precisely when that point occurred.
In contrast, when counting bytes, two different threads might
return values that are inconsistent with any global ordering
of operations.
To see this, suppose that thread~0 adds the value three to its
counter, thread~1 adds the value five to its counter, and
threads~2 and 3 sum the counters.
If the system is ``weakly ordered'' or if the compiler
uses aggressive optimizations, thread~2 might find the
sum to be three and thread~3 might find the sum to be five.
The only possible global orders of the sequence of values
of the counter are 0,3,8 and 0,5,8, and neither order is
consistent with the results obtained.
If you missed this one, you are not alone.
Michael Scott used this question to stump Paul McKenney during Paul's
Ph.D. defense.
Given that the reader must sum all the threads' counters,
this could take a long time given large numbers of threads.
Is there any way that the increment operation can remain
fast and scalable while allowing readers to also enjoy
reasonable performance and scalability?
One approach would be to maintain a global approximation
to the value.
Readers would increment their per-thread variable, but when it
reached some predefined limit, atomically add it to a global
variable, then zero their per-thread variable.
This would permit a tradeoff between average increment overhead
and accuracy of the value read out.
The reader is encouraged to think up and try out other approaches,
for example, using a combining tree.
What is with the strange form of the condition on line~3 of
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
Why not the following more intuitive form of the fastpath?
3 if (counter + delta <= countermax){
4 counter += delta;
5 return 1;
6 }
Two words.
``Integer overflow.''
Try the above formulation with \co{counter} equal to 10 and
\co{delta} equal to \co{ULONG_MAX}.
Then try it again with the code shown in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}.
A good understanding of integer overflow will be required for
the rest of this example, so if you have never dealt with
integer overflow before, please try several examples to get
the hang of it.
Integer overflow can sometimes be more difficult to get right
than parallel algorithms!
Why does \co{globalize_count()} zero the per-thread variables,
only to later call \co{balance_count()} to refill them in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
Why not just leave the per-thread variables non-zero?
That is in fact what an earlier version of this code did.
But addition and subtraction are extremely cheap, and handling
all of the special cases that arise is quite complex.
Again, feel free to try it yourself, but beware of integer
Given that \co{globalreserve} counted against us in \co{add_count()},
why doesn't it count for us in \co{sub_count()} in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
The \co{globalreserve} variable tracks the sum of all threads'
\co{countermax} variables.
The sum of these threads' \co{counter} variables might be anywhere
from zero to \co{globalreserve}.
We must therefore take a conservative approach, assuming that
all threads' \co{counter} variables are full in \co{add_count()}
and that they are all empty in \co{sub_count()}.
But remember this question, as we will come back to it later.
Why have both \co{add_count()} and \co{sub_count()} in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
Why not simply pass a negative number to \co{add_count()}?
Given that \co{add_count()} takes an \co{unsigned} \co{long}
as its argument, it is going to be a bit tough to pass it a
negative number.
And unless you have some anti-matter memory, there is little
point in allowing negative numbers when counting the number
of structures in use!
In what way does line~7 of
Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}
violate the C standard?
It assumes eight bits per byte.
This assumption does hold for all current commodity microprocessors
that can be easily assembled into shared-memory multiprocessors,
but certainly does not hold for all computer systems that have
ever run C code.
(What could you do instead in order to comply with the C
standard? What drawbacks would it have?)
Given that there is only one \co{ctrandmax} variable,
why bother passing in a pointer to it on line~18 of
Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}?
There is only one \co{ctrandmax} variable \emph{per thread}.
Later, we will see code that needs to pass other threads'
\co{ctrandmax} variables to \co{split_ctrandmax()}.
Why does \co{merge_ctrandmax()} in
Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}
return an \co{int} rather than storing directly into an
Later, we will see that we need the \co{int} return to pass
to the \co{atomic_cmpxchg()} primitive.
Why the ugly \co{goto} on line~11 of
Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}?
Haven't you heard of the \co{break} statement???
Replacing the \co{goto} with a \co{break} would require keeping
a flag to determine whether or not line~15 should return, which
is not the sort of thing you want on a fastpath.
If you really hate the \co{goto} that much, your best bet would
be to pull the fastpath into a separate function that returned
success or failure, with ``failure'' indicating a need for the
This is left as an exercise for goto-hating readers.
Why would the \co{atomic_cmpxchg()} primitive at lines~13-14 of
Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}
ever fail?
After all, we picked up its old value on line~9 and have not
changed it!
Later, we will see how the \co{flush_local_count()} function in
Figure~\ref{fig:count:Atomic Limit Counter Utility Functions}
might update this thread's \co{ctrandmax} variable concurrently
with the execution of the fastpath on lines~8-14 of
Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}.
What stops a thread from simply refilling its
\co{ctrandmax} variable immediately after
\co{flush_local_count()} on line 14 of
Figure~\ref{fig:count:Atomic Limit Counter Utility Functions}
empties it?
This other thread cannot refill its \co{ctrandmax}
until the caller of \co{flush_local_count()} releases the
By that time, the caller of \co{flush_local_count()} will have
finished making use of the counts, so there will be no problem
with this other thread refilling --- assuming that the value
of \co{globalcount} is large enough to permit a refill.
What prevents concurrent execution of the fastpath of either
\co{atomic_add()} or \co{atomic_sub()} from interfering with
the \co{ctrandmax} variable while
\co{flush_local_count()} is accessing it on line 27 of
Figure~\ref{fig:count:Atomic Limit Counter Utility Functions}
empties it?
Consider the following three cases:
\item If \co{flush_local_count()}'s \co{atomic_xchg()} executes
before the \co{split_ctrandmax()} of either fastpath,
then the fastpath will see a zero \co{counter} and
\co{countermax}, and will thus transfer to the slowpath
(unless of course \co{delta} is zero).
\item If \co{flush_local_count()}'s \co{atomic_xchg()} executes
after the \co{split_ctrandmax()} of either fastpath,
but before that fastpath's \co{atomic_cmpxchg()},
then the \co{atomic_cmpxchg()} will fail, causing the
fastpath to restart, which reduces to case~1 above.
\item If \co{flush_local_count()}'s \co{atomic_xchg()} executes
after the \co{atomic_cmpxchg()} of either fastpath,
then the fastpath will (most likely) complete successfully
before \co{flush_local_count()} zeroes the thread's
\co{ctrandmax} variable.
Either way, the race is resolved correctly.
Given that the \co{atomic_set()} primitive does a simple
store to the specified \co{atomic_t}, how can line~53 of
\co{balance_count()} in
Figure~\ref{fig:count:Atomic Limit Counter Utility Functions}
work correctly in face of concurrent \co{flush_local_count()}
updates to this variable?
The caller of both \co{balance_count()} and
\co{flush_local_count()} hold \co{gblcnt_mutex}, so
only one may be executing at a given time.
But signal handlers can be migrated to some other
CPU while running.
Doesn't this possibility require that atomic instructions
and memory barriers are required to reliably communicate
between a thread and a signal handler that interrupts that
If the signal handler is migrated to another CPU, then the
interrupted thread is also migrated along with it.
In Figure~\ref{fig:count:Signal-Theft State Machine}, why is
the REQ \co{theft} state colored blue?
To indicate that only the fastpath is permitted to change the
\co{theft} state.
In Figure~\ref{fig:count:Signal-Theft State Machine}, what is
the point of having separate REQ and ACK \co{theft} states?
Why not simplify the state machine by collapsing
them into a single state?
Then whichever of the signal handler or the fastpath gets there
first could set the state to READY.
Reasons why collapsing the REQ and ACK states would be a very
bad idea include:
\item The slowpath uses the REQ and ACK states to determine
whether the signal should be retransmitted.
If the states were collapsed, the slowpath would have
no choice but to send redundant signals, which would
have the unhelpful effect of slowing down the fastpath.
\item The following race would result:
\item The slowpath sets a given thread's state to REQACK.
\item That thread has just finished its fastpath, and
notes the REQACK state.
\item The thread receives the signal, which also notes
the REQACK state, and, because there is no fastpath
in effect, sets the state to READY.
\item The slowpath notes the READY state, steals the
count, and sets the state to IDLE, and completes.
\item The fastpath sets the state to READY, disabling
further fastpath execution for this thread.
The basic problem here is that the combined REQACK state
can be referenced by both the signal handler and the
The clear separation maintained by the four-state
setup ensures orderly state transitions.
That said, you might well be able to make a three-state setup
work correctly.
If you do succeed, compare carefully to the four-state setup.
Is the three-state solution really preferable, and why or why not?
In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}
function \co{flush_local_count_sig()}, why are there
\co{ACCESS_ONCE()} wrappers around the uses of the
\co{theft} per-thread variable?
The first one (on line~11) can be argued to be unnecessary.
The last two (lines~14 and 16) are important.
If these are removed, the compiler would be within its rights
to rewrite lines~14-17 as follows:
14 theft = THEFT_READY;
15 if (counting) {
16 theft = THEFT_ACK;
17 }
This would be fatal, as the slowpath might see the transient
value of \co{THEFT_READY}, and start stealing before the
corresponding thread was ready.
In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
why is it safe for line~28 to directly access the other thread's
\co{countermax} variable?
Because the other thread is not permitted to change the value
of its \co{countermax} variable unless it holds the
\co{gblcnt_mutex} lock.
But the caller has acquired this lock, so it is not possible
for the other thread to hold it, and therefore the other thread
is not permitted to change its \co{countermax} variable.
We can therefore safely access it --- but not change it.
In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
why doesn't line~33 check for the current thread sending itself
a signal?
There is no need for an additional check.
The caller of \co{flush_local_count()} has already invoked
\co{globalize_count()}, so the check on line~28 will have
succeeded, skipping the later \co{pthread_kill()}.
The code in
Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
works with gcc and POSIX.
What would be required to make it also conform to the ISO C standard?
The \co{theft} variable must be of type \co{sig_atomic_t}
to guarantee that it can be safely shared between the signal
handler and the code interrupted by the signal.
In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}, why does line~41 resend the signal?
Because many operating systems over several decades have
had the property of losing the occasional signal.
Whether this is a feature or a bug is debatable, but
The obvious symptom from the user's viewpoint will not be
a kernel bug, but rather a user application hanging.
\emph{Your} user application hanging!
What if you want an exact limit counter to be exact only for
its lower limit?
One simple solution is to overstate the upper limit by the
desired amount.
The limiting case of such overstatement results in the
upper limit being set to the largest value that the counter is
capable of representing.
What else had you better have done when using a biased counter?
You had better have set the upper limit to be large enough
accommodate the bias, the expected maximum number of accesses,
and enough ``slop'' to allow the counter to work efficiently
even when the number of accesses is at its maximum.
This is ridiculous!
We are \emph{read}-acquiring a reader-writer lock to
\emph{update} the counter?
What are you playing at???
Strange, perhaps, but true!
Almost enough to make you think that the name
``reader-writer lock'' was poorly chosen, isn't it?
What other issues would need to be accounted for in a real system?
A huge number!
Here are a few to start with:
\item There could be any number of devices, so that the
global variables are inappropriate, as are the
lack of arguments to functions like \co{do_io()}.
\item Polling loops can be problematic in real systems.
In many cases, it is far better to have the last
completing I/O wake up the device-removal thread.
\item The I/O might fail, and so \co{do_io()} will likely
need a return value.
\item If the device fails, the last I/O might never complete.
In such cases, there might need to be some sort of
timeout to allow error recovery.
\item Both \co{add_count()} and \co{sub_count()} can
fail, but their return values are not checked.
\item Reader-writer locks do not scale well.
One way of avoiding the high read-acquisition costs
of reader-writer locks is presented in
Chapter~\ref{chp:Deferred Processing}.
On the \url{count_stat.c} row of
Table~\ref{tab:count:Statistical Counter Performance on Power-5},
we see that the update side scales linearly with the number of
How is that possible given that the more threads there are,
the more per-thread counters must be summed up?
The read-side code must scan the entire fixed-size array, regardless
of the number of threads, so there is no difference in performance.
In contrast, in the last two algorithms, readers must do more
work when there are more threads.
In addition, the last two algorithms interpose an additional
level of indirection because they map from integer thread ID
to the corresponding \co{__thread} variable.
Even on the last row of
Table~\ref{tab:count:Statistical Counter Performance on Power-5},
the read-side performance of these statistical counter
implementations is pretty horrible.
So why bother with them?
``Use the right tool for the job.''
As can be seen from
Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem},
single-variable atomic increment need not apply for any job
involving heavy use of parallel updates.
In contrast, the algorithms shown in
Table~\ref{tab:count:Statistical Counter Performance on Power-5}
do an excellent job of handling update-heavy situations.
Of course, if you have a read-mostly situation, you should
use something else, for example, a single atomically incremented
variable that can be read out using a single load.
Given the performance data shown in
Table~\ref{tab:count:Limit Counter Performance on Power-5},
we should always prefer update-side signals over read-side
atomic operations, right?
That depends on the workload.
Note that you need a million readers (with roughly
a 40-nanosecond performance gain) to make up for even one
writer (with almost a 40-\emph{millisecond} performance loss).
Although there are no shortage of workloads with far greater
read intensity, you will need to consider your particular
In addition, although memory barriers have historically been
expensive compared to ordinary instructions, you should
check this on the specific hardware you will be running.
The properties of computer hardware do change over time,
and algorithms must change accordingly.
Can advanced techniques be applied to address the lock
contention for readers seen in
Table~\ref{tab:count:Limit Counter Performance on Power-5}?
There are a number of ways one might go about this, and these
are left as exercises for the reader.
The \co{++} operator works just fine for 1,000-digit numbers!
Haven't you heard of operator overloading???
In the C++ language, you might well be able to use \co{++}
on a 1,000-digit number, assuming that you had access to a
class implementing such numbers.
But as of 2010, the C language does not permit operator overloading.
But if we are going to have to partition everything, why bother
with shared-memory multithreading?
Why not just partition the problem completely and run as
multiple processes, each in its own address space?
Indeed, multiple processes with separate address spaces can be
an excellent way to exploit parallelism, as the proponents of
the fork-join methodology and the Erlang language would be very
quick to tell you.
However, there are also some advantages to shared-memory parallelism:
\item Only the most performance-critical portions of the
application must be partitioned, and such portions
are usually a small fraction of the application.
\item Although cache misses are quite slow compared to
individual register-to-register instructions,
they are typically considerably faster than
inter-process-communication primitives, which in
turn are considerably faster than things like
TCP/IP networking.
\item Shared-memory multiprocessors are readily available
and quite inexpensive, so, in stark contrast to the
1990s, there is little cost penalty for use of
shared-memory parallelism.
As always, use the right tool for the job!
\QuickQAC{cha:Partitioning and Synchronization Design}{Partitioning and Synchronization Design}
Is there a better solution to the Dining
Philosophers Problem?
\caption{Dining Philosophers Problem, Fully Partitioned}
\ContributedBy{Figure}{fig:SMPdesign:Dining Philosophers Problem, Fully Partitioned}{Kornilios Kourtis}
One such improved solution is shown in
Figure~\ref{fig:SMPdesign:Dining Philosophers Problem, Fully Partitioned},
where the philosophers are simply provided with an additional
five forks.
All five philosophers may now eat simultaneously, and there
is never any need for philosophers to wait on one another.
In addition, the improved disease control provided by this
approach should not be underestimated.
This solution can seem like cheating to some, but such
``cheating'' is key to finding good solutions to many
concurrency problems.
And in just what sense can this ``horizontal parallelism'' be
said to be ``horizontal''?
Inman was working with protocol stacks, which are normally
depicted vertically, with the application on top and the
hardware interconnect on the bottom.
Data flows up and down this stack.
``Horizontal parallelism'' processes packets from different network
connections in parallel, while ``vertical parallelism''
handles different protocol-processing steps for a given
packet in parallel.
``Vertical parallelism'' is also called ``pipelining''.
In this compound double-ended queue implementation, what should
be done if the queue has become non-empty while releasing
and reacquiring the lock?
In this case, simply dequeue an item from the now-nonempty
queue, release both locks, and return.
Is the hashed double-ended queue a good solution?
Why or why not?
The best way to answer this is to run \url{lockhdeq.c} on
a number of different multiprocessor systems, and you are
encouraged to do so in the strongest possible terms.
One reason for concern is that each operation on this
implementation must acquire not one but two locks.
% Getting about 500 nanoseconds per element when used as
% a queue on a 4.2GHz Power system. This is roughly the same as
% the version covered by a single lock. Sequential (unlocked
% variant is more than an order of magnitude faster!
The first well-designed performance study will be cited.
Do not forget to compare to a sequential implementation!
Move \emph{all} the elements to the queue that became empty?
In what possible universe is this braindead solution in any
way optimal???
It is optimal in the case where data flow switches direction only
It would of course be an extremely poor choice if the double-ended
queue was being emptied from both ends concurrently.
This of course raises the question as to what possible universe
emptying from both ends concurrently would be a reasonable
thing to do...
Why can't the compound parallel double-ended queue
implementation be symmetric?
The need to avoid deadlock by imposing a lock hierarchy
forces the asymmetry, just as it does in the fork-numbering
solution to the Dining Philosophers Problem.
Why is it necessary to retry the right-dequeue operation
on line~29 of
Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation}?
This retry is necessary because some other thread might have
enqueued an element between the time that this thread dropped
the lock and the time that it reacquired the lock.
Surely the left-hand lock must \emph{sometimes} be available!!!
So why is it necessary that line~26 of
Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation}
unconditionally release the right-hand lock?
It would be possible to use \co{spin_trylock()} to attempt
to acquire the left-hand lock when it was available.
However, the failure case would still need to drop the
right-hand lock and then re-acquire the two locks in order.
Making this transformation (and determining whether or not
it is worthwhile) is left as an exercise for the reader.
The tandem double-ended queue runs about twice as fast as
the hashed double-ended queue, even when I increase the
size of the hash table to an insanely large number.
Why is that?
The hashed double-ended queue's locking design only permits
one thread at a time at each end, and further requires
two lock acquisitions for each operation.
The tandem double-ended queue also permits one thread at a time
at each end, and in the common case requires only one lock
acquisition per operation.
Therefore, the tandem double-ended queue should be expected to
outperform the hashed double-ended queue.
Can you created a double-ended queue that allows multiple
concurrent operations at each end?
If so, how? If not, why not?
Is there a significantly better way of handling concurrency
for double-ended queues?
Transform the problem to be solved so that multiple double-ended
queues can be used in parallel, allowing the simpler single-lock
double-ended queue to be used, and perhaps also replace each
double-ended queue with a pair of conventional single-ended queues.
Without such ``horizontal scaling'', the speedup is limited
to 2.0.
In contrast, horizontal-scaling designs can enable very large
speedups, and are especially attractive if there are multiple threads
working either end of the queue, because in this
multiple-thread case the deque
simply cannot provide strong ordering guarantees.
And if there are no guarantees, we may as well obtain the
performance benefits that come with refusing to provide the
guarantees, right?
% about twice as fast as hashed version on 4.2GHz Power.
What are some ways of preventing a structure from being freed while
its lock is being acquired?
Here are a few possible solutions to this \emph{existence guarantee}
\item Provide a statically allocated lock that is held while
the per-structure lock is being acquired, which is an
example of hierarchical locking (see
Section~\ref{sec:SMPdesign:Hierarchical Locking}).
Of course, using a single global lock for this purpose
can result in unacceptably high levels of lock contention,
dramatically reducing performance and scalability.
\item Provide an array of statically allocated locks, hashing
the structure's address to select the lock to be acquired,
as described in Chapter~\ref{chp:Locking}.
Given a hash function of sufficiently high quality, this
avoids the scalability limitations of the single global
lock, but in read-mostly situations, the lock-acquisition
overhead can result in unacceptably degraded performance.
\item Use a garbage collector, in software environments providing
them, so that a structure cannot be deallocated while being
This works very well, removing the existence-guarantee
burden (and much else besides) from the developer's
shoulders, but imposes the overhead of garbage collection
on the program.
Although garbage-collection technology has advanced
considerably in the past few decades, its overhead
may be unacceptably high for some applications.
In addition, some applications require that the developer
exercise more control over the layout and placement of
data structures than is permitted by most garbage collected
\item As a special case of a garbage collector, use a global
reference counter, or a global array of reference counters.
\item Use \emph{hazard pointers}~\cite{MagedMichael04a}, which
can be thought of as an inside-out reference count.
Hazard-pointer-based algorithms maintain a per-thread list of
pointers, so that the appearance of a given pointer on
any of these lists acts as a reference to the corresponding
Hazard pointers are an interesting research direction, but
have not yet seen much use in production (written in 2008).
\item Use transactional memory
so that each reference and
modification to the data structure in question is
performed atomically.
Although TM has engendered much excitement in recent years,
and seems likely to be of some use in production software,
developers should exercise some
particularly in performance-critical code.
In particular, existence guarantees require that the
transaction cover the full path from a global reference
to the data elements being updated.
\item Use RCU, which can be thought of as an extremely lightweight
approximation to a garbage collector.
Updaters are not permitted to free RCU-protected
data structures that RCU readers might still be referencing.
RCU is most heavily used for read-mostly data structures,
and is discussed at length in
Chapter~\ref{chp:Deferred Processing}.
For more on providing existence guarantees, see
Chapters~\ref{chp:Locking} and \ref{chp:Deferred Processing}.
How can a single-threaded 64-by-64 matrix multiple possibly
have an efficiency of less than 1.0?
Shouldn't all of the traces in
Figure~\ref{fig:SMPdesign:Matrix Multiply Efficiency}
have efficiency of exactly 1.0 when running on only one thread?
The \texttt{matmul.c} program creates the specified number of
worker threads, so even the single-worker-thread case incurs
thread-creation overhead.
Making the changes required to optimize away thread-creation
overhead in the single-worker-thread case is left as an
exercise to the reader.
How are data-parallel techniques going to help with matrix
It is \emph{already} data parallel!!!
I am glad that you are paying attention!
This example serves to show that although data parallelism can
be a very good thing, it is not some magic wand that automatically
wards off any and all sources of inefficiency.
Linear scaling at full performance, even to ``only'' 64 threads,
requires care at all phases of design and implementation.
In particular, you need to pay careful attention to the
size of the partitions.
For example, if you split a 64-by-64 matrix multiply across
64 threads, each thread gets only 64 floating-point multiplies.
The cost of a floating-point multiply is miniscule compared to
the overhead of thread creation.
Moral: If you have a parallel program with variable input,
always include a check for the input size being too small to
be worth parallelizing.
And when it is not helpful to parallelize, it is not helpful
to spawn a single thread, now is it?
In what situation would hierarchical locking work well?
If the comparison on line~31 of
Figure~\ref{fig:SMPdesign:Hierarchical-Locking Hash Table Search}
were replaced by a much heavier-weight operation,
then releasing {\tt bp->bucket\_lock} \emph{might} reduce lock
contention enough to outweigh the overhead of the extra
acquisition and release of {\tt cur->node\_lock}.
In Figure~\ref{fig:SMPdesign:Allocator Cache Performance},
there is a pattern of performance rising with increasing run
length in groups of three samples, for example, for run lengths
10, 11, and 12.
This is due to the per-CPU target value being three.
A run length of 12 must acquire the global-pool lock twice,
while a run length of 13 must acquire the global-pool lock
three times.
Allocation failures were observed in the two-thread
tests at run lengths of 19 and greater.
Given the global-pool size of 40 and the per-CPU target
pool size of three, what is the smallest allocation run
length at which failures can occur?
The exact solution to this problem is left as an exercise to
the reader.
The first solution received will be credited to its submitter.
As a rough rule of thumb, the global pool size should be at least
$m+2sn$, where
``m'' is the maximum number of elements allocated at a given time,
``s'' is the per-CPU pool size,
and ``n'' is the number of CPUs.
But the definition of deadlock only said that each thread
was holding at least one lock and waiting on another lock
that was held by some thread.
How do you know that there is a cycle?
Suppose that there is no cycle in the graph.
We would then have a directed acyclic graph (DAG), which would
have at least one leaf node.
If this leaf node was a lock, then we would have a thread
that was waiting on a lock that wasn't held by any thread,
which violates the definition.
(And in this case the thread would immediately acquire the
On the other hand, if this leaf node was a thread, then
we would have a thread that was not waiting on any lock,
again violating the definition.
(And in this case, the thread would either be running or
blocked on something that is not a lock.)
Therefore, given this definition of deadlock, there must
be a cycle in the corresponding graph.
Are there any exceptions to this rule, so that there really could be
a deadlock cycle containing locks from both the library and
the caller, even given that the library code never invokes
any of the caller's functions?
Indeed there can!
Here are a few of them:
\item If one of the library function's arguments is a pointer
to a lock that this library function acquires, and if
the library function holds one if its locks while
acquiring the caller's lock, then we could have a
deadlock cycle involving both caller and library locks.
\item If one of the library functions returns a pointer to
a lock that is acquired by the caller, and if the
caller acquires one if its locks while holding the
library's lock, we could again have a deadlock
cycle involving both caller and library locks.
\item If one of the library functions acquires a lock and
then returns while still holding it, and if the caller
acquires one of its locks, we have yet another way
to create a deadlock cycle involving both caller
and library locks.
\item If the caller has a signal handler that acquires
locks, then the deadlock cycle can involve both
caller and library locks.
In this case, however, the library's locks are
innocent bystanders in the deadlock cycle.
That said, please note that acquiring a lock from
within a signal handler is a no-no in most
environments---it is not just a bad idea, it
is unsupported.
But if \co{qsort()} releases all its locks before invoking
the comparison function, how can it protect against races
with other \co{qsort()} threads?
By privatizing the data elements being compared
(as discussed in Chapter~\ref{chp:Data Ownership})
or through use of deferral mechanisms such as
reference counting (as discussed in
Chapter~\ref{chp:Deferred Processing}).
Name one common exception where it is perfectly reasonable
to pass a pointer to a lock into a function.
Locking primitives, of course!
Doesn't the fact that \co{pthread_cond_wait()} first releases the
mutex and then re-acquires it eliminate the possibility of deadlock?
Absolutely not!
Consider the a program that acquires \co{mutex_a}, and then
\co{mutex_b}, in that order, and then passes \co{mutex_a}
to \co{pthread_cond_wait}.
Now, \co{pthread_cond_wait} will release \co{mutex_a}, but
will re-acquire it before returning.
If some other thread acquires \co{mutex_a} in the meantime
and then blocks on \co{mutex_b}, the program will deadlock.
Can the transformation from
Figure~\ref{fig:locking:Protocol Layering and Deadlock} to
Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking}
be applied universally?
Absolutely not!
This transformation assumes that the
\co{layer_2_processing()} function is idempotent, given that
it might be executed multiple times on the same packet when
the \co{layer_1()} routing decision changes.
Therefore, in real life, this transformation can become
arbitrarily complex.
But the complexity in
Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking}
is well worthwhile given that it avoids deadlock, right?
If the routing decision in \co{layer_1()} changes often enough,
the code will always retry, never making forward progress.
This is termed ``livelock'' if no thread makes any forward progress or
if some threads make forward progress but other do not
(see Section~\ref{sec:locking:Livelock and Starvation}).
Why is it illegal to acquire a Lock~A that is acquired outside
of a signal handler without blocking signals while holding
a Lock~B that is acquired within a signal handler?
Because this would lead to deadlock.
Given that Lock~A is held outside of a signal
handler without blocking signals, a signal might be handled while
holding this lock.
The corresponding signal handler might then acquire
Lock~B, so that Lock~B is acquired while holding Lock~A.
Therefore, if we also acquire Lock~A while holding Lock~B
as called out in the question, we will have a deadlock cycle.
Therefore, it is illegal to acquire a lock that is acquired outside
of a signal handler without blocking signals while holding
a another lock that is acquired within a signal handler.
How can you legally block signals within a signal handler?
One of the simplest and fastest ways to do so is to use
the \co{sa_mask} field of the \co{struct sigaction} that
you pass to \co{sigaction()} when setting up the signal.
Given an object-oriented application that passes control freely
among a group of objects such that there is no reasonable
locking hierarchy, layered or otherwise, how can this
application be parallelized?
There are a number of approaches:
\item In the case of parametric search via simulation,
where a large number of simulations will be run
in order to converge on (for example) a good design
for a mechanical or electrical device, leave the
simulation single-threaded, but run many instances
of the simulation in parallel.
This retains the object-oriented design, and
gains parallelism at a higher level.
\item Partition the objects into groups such that there
is no need to operate on objects in
more than one group at a given time.
Then associate a lock with each group.
This is an example of a single-lock-at-a-time
design, which discussed in
Section~\ref{sec:locking:Single-Lock-at-a-Time Designs}.
\item Partition the objects into groups such that threads
can all operate on objects in the groups in some
groupwise ordering.
Then associate a lock with each group, and impose a
locking hierarchy over the groups.
\item Impose an arbitrarily selected hierarchy on the locks,
and then use conditional locking if it is necessary
to acquire a lock out of order, as was discussed in
Section~\ref{sec:locking:Conditional Locking}.
\item Before carrying out a given group of operations, predict
which locks will be acquired, and attempt to acquire them
before actually carrying out any updates.
If the prediction turns out to be incorrect, drop
all the locks and retry with an updated prediction
that includes the benefit of experience.
This approach was discussed in
Section~\ref{sec:locking:Acquire Needed Locks First}.
\item Use transactional memory.
This approach has a number of advantages and disadvantages
which will be discussed in
Section~\ref{sec:future:Transactional Memory}.
\item Refactor the application to be more concurrency-friendly.
This would likely also have the side effect of making
the application run faster even when single-threaded, but might
also make it more difficult to modify the application.
\item Use techniques from later chapters in addition to locking.
How can the livelock shown in
Figure~\ref{fig:locking:Abusing Conditional Locking}
be avoided?
This is left as an exercise to the reader.
Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking}
provides some good hints.
In many cases, livelocks are a hint that you should revisit your
locking design.
Or visit it in the first place if your locking design
``just grew''.
What problems can you spot in the code in
Figure~\ref{fig:locking:Conditional Locking and Exponential Backoff}?
Here are a couple:
\item A one-second wait is way too long for most uses.
Wait intervals should begin with roughly the time
required to execute the critical section, which will
normally be in the microsecond or millisecond range.
\item The code does not check for overflow.
On the other hand, this bug is nullified
by the previous bug: 32 bits worth of seconds is
more than 50 years.
Wouldn't it be better just to use a good parallel design
so that lock contention was low enough to avoid unfairness?
It would be better in some sense, but there are situations
where it can be appropriate to use
designs that sometimes result in high lock contentions.
For example, imagine a system that is subject to a rare error
It might well be best to have a simple error-handling design
that has poor performance and scalability for the duration of
the rare error condition, as opposed to a complex and
difficult-to-debug design that is helpful only when one of
those rare error conditions is in effect.
That said, it is usually worth putting some effort into
attempting to produce a design that both simple as well as
efficient during error condsitions, for example by partitioning
the problem.
How might the lock holder be interfered with?
If the data protected by the lock is in the same cache line
as the lock itself, then attempts by other CPUs to acquire
the lock will result in expensive cache misses on the part
of the CPU holding the lock.
In contrast, if the lock is in a different cache line than
the data that it protects, the CPU holding the lock will
usually suffer a cache miss only on first access to a given
This is a special case of false sharing, which can also occur
if a pair of variables protected by different locks happen
to share a cache line.
Is there any other way for the VAX/VMS DLM to emulate
a reader-writer lock?
There are in fact several.
One way would be to use the null, protected-read, and exclusive
Another way would be to use the null, protected-read, and
concurrent-write modes.
A third way would be to use the null, concurrent-read, and
exclusive modes.
Why bother with the inner loop on lines~7-8 of
Figure~\ref{fig:locking:Sample Lock Based on Atomic Exchange}?
Why not simply repeatedly do the atomic exchange operation
on line~6?
Suppose that the lock is held and that several threads
are attempting to acquire the lock.
In this situation, if these threads all loop on the atomic
exchange operation, they will ping-pong the cache line
containing the lock among themselves, imposing load
on the interconnect.
In contrast, if these threads are spinning in the inner
loop on lines~7-8, they will each spin within their own
caches, putting negligible load on the interconnect.
Why not simply store zero into the lock word on line 14 of
Figure~\ref{fig:locking:Sample Lock Based on Atomic Exchange}?
This can be a legitimate implementation, but only if
this store is preceded by a memory barrier.
The memory barrier is not required when the \co{xchg()}
operation is used because this operation implies a
full memory barrier due to the fact that it returns
a value.
How can relying on implicit existence guarantees result in
a bug?
Here are some bugs resulting from improper use of implicit
existence guarantees:
\item A program writes the address of a global variable to
a file, then a later instance of that same program
reads that address and attempts to dereference it.
This can fail due to address-space randomization,
to say nothing of recompilation of the program.
\item A module can record the address of one of its variables
in a pointer located in some other module, then attempt
to dereference that pointer after the module has
been unloaded.
\item A function can record the address of one of its on-stack
variables into a global pointer, which some other
function might attempt to dereference after that function
has returned.
I am sure that you can come up with additional possibilities.
What if the element we need to delete is not the first element
of the list on line~8 of
Figure~\ref{fig:locking:Per-Element Locking Without Existence Guarantees}?
This is a very simple hash table with no chaining, so the only
element in a given bucket is the first element.
The reader is invited to adapt this example to a hash table with
full chaining.
What race condition can occur in
Figure~\ref{fig:locking:Per-Element Locking Without Existence Guarantees}?
Consider the following sequence of events:
\item Thread~0 invokes \co{delete(0)}, and reaches line~10 of
the figure, acquiring the lock.
\item Thread~1 concurrently invokes \co{delete(0)}, and reaches
line~10, but spins on the lock because Thread~1 holds it.
\item Thread~0 executes lines~11-14, removing the element from
the hashtable, releasing the lock, and then freeing the
\item Thread~0 continues execution, and allocates memory, getting
the exact block of memory that it just freed.
\item Thread~0 then initializes this block of memory as some
other type of structure.
\item Thread~1's \co{spin_lock()} operation fails due to the
fact that what it believes to be \co{p->lock} is no longer
a spinlock.
Because there is no existence guarantee, the identity of the
data element can change while a thread is attempting to acquire
that element's lock on line~10!
\QuickQAC{chp:Data Ownership}{Data Ownership}
What form of data ownership that is extremely difficult
to avoid using when creating shared-memory parallel programs
(for example, using pthreads) in C or C++?
Use of auto variables in functions.
By default, these are private to the thread executing the
current function.
What synchronization remains in the example shown in
Section~\ref{sec:owned:Multiple Processes}?
The creation of the threads via the \co{sh} \co{&} operator
and the joining of thread via the \co{sh} \co{wait}
Of course, if the processes explicitly share memory, for
example, using the \co{shmget()} or \co{mmap()} system
calls, explicit synchronization might well be needed when
acccessing or updating the shared memory.
The processes might also synchronize using any of the following
interprocess communications mechanisms:
\item System V semaphores.
\item System V message queues.
\item UNIX-domain sockets.
\item Networking protocols, including TCP/IP, UDP, and a whole
host of others.
\item File locking.
\item Use of the \co{open()} system call with the
\co{O_CREAT} and \co{O_EXCL} flags.
\item Use of the \co{rename()} system call.
A complete list of possible synchronization mechanisms is left
as an exercise to the reader, who is warned that it will be
an extremely long list.
A surprising number of unassuming system calls can be pressed
into service as synchronization mechanisms.
Is there any shared data in the example shown in
Section~\ref{sec:owned:Multiple Processes}?
That is a philosophical question.
Those wishing the answer ``no'' might argue that processes by
definition do not share memory.
Those wishing to answer ``yes'' might list a large number of
synchronization mechanisms that do not require shared memory,
note that the kernel will have some shared state, and perhaps
even argue that the assignment of process IDs (PIDs) constitute
shared data.
Such arguments are excellent intellectual exercise, and are
also a wonderful way of feeling intelligent, scoring points
against hapless classmates or colleagues,
and (especially!) avoiding getting anything useful done.
Does it ever make sense to have partial data ownership where
each thread reads only its own instance of a per-thread variable,
but writes to other threads' instances?
Amazingly enough, yes.
One example is a simple message-passing system where threads post
messages to other threads' mailboxes, and where each thread
is responsible for removing any message it sent once that message
has been acted on.
Implementation of such an algorithm is left as an exercise for the
reader, as is the task of identifying other algorithms with
similar ownership patterns.
What mechanisms other than POSIX signals may be used to ship
There is a very large number of such mechanisms, including:
\item System V message queues.
\item Shared-memory dequeue (see
Section~\ref{sec:SMPdesign:Double-Ended Queue}).
\item Shared-memory mailboxes.
\item UNIX-domain sockets.
\item TCP/IP or UDP, possibly augmented by any number of
higher-level protocols, including RPC, HTTP,
XML, SOAP, and so on.
Compilation of a complete list is left as an exercise to
sufficiently single-minded readers, who are warned that the
list will be extremely long.
But none of the data in the \co{eventual()} function shown on
lines~15-32 of
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
is actually owned by the \co{eventual()} thread!
In just what way is this data ownership???
The key phrase is ``owns the rights to the data''.
In this case, the rights in question are the rights to access
the per-thread \co{counter} variable defined on line~1
of the figure.
This situation is similar to that described in
Section~\ref{sec:owned:Partial Data Ownership and pthreads}.
However, there really is data that is owned by the \co{eventual()}
thread, namely the \co{t} and \co{sum} variables defined on
lines~17 and~18 of the figure.
For other examples of designated threads, look at the kernel
threads in the Linux kernel, for example, those created by
\co{kthread_create()} and \co{kthread_run()}.
Is it possible to obtain greater accuracy while still
maintaining full privacy of the per-thread data?
One approach is for \co{read_count()} to add the value
of its own per-thread variable.
This maintains full ownership and performance, but only
a slight improvement in accuracy, particulary on systems
with very large numbers of threads.
Another approach is for \co{read_count()} to use function
shipping, for example, in the form of per-thread signals.
This greatly improves accuracy, but at a significant performance
cost for \co{read_count()}.
However, both of these methods have the advantage of eliminating
cache-line bouncing for the common case of updating counters.
\QuickQAC{chp:Deferred Processing}{Deferred Processing}
Why not implement reference-acquisition using
a simple compare-and-swap operation that only
acquires a reference if the reference counter is
Although this can resolve the race between the release of
the last reference and acquisition of a new reference,
it does absolutely nothing to prevent the data structure
from being freed and reallocated, possibly as some completely
different type of structure.
It is quite likely that the ``simple compare-and-swap
operation'' would give undefined results if applied to the
differently typed structure.
In short, use of atomic operations such as compare-and-swap
absolutely requires either type-safety or existence guarantees.
Why isn't it necessary to guard against cases where one CPU
acquires a reference just after another CPU releases the last
Because a CPU must already hold a reference in order
to legally acquire another reference.
Therefore, if one CPU releases the last reference,
there cannot possibly be any CPU that is permitted
to acquire a new reference.
This same fact allows the non-atomic check in line~22
of Figure~\ref{fig:defer:Linux Kernel kref API}.
If the check on line~22 of
Figure~\ref{fig:defer:Linux Kernel kref API} fails, how
could the check on line~23 possibly succeed?
Suppose that {\tt kref\_put()} is protected by RCU, so
that two CPUs might be executing line~22 concurrently.
Both might see the value ``2'', causing both to then
execute line~23.
One of the two instances of {\tt atomic\_dec\_and\_test()}
will decrement the value to zero and thus return 1.
How can it possibly be safe to non-atomically check for equality
with ``1'' on line~22 of
Figure~\ref{fig:defer:Linux Kernel kref API}?
Remember that it is not legal to call either {\tt kref\_get()}
or {\tt kref\_put()} unless you hold a reference.
If the reference count is equal to ``1'', then there
can't possibly be another CPU authorized to change the
value of the reference count.
Why can't the check for a zero reference count be
made in a simple ``if'' statement with an atomic
increment in its ``then'' clause?
Suppose that the ``if'' condition completed, finding
the reference counter value equal to one.
Suppose that a release operation executes, decrementing
the reference counter to zero and therefore starting
cleanup operations.
But now the ``then'' clause can increment the counter
back to a value of one, allowing the object to be
used after it has been cleaned up.
Why isn't this sequence-lock discussion in Chapter~\ref{chp:Locking},
you know, the one on \emph{locking}?
The sequence-lock mechanism is really a combination of two
separate synchronization mechanisms, sequence counts and
In fact, the sequence-count mechanism is available separately
in the Linux kernel via the
\co{write_seqcount_begin()} and \co{write_seqcount_end()}
However, the combined \co{write_seqlock()} and
\co{write_sequnlock()} primitives are used much more heavily
in the Linux kernel.
More importantly, many more people will understand what you
mean if you san ``sequence lock'' than if you say
``sequence count''.
So this section is entitled ``Sequence Locks'' so that people
will understand what it is about just from the title, and
it appears in the ``Deferred Processing'' because (1) of the
emphasis on the ``sequence count'' aspect of ``sequence locks''
and (2) because a ``sequence lock'' is much more than merely
a lock.
Can you use sequence locks as the only synchronization
mechanism protecting a linked list supporting concurrent
addition, deletion, and search?
One trivial way of accomplishing this is to surround all
accesses, including the read-only accesses, with
\co{write_seqlock()} and \co{write_sequnlock()}.
Of course, this solution also prohibits all read-side
parallelism, and furthermore could just as easily be implemented
using simple locking.
If you do come up with a solution that uses \co{read_seqbegin()}
and \co{read_seqretry()} to protect read-side accesses, make
sure that you correctly handle the following sequence of events:
\item CPU~0 is traversing the linked list, and picks up a pointer
to list element~A.
\item CPU~1 removes element~A from the list and frees it.
\item CPU~2 allocates an unrelated data structure, and gets
the memory formerly occupied by element~A.
In this unrelated data structure, the memory previously
used for element~A's \co{->next} pointer is now occupied
by a floating-point number.
\item CPU~0 picks up what used to be element~A's \co{->next}
pointer, gets random bits, and therefore gets a
segmentation fault.
One way to protect against this sort of problem requires use
of ``type-safe memory'', which will be discussed in
Section~\ref{sec:deferRCU is a Way of Providing Type-Safe Memory}.
But in that case, you would be using some other synchronization
mechanism in addition to sequence locks!
Why bother with the check on line~19 of
\co{read_seqbegin()} in
Figure~\ref{fig:defer:Sequence-Locking Implementation}?
Given that a new writer could begin at any time, why not
simply incorporate the check into line~31 of
That would be a legitimate implementation.
However, it would not save anything to move the check down
to \co{read_seqretry()}: There would be roughly the same number
of instructions.
Furthermore, the reader's accesses from its doomed read-side
critical section could inflict overhead on the writer in
the form of cache misses.
We can avoid these cache misses by placing the check in
\co{read_seqbegin()} as shown on line~19 of
Figure~\ref{fig:defer:Sequence-Locking Implementation}.
What prevents sequence-locking updaters from starving readers?
This is one of the weaknesses of sequence locking, and as a
result, you should use sequence locking only in read-mostly
Unless of course read-side starvation is acceptable in your
situation, in which case, go wild the sequence-locking updates!
What if something else serializes writers, so that the lock
is not needed?
In this case, the \co{->lock} field could be omitted, as it
is in \co{seqcount_t} in the Linux kernel.
Why isn't \co{seq} on line 2 of
Figure~\ref{fig:defer:Sequence-Locking Implementation}
\co{unsigned} rather than \co{unsigned long}?
After all, if \co{unsigned} is good enough for the Linux
kernel, shouldn't it be good enough for everyone?
Not at all.
The Linux kernel has a number of special attributes that allow
it to ignore the following sequence of events:
\item Thread 0 executes \co{read_seqbegin()}, picking up
\co{->seq} in line~17, noting that the value is even,
and thus returning to the caller.
\item Thread 0 starts executing its read-side critical section,
but is then preempted for a long time.
\item Other threads repeatedly invoke \co{write_seqlock()} and
\co{write_sequnlock()}, until the value of \co{->seq}
overflows back to the value that Thread~0 fetched.
\item Thread 0 resumes execution, completing its read-side
critical section with inconsistent data.
\item Thread 0 invokes \co{read_seqretry()}, which incorrectly
concludes that Thread~0 has seen a consistent view of
the data protected by the sequence lock.
The Linux kernel uses sequence locking for things that are
updated rarely, with time-of-day information being a case
in point.
This information is updated at most once per millisecond,
so that seven weeks would be required to overflow the counter.
If a kernel thread was preempted for seven weeks, the Linux
kernel's soft-lockup code would be emitting warnings every two
minutes for that entire time.
In contrast, with a 64-bit counter, more than five centuries
would be required to overflow, even given an update every
Therefore, this implementation uses a type for \co{->seq}
that is 64 bits on 64-bit systems.
But doesn't Section~\ref{sec:defer:Sequence Locks}'s seqlock
also permit readers and updaters to get work done concurrently?
Yes and no.
Although seqlock readers can run concurrently with
seqlock writers, whenever this happens, the {\tt read\_seqretry()}
primitive will force the reader to retry.
This means that any work done by a seqlock reader running concurrently
with a seqlock updater will be discarded and redone.
So seqlock readers can \emph{run} concurrently with updaters,
but they cannot actually get any work done in this case.
In contrast, RCU readers can perform useful work even in presence
of concurrent RCU updaters.
What prevents the {\tt list\_for\_each\_entry\_rcu()} from
getting a segfault if it happens to execute at exactly the same
time as the {\tt list\_add\_rcu()}?
On all systems running Linux, loads from and stores
to pointers are atomic, that is, if a store to a pointer occurs at
the same time as a load from that same pointer, the load will return
either the initial value or the value stored, never some bitwise
mashup of the two.
In addition, the {\tt list\_for\_each\_entry\_rcu()} always proceeds
forward through the list, never looking back.
Therefore, the {\tt list\_for\_each\_entry\_rcu()} will either see
the element being added by {\tt list\_add\_rcu()} or it will not,
but either way, it will see a valid well-formed list.
Why do we need to pass two pointers into
{\tt hlist\_for\_each\_entry\_rcu()}
when only one is needed for {\tt list\_for\_each\_entry\_rcu()}?
Because in an hlist it is necessary to check for
NULL rather than for encountering the head.
(Try coding up a single-pointer {\tt hlist\_for\_each\_entry\_rcu()}
If you come up with a nice solution, it would be a very good thing!)
How would you modify the deletion example to permit more than two
versions of the list to be active?
One way of accomplishing this is as shown in
Figure~\ref{fig:defer:Concurrent RCU Deletion}.
{ \centering
1 spin_lock(&mylock);
2 p = search(head, key);
3 if (p == NULL)
4 spin_unlock(&mylock);
5 else {
6 list_del_rcu(&p->list);
7 spin_unlock(&mylock);
8 synchronize_rcu();
9 kfree(p);
10 }
\caption{Concurrent RCU Deletion}
\label{fig:defer:Concurrent RCU Deletion}
Note that this means that multiple concurrent deletions might be
waiting in {\tt synchronize\_rcu()}.
How many RCU versions of a given list can be
active at any given time?
That depends on the synchronization design.
If a semaphore protecting the update is held across the grace period,
then there can be at most two versions, the old and the new.
However, if only the search, the update, and the
{\tt list\_replace\_rcu()} were protected by a lock, then
there could be an arbitrary number of versions active, limited only
by memory and by how many updates could be completed within a
grace period.
But please note that data structures that are updated so frequently
probably are not good candidates for RCU.
That said, RCU can handle high update rates when necessary.
How can RCU updaters possibly delay RCU readers, given that the
{\tt rcu\_read\_lock()} and {\tt rcu\_read\_unlock()}
primitives neither spin nor block?
The modifications undertaken by a given RCU updater will cause the
corresponding CPU to invalidate cache lines containing the data,
forcing the CPUs running concurrent RCU readers to incur expensive
cache misses.
(Can you design an algorithm that changes a data structure
inflicting expensive cache misses on concurrent readers?
On subsequent readers?)
How the heck do you expect me to believe that RCU has a
100-femtosecond overhead when the clock period at 3GHz is more than
300 \emph{picoseconds}?
First, consider that the inner loop used to
take this measurement is as follows:
1 for (i = 0; i < CSCOUNT_SCALE; i++) {
2 rcu_read_lock();
3 rcu_read_unlock();
4 }
Next, consider the effective definitions of \co{rcu_read_lock()}
and \co{rcu_read_unlock()}:
1 #define rcu_read_lock() do { } while (0)
2 #define rcu_read_unlock() do { } while (0)
Consider also that the compiler does simple optimizations,
allowing it to replace the loop with:
So the "measurement" of 100 femtoseconds is simply the fixed
overhead of the timing measurements divided by the number of
passes through the inner loop containing the calls
to \co{rcu_read_lock()} and \co{rcu_read_unlock()}.
And therefore, this measurement really is in error, in fact,
in error by an arbitrary number of orders of magnitude.
As you can see by the definition of \co{rcu_read_lock()}
and \co{rcu_read_unlock()} above, the actual overhead
is precisely zero.
It certainly is not every day that a timing measurement of
100 femtoseconds turns out to be an overestimate!
Why does both the variability and overhead of rwlock decrease as the
critical-section overhead increases?
Because the contention on the underlying
\co{rwlock_t} decreases as the critical-section overhead
However, the rwlock overhead will not quite drop to that on a single
CPU because of cache-thrashing overhead.
Is there an exception to this deadlock immunity, and if so,
what sequence of events could lead to deadlock?
One way to cause a deadlock cycle involving
RCU read-side primitives is via the following (illegal) sequence
of statements:
idx = srcu_read_lock(&srcucb);
srcu_read_unlock(&srcucb, idx);
The \co{synchronize_rcu()} cannot return until all
pre-existing SRCU read-side critical sections complete, but
is enclosed in an SRCU read-side critical section that cannot
complete until the \co{synchronize_srcu()} returns.
The result is a classic self-deadlock--you get the same
effect when attempting to write-acquire a reader-writer lock
while read-holding it.
Note that this self-deadlock scenario does not apply to
RCU Classic, because the context switch performed by the
\co{synchronize_rcu()} would act as a quiescent state
for this CPU, allowing a grace period to complete.
However, this is if anything even worse, because data used
by the RCU read-side critical section might be freed as a
result of the grace period completing.
In short, do not invoke synchronous RCU update-side primitives
from within an RCU read-side critical section.
But wait!
This is exactly the same code that might be used when thinking
of RCU as a replacement for reader-writer locking!
What gives?
This is an effect of the Law of Toy Examples:
beyond a certain point, the code fragments look the same.
The only difference is in how we think about the code.
However, this difference can be extremely important.
For but one example of the importance, consider that if we think
of RCU as a restricted reference counting scheme, we would never
be fooled into thinking that the updates would exclude the RCU
read-side critical sections.
It nevertheless is often useful to think of RCU as a replacement
for reader-writer locking, for example, when you are replacing
reader-writer locking with RCU.
Why the dip in refcnt overhead near 6 CPUs?
Most likely NUMA effects.
However, there is substantial variance in the values measured for the
refcnt line, as can be seen by the error bars.
In fact, standard deviations range in excess of 10% of measured
values in some cases.
The dip in overhead therefore might well be a statistical aberration.
What if the element we need to delete is not the first element
of the list on line~9 of
Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking}?
As with
Figure~\ref{fig:locking:Per-Element Locking Without Existence Guarantees},
this is a very simple hash table with no chaining, so the only
element in a given bucket is the first element.
The reader is again invited to adapt this example to a hash table with
full chaining.
Why is it OK to exit the RCU read-side critical section on
line~15 of
Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking}
before releasing the lock on line~17?
First, please note that the second check on line~14 is
necessary because some other
CPU might have removed this element while we were waiting
to acquire the lock.
However, the fact that we were in an RCU read-side critical section
while acquiring the lock guarantees that this element could not
possibly have been re-allocated and re-inserted into this
hash table.
Furthermore, once we acquire the lock, the lock itself guarantees
the element's existence, so we no longer need to be in an
RCU read-side critical section.
The question as to whether it is necessary to re-check the
element's key is left as an exercise to the reader.
Why not exit the RCU read-side critical section on
line~23 of
Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking}
before releasing the lock on line~22?
Suppose we reverse the order of these two lines.
Then this code is vulnerable to the following sequence of
\item CPU~0 invokes \co{delete()}, and finds the element
to be deleted, executing through line~15.
It has not yet actually deleted the element, but
is about to do so.
\item CPU~1 concurrently invokes \co{delete()}, attempting
to delete this same element.
However, CPU~0 still holds the lock, so CPU~1 waits
for it at line~13.
\item CPU~0 executes lines~16 and 17, and blocks at
line~18 waiting for CPU~1 to exit its RCU read-side
critical section.
\item CPU~1 now acquires the lock, but the test on line~14
fails because CPU~0 has already removed the element.
CPU~1 now executes line~22 (which we switched with line~23
for the purposes of this Quick Quiz)
and exits its RCU read-side critical section.
\item CPU~0 can now return from \co{synchronize_rcu()},
and thus executes line~19, sending the element to
the freelist.
\item CPU~1 now attempts to release a lock for an element
that has been freed, and, worse yet, possibly
reallocated as some other type of data structure.
This is a fatal memory-corruption error.
But what if there is an arbitrarily long series of RCU
read-side critical sections in multiple threads, so that at
any point in time there is at least one thread in the system
executing in an RCU read-side critical section?
Wouldn't that prevent any data from a \co{SLAB_DESTROY_BY_RCU}
slab ever being returned to the system, possibly resulting
in OOM events?
There could certainly be an arbitrarily long period of time
during which at least one thread is always in an RCU read-side
critical section.
However, the key words in the description in
Section~\ref{sec:deferRCU is a Way of Providing Type-Safe Memory}
are ``in-use'' and ``pre-existing''.
Keep in mind that a given RCU read-side critical section is
conceptually only permitted to gain references to data elements
that were in use at the beginning of that critical section.
Furthermore, remember that a slab cannot be returned to the
system until all of its data elements have been freed, in fact,
the RCU grace period cannot start until after they have all been
Therefore, the slab cache need only wait for those RCU read-side
critical sections that started before the freeing of the last element
of the slab.
This in turn means that any RCU grace period that begins after
the freeing of the last element will do---the slab may be returned
to the system after that grace period ends.
Suppose that the \co{nmi_profile()} function was preemptible.
What would need to change to make this example work correctly?
One approach would be to use
\co{rcu_read_lock()} and \co{rcu_read_unlock()}
in \co{nmi_profile()}, and to replace the
\co{synchronize_sched()} with \co{synchronize_rcu()},
perhaps as shown in
Figure~\ref{fig:defer:Using RCU to Wait for Mythical Preemptible NMIs to Finish}.
{ \tt \scriptsize
1 struct profile_buffer {
2 long size;
3 atomic_t entry[0];
4 };
5 static struct profile_buffer *buf = NULL;
7 void nmi_profile(unsigned long pcvalue)
8 {
9 struct profile_buffer *p;
11 rcu_read_lock();
12 p = rcu_dereference(buf);
13 if (p == NULL) {
14 rcu_read_unlock();
15 return;
16 }
17 if (pcvalue >= p->size) {
18 rcu_read_unlock();
19 return;
20 }
21 atomic_inc(&p->entry[pcvalue]);
22 rcu_read_unlock();
23 }
25 void nmi_stop(void)
26 {
27 struct profile_buffer *p = buf;
29 if (p == NULL)
30 return;
31 rcu_assign_pointer(buf, NULL);
32 synchronize_rcu();
33 kfree(p);
34 }
\caption{Using RCU to Wait for Mythical Preemptible NMIs to Finish}
\label{fig:defer:Using RCU to Wait for Mythical Preemptible NMIs to Finish}
Why do some of the cells in
Table~\ref{tab:defer:RCU Wait-to-Finish APIs}
have exclamation marks (``!'')?
The API members with exclamation marks (\co{rcu_read_lock()},
\co{rcu_read_unlock()}, and \co{call_rcu()}) were the
only members of the Linux RCU API that Paul E. McKenney was aware
of back in the mid-90s.
During this timeframe, he was under the mistaken impression that
he knew all that there is to know about RCU.
How do you prevent a huge number of RCU read-side critical
sections from indefinitely blocking a \co{synchronize_rcu()}
There is no need to do anything to prevent RCU read-side
critical sections from indefinitely blocking a
\co{synchronize_rcu()} invocation, because the
\co{synchronize_rcu()} invocation need wait only for
\emph{pre-existing} RCU read-side critical sections.
So as long as each RCU read-side critical section is
of finite duration, there should be no problem.
The \co{synchronize_rcu()} API waits for all pre-existing
interrupt handlers to complete, right?
Absolutely not!
And especially not when using preemptible RCU!
You instead want \co{synchronize_irq()}.
Alternatively, you can place calls to \co{rcu_read_lock()}
and \co{rcu_read_unlock()} in the specific interrupt handlers that
you want \co{synchronize_rcu()} to wait for.
What happens if you mix and match?
For example, suppose you use \co{rcu_read_lock()} and
\co{rcu_read_unlock()} to delimit RCU read-side critical
sections, but then use \co{call_rcu_bh()} to post an
RCU callback?
If there happened to be no RCU read-side critical
sections delimited by \co{rcu_read_lock_bh()} and
\co{rcu_read_unlock_bh()} at the time \co{call_rcu_bh()}
was invoked, RCU would be within its rights to invoke the callback
immediately, possibly freeing a data structure still being used by
the RCU read-side critical section!
This is not merely a theoretical possibility: a long-running RCU
read-side critical section delimited by \co{rcu_read_lock()}
and \co{rcu_read_unlock()} is vulnerable to this failure mode.
This vulnerability disappears in -rt kernels, where
RCU Classic and RCU BH both map onto a common implementation.
Hardware interrupt handlers can be thought of as being
under the protection of an implicit \co{rcu_read_lock_bh()},
Absolutely not!
And especially not when using preemptible RCU!
If you need to access ``rcu\_bh''-protected data structures
in an interrupt handler, you need to provide explicit calls to
\co{rcu_read_lock_bh()} and \co{rcu_read_unlock_bh()}.
What happens if you mix and match RCU Classic and RCU Sched?
In a non-\co{PREEMPT} or a \co{PREEMPT} kernel, mixing these
two works "by accident" because in those kernel builds, RCU Classic
and RCU Sched map to the same implementation.
However, this mixture is fatal in \co{PREEMPT_RT} builds using the -rt
patchset, due to the fact that Realtime RCU's read-side critical
sections can be preempted, which would permit
\co{synchronize_sched()} to return before the
RCU read-side critical section reached its \co{rcu_read_unlock()}
This could in turn result in a data structure being freed before the
read-side critical section was finished with it,
which could in turn greatly increase the actuarial risk experienced
by your kernel.
In fact, the split between RCU Classic and RCU Sched was inspired
by the need for preemptible RCU read-side critical sections.
In general, you cannot rely on \co{synchronize_sched()} to
wait for all pre-existing interrupt handlers,
That is correct!
Because -rt Linux uses threaded interrupt handlers, there can
be context switches in the middle of an interrupt handler.
Because \co{synchronize_sched()} waits only until each
CPU has passed through a context switch, it can return
before a given interrupt handler completes.
If you need to wait for a given interrupt handler to complete,
you should instead use \co{synchronize_irq()} or place
explicit RCU read-side critical sections in the interrupt
handlers that you wish to wait on.
Why do both SRCU and QRCU lack asynchronous \co{call_srcu()}
or \co{call_qrcu()} interfaces?
Given an asynchronous interface, a single task
could register an arbitrarily large number of SRCU or QRCU callbacks,
thereby consuming an arbitrarily large quantity of memory.
In contrast, given the current synchronous
\co{synchronize_srcu()} and \co{synchronize_qrcu()}
interfaces, a given task must finish waiting for a given grace period
before it can start waiting for the next one.
Under what conditions can \co{synchronize_srcu()} be safely
used within an SRCU read-side critical section?
In principle, you can use
\co{synchronize_srcu()} with a given \co{srcu_struct}
within an SRCU read-side critical section that uses some other
In practice, however, doing this is almost certainly a bad idea.
In particular, the code shown in
Figure~\ref{fig:defer:Multistage SRCU Deadlocks}
could still result in deadlock.
{ \centering
1 idx = srcu_read_lock(&ssa);
2 synchronize_srcu(&ssb);
3 srcu_read_unlock(&ssa, idx);
5 /* . . . */
7 idx = srcu_read_lock(&ssb);
8 synchronize_srcu(&ssa);
9 srcu_read_unlock(&ssb, idx);
\caption{Multistage SRCU Deadlocks}
\label{fig:defer:Multistage SRCU Deadlocks}
Why doesn't \co{list_del_rcu()} poison both the \co{next}
and \co{prev} pointers?
Poisoning the \co{next} pointer would interfere
with concurrent RCU readers, who must use this pointer.
However, RCU readers are forbidden from using the \co{prev}
pointer, so it may safely be poisoned.
Normally, any pointer subject to \co{rcu_dereference()} \emph{must}
always be updated using \co{rcu_assign_pointer()}.
What is an exception to this rule?
One such exception is when a multi-element linked
data structure is initialized as a unit while inaccessible to other
CPUs, and then a single \co{rcu_assign_pointer()} is used
to plant a global pointer to this data structure.
The initialization-time pointer assignments need not use
\co{rcu_assign_pointer()}, though any such assignments that
happen after the structure is globally visible \emph{must} use
However, unless this initialization code is on an impressively hot
code-path, it is probably wise to use \co{rcu_assign_pointer()}
anyway, even though it is in theory unnecessary.
It is all too easy for a "minor" change to invalidate your cherished
assumptions about the initialization happening privately.
Are there any downsides to the fact that these traversal and update
primitives can be used with any of the RCU API family members?
It can sometimes be difficult for automated
code checkers such as ``sparse'' (or indeed for human beings) to
work out which type of RCU read-side critical section a given
RCU traversal primitive corresponds to.
For example, consider the code shown in
Figure~\ref{fig:defer:Diverse RCU Read-Side Nesting}.
{ \centering
1 rcu_read_lock();
2 preempt_disable();
3 p = rcu_dereference(global_pointer);
5 /* . . . */
7 preempt_enable();
8 rcu_read_unlock();
\caption{Diverse RCU Read-Side Nesting}
\label{fig:defer:Diverse RCU Read-Side Nesting}
Is the \co{rcu_dereference()} primitive in an RCU Classic
or an RCU Sched critical section?
What would you have to do to figure this out?
Why wouldn't any deadlock in the RCU implementation in
Figure~\ref{fig:defer:Lock-Based RCU Implementation}
also be a deadlock in any other RCU implementation?
{ \scriptsize
1 void foo(void)
2 {
3 spin_lock(&my_lock);
4 rcu_read_lock();
5 do_something();
6 rcu_read_unlock();
7 do_something_else();
8 spin_unlock(&my_lock);
9 }
11 void bar(void)
12 {
13 rcu_read_lock();
14 spin_lock(&my_lock);
15 do_some_other_thing();
16 spin_unlock(&my_lock);
17 do_whatever();
18 rcu_read_unlock();
19 }
\caption{Deadlock in Lock-Based RCU Implementation}
\label{fig:defer:Deadlock in Lock-Based RCU Implementation}
Suppose the functions \co{foo()} and \co{bar()} in
Figure~\ref{fig:defer:Deadlock in Lock-Based RCU Implementation}
are invoked concurrently from different CPUs.
Then \co{foo()} will acquire \co{my_lock()} on line~3,
while \co{bar()} will acquire \co{rcu_gp_lock} on
When \co{foo()} advances to line~4, it will attempt to
acquire \co{rcu_gp_lock}, which is held by \co{bar()}.
Then when \co{bar()} advances to line~14, it will attempt
to acquire \co{my_lock}, which is held by \co{foo()}.
Each function is then waiting for a lock that the other
holds, a classic deadlock.
Other RCU implementations neither spin nor block in
\co{rcu_read_lock()}, hence avoiding deadlocks.
Why not simply use reader-writer locks in the RCU implementation
Figure~\ref{fig:defer:Lock-Based RCU Implementation}
in order to allow RCU readers to proceed in parallel?
One could in fact use reader-writer locks in this manner.
However, textbook reader-writer locks suffer from memory
contention, so that the RCU read-side critical sections would
need to be quite long to actually permit parallel
On the other hand, use of a reader-writer lock that is
read-acquired in \co{rcu_read_lock()} would avoid the
deadlock condition noted above.
Wouldn't it be cleaner to acquire all the locks, and then
release them all in the loop from lines~15-18 of
Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation}?
After all, with this change, there would be a point in time
when there were no readers, simplifying things greatly.
Making this change would re-introduce the deadlock, so
no, it would not be cleaner.
Is the implementation shown in
Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation}
free from deadlocks?
Why or why not?
One deadlock is where a lock is
held across \co{synchronize_rcu()}, and that same lock is
acquired within an RCU read-side critical section.
However, this situation will deadlock any correctly designed
RCU implementation.
After all, the \co{synchronize_rcu()} primitive must wait for all
pre-existing RCU read-side critical sections to complete,
but if one of those critical sections is spinning on a lock
held by the thread executing the \co{synchronize_rcu()},
we have a deadlock inherent in the definition of RCU.
Another deadlock happens when attempting to nest RCU read-side
critical sections.
This deadlock is peculiar to this implementation, and might
be avoided by using recursive locks, or by using reader-writer
locks that are read-acquired by \co{rcu_read_lock()} and
write-acquired by \co{synchronize_rcu()}.
However, if we exclude the above two cases,
this implementation of RCU does not introduce any deadlock
This is because only time some other thread's lock is acquired is when
executing \co{synchronize_rcu()}, and in that case, the lock
is immediately released, prohibiting a deadlock cycle that
does not involve a lock held across the \co{synchronize_rcu()}
which is the first case above.
Isn't one advantage of the RCU algorithm shown in
Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation}
that it uses only primitives that are widely available,
for example, in POSIX pthreads?
This is indeed an advantage, but do not forget that
\co{rcu_dereference()} and \co{rcu_assign_pointer()}
are still required, which means \co{volatile} manipulation
for \co{rcu_dereference()} and memory barriers for
Of course, many Alpha CPUs require memory barriers for both
But what if you hold a lock across a call to
\co{synchronize_rcu()}, and then acquire that same lock within
an RCU read-side critical section?
Indeed, this would deadlock any legal RCU implementation.
But is \co{rcu_read_lock()} \emph{really} participating in
the deadlock cycle?
If you believe that it is, then please
ask yourself this same question when looking at the
RCU implementation in
Section~\ref{defer:RCU Based on Quiescent States}.
How can the grace period possibly elapse in 40 nanoseconds when
\co{synchronize_rcu()} contains a 10-millisecond delay?
The update-side test was run in absence of readers, so the
\co{poll()} system call was never invoked.
In addition, the actual code has this \co{poll()}
system call commented out, the better to evaluate the
true overhead of the update-side code.
Any production uses of this code would be better served by
using the \co{poll()} system call, but then again,
production uses would be even better served by other implementations
shown later in this section.
Why not simply make \co{rcu_read_lock()} wait when a concurrent
\co{synchronize_rcu()} has been waiting too long in
the RCU implementation in
Figure~\ref{fig:defer:RCU Implementation Using Single Global Reference Counter}?
Wouldn't that prevent \co{synchronize_rcu()} from starving?
Although this would in fact eliminate the starvation, it would
also mean that \co{rcu_read_lock()} would spin or block waiting
for the writer, which is in turn waiting on readers.
If one of these readers is attempting to acquire a lock that
the spinning/blocking \co{rcu_read_lock()} holds, we again
have deadlock.
In short, the cure is worse than the disease.
See Section~\ref{defer:Starvation-Free Counter-Based RCU}
for a proper cure.
Why the memory barrier on line~5 of \co{synchronize_rcu()} in
Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}
given that there is a spin-lock acquisition immediately after?
The spin-lock acquisition only guarantees that the spin-lock's
critical section will not ``bleed out'' to precede the
It in no way guarantees that code preceding the spin-lock
acquisition won't be reordered into the critical section.
Such reordering could cause a removal from an RCU-protected
list to be reordered to follow the complementing of
\co{rcu_idx}, which could allow a newly starting RCU
read-side critical section to see the recently removed
data element.
Exercise for the reader: use a tool such as Promela/spin
to determine which (if any) of the memory barriers in
Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}
are really needed.
See Section~\ref{app:formal:Formal Verification}
for information on using these tools.
The first correct and complete response will be credited.
Why is the counter flipped twice in
Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}?
Shouldn't a single flip-and-wait cycle be sufficient?
Both flips are absolutely required.
To see this, consider the following sequence of events:
\item Line~8 of \co{rcu_read_lock()} in
Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair}
picks up \co{rcu_idx}, finding its value to be zero.
\item Line~8 of \co{synchronize_rcu()} in
Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}
complements the value of \co{rcu_idx}, setting its
value to one.
\item Lines~10-13 of \co{synchronize_rcu()} find that the
value of \co{rcu_refcnt[0]} is zero, and thus
(Recall that the question is asking what happens if
lines~14-20 are omitted.)
\item Lines~9 and 10 of \co{rcu_read_lock()} store the
value zero to this thread's instance of \co{rcu_read_idx}
and increments \co{rcu_refcnt[0]}, respectively.
Execution then proceeds into the RCU read-side critical
\label{defer:rcu_rcgp:RCU Read Side Start}
\item Another instance of \co{synchronize_rcu()} again complements
\co{rcu_idx}, this time setting its value to zero.
Because \co{rcu_refcnt[1]} is zero, \co{synchronize_rcu()}
returns immediately.
(Recall that \co{rcu_read_lock()} incremented
\co{rcu_refcnt[0]}, not \co{rcu_refcnt[1]}!)
\label{defer:rcu_rcgp:RCU Grace Period Start}
\item The grace period that started in
step~\ref{defer:rcu_rcgp:RCU Grace Period Start}
has been allowed to end, despite
the fact that the RCU read-side critical section
that started beforehand in
step~\ref{defer:rcu_rcgp:RCU Read Side Start}
has not completed.
This violates RCU semantics, and could allow the update
to free a data element that the RCU read-side critical
section was still referencing.
Exercise for the reader: What happens if \co{rcu_read_lock()}
is preempted for a very long time (hours!) just after
Does this implementation operate correctly in that case?
Why or why not?
The first correct and complete response will be credited.
Given that atomic increment and decrement are so expensive,
why not just use non-atomic increment on line~10 and a
non-atomic decrement on line~25 of
Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair}?
Using non-atomic operations would cause increments and decrements
to be lost, in turn causing the implementation to fail.
See Section~\ref{defer:Scalable Counter-Based RCU}
for a safe way to use non-atomic operations in
\co{rcu_read_lock()} and \co{rcu_read_unlock()}.
Come off it!
We can see the \co{atomic_read()} primitive in
So why are you trying to pretend that \co{rcu_read_lock()}
contains no atomic operations???
The \co{atomic_read()} primitives does not actually execute
atomic machine instructions, but rather does a normal load
from an \co{atomic_t}.
Its sole purpose is to keep the compiler's type-checking happy.
If the Linux kernel ran on 8-bit CPUs, it would also need to
prevent ``store tearing'', which could happen due to the need
to store a 16-bit pointer with two eight-bit accesses on some
8-bit systems.
But thankfully, it seems that no one runs Linux on 8-bit systems.
Great, if we have $N$ threads, we can have $2N$ ten-millisecond
waits (one set per \co{flip_counter_and_wait()} invocation,
and even that assumes that we wait only once for each thread.
Don't we need the grace period to complete \emph{much} more quickly?
Keep in mind that we only wait for a given thread if that thread
is still in a pre-existing RCU read-side critical section,
and that waiting for one hold-out thread gives all the other
threads a chance to complete any pre-existing RCU read-side
critical sections that they might still be executing.
So the only way that we would wait for $2N$ intervals
would be if the last thread still remained in a pre-existing
RCU read-side critical section despite all the waiting for
all the prior threads.
In short, this implementation will not wait unnecessarily.
However, if you are stress-testing code that uses RCU, you
might want to comment out the \co{poll()} statement in
order to better catch bugs that incorrectly retain a reference
to an RCU-protected data element outside of an RCU
read-side critical section.
All of these toy RCU implementations have either atomic operations
in \co{rcu_read_lock()} and \co{rcu_read_unlock()},
or \co{synchronize_rcu()}
overhead that increases linearly with the number of threads.
Under what circumstances could an RCU implementation enjoy
light-weight implementations for all three of these primitives,
all having deterministic ($O\left(1\right)$) overheads and latencies?
Special-purpose uniprocessor implementations of RCU can attain
this ideal~\cite{PaulEMcKenney2009BloatwatchRCU}.
If any even value is sufficient to tell \co{synchronize_rcu()}
to ignore a given task, why doesn't line~10 of
Figure~\ref{fig:defer:Free-Running Counter Using RCU}
simply assign zero to \co{rcu_reader_gp}?
Assigning zero (or any other even-numbered constant)
would in fact work, but assigning the value of
\co{rcu_gp_ctr} can provide a valuable debugging aid,
as it gives the developer an idea of when the corresponding
thread last exited an RCU read-side critical section.
Why are the memory barriers on lines~17 and 29 of
Figure~\ref{fig:defer:Free-Running Counter Using RCU}
Aren't the memory barriers inherent in the locking
primitives on lines~18 and 28 sufficient?
These memory barriers are required because the locking
primitives are only guaranteed to confine the critical
The locking primitives are under absolutely no obligation
to keep other code from bleeding in to the critical section.
The pair of memory barriers are therefore requires to prevent
this sort of code motion, whether performed by the compiler
or by the CPU.
Couldn't the update-side optimization described in
Section~\ref{defer:Scalable Counter-Based RCU With Shared Grace Periods}
be applied to the implementation shown in
Figure~\ref{fig:defer:Free-Running Counter Using RCU}?
Indeed it could, with a few modifications.
This work is left as an exercise for the reader.
Is the possibility o readers being preempted in
line~3 of Figure~\ref{fig:defer:Free-Running Counter Using RCU}
a real problem, in other words, is there a real sequence
of events that could lead to failure?
If not, why not?
If so, what is the sequence of events, and how can the
failure be addressed?
It is a real problem, there is a sequence of events leading to
failure, and there are a number of possible ways of
addressing it.
For more details, see the Quick Quizzes near the end of
Section~\ref{defer:Nestable RCU Based on Free-Running Counter}.
The reason for locating the discussion there is to (1) give you
more time to think about it, and (2) because the nesting support
added in that section greatly reduces the time required to
overflow the counter.
Why not simply maintain a separate per-thread nesting-level
variable, as was done in previous section, rather than having
all this complicated bit manipulation?
The apparent simplicity of the separate per-thread variable
is a red herring.
This approach incurs much greater complexity in the guise
of careful ordering of operations, especially if signal
handlers are to be permitted to contain RCU read-side
critical sections.
But don't take my word for it, code it up and see what you
end up with!
Given the algorithm shown in
Figure~\ref{fig:defer:Nestable RCU Using a Free-Running Counter},
how could you double the time required to overflow the global
One way would be to replace the magnitude comparison on
lines~33 and 34 with an inequality check of the per-thread
\co{rcu_reader_gp} variable against
Again, given the algorithm shown in
Figure~\ref{fig:defer:Nestable RCU Using a Free-Running Counter},
is counter overflow fatal?
Why or why not?
If it is fatal, what can be done to fix it?
It can indeed be fatal.
To see this, consider the following sequence of events:
\item Thread~0 enters \co{rcu_read_lock()}, determines
that it is not nested, and therefore fetches the
value of the global \co{rcu_gp_ctr}.
Thread~0 is then preempted for an extremely long time
(before storing to its per-thread \co{rcu_reader_gp}
\item Other threads repeatedly invoke \co{synchronize_rcu()},
so that the new value of the global \co{rcu_gp_ctr}
less than it was when thread~0 fetched it.
\item Thread~0 now starts running again, and stores into
its per-thread \co{rcu_reader_gp} variable.
The value it stores is
greater than that of the global \co{rcu_gp_ctr}.
\item Thread~0 acquires a reference to RCU-protected data
\item Thread 1 now removes the data element~A that thread~0
just acquired a reference to.
\item Thread 1 invokes \co{synchronize_rcu()}, which
increments the global \co{rcu_gp_ctr} by
It then checks all of the per-thread \co{rcu_reader_gp}
variables, but thread~0's value (incorrectly) indicates
that it started after thread~1's call to
\co{synchronize_rcu()}, so thread~1 does not wait
for thread~0 to complete its RCU read-side critical
\item Thread 1 then frees up data element~A, which thread~0
is still referencing.