# Scorpiion/Parallel-Programming-Book

Switch branches/tags
Nothing to show
Fetching contributors…
Cannot retrieve contributors at this time
6573 lines (5991 sloc) 255 KB
 \QuickQAC{chp:Introduction}{Introduction} \QuickQ{} Come on now!!! Parallel programming has been known to be exceedingly hard for many decades. You seem to be hinting that it is not so hard. What sort of game are you playing? \QuickA{} If you really believe that parallel programming is exceedingly hard, then you should have a ready answer to the question Why is parallel programming hard?'' One could list any number of reasons, ranging from deadlocks to race conditions to testing coverage, but the real answer is that {\em it is not really all that hard}. After all, if parallel programming was really so horribly difficult, how could a large number of open-source projects, ranging from Apache to MySQL to the Linux kernel, have managed to master it? A better question might be: ''Why is parallel programming {\em perceived} to be so difficult?'' To see the answer, let's go back to the year 1991. Paul McKenney was walking across the parking lot to Sequent's benchmarking center carrying six dual-80486 Sequent Symmetry CPU boards, when he suddenly realized that he was carrying several times the price of the house he had just purchased.\footnote{ Yes, this sudden realization {\em did} cause him to walk quite a bit more carefully. Why do you ask?} This high cost of parallel systems meant that parallel programming was restricted to a privileged few who worked for an employer who either manufactured or could afford to purchase machines costing upwards of \$100,000 --- in 1991 dollars US. In contrast, in 2006, Paul finds himself typing these words on a dual-core x86 laptop. Unlike the dual-80486 CPU boards, this laptop also contains 2GB of main memory, a 60GB disk drive, a display, Ethernet, USB ports, wireless, and Bluetooth. And the laptop is more than an order of magnitude cheaper than even one of those dual-80486 CPU boards, even before taking inflation into account. Parallel systems have truly arrived. They are no longer the sole domain of a privileged few, but something available to almost everyone. The earlier restricted availability of parallel hardware is the \emph{real} reason that parallel programming is considered so difficult. After all, it is quite difficult to learn to program even the simplest machine if you have no access to it. Since the age of rare and expensive parallel machines is for the most part behind us, the age during which parallel programming is perceived to be mind-crushingly difficult is coming to a close.\footnote{ Parallel programming is in some ways more difficult than sequential programming, for example, parallel validation is more difficult. But no longer mind-crushingly difficult.} \QuickQ{} How could parallel programming \emph{ever} be as easy as sequential programming? \QuickA{} It depends on the programming environment. SQL~\cite{DIS9075SQL92} is an underappreciated success story, as it permits programmers who know nothing about parallelism to keep a large parallel system productively busy. We can expect more variations on this theme as parallel computers continue to become cheaper and more readily available. For example, one possible contender in the scientific and technical computing arena is MATLAB*P, which is an attempt to automatically parallelize common matrix operations. Finally, on Linux and UNIX systems, consider the following shell command: {\tt get\_input | grep "interesting" | sort} This shell pipeline runs the \co{get_input}, \co{grep}, and \co{sort} processes in parallel. There, that wasn't so hard, now was it? \QuickQ{} Oh, really??? What about correctness, maintainability, robustness, and so on? \QuickA{} These are important goals, but they are just as important for sequential programs as they are for parallel programs. Therefore, important though they are, they do not belong on a list specific to parallel programming. \QuickQ{} And if correctness, maintainability, and robustness don't make the list, why do productivity and generality? \QuickA{} Given that parallel programming is perceived to be much harder than is sequential programming, productivity is tantamount and therefore must not be omitted. Furthermore, high-productivity parallel-programming environments such as SQL have been special purpose, hence generality must also be added to the list. \QuickQ{} Given that parallel programs are much harder to prove correct than are sequential programs, again, shouldn't correctness \emph{really} be on the list? \QuickA{} From an engineering standpoint, the difficulty in proving correctness, either formally or informally, would be important insofar as it impacts the primary goal of productivity. So, in cases where correctness proofs are important, they are subsumed under the productivity'' rubric. \QuickQ{} What about just having fun? \QuickA{} Having fun is important as well, but, unless you are a hobbyist, would not normally be a \emph{primary} goal. On the other hand, if you \emph{are} a hobbyist, go wild! \QuickQ{} Are there no cases where parallel programming is about something other than performance? \QuickA{} There are certainly cases where the problem to be solved is inherently parallel, for example, Monte Carlo methods and some numerical computations. Even in these cases, however, there will be some amount of extra work managing the parallelism. \QuickQ{} Why all this prattling on about non-technical issues??? And not just \emph{any} non-technical issue, but \emph{productivity} of all things? Who cares? \QuickA{} If you are a pure hobbyist, perhaps you don't need to care. But even pure hobbyists will often care about how much they can get done, and how quickly. After all, the most popular hobbyist tools are usually those that are the best suited for the job, and an important part of the definition of best suited'' involves productivity. And if someone is paying you to write parallel code, they will very likely care deeply about your productivity. And if the person paying you cares about something, you would be most wise to pay at least some attention to it! Besides, if you \emph{really} didn't care about productivity, you would be doing it by hand rather than using a computer! \QuickQ{} Given how cheap parallel hardware has become, how can anyone afford to pay people to program it? \QuickA{} There are a number of answers to this question: \begin{enumerate} \item Given a large computational cluster of parallel machines, the aggregate cost of the cluster can easily justify substantial developer effort, because the development cost can be spread over the large number of machines. \item Popular software that is run by tens of millions of users can easily justify substantial developer effort, as the cost of this development can be spread over the tens of millions of users. Note that this includes things like kernels and system libraries. \item If the low-cost parallel machine is controlling the operation of a valuable piece of equipment, then the cost of this piece of equipment might easily justify substantial developer effort. \item If the software for the low-cost parallel produces an extremely valuable result (e.g., mineral exploration), then the valuable result might again justify substantial developer cost. \item Safety-critical systems protect lives, which can clearly justify very large developer effort. \item Hobbyists and researchers might seek knowledge, experience, fun, or glory rather than mere money. \end{enumerate} So it is not the case that the decreasing cost of hardware renders software worthless, but rather that it is no longer possible to hide'' the cost of software development within the cost of the hardware, at least not unless there are extremely large quantities of hardware. \QuickQ{} This is a ridiculously unachievable ideal! Why not focus on something that is achievable in practice? \QuickA{} This is eminently achievable. The cellphone is a computer that can be used to make phone calls and to send and receive text messages with little or no programming or configuration on the part of the end user. This might seem to be a trivial example at first glance, but if you consider it carefully you will see that it is both simple and profound. When we are willing to sacrifice generality, we can achieve truly astounding increases in productivity. Those who cling to generality will therefore fail to set the productivity bar high enough to succeed in production environments. \QuickQ{} What other bottlenecks might prevent additional CPUs from providing additional performance? \QuickA{} There are any number of potential bottlenecks: \begin{enumerate} \item Main memory. If a single thread consumes all available memory, additional threads will simply page themselves silly. \item Cache. If a single thread's cache footprint completely fills any shared CPU cache(s), then adding more threads will simply thrash the affected caches. \item Memory bandwidth. If a single thread consumes all available memory bandwidth, additional threads will simply result in additional queuing on the system interconnect. \item I/O bandwidth. If a single thread is I/O bound, adding more threads will simply result in them all waiting in line for the affected I/O resource. \end{enumerate} Specific hardware systems may have any number of additional bottlenecks. \QuickQ{} What besides CPU cache capacity might require limiting the number of concurrent threads? \QuickA{} There are any number of potential limits on the number of threads: \begin{enumerate} \item Main memory. Each thread consumes some memory (for its stack if nothing else), so that excessive numbers of threads can exhaust memory, resulting in excessive paging or memory-allocation failures. \item I/O bandwidth. If each thread initiates a given amount of mass-storage I/O or networking traffic, excessive numbers of threads can result in excessive I/O queuing delays, again degrading performance. Some networking protocols may be subject to timeouts or other failures if there are so many threads that networking events cannot be responded to in a timely fashion. \item Synchronization overhead. For many synchronization protocols, excessive numbers of threads can result in excessive spinning, blocking, or rollbacks, thus degrading performance. \end{enumerate} Specific applications and platforms may have any number of additional limiting factors. \QuickQ{} Are there any other obstacles to parallel programming? \QuickA{} There are a great many other potential obstacles to parallel programming. Here are a few of them: \begin{enumerate} \item The only known algorithms for a given project might be inherently sequential in nature. In this case, either avoid parallel programming (there being no law saying that your project \emph{has} to run in parallel) or invent a new parallel algorithm. \item The project allows binary-only plugins that share the same address space, such that no one developer has access to all of the source code for the project. Because many parallel bugs, including deadlocks, are global in nature, such binary-only plugins pose a severe challenge to current software development methodologies. This might well change, but for the time being, all developers of parallel code sharing a given address space need to be able to see \emph{all} of the code running in that address space. \item The project contains heavily used APIs that were designed without regard to parallelism. Some of the more ornate features of the System V message-queue API form a case in point. Of course, if your project has been around for a few decades, and if its developers did not have access to parallel hardware, your project undoubtedly has at least its share of such APIs. \item The project was implemented without regard to parallelism. Given that there are a great many techniques that work extremely well in a sequential environment, but that fail miserably in parallel environments, if your project ran only on sequential hardware for most of its lifetime, then your project undoubtably has at least its share of parallel-unfriendly code. \item The project was implemented without regard to good software-development practice. The cruel truth is that shared-memory parallel environments are often much less forgiving of sloppy development practices than are sequential environments. You may be well-served to clean up the existing design and code prior to attempting parallelization. \item The people who originally did the development on your project have since moved on, and the people remaining, while well able to maintain it or add small features, are unable to make big animal'' changes. In this case, unless you can work out a very simple way to parallelize your project, you will probably be best off leaving it sequential. That said, there are a number of simple approaches that you might use to parallelize your project, including running multiple instances of it, using a parallel implementation of some heavily used library function, or making use of some other parallel project, such as a database. \end{enumerate} One can argue that many of these obstacles are non-technical in nature, but that does not make them any less real. In short, parallelization can be a large and complex effort. As with any large and complex effort, it makes sense to do your homework beforehand. \QuickQ{} Where are the answers to the Quick Quizzes found? \QuickA{} In Appendix~\ref{chp:Answers to Quick Quizzes} starting on page~\pageref{chp:Answers to Quick Quizzes}. Hey, I thought I owed you an easy one! \QuickQ{} Some of the Quick Quiz questions seem to be from the viewpoint of the reader rather than the author. Is that really the intent? \QuickA{} Indeed it is! Many are modeled after Paul---just ask anyone who has had the misfortune of being assigned to teach him. Others are quite similar to actual questions that have been asked during conference presentations and lectures covering the material in this book. Still others are from the viewpoint of the author. \QuickQ{} These Quick Quizzes just are not my cup of tea. What do you recommend? \QuickA{} There are a number of alternatives available to you: \begin{enumerate} \item Just ignore the Quick Quizzes and read the rest of the book. You might miss out on the interesting material in some of the Quick Quizzes, but the rest of the book has lots of good material as well. \item If you prefer a more academic and rigorous treatment of parallel programming, you might like Herlihy's and Shavit's textbook~\cite{HerlihyShavit2008Textbook}. This book starts with an interesting combination of low-level primitives at high levels of abstraction from the hardware, and works its way through locking and simple data structures including lists, queues, hash tables, and counters, culminating with transactional memory. \item If you would like an academic treatment of parallel programming from a programming-language-pragmatics viewpoint, you might be interested in the concurrency chapter from Scott's textbook~\cite{MichaelScott2006Textbook} on programming languages. \item If you are interested in an object-oriented patternist treatment of parallel programming focussing on C++, you might try Volumes~2 and 4 of Schmidt's POSA series~\cite{SchmidtStalRohnertBuschmann2000v2Textbook, BuschmannHenneySchmidt2007v4Textbook}. Volume 4 in particular has some interesting chapters applying this work to a warehouse application. The realism of this example is attested to by the section entitled Partitioning the Big Ball of Mud'', wherein the problems inherent in parallelism often take a back seat to the problems inherent in getting one's head around a real-world application. \item If your primary focus is scientific and technical computing, and you prefer a patternist approach, you might try Mattson et al.'s textbook~\cite{Mattson2005Textbook}. It covers Java, C/C++, OpenMP, and MPI. Its patterns are admirably focused first on design, then on implementation. \item If you are interested in POSIX Threads, you might take a look at David R. Butenhof's book~\cite{Butenhof1997pthreads}. \item If you are interested in C++, but in a Windows environment, you might try Herb Sutter's Effective Concurrency'' series in Dr. Dobbs Journal~\cite{HerbSutter2008EffectiveConcurrency}. This series does a reasonable job of presenting a commonsense approach to parallelism. \item If you want to try out Intel Threading Building Blocks, then perhaps James Reinders's book~\cite{Reinders2007Textbook} is what you are looking for. \item Finally, those preferring to work in Java might be well-served by Doug Lea's textbooks~\cite{DougLea1997Textbook,Goetz2007Textbook}. \end{enumerate} In contrast, this book meshes real-world machines with real-world algorithms. If your sole goal is to find an optimal parallel queue, you might be better served by one of the above books. However, if you are interested in principles of parallel design that allow multiple such queues to operate in parallel, read on! \QuickQAC{chp:Hardware and its Habits}{Hardware and its Habits} \QuickQ{} Why should parallel programmers bother learning low-level properties of the hardware? Wouldn't it be easier, better, and more general to remain at a higher level of abstraction? \QuickA{} It might well be easier to ignore the detailed properties of the hardware, but in most cases it would be quite foolish to do so. If you accept that the only purpose of parallelism is to increase performance, and if you further accept that performance depends on detailed properties of the hardware, then it logically follows that parallel programmers are going to need to know at least a few hardware properties. This is the case in most engineering disciplines. Would \emph{you} want to use a bridge designed by an engineer who did not understand the properties of the concrete and steel making up that bridge? If not, why would you expect a parallel programmer to be able to develop competent parallel software without at least \emph{some} understanding of the underlying hardware? \QuickQ{} What types of machines would allow atomic operations on multiple data elements? \QuickA{} One answer to this question is that it is often possible to pack multiple elements of data into a single machine word, which can then be manipulated atomically. A more trendy answer would be machines supporting transactional memory~\cite{DBLomet1977SIGSOFT}. However, such machines are still (as of 2008) research curiosities. The jury is still out on the applicability of transactional memory~\cite{McKenney2007PLOSTM,DonaldEPorter2007TRANSACT, ChistopherJRossbach2007a}. \QuickQ{} This is a \emph{simplified} sequence of events? How could it \emph{possibly} be any more complex? \QuickA{} This sequence ignored a number of possible complications, including: \begin{enumerate} \item Other CPUs might be concurrently attempting to perform CAS operations involving this same cacheline. \item The cacheline might have been replicated read-only in several CPUs' caches, in which case, it would need to be flushed from their caches. \item CPU~7 might have been operating on the cache line when the request for it arrived, in which case CPU~7 would need to hold of the request until its own operation completed. \item CPU~7 might have ejected the cacheline from its cache (for example, in order to make room for other data), so that by the time that the request arrived, the cacheline was on its way to memory. \item A correctable error might have occurred in the cacheline, which would then need to be corrected at some point before the data was used. \end{enumerate} Production-quality cache-coherence mechanisms are extremely complicated due to these sorts of considerations. \QuickQ{} Why is it necessary to flush the cacheline from CPU~7's cache? \QuickA{} If the cacheline was not flushed from CPU~7's cache, then CPUs~0 and 7 might have different values for the same set of variables in the cacheline. This sort of incoherence would greatly complicate parallel software, and so hardware architects have been convinced to avoid it. \QuickQ{} Surely the hardware designers could be persuaded to improve this situation! Why have they been content with such abysmal performance for these single-instruction operations? \QuickA{} The hardware designers \emph{have} been working on this problem, and have consulted with no less a luminary than the physicist Stephen Hawking. Hawking's observation was that the hardware designers have two basic problems~\cite{BryanGardiner2007}: \begin{enumerate} \item the finite speed of light, and \item the atomic nature of matter. \end{enumerate} \begin{table} \centering \begin{tabular}{l||r|r} Operation & Cost (ns) & Ratio \\ \hline \hline Clock period & 0.4 & 1.0 \\ \hline Best-case'' CAS & 12.2 & 33.8 \\ \hline Best-case lock & 25.6 & 71.2 \\ \hline Single cache miss & 12.9 & 35.8 \\ \hline CAS cache miss & 7.0 & 19.4 \\ \hline Off-Core & & \\ \hline Single cache miss & 31.2 & 86.6 \\ \hline CAS cache miss & 31.2 & 86.5 \\ \hline Off-Socket & & \\ \hline Single cache miss & 92.4 & 256.7 \\ \hline CAS cache miss & 95.9 & 266.4 \\ \hline Comms Fabric & 4,500 & 7,500 \\ \hline Global Comms & 195,000,000 & 324,000,000 \\ \end{tabular} \caption{Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System} \label{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System} \end{table} The first problem limits raw speed, and the second limits miniaturization, which in turn limits frequency. And even this sidesteps the power-consumption issue that is currently holding production frequencies to well below 10 GHz. Nevertheless, some progress is being made, as may be seen by comparing Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System} with Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System} on page~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}. Integration of hardware threads in a single core and multiple cores on a die have improved latencies greatly, at least within the confines of a single core or single die. There has been some improvement in overall system latency, but only by about a factor of two. Unfortunately, neither the speed of light nor the atomic nature of matter has changed much in the past few years. Section~\ref{sec:cpu:Hardware Free Lunch?} looks at what else hardware designers might be able to do to ease the plight of parallel programmers. \QuickQ{} These numbers are insanely large! How can I possibly get my head around them? \QuickA{} Get a roll of toilet paper. In the USA, each roll will normally have somewhere around 350-500 sheets. Tear off one sheet to represent a single clock cycle, setting it aside. Now unroll the rest of the roll. The resulting pile of toilet paper will likely represent a single CAS cache miss. For the more-expensive inter-system communications latencies, use several rolls (or multiple cases) of toilet paper to represent the communications latency. Important safety tip: make sure to account for the needs of those you live with when appropriating toilet paper! \QuickQ{} Given that distributed-systems communication is so horribly expensive, why does anyone bother with them? \QuickA{} There are a number of reasons: \begin{enumerate} \item Shared-memory multiprocessor systems have strict size limits. If you need more than a few thousand CPUs, you have no choice but to use a distributed system. \item Extremely large shared-memory systems tend to be quite expensive and to have even longer cache-miss latencies than does the small four-CPU system shown in Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}. \item The distributed-systems communications latencies do not necessarily consume the CPU, which can often allow computation to proceed in parallel with message transfer. \item Many important problems are embarrassingly parallel'', so that extremely large quantities of processing may be enabled by a very small number of messages. SETI@HOME~\cite{SETIatHOME2008} is but one example of such an application. These sorts of applications can make good use of networks of computers despite extremely long communications latencies. \end{enumerate} It is likely that continued work on parallel applications will increase the number of embarrassingly parallel applications that can run well on machines and/or clusters having long communications latencies. That said, greatly reduced hardware latencies would be an extremely welcome development. \QuickQAC{chp:Tools of the Trade}{Tools of the Trade} \QuickQ{} But this silly shell script isn't a \emph{real} parallel program! Why bother with such trivia??? \QuickA{} Because you should \emph{never} forget the simple stuff! Please keep in mind that the title of this book is Is Parallel Programming Hard, And, If So, What Can You Do About It?''. One of the most effective things you can do about it is to avoid forgetting the simple stuff! After all, if you choose to do parallel programming the hard way, you have no one but yourself to blame for it being hard. \QuickQ{} Is there a simpler way to create a parallel shell script? If so, how? If not, why not? \QuickA{} One straightforward approach is the shell pipeline: \vspace{5pt} \begin{minipage}[t]{\columnwidth} \small \begin{verbatim} grep$pattern1 | sed -e 's/a/b/' | sort \end{verbatim} \end{minipage} \vspace{5pt} For a sufficiently large input file, \co{grep} will pattern-match in parallel with \co{sed} editing and with the input processing of \co{sort}. See the file \co{parallel.sh} for a demonstration of shell-script parallelism and pipelining. \QuickQ{} But if script-based parallel programming is so easy, why bother with anything else? \QuickA{} In fact, it is quite likely that a very large fraction of parallel programs in use today are script-based. However, script-based parallelism does have its limitations: \begin{enumerate} \item Creation of new processes is usually quite heavyweight, involving the expensive \co{fork()} and \co{exec()} system calls. \item Sharing of data, including pipelining, typically involves expensive file I/O. \item The reliable synchronization primitives available to scripts also typically involve expensive file I/O. \end{enumerate} These limitations require that script-based parallelism use coarse-grained parallelism, with each unit of work having execution time of at least tens of milliseconds, and preferably much longer. Those requiring finer-grained parallelism are well advised to think hard about their problem to see if it can be expressed in a coarse-grained form. If not, they should consider using other parallel-programming environments, such as those discussed in Section~\ref{sec:toolsoftrade:POSIX Multiprocessing}. \QuickQ{} Why does this \co{wait()} primitive need to be so complicated? Why not just make it work like the shell-script \co{wait} does? \QuickA{} Some parallel applications need to take special action when specific children exit, and therefore need to wait for each child individually. In addition, some parallel applications need to detect the reason that the child died. As we saw in Figure~\ref{fig:toolsoftrade:Using the wait() Primitive}, it is not hard to build a \co{waitall()} function out of the \co{wait()} function, but it would be impossible to do the reverse. Once the information about a specific child is lost, it is lost. \QuickQ{} Isn't there a lot more to \co{fork()} and \co{wait()} than discussed here? \QuickA{} Indeed there is, and it is quite possible that this section will be expanded in future versions to include messaging features (such as UNIX pipes, TCP/IP, and shared file I/O) and memory mapping (such as \co{mmap()} and \co{shmget()}). In the meantime, there are any number of textbooks that cover these primitives in great detail, and the truly motivated can read manpages, existing parallel applications using these primitives, as well as the source code of the Linux-kernel implementations themselves. \QuickQ{} If the \co{mythread()} function in Figure~\ref{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory} can simply return, why bother with \co{pthread_exit()}? \QuickA{} In this simple example, there is no reason whatsoever. However, imagine a more complex example, where \co{mythread()} invokes other functions, possibly separately compiled. In such a case, \co{pthread_exit()} allows these other functions to end the thread's execution without having to pass some sort of error return all the way back up to \co{mythread()}. \QuickQ{} If the C language makes no guarantees in presence of a data race, then why does the Linux kernel have so many data races? Are you trying to tell me that the Linux kernel is completely broken??? \QuickA{} Ah, but the Linux kernel is written in a carefully selected superset of the C language that includes special gcc extensions, such as asms, that permit safe execution even in presence of data races. In addition, the Linux kernel does not run on a number of platforms where data races would be especially problematic. For an example, consider embedded systems with 32-bit pointers and 16-bit busses. On such a system, a data race involving a store to and a load from a given pointer might well result in the load returning the low-order 16 bits of the old value of the pointer concatenated with the high-order 16 bits of the new value of the pointer. \QuickQ{} What if I want several threads to hold the same lock at the same time? \QuickA{} The first thing you should do is to ask yourself why you would want to do such a thing. If the answer is because I have a lot of data that is read by many threads, and only occasionally updated'', then POSIX reader-writer locks might be what you are looking for. These are introduced in Section~\ref{sec:toolsoftrade:POSIX Reader-Writer Locking}. Another way to get the effect of multiple threads holding the same lock is for one thread to acquire the lock, and then use \co{pthread_create()} to create the other threads. The question of why this would ever be a good idea is left to the reader. \QuickQ{} Why not simply make the argument to \co{lock_reader()} on line~5 of Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks} be a pointer to a \co{pthread_mutex_t}? \QuickA{} Because we will need to pass \co{lock_reader()} to \co{pthread_create()}. Although we could cast the function when passing it to \co{pthread_create()}, function casts are quite a bit uglier and harder to get right than are simple pointer casts. \QuickQ{} Writing four lines of code for each acquisition and release of a \co{pthread_mutex_t} sure seems painful! Isn't there a better way? \QuickA{} Indeed! And for that reason, the \co{pthread_mutex_lock()} and \co{pthread_mutex_unlock()} primitives are normally wrapped in functions that do this error checking. Later on, we will wrapper them with the Linux kernel \co{spin_lock()} and \co{spin_unlock()} APIs. \QuickQ{} Is x = 0'' the only possible output from the code fragment shown in Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}? If so, why? If not, what other output could appear, and why? \QuickA{} No. The reason that x = 0'' was output was that \co{lock_reader()} acquired the lock first. Had \co{lock_writer()} instead acquired the lock first, then the output would have been x = 3''. However, because the code fragment started \co{lock_reader()} first and because this run was performed on a multiprocessor, one would normally expect \co{lock_reader()} to acquire the lock first. However, there are no guarantees, especially on a busy system. \QuickQ{} Using different locks could cause quite a bit of confusion, what with threads seeing each others' intermediate states. So should well-written parallel programs restrict themselves to using a single lock in order to avoid this kind of confusion? \QuickA{} Although it is sometimes possible to write a program using a single global lock that both performs and scales well, such programs are exceptions to the rule. You will normally need to use multiple locks to attain good performance and scalability. One possible exception to this rule is transactional memory'', which is currently a research topic. Transactional-memory semantics can be thought of as those of a single global lock with optimizations permitted and with the addition of rollback~\cite{HansJBoehm2009HOTPAR}. \QuickQ{} In the code shown in Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}, is \co{lock_reader()} guaranteed to see all the values produced by \co{lock_writer()}? Why or why not? \QuickA{} No. On a busy system, \co{lock_reader()} might be preempted for the entire duration of \co{lock_writer()}'s execution, in which case it would not see \emph{any} of \co{lock_writer()}'s intermediate states for \co{x}. \QuickQ{} Wait a minute here!!! Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock} didn't initialize shared variable \co{x}, so why does it need to be initialized in Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}? \QuickA{} See line~3 of Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}. Because the code in Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock} ran first, it could rely on the compile-time initialization of \co{x}. The code in Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks} ran next, so it had to re-initialize \co{x}. \QuickQ{} Isn't comparing against single-CPU throughput a bit harsh? \QuickA{} Not at all. In fact, this comparison was, if anything, overly lenient. A more balanced comparison would be against single-CPU throughput with the locking primitives commented out. \QuickQ{} But 1,000 instructions is not a particularly small size for a critical section. What do I do if I need a much smaller critical section, for example, one containing only a few tens of instructions? \QuickA{} If the data being read \emph{never} changes, then you do not need to hold any locks while accessing it. If the data changes sufficiently infrequently, you might be able to checkpoint execution, terminate all threads, change the data, then restart at the checkpoint. Another approach is to keep a single exclusive lock per thread, so that a thread read-acquires the larger aggregate reader-writer lock by acquiring its own lock, and write-acquires by acquiring all the per-thread locks~\cite{WilsonCHsieh92a}. This can work quite well for readers, but causes writers to incur increasingly large overheads as the number of threads increases. Some other ways of handling very small critical sections are described in Section~\ref{sec:defer:Read-Copy Update (RCU)}. \QuickQ{} In Figure~\ref{fig:intro:Reader-Writer Lock Scalability}, all of the traces other than the 100M trace deviate gently from the ideal line. In contrast, the 100M trace breaks sharply from the ideal line at 64 CPUs. In addition, the spacing between the 100M trace and the 10M trace is much smaller than that between the 10M trace and the 1M trace. Why does the 100M trace behave so much differently than the other traces? \QuickA{} Your first clue is that 64 CPUs is exactly half of the 128 CPUs on the machine. The difference is an artifact of hardware threading. This system has 64 cores with two hardware threads per core. As long as fewer than 64 threads are running, each can run in its own core. But as soon as there are more than 64 threads, some of the threads must share cores. Because the pair of threads in any given core share some hardware resources, the throughput of two threads sharing a core is not quite as high as that of two threads each in their own core. So the performance of the 100M trace is limited not by the reader-writer lock, but rather by the sharing of hardware resources between hardware threads in a single core. This can also be seen in the 10M trace, which deviates gently from the ideal line up to 64 threads, then breaks sharply down, parallel to the 100M trace. Up to 64 threads, the 10M trace is limited primarily by reader-writer lock scalability, and beyond that, also by sharing of hardware resources between hardware threads in a single core. \QuickQ{} Power-5 is several years old, and new hardware should be faster. So why should anyone worry about reader-writer locks being slow? \QuickA{} In general, newer hardware is improving. However, it will need to improve more than two orders of magnitude to permit reader-writer lock to achieve idea performance on 128 CPUs. Worse yet, the greater the number of CPUs, the larger the required performance improvement. The performance problems of reader-writer locking are therefore very likely to be with us for quite some time to come. \QuickQ{} Is it really necessary to have both sets of primitives? \QuickA{} Strictly speaking, no. One could implement any member of the second set using the corresponding member of the first set. For example, one could implement \co{__sync_nand_and_fetch()} in terms of \co{__sync_fetch_and_nand()} as follows: \vspace{5pt} \begin{minipage}[t]{\columnwidth} \scriptsize \begin{verbatim} tmp = v; ret = __sync_fetch_and_nand(p, tmp); ret = ~ret & tmp; \end{verbatim} \end{minipage} \vspace{5pt} It is similarly possible to implement \co{__sync_fetch_and_add()}, \co{__sync_fetch_and_sub()}, and \co{__sync_fetch_and_xor()} in terms of their post-value counterparts. However, the alternative forms can be quite convenient, both for the programmer and for the compiler/library implementor. \QuickQ{} Given that these atomic operations will often be able to generate single atomic instructions that are directly supported by the underlying instruction set, shouldn't they be the fastest possible way to get things done? \QuickA{} Unfortunately, no. See Chapter~\ref{chp:Counting} for some stark counterexamples. \QuickQ{} What happened to the Linux-kernel equivalents to \co{fork()} and \co{join()}? \QuickA{} They don't really exist. All tasks executing within the Linux kernel share memory, at least unless you want to do a huge amount of memory-mapping work by hand. \QuickQAC{chp:Counting}{Counting} \QuickQ{} Why on earth should efficient and scalable counting be hard? After all, computers have special hardware for the sole purpose of doing counting, addition, subtraction, and lots more besides, don't they??? \QuickA{} Because the straightforward counting algorithms, for example, atomic operations on a shared counter, are slow and scale badly, as will be seen in Section~\ref{sec:count:Why Isn't Concurrent Counting Trivial?}. \QuickQ{} { \bfseries Network-packet counting problem. } Suppose that you need to collect statistics on the number of networking packets (or total number of bytes) transmitted and/or received. Packets might be transmitted or received by any CPU on the system. Suppose further that this large machine is capable of handling a million packets per second, and that there is a systems-monitoring package that reads out the count every five seconds. How would you implement this statistical counter? \QuickA{} Hint: the act of updating the counter must be blazingly fast, but because the counter is read out only about once in five million updates, the act of reading out the counter can be quite slow. In addition, the value read out normally need not be all that accurate---after all, since the counter is updated a thousand times per millisecond, we should be able to work with a value that is within a few thousand counts of the true value'', whatever true value'' might mean in this context. However, the value read out should maintain roughly the same absolute error over time. For example, a 1\% error might be just fine when the count is on the order of a million or so, but might be absolutely unacceptable once the count reaches a trillion. See Section~\ref{sec:count:Statistical Counters}. \QuickQ{} { \bfseries Approximate structure-allocation limit problem. } Suppose that you need to maintain a count of the number of structures allocated in order to fail any allocations once the number of structures in use exceeds a limit (say, 10,000). Suppose further that these structures are short-lived, that the limit is rarely exceeded, and that a sloppy'' approximate limit is acceptable. \QuickA{} Hint: the act of updating the counter must be blazingly fast, but the counter is read out each time that the counter is increased. However, the value read out need not be accurate \emph{except} that it absolutely must distinguish perfectly between values below the limit and values greater than or equal to the limit. See Section~\ref{sec:count:Approximate Limit Counters}. \QuickQ{} { \bfseries Exact structure-allocation limit problem. } Suppose that you need to maintain a count of the number of structures allocated in order to fail any allocations once the number of structures in use exceeds an exact limit (say, 10,000). Suppose further that these structures are short-lived, and that the limit is rarely exceeded, that there is almost always at least one structure in use, and suppose further still that it is necessary to know exactly when this counter reaches zero, for example, in order to free up some memory that is not required unless there is at least one structure in use. \QuickA{} Hint: the act of updating the counter must be blazingly fast, but the counter is read out each time that the counter is increased. However, the value read out need not be accurate \emph{except} that it absolutely must distinguish perfectly between values between the limit and zero on the one hand, and values that either are less than or equal to zero or are greater than or equal to the limit on the other hand. See Section~\ref{sec:count:Exact Limit Counters}. \QuickQ{} { \bfseries Removable I/O device access-count problem. } Suppose that you need to maintain a reference count on a heavily used removable mass-storage device, so that you can tell the user when it is safe to removed the device. This device follows the usual removal procedure where the user indicates a desire to remove the device, and the system tells the user when it is safe to do so. \QuickA{} Hint: the act of updating the counter must be blazingly fast and scalable in order to avoid slowing down I/O operations, but because the counter is read out only when the user wishes to remove the device, the counter read-out operation can be extremely slow. Furthermore, there is no need to be able to read out the counter at all unless the user has already indicated a desire to remove the device. In addition, the value read out need not be accurate \emph{except} that it absolutely must distinguish perfectly between non-zero and zero values. However, once it has read out a zero value, it must act to keep the value at zero until it has taken some action to prevent subsequent threads from gaining access to the device being removed. See Section~\ref{sec:count:Applying Specialized Parallel Counters}. \QuickQ{} But doesn't the \co{++} operator produce an x86 add-to-memory instruction? And won't the CPU cache cause this to be atomic? \QuickA{} Although the \co{++} operator \emph{could} be atomic, there is no requirement that it be so. Furthermore, the \co{ACCESS_ONCE()} primitive forces most version of {\sf gcc} to load the value to a register, increment the register, then store the value to memory, which is decidedly non-atomic. \QuickQ{} The 8-figure accuracy on the number of failures indicates that you really did test this. Why would it be necessary to test such a trivial program, especially when the bug is easily seen by inspection? \QuickA{} There are no trivial parallel programs, and most days I am not so sure that there are trivial sequential programs, either. No matter how small or simple the program, if you haven't tested it, it does not work. And even if you have tested it, Murphy says there are at least a few bugs still lurking. Furthermore, while proofs of correctness certainly do have their place, they never will replace testing, including the \url{counttorture.h} test setup used here. After all, proofs can have bugs just as easily as programs can! \QuickQ{} Why doesn't the dashed line on the x~axis meet the diagonal line at $y=1$? \QuickA{} Because of the overhead of the atomic operation. The dashed line on the x~axis represents the overhead of a single \emph{non-atomic} increment. After all, an \emph{ideal} algorithm would not only scale linearly, it would also incur no performance penalty compared to single-threaded code. This level of idealism may seem severe, but if it is good enough for Linus Torvalds, it is good enough for you. \QuickQ{} But atomic increment is still pretty fast. And incrementing a single variable in a tight loop sounds pretty unrealistic to me, after all, most of the program's execution should be devoted to actually doing work, not accounting for the work it has done! Why should I care about making this go faster? \QuickA{} In many cases, atomic increment will in fact be fast enough for you. In those cases, you should by all means use atomic increment. That said, there are many real-world situations where more elaborate counting algorithms are required. The canonical example of such a situation is counting packets and bytes in highly optimized networking stacks, where it is all too easy to find much of the execution time going into these sorts of accounting tasks, especially on large multiprocessors. In addition, counting provides an excellent view of the issues encountered in shared-memory parallel programs. \QuickQ{} But why can't CPU designers simply ship the operation to the data, avoiding the need to circulate the cache line containing the global variable being incremented? \QuickA{} It might well be possible to do this in some cases. However, there are a few complications: \begin{enumerate} \item If the value of the variable is required, then the thread will be forced to wait for the operation to be shipped to the data, and then for the result to be shipped back. \item If the atomic increment must be ordered with respect to prior and/or subsequent operations, then the thread will be forced to wait for the operation to be shipped to the data, and for an indication that the operation completed to be shipped back. \item Shipping operations among CPUs will likely require more signals, which will consume more die area and more electrical power. \end{enumerate} But what if neither of the first two conditions holds? Then you should think carefully about the algorithms discussed in Section~\ref{sec:count:Statistical Counters}, which achieve near-ideal performance on commodity hardware. \begin{figure}[tb] \begin{center} \resizebox{3in}{!}{\includegraphics{count/GlobalTreeInc}} \end{center} \caption{Data Flow For Global Combining-Tree Atomic Increment} \label{fig:count:Data Flow For Global Combining-Tree Atomic Increment} \end{figure} If either or both of the first two conditions hold, there is \emph{some} hope for improvement. One could imagine the hardware implementing a combining tree, so that the increment requests from multiple CPUs are combined by the hardware into a single addition when the combined request reaches the hardware. The hardware could also apply an order to the requests, thus returning to each CPU the return value corresponding to its particular atomic increment. This results in instruction latency that varies as $O(log N)$, where $N$ is the number of CPUs, as shown in Figure~\ref{fig:count:Data Flow For Global Combining-Tree Atomic Increment}. This is a great improvement over the $O(N)$ performance of current hardware shown in Figure~\ref{fig:count:Data Flow For Global Atomic Increment}, and it is possible that hardware latencies might decrease somewhat if innovations such as three-D fabrication prove practical. Nevertheless, we will see that in some important special cases, software can do \emph{much} better. \QuickQ{} But doesn't the fact that C's integers'' are limited in size complicate things? \QuickA{} No, because modulo addition is still commutative and associative. At least as long as you use unsigned integer. Recall that in the C standard, overflow of signed integers results in undefined behavior (never mind the fact that machines that do anything other than wrap on overflow are quite rare these days. That said, one potential source of additional complexity arises when attempting to gather (say) a 64-bit sum from 32-bit per-thread counters. For the moment, dealing with this added complexity is left as an exercise for the reader. \QuickQ{} An array??? But doesn't that limit the number of threads? \QuickA{} It can, and in this toy implementation, it does. But it is not that hard to come up with an alternative implementation that permits an arbitrary number of threads. However, this is left as an exercise for the reader. \QuickQ{} What other choice does gcc have, anyway??? \QuickA{} According to the C standard, the effects of fetching a variable that might be concurrently modified by some other thread are undefined. It turns out that the C standard really has no other choice, given that C must support (for example) eight-bit architectures which are incapable of atomically loading a \co{long}. An upcoming version of the C standard aims to fill this gap, but until then, we depend on the kindness of the gcc developers. \QuickQ{} How does the per-thread \co{counter} variable in Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters} get initialized? \QuickA{} The C standard specifies that the initial value of global variables is zero, unless they are explicitly initialized. So the initial value of all the instances of \co{counter} will be zero. That said, one often takes differences of consecutive reads from statistical counters, in which case the initial value is irrelevant. \QuickQ{} How is the code in Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters} supposed to permit more than one counter? \QuickA{} Indeed, this toy example does not support more than one counter. Modifying it so that it can provide multiple counters is left as an exercise to the reader. \QuickQ{} Why does \co{inc_count()} in Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters} need to use atomic instructions? \QuickA{} If non-atomic instructions were used, counts could be lost. \QuickQ{} Won't the single global thread in the function \co{eventual()} of Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters} be just as severe a bottleneck as a global lock would be? \QuickA{} In this case, no. What will happen instead is that the estimate of the counter value returned by \co{read_count()} will become more inaccurate. \QuickQ{} Won't the estimate returned by \co{read_count()} in Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters} become increasingly inaccurate as the number of threads rises? \QuickA{} Yes. If this proves problematic, one fix is to provide multiple \co{eventual()} threads, each covering its own subset of the other threads. In even more extreme cases, a tree-like hierarchy of \co{eventual()} threads might be required. \QuickQ{} Why do we need an explicit array to find the other threads' counters? Why doesn't gcc provide a \co{per_thread()} interface, similar to the Linux kernel's \co{per_cpu()} primitive, to allow threads to more easily access each others' per-thread variables? \QuickA{} Why indeed? To be fair, gcc faces some challenges that the Linux kernel gets to ignore. When a user-level thread exits, its per-thread variables all disappear, which complicates the problem of per-thread-variable access, particularly before the advent of user-level RCU. In contrast, in the Linux kernel, when a CPU goes offline, that CPU's per-CPU variables remain mapped and accessible. Similarly, when a new user-level thread is created, its per-thread variables suddenly come into existence. In contrast, in the Linux kernel, all per-CPU variables are mapped and initialized at boot time, regardless of whether the corresponding CPU exists yet, or indeed, whether the corresponding CPU will ever exist. A key limitation that the Linux kernel imposes is a compile-time maximum limit on the number of CPUs, namely, \co{CONFIG_NR_CPUS}. In contrast, in user space, there is no hard-coded upper limit on the number of threads. Of course, both environments must deal with dynamically loaded code (dynamic libraries in user space, kernel modules in the Linux kernel), which increases the complexity of per-thread variables in both environments. These complications make it significantly harder for user-space environments to provide access to other threads' per-thread variables. Nevertheless, such access is highly useful, and it is hoped that it will someday appear. \QuickQ{} Why on earth do we need something as heavyweight as a \emph{lock} guarding the summation in the function \co{read_count()} in Figure~\ref{fig:count:Per-Thread Statistical Counters}? \QuickA{} Remember, when a thread exits, its per-thread variables disappear. Therefore, if we attempt to access a given thread's per-thread variables after that thread exits, we will get a segmentation fault. The lock coordinates summation and thread exit, preventing this scenario. Of course, we could instead read-acquire a reader-writer lock, but Chapter~\ref{chp:Deferred Processing} will introduce even lighter-weight mechanisms for implementing the required coordination. \QuickQ{} Why on earth do we need to acquire the lock in \co{count_register_thread()} in Figure~\ref{fig:count:Per-Thread Statistical Counters}? It is a single properly aligned machine-word store to a location that no other thread is modifying, so it should be atomic anyway, right? \QuickA{} This lock could in fact be omitted, but better safe than sorry, especially given that this function is executed only at thread startup, and is therefore not on any critical path. Now, if we were testing on machines with thousands of CPUs, we might need to omit the lock, but on machines with only'' a hundred or so CPUs, no need to get fancy. \QuickQ{} Fine, but the Linux kernel doesn't have to acquire a lock when reading out the aggregate value of per-CPU counters. So why should user-space code need to do this??? \QuickA{} Remember, the Linux kernel's per-CPU variables are always accessible, even if the corresponding CPU is offline --- even if the corresponding CPU never existed and never will exist. \begin{figure}[tbp] { \scriptsize \begin{verbatim} 1 long __thread counter = 0; 2 long *counterp[NR_THREADS] = { NULL }; 3 int finalthreadcount = 0; 4 DEFINE_SPINLOCK(final_mutex); 5 6 void inc_count(void) 7 { 8 counter++; 9 } 10 11 long read_count(void) 12 { 13 int t; 14 long sum = 0; 15 16 for_each_thread(t) 17 if (counterp[t] != NULL) 18 sum += *counterp[t]; 19 return sum; 20 } 21 22 void count_init(void) 23 { 24 } 25 26 void count_register_thread(void) 27 { 28 counterp[smp_thread_id()] = &counter; 29 } 30 31 void count_unregister_thread(int nthreadsexpected) 32 { 33 spin_lock(&final_mutex); 34 finalthreadcount++; 35 spin_unlock(&final_mutex); 36 while (finalthreadcount < nthreadsexpected) 37 poll(NULL, 0, 1); 38 } \end{verbatim} } \caption{Per-Thread Statistical Counters With Lockless Summation} \label{fig:count:Per-Thread Statistical Counters With Lockless Summation} \end{figure} One workaround is to ensure that each thread sticks around until all threads are finished, as shown in Figure~\ref{fig:count:Per-Thread Statistical Counters With Lockless Summation}. Analysis of this code is left as an exercise to the reader, however, please note that it does not fit well into the \url{counttorture.h} counter-evaluation scheme. (Why not?) Chapter~\ref{chp:Deferred Processing} will introduce synchronization mechanisms that handle this situation in a much more graceful manner. \QuickQ{} What fundamental difference is there between counting packets and counting the total number of bytes in the packets, given that the packets vary in size? \QuickA{} When counting packets, the counter is only incremented by the value one. On the other hand, when counting bytes, the counter might be incremented by largish numbers. Why does this matter? Because in the increment-by-one case, the value returned will be exact in the sense that the counter must necessarily have taken on that value at some point in time, even if it is impossible to say precisely when that point occurred. In contrast, when counting bytes, two different threads might return values that are inconsistent with any global ordering of operations. To see this, suppose that thread~0 adds the value three to its counter, thread~1 adds the value five to its counter, and threads~2 and 3 sum the counters. If the system is weakly ordered'' or if the compiler uses aggressive optimizations, thread~2 might find the sum to be three and thread~3 might find the sum to be five. The only possible global orders of the sequence of values of the counter are 0,3,8 and 0,5,8, and neither order is consistent with the results obtained. If you missed this one, you are not alone. Michael Scott used this question to stump Paul McKenney during Paul's Ph.D. defense. \QuickQ{} Given that the reader must sum all the threads' counters, this could take a long time given large numbers of threads. Is there any way that the increment operation can remain fast and scalable while allowing readers to also enjoy reasonable performance and scalability? \QuickA{} One approach would be to maintain a global approximation to the value. Readers would increment their per-thread variable, but when it reached some predefined limit, atomically add it to a global variable, then zero their per-thread variable. This would permit a tradeoff between average increment overhead and accuracy of the value read out. The reader is encouraged to think up and try out other approaches, for example, using a combining tree. \QuickQ{} What is with the strange form of the condition on line~3 of Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}? Why not the following more intuitive form of the fastpath? \vspace{5pt} \begin{minipage}[t]{\columnwidth} \small \begin{verbatim} 3 if (counter + delta <= countermax){ 4 counter += delta; 5 return 1; 6 } \end{verbatim} \end{minipage} \vspace{5pt} \QuickA{} Two words. Integer overflow.'' Try the above formulation with \co{counter} equal to 10 and \co{delta} equal to \co{ULONG_MAX}. Then try it again with the code shown in Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}. A good understanding of integer overflow will be required for the rest of this example, so if you have never dealt with integer overflow before, please try several examples to get the hang of it. Integer overflow can sometimes be more difficult to get right than parallel algorithms! \QuickQ{} Why does \co{globalize_count()} zero the per-thread variables, only to later call \co{balance_count()} to refill them in Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}? Why not just leave the per-thread variables non-zero? \QuickA{} That is in fact what an earlier version of this code did. But addition and subtraction are extremely cheap, and handling all of the special cases that arise is quite complex. Again, feel free to try it yourself, but beware of integer overflow! \QuickQ{} Given that \co{globalreserve} counted against us in \co{add_count()}, why doesn't it count for us in \co{sub_count()} in Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}? \QuickA{} The \co{globalreserve} variable tracks the sum of all threads' \co{countermax} variables. The sum of these threads' \co{counter} variables might be anywhere from zero to \co{globalreserve}. We must therefore take a conservative approach, assuming that all threads' \co{counter} variables are full in \co{add_count()} and that they are all empty in \co{sub_count()}. But remember this question, as we will come back to it later. \QuickQ{} Why have both \co{add_count()} and \co{sub_count()} in Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}? Why not simply pass a negative number to \co{add_count()}? \QuickA{} Given that \co{add_count()} takes an \co{unsigned} \co{long} as its argument, it is going to be a bit tough to pass it a negative number. And unless you have some anti-matter memory, there is little point in allowing negative numbers when counting the number of structures in use! \QuickQ{} In what way does line~7 of Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions} violate the C standard? \QuickA{} It assumes eight bits per byte. This assumption does hold for all current commodity microprocessors that can be easily assembled into shared-memory multiprocessors, but certainly does not hold for all computer systems that have ever run C code. (What could you do instead in order to comply with the C standard? What drawbacks would it have?) \QuickQ{} Given that there is only one \co{ctrandmax} variable, why bother passing in a pointer to it on line~18 of Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}? \QuickA{} There is only one \co{ctrandmax} variable \emph{per thread}. Later, we will see code that needs to pass other threads' \co{ctrandmax} variables to \co{split_ctrandmax()}. \QuickQ{} Why does \co{merge_ctrandmax()} in Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions} return an \co{int} rather than storing directly into an \co{atomic_t}? \QuickA{} Later, we will see that we need the \co{int} return to pass to the \co{atomic_cmpxchg()} primitive. \QuickQ{} Yecch! Why the ugly \co{goto} on line~11 of Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}? Haven't you heard of the \co{break} statement??? \QuickA{} Replacing the \co{goto} with a \co{break} would require keeping a flag to determine whether or not line~15 should return, which is not the sort of thing you want on a fastpath. If you really hate the \co{goto} that much, your best bet would be to pull the fastpath into a separate function that returned success or failure, with failure'' indicating a need for the slowpath. This is left as an exercise for goto-hating readers. \QuickQ{} Why would the \co{atomic_cmpxchg()} primitive at lines~13-14 of Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract} ever fail? After all, we picked up its old value on line~9 and have not changed it! \QuickA{} Later, we will see how the \co{flush_local_count()} function in Figure~\ref{fig:count:Atomic Limit Counter Utility Functions} might update this thread's \co{ctrandmax} variable concurrently with the execution of the fastpath on lines~8-14 of Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}. \QuickQ{} What stops a thread from simply refilling its \co{ctrandmax} variable immediately after \co{flush_local_count()} on line 14 of Figure~\ref{fig:count:Atomic Limit Counter Utility Functions} empties it? \QuickA{} This other thread cannot refill its \co{ctrandmax} until the caller of \co{flush_local_count()} releases the \co{gblcnt_mutex}. By that time, the caller of \co{flush_local_count()} will have finished making use of the counts, so there will be no problem with this other thread refilling --- assuming that the value of \co{globalcount} is large enough to permit a refill. \QuickQ{} What prevents concurrent execution of the fastpath of either \co{atomic_add()} or \co{atomic_sub()} from interfering with the \co{ctrandmax} variable while \co{flush_local_count()} is accessing it on line 27 of Figure~\ref{fig:count:Atomic Limit Counter Utility Functions} empties it? \QuickA{} Nothing. Consider the following three cases: \begin{enumerate} \item If \co{flush_local_count()}'s \co{atomic_xchg()} executes before the \co{split_ctrandmax()} of either fastpath, then the fastpath will see a zero \co{counter} and \co{countermax}, and will thus transfer to the slowpath (unless of course \co{delta} is zero). \item If \co{flush_local_count()}'s \co{atomic_xchg()} executes after the \co{split_ctrandmax()} of either fastpath, but before that fastpath's \co{atomic_cmpxchg()}, then the \co{atomic_cmpxchg()} will fail, causing the fastpath to restart, which reduces to case~1 above. \item If \co{flush_local_count()}'s \co{atomic_xchg()} executes after the \co{atomic_cmpxchg()} of either fastpath, then the fastpath will (most likely) complete successfully before \co{flush_local_count()} zeroes the thread's \co{ctrandmax} variable. \end{enumerate} Either way, the race is resolved correctly. \QuickQ{} Given that the \co{atomic_set()} primitive does a simple store to the specified \co{atomic_t}, how can line~53 of \co{balance_count()} in Figure~\ref{fig:count:Atomic Limit Counter Utility Functions} work correctly in face of concurrent \co{flush_local_count()} updates to this variable? \QuickA{} The caller of both \co{balance_count()} and \co{flush_local_count()} hold \co{gblcnt_mutex}, so only one may be executing at a given time. \QuickQ{} But signal handlers can be migrated to some other CPU while running. Doesn't this possibility require that atomic instructions and memory barriers are required to reliably communicate between a thread and a signal handler that interrupts that thread? \QuickA{} No. If the signal handler is migrated to another CPU, then the interrupted thread is also migrated along with it. \QuickQ{} In Figure~\ref{fig:count:Signal-Theft State Machine}, why is the REQ \co{theft} state colored blue? \QuickA{} To indicate that only the fastpath is permitted to change the \co{theft} state. \QuickQ{} In Figure~\ref{fig:count:Signal-Theft State Machine}, what is the point of having separate REQ and ACK \co{theft} states? Why not simplify the state machine by collapsing them into a single state? Then whichever of the signal handler or the fastpath gets there first could set the state to READY. \QuickA{} Reasons why collapsing the REQ and ACK states would be a very bad idea include: \begin{enumerate} \item The slowpath uses the REQ and ACK states to determine whether the signal should be retransmitted. If the states were collapsed, the slowpath would have no choice but to send redundant signals, which would have the unhelpful effect of slowing down the fastpath. \item The following race would result: \begin{enumerate} \item The slowpath sets a given thread's state to REQACK. \item That thread has just finished its fastpath, and notes the REQACK state. \item The thread receives the signal, which also notes the REQACK state, and, because there is no fastpath in effect, sets the state to READY. \item The slowpath notes the READY state, steals the count, and sets the state to IDLE, and completes. \item The fastpath sets the state to READY, disabling further fastpath execution for this thread. \end{enumerate} The basic problem here is that the combined REQACK state can be referenced by both the signal handler and the fastpath. The clear separation maintained by the four-state setup ensures orderly state transitions. \end{enumerate} That said, you might well be able to make a three-state setup work correctly. If you do succeed, compare carefully to the four-state setup. Is the three-state solution really preferable, and why or why not? \QuickQ{} In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions} function \co{flush_local_count_sig()}, why are there \co{ACCESS_ONCE()} wrappers around the uses of the \co{theft} per-thread variable? \QuickA{} The first one (on line~11) can be argued to be unnecessary. The last two (lines~14 and 16) are important. If these are removed, the compiler would be within its rights to rewrite lines~14-17 as follows: \vspace{5pt} \begin{minipage}[t]{\columnwidth} \small \begin{verbatim} 14 theft = THEFT_READY; 15 if (counting) { 16 theft = THEFT_ACK; 17 } \end{verbatim} \end{minipage} \vspace{5pt} This would be fatal, as the slowpath might see the transient value of \co{THEFT_READY}, and start stealing before the corresponding thread was ready. \QuickQ{} In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}, why is it safe for line~28 to directly access the other thread's \co{countermax} variable? \QuickA{} Because the other thread is not permitted to change the value of its \co{countermax} variable unless it holds the \co{gblcnt_mutex} lock. But the caller has acquired this lock, so it is not possible for the other thread to hold it, and therefore the other thread is not permitted to change its \co{countermax} variable. We can therefore safely access it --- but not change it. \QuickQ{} In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}, why doesn't line~33 check for the current thread sending itself a signal? \QuickA{} There is no need for an additional check. The caller of \co{flush_local_count()} has already invoked \co{globalize_count()}, so the check on line~28 will have succeeded, skipping the later \co{pthread_kill()}. \QuickQ{} The code in Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}, works with gcc and POSIX. What would be required to make it also conform to the ISO C standard? \QuickA{} The \co{theft} variable must be of type \co{sig_atomic_t} to guarantee that it can be safely shared between the signal handler and the code interrupted by the signal. \QuickQ{} In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}, why does line~41 resend the signal? \QuickA{} Because many operating systems over several decades have had the property of losing the occasional signal. Whether this is a feature or a bug is debatable, but irrelevant. The obvious symptom from the user's viewpoint will not be a kernel bug, but rather a user application hanging. \emph{Your} user application hanging! \QuickQ{} What if you want an exact limit counter to be exact only for its lower limit? \QuickA{} One simple solution is to overstate the upper limit by the desired amount. The limiting case of such overstatement results in the upper limit being set to the largest value that the counter is capable of representing. \QuickQ{} What else had you better have done when using a biased counter? \QuickA{} You had better have set the upper limit to be large enough accommodate the bias, the expected maximum number of accesses, and enough slop'' to allow the counter to work efficiently even when the number of accesses is at its maximum. \QuickQ{} This is ridiculous! We are \emph{read}-acquiring a reader-writer lock to \emph{update} the counter? What are you playing at??? \QuickA{} Strange, perhaps, but true! Almost enough to make you think that the name reader-writer lock'' was poorly chosen, isn't it? \QuickQ{} What other issues would need to be accounted for in a real system? \QuickA{} A huge number! Here are a few to start with: \begin{enumerate} \item There could be any number of devices, so that the global variables are inappropriate, as are the lack of arguments to functions like \co{do_io()}. \item Polling loops can be problematic in real systems. In many cases, it is far better to have the last completing I/O wake up the device-removal thread. \item The I/O might fail, and so \co{do_io()} will likely need a return value. \item If the device fails, the last I/O might never complete. In such cases, there might need to be some sort of timeout to allow error recovery. \item Both \co{add_count()} and \co{sub_count()} can fail, but their return values are not checked. \item Reader-writer locks do not scale well. One way of avoiding the high read-acquisition costs of reader-writer locks is presented in Chapter~\ref{chp:Deferred Processing}. \end{enumerate} \QuickQ{} On the \url{count_stat.c} row of Table~\ref{tab:count:Statistical Counter Performance on Power-5}, we see that the update side scales linearly with the number of threads. How is that possible given that the more threads there are, the more per-thread counters must be summed up? \QuickA{} The read-side code must scan the entire fixed-size array, regardless of the number of threads, so there is no difference in performance. In contrast, in the last two algorithms, readers must do more work when there are more threads. In addition, the last two algorithms interpose an additional level of indirection because they map from integer thread ID to the corresponding \co{__thread} variable. \QuickQ{} Even on the last row of Table~\ref{tab:count:Statistical Counter Performance on Power-5}, the read-side performance of these statistical counter implementations is pretty horrible. So why bother with them? \QuickA{} Use the right tool for the job.'' As can be seen from Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem}, single-variable atomic increment need not apply for any job involving heavy use of parallel updates. In contrast, the algorithms shown in Table~\ref{tab:count:Statistical Counter Performance on Power-5} do an excellent job of handling update-heavy situations. Of course, if you have a read-mostly situation, you should use something else, for example, a single atomically incremented variable that can be read out using a single load. \QuickQ{} Given the performance data shown in Table~\ref{tab:count:Limit Counter Performance on Power-5}, we should always prefer update-side signals over read-side atomic operations, right? \QuickA{} That depends on the workload. Note that you need a million readers (with roughly a 40-nanosecond performance gain) to make up for even one writer (with almost a 40-\emph{millisecond} performance loss). Although there are no shortage of workloads with far greater read intensity, you will need to consider your particular workload. In addition, although memory barriers have historically been expensive compared to ordinary instructions, you should check this on the specific hardware you will be running. The properties of computer hardware do change over time, and algorithms must change accordingly. \QuickQ{} Can advanced techniques be applied to address the lock contention for readers seen in Table~\ref{tab:count:Limit Counter Performance on Power-5}? \QuickA{} There are a number of ways one might go about this, and these are left as exercises for the reader. \QuickQ{} The \co{++} operator works just fine for 1,000-digit numbers! Haven't you heard of operator overloading??? \QuickA{} In the C++ language, you might well be able to use \co{++} on a 1,000-digit number, assuming that you had access to a class implementing such numbers. But as of 2010, the C language does not permit operator overloading. \QuickQ{} But if we are going to have to partition everything, why bother with shared-memory multithreading? Why not just partition the problem completely and run as multiple processes, each in its own address space? \QuickA{} Indeed, multiple processes with separate address spaces can be an excellent way to exploit parallelism, as the proponents of the fork-join methodology and the Erlang language would be very quick to tell you. However, there are also some advantages to shared-memory parallelism: \begin{enumerate} \item Only the most performance-critical portions of the application must be partitioned, and such portions are usually a small fraction of the application. \item Although cache misses are quite slow compared to individual register-to-register instructions, they are typically considerably faster than inter-process-communication primitives, which in turn are considerably faster than things like TCP/IP networking. \item Shared-memory multiprocessors are readily available and quite inexpensive, so, in stark contrast to the 1990s, there is little cost penalty for use of shared-memory parallelism. \end{enumerate} As always, use the right tool for the job! \QuickQAC{cha:Partitioning and Synchronization Design}{Partitioning and Synchronization Design} \QuickQ{} Is there a better solution to the Dining Philosophers Problem? \QuickA{} \begin{figure}[tb] \begin{center} \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5PEM} \end{center} \caption{Dining Philosophers Problem, Fully Partitioned} \ContributedBy{Figure}{fig:SMPdesign:Dining Philosophers Problem, Fully Partitioned}{Kornilios Kourtis} \end{figure} One such improved solution is shown in Figure~\ref{fig:SMPdesign:Dining Philosophers Problem, Fully Partitioned}, where the philosophers are simply provided with an additional five forks. All five philosophers may now eat simultaneously, and there is never any need for philosophers to wait on one another. In addition, the improved disease control provided by this approach should not be underestimated. This solution can seem like cheating to some, but such cheating'' is key to finding good solutions to many concurrency problems. \QuickQ{} And in just what sense can this horizontal parallelism'' be said to be horizontal''? \QuickA{} Inman was working with protocol stacks, which are normally depicted vertically, with the application on top and the hardware interconnect on the bottom. Data flows up and down this stack. Horizontal parallelism'' processes packets from different network connections in parallel, while vertical parallelism'' handles different protocol-processing steps for a given packet in parallel. Vertical parallelism'' is also called pipelining''. \QuickQ{} In this compound double-ended queue implementation, what should be done if the queue has become non-empty while releasing and reacquiring the lock? \QuickA{} In this case, simply dequeue an item from the now-nonempty queue, release both locks, and return. \QuickQ{} Is the hashed double-ended queue a good solution? Why or why not? \QuickA{} The best way to answer this is to run \url{lockhdeq.c} on a number of different multiprocessor systems, and you are encouraged to do so in the strongest possible terms. One reason for concern is that each operation on this implementation must acquire not one but two locks. % Getting about 500 nanoseconds per element when used as % a queue on a 4.2GHz Power system. This is roughly the same as % the version covered by a single lock. Sequential (unlocked % variant is more than an order of magnitude faster! The first well-designed performance study will be cited. Do not forget to compare to a sequential implementation! \QuickQ{} Move \emph{all} the elements to the queue that became empty? In what possible universe is this braindead solution in any way optimal??? \QuickA{} It is optimal in the case where data flow switches direction only rarely. It would of course be an extremely poor choice if the double-ended queue was being emptied from both ends concurrently. This of course raises the question as to what possible universe emptying from both ends concurrently would be a reasonable thing to do... \QuickQ{} Why can't the compound parallel double-ended queue implementation be symmetric? \QuickA{} The need to avoid deadlock by imposing a lock hierarchy forces the asymmetry, just as it does in the fork-numbering solution to the Dining Philosophers Problem. \QuickQ{} Why is it necessary to retry the right-dequeue operation on line~29 of Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation}? \QuickA{} This retry is necessary because some other thread might have enqueued an element between the time that this thread dropped the lock and the time that it reacquired the lock. \QuickQ{} Surely the left-hand lock must \emph{sometimes} be available!!! So why is it necessary that line~26 of Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation} unconditionally release the right-hand lock? \QuickA{} It would be possible to use \co{spin_trylock()} to attempt to acquire the left-hand lock when it was available. However, the failure case would still need to drop the right-hand lock and then re-acquire the two locks in order. Making this transformation (and determining whether or not it is worthwhile) is left as an exercise for the reader. \QuickQ{} The tandem double-ended queue runs about twice as fast as the hashed double-ended queue, even when I increase the size of the hash table to an insanely large number. Why is that? \QuickA{} The hashed double-ended queue's locking design only permits one thread at a time at each end, and further requires two lock acquisitions for each operation. The tandem double-ended queue also permits one thread at a time at each end, and in the common case requires only one lock acquisition per operation. Therefore, the tandem double-ended queue should be expected to outperform the hashed double-ended queue. Can you created a double-ended queue that allows multiple concurrent operations at each end? If so, how? If not, why not? \QuickQ{} Is there a significantly better way of handling concurrency for double-ended queues? \QuickA{} Transform the problem to be solved so that multiple double-ended queues can be used in parallel, allowing the simpler single-lock double-ended queue to be used, and perhaps also replace each double-ended queue with a pair of conventional single-ended queues. Without such horizontal scaling'', the speedup is limited to 2.0. In contrast, horizontal-scaling designs can enable very large speedups, and are especially attractive if there are multiple threads working either end of the queue, because in this multiple-thread case the deque simply cannot provide strong ordering guarantees. And if there are no guarantees, we may as well obtain the performance benefits that come with refusing to provide the guarantees, right? % about twice as fast as hashed version on 4.2GHz Power. \QuickQ{} What are some ways of preventing a structure from being freed while its lock is being acquired? \QuickA{} Here are a few possible solutions to this \emph{existence guarantee} problem: \begin{enumerate} \item Provide a statically allocated lock that is held while the per-structure lock is being acquired, which is an example of hierarchical locking (see Section~\ref{sec:SMPdesign:Hierarchical Locking}). Of course, using a single global lock for this purpose can result in unacceptably high levels of lock contention, dramatically reducing performance and scalability. \item Provide an array of statically allocated locks, hashing the structure's address to select the lock to be acquired, as described in Chapter~\ref{chp:Locking}. Given a hash function of sufficiently high quality, this avoids the scalability limitations of the single global lock, but in read-mostly situations, the lock-acquisition overhead can result in unacceptably degraded performance. \item Use a garbage collector, in software environments providing them, so that a structure cannot be deallocated while being referenced. This works very well, removing the existence-guarantee burden (and much else besides) from the developer's shoulders, but imposes the overhead of garbage collection on the program. Although garbage-collection technology has advanced considerably in the past few decades, its overhead may be unacceptably high for some applications. In addition, some applications require that the developer exercise more control over the layout and placement of data structures than is permitted by most garbage collected environments. \item As a special case of a garbage collector, use a global reference counter, or a global array of reference counters. \item Use \emph{hazard pointers}~\cite{MagedMichael04a}, which can be thought of as an inside-out reference count. Hazard-pointer-based algorithms maintain a per-thread list of pointers, so that the appearance of a given pointer on any of these lists acts as a reference to the corresponding structure. Hazard pointers are an interesting research direction, but have not yet seen much use in production (written in 2008). \item Use transactional memory (TM)~\cite{Herlihy93a,DBLomet1977SIGSOFT,Shavit95}, so that each reference and modification to the data structure in question is performed atomically. Although TM has engendered much excitement in recent years, and seems likely to be of some use in production software, developers should exercise some caution~\cite{Blundell2005DebunkTM,Blundell2006TMdeadlock,McKenney2007PLOSTM}, particularly in performance-critical code. In particular, existence guarantees require that the transaction cover the full path from a global reference to the data elements being updated. \item Use RCU, which can be thought of as an extremely lightweight approximation to a garbage collector. Updaters are not permitted to free RCU-protected data structures that RCU readers might still be referencing. RCU is most heavily used for read-mostly data structures, and is discussed at length in Chapter~\ref{chp:Deferred Processing}. \end{enumerate} For more on providing existence guarantees, see Chapters~\ref{chp:Locking} and \ref{chp:Deferred Processing}. \QuickQ{} How can a single-threaded 64-by-64 matrix multiple possibly have an efficiency of less than 1.0? Shouldn't all of the traces in Figure~\ref{fig:SMPdesign:Matrix Multiply Efficiency} have efficiency of exactly 1.0 when running on only one thread? \QuickA{} The \texttt{matmul.c} program creates the specified number of worker threads, so even the single-worker-thread case incurs thread-creation overhead. Making the changes required to optimize away thread-creation overhead in the single-worker-thread case is left as an exercise to the reader. \QuickQ{} How are data-parallel techniques going to help with matrix multiply? It is \emph{already} data parallel!!! \QuickA{} I am glad that you are paying attention! This example serves to show that although data parallelism can be a very good thing, it is not some magic wand that automatically wards off any and all sources of inefficiency. Linear scaling at full performance, even to only'' 64 threads, requires care at all phases of design and implementation. In particular, you need to pay careful attention to the size of the partitions. For example, if you split a 64-by-64 matrix multiply across 64 threads, each thread gets only 64 floating-point multiplies. The cost of a floating-point multiply is miniscule compared to the overhead of thread creation. Moral: If you have a parallel program with variable input, always include a check for the input size being too small to be worth parallelizing. And when it is not helpful to parallelize, it is not helpful to spawn a single thread, now is it? \QuickQ{} In what situation would hierarchical locking work well? \QuickA{} If the comparison on line~31 of Figure~\ref{fig:SMPdesign:Hierarchical-Locking Hash Table Search} were replaced by a much heavier-weight operation, then releasing {\tt bp->bucket\_lock} \emph{might} reduce lock contention enough to outweigh the overhead of the extra acquisition and release of {\tt cur->node\_lock}. \QuickQ{} In Figure~\ref{fig:SMPdesign:Allocator Cache Performance}, there is a pattern of performance rising with increasing run length in groups of three samples, for example, for run lengths 10, 11, and 12. Why? \QuickA{} This is due to the per-CPU target value being three. A run length of 12 must acquire the global-pool lock twice, while a run length of 13 must acquire the global-pool lock three times. \QuickQ{} Allocation failures were observed in the two-thread tests at run lengths of 19 and greater. Given the global-pool size of 40 and the per-CPU target pool size of three, what is the smallest allocation run length at which failures can occur? \QuickA{} The exact solution to this problem is left as an exercise to the reader. The first solution received will be credited to its submitter. As a rough rule of thumb, the global pool size should be at least $m+2sn$, where m'' is the maximum number of elements allocated at a given time, s'' is the per-CPU pool size, and n'' is the number of CPUs. \QuickQAC{chp:Locking}{Locking} \QuickQ{} But the definition of deadlock only said that each thread was holding at least one lock and waiting on another lock that was held by some thread. How do you know that there is a cycle? \QuickA{} Suppose that there is no cycle in the graph. We would then have a directed acyclic graph (DAG), which would have at least one leaf node. If this leaf node was a lock, then we would have a thread that was waiting on a lock that wasn't held by any thread, which violates the definition. (And in this case the thread would immediately acquire the lock.) On the other hand, if this leaf node was a thread, then we would have a thread that was not waiting on any lock, again violating the definition. (And in this case, the thread would either be running or blocked on something that is not a lock.) Therefore, given this definition of deadlock, there must be a cycle in the corresponding graph. \QuickQ{} Are there any exceptions to this rule, so that there really could be a deadlock cycle containing locks from both the library and the caller, even given that the library code never invokes any of the caller's functions? \QuickA{} Indeed there can! Here are a few of them: \begin{enumerate} \item If one of the library function's arguments is a pointer to a lock that this library function acquires, and if the library function holds one if its locks while acquiring the caller's lock, then we could have a deadlock cycle involving both caller and library locks. \item If one of the library functions returns a pointer to a lock that is acquired by the caller, and if the caller acquires one if its locks while holding the library's lock, we could again have a deadlock cycle involving both caller and library locks. \item If one of the library functions acquires a lock and then returns while still holding it, and if the caller acquires one of its locks, we have yet another way to create a deadlock cycle involving both caller and library locks. \item If the caller has a signal handler that acquires locks, then the deadlock cycle can involve both caller and library locks. In this case, however, the library's locks are innocent bystanders in the deadlock cycle. That said, please note that acquiring a lock from within a signal handler is a no-no in most environments---it is not just a bad idea, it is unsupported. \end{enumerate} \QuickQ{} But if \co{qsort()} releases all its locks before invoking the comparison function, how can it protect against races with other \co{qsort()} threads? \QuickA{} By privatizing the data elements being compared (as discussed in Chapter~\ref{chp:Data Ownership}) or through use of deferral mechanisms such as reference counting (as discussed in Chapter~\ref{chp:Deferred Processing}). \QuickQ{} Name one common exception where it is perfectly reasonable to pass a pointer to a lock into a function. \QuickA{} Locking primitives, of course! \QuickQ{} Doesn't the fact that \co{pthread_cond_wait()} first releases the mutex and then re-acquires it eliminate the possibility of deadlock? \QuickA{} Absolutely not! Consider the a program that acquires \co{mutex_a}, and then \co{mutex_b}, in that order, and then passes \co{mutex_a} to \co{pthread_cond_wait}. Now, \co{pthread_cond_wait} will release \co{mutex_a}, but will re-acquire it before returning. If some other thread acquires \co{mutex_a} in the meantime and then blocks on \co{mutex_b}, the program will deadlock. \QuickQ{} Can the transformation from Figure~\ref{fig:locking:Protocol Layering and Deadlock} to Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking} be applied universally? \QuickA{} Absolutely not! This transformation assumes that the \co{layer_2_processing()} function is idempotent, given that it might be executed multiple times on the same packet when the \co{layer_1()} routing decision changes. Therefore, in real life, this transformation can become arbitrarily complex. \QuickQ{} But the complexity in Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking} is well worthwhile given that it avoids deadlock, right? \QuickA{} Maybe. If the routing decision in \co{layer_1()} changes often enough, the code will always retry, never making forward progress. This is termed livelock'' if no thread makes any forward progress or starvation'' if some threads make forward progress but other do not (see Section~\ref{sec:locking:Livelock and Starvation}). \QuickQ{} Why is it illegal to acquire a Lock~A that is acquired outside of a signal handler without blocking signals while holding a Lock~B that is acquired within a signal handler? \QuickA{} Because this would lead to deadlock. Given that Lock~A is held outside of a signal handler without blocking signals, a signal might be handled while holding this lock. The corresponding signal handler might then acquire Lock~B, so that Lock~B is acquired while holding Lock~A. Therefore, if we also acquire Lock~A while holding Lock~B as called out in the question, we will have a deadlock cycle. Therefore, it is illegal to acquire a lock that is acquired outside of a signal handler without blocking signals while holding a another lock that is acquired within a signal handler. \QuickQ{} How can you legally block signals within a signal handler? \QuickA{} One of the simplest and fastest ways to do so is to use the \co{sa_mask} field of the \co{struct sigaction} that you pass to \co{sigaction()} when setting up the signal. \QuickQ{} Given an object-oriented application that passes control freely among a group of objects such that there is no reasonable locking hierarchy, layered or otherwise, how can this application be parallelized? \QuickA{} There are a number of approaches: \begin{enumerate} \item In the case of parametric search via simulation, where a large number of simulations will be run in order to converge on (for example) a good design for a mechanical or electrical device, leave the simulation single-threaded, but run many instances of the simulation in parallel. This retains the object-oriented design, and gains parallelism at a higher level. \item Partition the objects into groups such that there is no need to operate on objects in more than one group at a given time. Then associate a lock with each group. This is an example of a single-lock-at-a-time design, which discussed in Section~\ref{sec:locking:Single-Lock-at-a-Time Designs}. \item Partition the objects into groups such that threads can all operate on objects in the groups in some groupwise ordering. Then associate a lock with each group, and impose a locking hierarchy over the groups. \item Impose an arbitrarily selected hierarchy on the locks, and then use conditional locking if it is necessary to acquire a lock out of order, as was discussed in Section~\ref{sec:locking:Conditional Locking}. \item Before carrying out a given group of operations, predict which locks will be acquired, and attempt to acquire them before actually carrying out any updates. If the prediction turns out to be incorrect, drop all the locks and retry with an updated prediction that includes the benefit of experience. This approach was discussed in Section~\ref{sec:locking:Acquire Needed Locks First}. \item Use transactional memory. This approach has a number of advantages and disadvantages which will be discussed in Section~\ref{sec:future:Transactional Memory}. \item Refactor the application to be more concurrency-friendly. This would likely also have the side effect of making the application run faster even when single-threaded, but might also make it more difficult to modify the application. \item Use techniques from later chapters in addition to locking. \end{enumerate} \QuickQ{} How can the livelock shown in Figure~\ref{fig:locking:Abusing Conditional Locking} be avoided? \QuickA{} This is left as an exercise to the reader. Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking} provides some good hints. In many cases, livelocks are a hint that you should revisit your locking design. Or visit it in the first place if your locking design just grew''. \QuickQ{} What problems can you spot in the code in Figure~\ref{fig:locking:Conditional Locking and Exponential Backoff}? \QuickA{} Here are a couple: \begin{enumerate} \item A one-second wait is way too long for most uses. Wait intervals should begin with roughly the time required to execute the critical section, which will normally be in the microsecond or millisecond range. \item The code does not check for overflow. On the other hand, this bug is nullified by the previous bug: 32 bits worth of seconds is more than 50 years. \end{enumerate} \QuickQ{} Wouldn't it be better just to use a good parallel design so that lock contention was low enough to avoid unfairness? \QuickA{} It would be better in some sense, but there are situations where it can be appropriate to use designs that sometimes result in high lock contentions. For example, imagine a system that is subject to a rare error condition. It might well be best to have a simple error-handling design that has poor performance and scalability for the duration of the rare error condition, as opposed to a complex and difficult-to-debug design that is helpful only when one of those rare error conditions is in effect. That said, it is usually worth putting some effort into attempting to produce a design that both simple as well as efficient during error condsitions, for example by partitioning the problem. \QuickQ{} How might the lock holder be interfered with? \QuickA{} If the data protected by the lock is in the same cache line as the lock itself, then attempts by other CPUs to acquire the lock will result in expensive cache misses on the part of the CPU holding the lock. In contrast, if the lock is in a different cache line than the data that it protects, the CPU holding the lock will usually suffer a cache miss only on first access to a given variable. This is a special case of false sharing, which can also occur if a pair of variables protected by different locks happen to share a cache line. \QuickQ{} Is there any other way for the VAX/VMS DLM to emulate a reader-writer lock? \QuickA{} There are in fact several. One way would be to use the null, protected-read, and exclusive modes. Another way would be to use the null, protected-read, and concurrent-write modes. A third way would be to use the null, concurrent-read, and exclusive modes. \QuickQ{} Why bother with the inner loop on lines~7-8 of Figure~\ref{fig:locking:Sample Lock Based on Atomic Exchange}? Why not simply repeatedly do the atomic exchange operation on line~6? \QuickA{} Suppose that the lock is held and that several threads are attempting to acquire the lock. In this situation, if these threads all loop on the atomic exchange operation, they will ping-pong the cache line containing the lock among themselves, imposing load on the interconnect. In contrast, if these threads are spinning in the inner loop on lines~7-8, they will each spin within their own caches, putting negligible load on the interconnect. \QuickQ{} Why not simply store zero into the lock word on line 14 of Figure~\ref{fig:locking:Sample Lock Based on Atomic Exchange}? \QuickA{} This can be a legitimate implementation, but only if this store is preceded by a memory barrier. The memory barrier is not required when the \co{xchg()} operation is used because this operation implies a full memory barrier due to the fact that it returns a value. \QuickQ{} How can relying on implicit existence guarantees result in a bug? \QuickA{} Here are some bugs resulting from improper use of implicit existence guarantees: \begin{enumerate} \item A program writes the address of a global variable to a file, then a later instance of that same program reads that address and attempts to dereference it. This can fail due to address-space randomization, to say nothing of recompilation of the program. \item A module can record the address of one of its variables in a pointer located in some other module, then attempt to dereference that pointer after the module has been unloaded. \item A function can record the address of one of its on-stack variables into a global pointer, which some other function might attempt to dereference after that function has returned. \end{enumerate} I am sure that you can come up with additional possibilities. \QuickQ{} What if the element we need to delete is not the first element of the list on line~8 of Figure~\ref{fig:locking:Per-Element Locking Without Existence Guarantees}? \QuickA{} This is a very simple hash table with no chaining, so the only element in a given bucket is the first element. The reader is invited to adapt this example to a hash table with full chaining. \QuickQ{} What race condition can occur in Figure~\ref{fig:locking:Per-Element Locking Without Existence Guarantees}? \QuickA{} Consider the following sequence of events: \begin{enumerate} \item Thread~0 invokes \co{delete(0)}, and reaches line~10 of the figure, acquiring the lock. \item Thread~1 concurrently invokes \co{delete(0)}, and reaches line~10, but spins on the lock because Thread~1 holds it. \item Thread~0 executes lines~11-14, removing the element from the hashtable, releasing the lock, and then freeing the element. \item Thread~0 continues execution, and allocates memory, getting the exact block of memory that it just freed. \item Thread~0 then initializes this block of memory as some other type of structure. \item Thread~1's \co{spin_lock()} operation fails due to the fact that what it believes to be \co{p->lock} is no longer a spinlock. \end{enumerate} Because there is no existence guarantee, the identity of the data element can change while a thread is attempting to acquire that element's lock on line~10! \QuickQAC{chp:Data Ownership}{Data Ownership} \QuickQ{} What form of data ownership that is extremely difficult to avoid using when creating shared-memory parallel programs (for example, using pthreads) in C or C++? \QuickA{} Use of auto variables in functions. By default, these are private to the thread executing the current function. \QuickQ{} What synchronization remains in the example shown in Section~\ref{sec:owned:Multiple Processes}? \QuickA{} The creation of the threads via the \co{sh} \co{&} operator and the joining of thread via the \co{sh} \co{wait} command. Of course, if the processes explicitly share memory, for example, using the \co{shmget()} or \co{mmap()} system calls, explicit synchronization might well be needed when acccessing or updating the shared memory. The processes might also synchronize using any of the following interprocess communications mechanisms: \begin{enumerate} \item System V semaphores. \item System V message queues. \item UNIX-domain sockets. \item Networking protocols, including TCP/IP, UDP, and a whole host of others. \item File locking. \item Use of the \co{open()} system call with the \co{O_CREAT} and \co{O_EXCL} flags. \item Use of the \co{rename()} system call. \end{enumerate} A complete list of possible synchronization mechanisms is left as an exercise to the reader, who is warned that it will be an extremely long list. A surprising number of unassuming system calls can be pressed into service as synchronization mechanisms. \QuickQ{} Is there any shared data in the example shown in Section~\ref{sec:owned:Multiple Processes}? \QuickA{} That is a philosophical question. Those wishing the answer no'' might argue that processes by definition do not share memory. Those wishing to answer yes'' might list a large number of synchronization mechanisms that do not require shared memory, note that the kernel will have some shared state, and perhaps even argue that the assignment of process IDs (PIDs) constitute shared data. Such arguments are excellent intellectual exercise, and are also a wonderful way of feeling intelligent, scoring points against hapless classmates or colleagues, and (especially!) avoiding getting anything useful done. \QuickQ{} Does it ever make sense to have partial data ownership where each thread reads only its own instance of a per-thread variable, but writes to other threads' instances? \QuickA{} Amazingly enough, yes. One example is a simple message-passing system where threads post messages to other threads' mailboxes, and where each thread is responsible for removing any message it sent once that message has been acted on. Implementation of such an algorithm is left as an exercise for the reader, as is the task of identifying other algorithms with similar ownership patterns. \QuickQ{} What mechanisms other than POSIX signals may be used to ship functions? \QuickA{} There is a very large number of such mechanisms, including: \begin{enumerate} \item System V message queues. \item Shared-memory dequeue (see Section~\ref{sec:SMPdesign:Double-Ended Queue}). \item Shared-memory mailboxes. \item UNIX-domain sockets. \item TCP/IP or UDP, possibly augmented by any number of higher-level protocols, including RPC, HTTP, XML, SOAP, and so on. \end{enumerate} Compilation of a complete list is left as an exercise to sufficiently single-minded readers, who are warned that the list will be extremely long. \QuickQ{} But none of the data in the \co{eventual()} function shown on lines~15-32 of Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters} is actually owned by the \co{eventual()} thread! In just what way is this data ownership??? \QuickA{} The key phrase is owns the rights to the data''. In this case, the rights in question are the rights to access the per-thread \co{counter} variable defined on line~1 of the figure. This situation is similar to that described in Section~\ref{sec:owned:Partial Data Ownership and pthreads}. However, there really is data that is owned by the \co{eventual()} thread, namely the \co{t} and \co{sum} variables defined on lines~17 and~18 of the figure. For other examples of designated threads, look at the kernel threads in the Linux kernel, for example, those created by \co{kthread_create()} and \co{kthread_run()}. \QuickQ{} Is it possible to obtain greater accuracy while still maintaining full privacy of the per-thread data? \QuickA{} Yes. One approach is for \co{read_count()} to add the value of its own per-thread variable. This maintains full ownership and performance, but only a slight improvement in accuracy, particulary on systems with very large numbers of threads. Another approach is for \co{read_count()} to use function shipping, for example, in the form of per-thread signals. This greatly improves accuracy, but at a significant performance cost for \co{read_count()}. However, both of these methods have the advantage of eliminating cache-line bouncing for the common case of updating counters. \QuickQAC{chp:Deferred Processing}{Deferred Processing} \QuickQ{} Why not implement reference-acquisition using a simple compare-and-swap operation that only acquires a reference if the reference counter is non-zero? \QuickA{} Although this can resolve the race between the release of the last reference and acquisition of a new reference, it does absolutely nothing to prevent the data structure from being freed and reallocated, possibly as some completely different type of structure. It is quite likely that the simple compare-and-swap operation'' would give undefined results if applied to the differently typed structure. In short, use of atomic operations such as compare-and-swap absolutely requires either type-safety or existence guarantees. \QuickQ{} Why isn't it necessary to guard against cases where one CPU acquires a reference just after another CPU releases the last reference? \QuickA{} Because a CPU must already hold a reference in order to legally acquire another reference. Therefore, if one CPU releases the last reference, there cannot possibly be any CPU that is permitted to acquire a new reference. This same fact allows the non-atomic check in line~22 of Figure~\ref{fig:defer:Linux Kernel kref API}. \QuickQ{} If the check on line~22 of Figure~\ref{fig:defer:Linux Kernel kref API} fails, how could the check on line~23 possibly succeed? \QuickA{} Suppose that {\tt kref\_put()} is protected by RCU, so that two CPUs might be executing line~22 concurrently. Both might see the value 2'', causing both to then execute line~23. One of the two instances of {\tt atomic\_dec\_and\_test()} will decrement the value to zero and thus return 1. \QuickQ{} How can it possibly be safe to non-atomically check for equality with 1'' on line~22 of Figure~\ref{fig:defer:Linux Kernel kref API}? \QuickA{} Remember that it is not legal to call either {\tt kref\_get()} or {\tt kref\_put()} unless you hold a reference. If the reference count is equal to 1'', then there can't possibly be another CPU authorized to change the value of the reference count. \QuickQ{} Why can't the check for a zero reference count be made in a simple if'' statement with an atomic increment in its then'' clause? \QuickA{} Suppose that the if'' condition completed, finding the reference counter value equal to one. Suppose that a release operation executes, decrementing the reference counter to zero and therefore starting cleanup operations. But now the then'' clause can increment the counter back to a value of one, allowing the object to be used after it has been cleaned up. \QuickQ{} Why isn't this sequence-lock discussion in Chapter~\ref{chp:Locking}, you know, the one on \emph{locking}? \QuickA{} The sequence-lock mechanism is really a combination of two separate synchronization mechanisms, sequence counts and locking. In fact, the sequence-count mechanism is available separately in the Linux kernel via the \co{write_seqcount_begin()} and \co{write_seqcount_end()} primitives. However, the combined \co{write_seqlock()} and \co{write_sequnlock()} primitives are used much more heavily in the Linux kernel. More importantly, many more people will understand what you mean if you san sequence lock'' than if you say sequence count''. So this section is entitled Sequence Locks'' so that people will understand what it is about just from the title, and it appears in the Deferred Processing'' because (1) of the emphasis on the sequence count'' aspect of sequence locks'' and (2) because a sequence lock'' is much more than merely a lock. \QuickQ{} Can you use sequence locks as the only synchronization mechanism protecting a linked list supporting concurrent addition, deletion, and search? \QuickA{} One trivial way of accomplishing this is to surround all accesses, including the read-only accesses, with \co{write_seqlock()} and \co{write_sequnlock()}. Of course, this solution also prohibits all read-side parallelism, and furthermore could just as easily be implemented using simple locking. If you do come up with a solution that uses \co{read_seqbegin()} and \co{read_seqretry()} to protect read-side accesses, make sure that you correctly handle the following sequence of events: \begin{enumerate} \item CPU~0 is traversing the linked list, and picks up a pointer to list element~A. \item CPU~1 removes element~A from the list and frees it. \item CPU~2 allocates an unrelated data structure, and gets the memory formerly occupied by element~A. In this unrelated data structure, the memory previously used for element~A's \co{->next} pointer is now occupied by a floating-point number. \item CPU~0 picks up what used to be element~A's \co{->next} pointer, gets random bits, and therefore gets a segmentation fault. \end{enumerate} One way to protect against this sort of problem requires use of type-safe memory'', which will be discussed in Section~\ref{sec:deferRCU is a Way of Providing Type-Safe Memory}. But in that case, you would be using some other synchronization mechanism in addition to sequence locks! \QuickQ{} Why bother with the check on line~19 of \co{read_seqbegin()} in Figure~\ref{fig:defer:Sequence-Locking Implementation}? Given that a new writer could begin at any time, why not simply incorporate the check into line~31 of \co{read_seqretry()}? \QuickA{} That would be a legitimate implementation. However, it would not save anything to move the check down to \co{read_seqretry()}: There would be roughly the same number of instructions. Furthermore, the reader's accesses from its doomed read-side critical section could inflict overhead on the writer in the form of cache misses. We can avoid these cache misses by placing the check in \co{read_seqbegin()} as shown on line~19 of Figure~\ref{fig:defer:Sequence-Locking Implementation}. \QuickQ{} What prevents sequence-locking updaters from starving readers? \QuickA{} Nothing. This is one of the weaknesses of sequence locking, and as a result, you should use sequence locking only in read-mostly situations. Unless of course read-side starvation is acceptable in your situation, in which case, go wild the sequence-locking updates! \QuickQ{} What if something else serializes writers, so that the lock is not needed? \QuickA{} In this case, the \co{->lock} field could be omitted, as it is in \co{seqcount_t} in the Linux kernel. \QuickQ{} Why isn't \co{seq} on line 2 of Figure~\ref{fig:defer:Sequence-Locking Implementation} \co{unsigned} rather than \co{unsigned long}? After all, if \co{unsigned} is good enough for the Linux kernel, shouldn't it be good enough for everyone? \QuickA{} Not at all. The Linux kernel has a number of special attributes that allow it to ignore the following sequence of events: \begin{enumerate} \item Thread 0 executes \co{read_seqbegin()}, picking up \co{->seq} in line~17, noting that the value is even, and thus returning to the caller. \item Thread 0 starts executing its read-side critical section, but is then preempted for a long time. \item Other threads repeatedly invoke \co{write_seqlock()} and \co{write_sequnlock()}, until the value of \co{->seq} overflows back to the value that Thread~0 fetched. \item Thread 0 resumes execution, completing its read-side critical section with inconsistent data. \item Thread 0 invokes \co{read_seqretry()}, which incorrectly concludes that Thread~0 has seen a consistent view of the data protected by the sequence lock. \end{enumerate} The Linux kernel uses sequence locking for things that are updated rarely, with time-of-day information being a case in point. This information is updated at most once per millisecond, so that seven weeks would be required to overflow the counter. If a kernel thread was preempted for seven weeks, the Linux kernel's soft-lockup code would be emitting warnings every two minutes for that entire time. In contrast, with a 64-bit counter, more than five centuries would be required to overflow, even given an update every \emph{nano}second. Therefore, this implementation uses a type for \co{->seq} that is 64 bits on 64-bit systems. \QuickQ{} But doesn't Section~\ref{sec:defer:Sequence Locks}'s seqlock also permit readers and updaters to get work done concurrently? \QuickA{} Yes and no. Although seqlock readers can run concurrently with seqlock writers, whenever this happens, the {\tt read\_seqretry()} primitive will force the reader to retry. This means that any work done by a seqlock reader running concurrently with a seqlock updater will be discarded and redone. So seqlock readers can \emph{run} concurrently with updaters, but they cannot actually get any work done in this case. In contrast, RCU readers can perform useful work even in presence of concurrent RCU updaters. \QuickQ{} What prevents the {\tt list\_for\_each\_entry\_rcu()} from getting a segfault if it happens to execute at exactly the same time as the {\tt list\_add\_rcu()}? \QuickA{} On all systems running Linux, loads from and stores to pointers are atomic, that is, if a store to a pointer occurs at the same time as a load from that same pointer, the load will return either the initial value or the value stored, never some bitwise mashup of the two. In addition, the {\tt list\_for\_each\_entry\_rcu()} always proceeds forward through the list, never looking back. Therefore, the {\tt list\_for\_each\_entry\_rcu()} will either see the element being added by {\tt list\_add\_rcu()} or it will not, but either way, it will see a valid well-formed list. \QuickQ{} Why do we need to pass two pointers into {\tt hlist\_for\_each\_entry\_rcu()} when only one is needed for {\tt list\_for\_each\_entry\_rcu()}? \QuickA{} Because in an hlist it is necessary to check for NULL rather than for encountering the head. (Try coding up a single-pointer {\tt hlist\_for\_each\_entry\_rcu()} If you come up with a nice solution, it would be a very good thing!) \QuickQ{} How would you modify the deletion example to permit more than two versions of the list to be active? \QuickA{} One way of accomplishing this is as shown in Figure~\ref{fig:defer:Concurrent RCU Deletion}. \begin{figure}[htbp] { \centering \begin{verbatim} 1 spin_lock(&mylock); 2 p = search(head, key); 3 if (p == NULL) 4 spin_unlock(&mylock); 5 else { 6 list_del_rcu(&p->list); 7 spin_unlock(&mylock); 8 synchronize_rcu(); 9 kfree(p); 10 } \end{verbatim} } \caption{Concurrent RCU Deletion} \label{fig:defer:Concurrent RCU Deletion} \end{figure} Note that this means that multiple concurrent deletions might be waiting in {\tt synchronize\_rcu()}. \QuickQ{} How many RCU versions of a given list can be active at any given time? \QuickA{} That depends on the synchronization design. If a semaphore protecting the update is held across the grace period, then there can be at most two versions, the old and the new. However, if only the search, the update, and the {\tt list\_replace\_rcu()} were protected by a lock, then there could be an arbitrary number of versions active, limited only by memory and by how many updates could be completed within a grace period. But please note that data structures that are updated so frequently probably are not good candidates for RCU. That said, RCU can handle high update rates when necessary. \QuickQ{} How can RCU updaters possibly delay RCU readers, given that the {\tt rcu\_read\_lock()} and {\tt rcu\_read\_unlock()} primitives neither spin nor block? \QuickA{} The modifications undertaken by a given RCU updater will cause the corresponding CPU to invalidate cache lines containing the data, forcing the CPUs running concurrent RCU readers to incur expensive cache misses. (Can you design an algorithm that changes a data structure \emph{without} inflicting expensive cache misses on concurrent readers? On subsequent readers?) \QuickQ{} WTF? How the heck do you expect me to believe that RCU has a 100-femtosecond overhead when the clock period at 3GHz is more than 300 \emph{picoseconds}? \QuickA{} First, consider that the inner loop used to take this measurement is as follows: \vspace{5pt} \begin{minipage}[t]{\columnwidth} \scriptsize \begin{verbatim} 1 for (i = 0; i < CSCOUNT_SCALE; i++) { 2 rcu_read_lock(); 3 rcu_read_unlock(); 4 } \end{verbatim} \end{minipage} \vspace{5pt} Next, consider the effective definitions of \co{rcu_read_lock()} and \co{rcu_read_unlock()}: \vspace{5pt} \begin{minipage}[t]{\columnwidth} \scriptsize \begin{verbatim} 1 #define rcu_read_lock() do { } while (0) 2 #define rcu_read_unlock() do { } while (0) \end{verbatim} \end{minipage} \vspace{5pt} Consider also that the compiler does simple optimizations, allowing it to replace the loop with: \vspace{5pt} \begin{minipage}[t]{\columnwidth} \scriptsize \begin{verbatim} i = CSCOUNT_SCALE; \end{verbatim} \end{minipage} \vspace{5pt} So the "measurement" of 100 femtoseconds is simply the fixed overhead of the timing measurements divided by the number of passes through the inner loop containing the calls to \co{rcu_read_lock()} and \co{rcu_read_unlock()}. And therefore, this measurement really is in error, in fact, in error by an arbitrary number of orders of magnitude. As you can see by the definition of \co{rcu_read_lock()} and \co{rcu_read_unlock()} above, the actual overhead is precisely zero. It certainly is not every day that a timing measurement of 100 femtoseconds turns out to be an overestimate! \QuickQ{} Why does both the variability and overhead of rwlock decrease as the critical-section overhead increases? \QuickA{} Because the contention on the underlying \co{rwlock_t} decreases as the critical-section overhead increases. However, the rwlock overhead will not quite drop to that on a single CPU because of cache-thrashing overhead. \QuickQ{} Is there an exception to this deadlock immunity, and if so, what sequence of events could lead to deadlock? \QuickA{} One way to cause a deadlock cycle involving RCU read-side primitives is via the following (illegal) sequence of statements: \vspace{5pt} \begin{minipage}[t]{\columnwidth} \small \begin{verbatim} idx = srcu_read_lock(&srcucb); synchronize_srcu(&srcucb); srcu_read_unlock(&srcucb, idx); \end{verbatim} \end{minipage} \vspace{5pt} The \co{synchronize_rcu()} cannot return until all pre-existing SRCU read-side critical sections complete, but is enclosed in an SRCU read-side critical section that cannot complete until the \co{synchronize_srcu()} returns. The result is a classic self-deadlock--you get the same effect when attempting to write-acquire a reader-writer lock while read-holding it. Note that this self-deadlock scenario does not apply to RCU Classic, because the context switch performed by the \co{synchronize_rcu()} would act as a quiescent state for this CPU, allowing a grace period to complete. However, this is if anything even worse, because data used by the RCU read-side critical section might be freed as a result of the grace period completing. In short, do not invoke synchronous RCU update-side primitives from within an RCU read-side critical section. \QuickQ{} But wait! This is exactly the same code that might be used when thinking of RCU as a replacement for reader-writer locking! What gives? \QuickA{} This is an effect of the Law of Toy Examples: beyond a certain point, the code fragments look the same. The only difference is in how we think about the code. However, this difference can be extremely important. For but one example of the importance, consider that if we think of RCU as a restricted reference counting scheme, we would never be fooled into thinking that the updates would exclude the RCU read-side critical sections. It nevertheless is often useful to think of RCU as a replacement for reader-writer locking, for example, when you are replacing reader-writer locking with RCU. \QuickQ{} Why the dip in refcnt overhead near 6 CPUs? \QuickA{} Most likely NUMA effects. However, there is substantial variance in the values measured for the refcnt line, as can be seen by the error bars. In fact, standard deviations range in excess of 10% of measured values in some cases. The dip in overhead therefore might well be a statistical aberration. \QuickQ{} What if the element we need to delete is not the first element of the list on line~9 of Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking}? \QuickA{} As with Figure~\ref{fig:locking:Per-Element Locking Without Existence Guarantees}, this is a very simple hash table with no chaining, so the only element in a given bucket is the first element. The reader is again invited to adapt this example to a hash table with full chaining. \QuickQ{} Why is it OK to exit the RCU read-side critical section on line~15 of Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking} before releasing the lock on line~17? \QuickA{} First, please note that the second check on line~14 is necessary because some other CPU might have removed this element while we were waiting to acquire the lock. However, the fact that we were in an RCU read-side critical section while acquiring the lock guarantees that this element could not possibly have been re-allocated and re-inserted into this hash table. Furthermore, once we acquire the lock, the lock itself guarantees the element's existence, so we no longer need to be in an RCU read-side critical section. The question as to whether it is necessary to re-check the element's key is left as an exercise to the reader. \QuickQ{} Why not exit the RCU read-side critical section on line~23 of Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking} before releasing the lock on line~22? \QuickA{} Suppose we reverse the order of these two lines. Then this code is vulnerable to the following sequence of events: \begin{enumerate} \item CPU~0 invokes \co{delete()}, and finds the element to be deleted, executing through line~15. It has not yet actually deleted the element, but is about to do so. \item CPU~1 concurrently invokes \co{delete()}, attempting to delete this same element. However, CPU~0 still holds the lock, so CPU~1 waits for it at line~13. \item CPU~0 executes lines~16 and 17, and blocks at line~18 waiting for CPU~1 to exit its RCU read-side critical section. \item CPU~1 now acquires the lock, but the test on line~14 fails because CPU~0 has already removed the element. CPU~1 now executes line~22 (which we switched with line~23 for the purposes of this Quick Quiz) and exits its RCU read-side critical section. \item CPU~0 can now return from \co{synchronize_rcu()}, and thus executes line~19, sending the element to the freelist. \item CPU~1 now attempts to release a lock for an element that has been freed, and, worse yet, possibly reallocated as some other type of data structure. This is a fatal memory-corruption error. \end{enumerate} \QuickQ{} But what if there is an arbitrarily long series of RCU read-side critical sections in multiple threads, so that at any point in time there is at least one thread in the system executing in an RCU read-side critical section? Wouldn't that prevent any data from a \co{SLAB_DESTROY_BY_RCU} slab ever being returned to the system, possibly resulting in OOM events? \QuickA{} There could certainly be an arbitrarily long period of time during which at least one thread is always in an RCU read-side critical section. However, the key words in the description in Section~\ref{sec:deferRCU is a Way of Providing Type-Safe Memory} are in-use'' and pre-existing''. Keep in mind that a given RCU read-side critical section is conceptually only permitted to gain references to data elements that were in use at the beginning of that critical section. Furthermore, remember that a slab cannot be returned to the system until all of its data elements have been freed, in fact, the RCU grace period cannot start until after they have all been freed. Therefore, the slab cache need only wait for those RCU read-side critical sections that started before the freeing of the last element of the slab. This in turn means that any RCU grace period that begins after the freeing of the last element will do---the slab may be returned to the system after that grace period ends. \QuickQ{} Suppose that the \co{nmi_profile()} function was preemptible. What would need to change to make this example work correctly? \QuickA{} One approach would be to use \co{rcu_read_lock()} and \co{rcu_read_unlock()} in \co{nmi_profile()}, and to replace the \co{synchronize_sched()} with \co{synchronize_rcu()}, perhaps as shown in Figure~\ref{fig:defer:Using RCU to Wait for Mythical Preemptible NMIs to Finish}. \begin{figure}[tbp] { \tt \scriptsize \begin{verbatim} 1 struct profile_buffer { 2 long size; 3 atomic_t entry[0]; 4 }; 5 static struct profile_buffer *buf = NULL; 6 7 void nmi_profile(unsigned long pcvalue) 8 { 9 struct profile_buffer *p; 10 11 rcu_read_lock(); 12 p = rcu_dereference(buf); 13 if (p == NULL) { 14 rcu_read_unlock(); 15 return; 16 } 17 if (pcvalue >= p->size) { 18 rcu_read_unlock(); 19 return; 20 } 21 atomic_inc(&p->entry[pcvalue]); 22 rcu_read_unlock(); 23 } 24 25 void nmi_stop(void) 26 { 27 struct profile_buffer *p = buf; 28 29 if (p == NULL) 30 return; 31 rcu_assign_pointer(buf, NULL); 32 synchronize_rcu(); 33 kfree(p); 34 } \end{verbatim} } \caption{Using RCU to Wait for Mythical Preemptible NMIs to Finish} \label{fig:defer:Using RCU to Wait for Mythical Preemptible NMIs to Finish} \end{figure} \QuickQ{} Why do some of the cells in Table~\ref{tab:defer:RCU Wait-to-Finish APIs} have exclamation marks (!'')? \QuickA{} The API members with exclamation marks (\co{rcu_read_lock()}, \co{rcu_read_unlock()}, and \co{call_rcu()}) were the only members of the Linux RCU API that Paul E. McKenney was aware of back in the mid-90s. During this timeframe, he was under the mistaken impression that he knew all that there is to know about RCU. \QuickQ{} How do you prevent a huge number of RCU read-side critical sections from indefinitely blocking a \co{synchronize_rcu()} invocation? \QuickA{} There is no need to do anything to prevent RCU read-side critical sections from indefinitely blocking a \co{synchronize_rcu()} invocation, because the \co{synchronize_rcu()} invocation need wait only for \emph{pre-existing} RCU read-side critical sections. So as long as each RCU read-side critical section is of finite duration, there should be no problem. \QuickQ{} The \co{synchronize_rcu()} API waits for all pre-existing interrupt handlers to complete, right? \QuickA{} Absolutely not! And especially not when using preemptible RCU! You instead want \co{synchronize_irq()}. Alternatively, you can place calls to \co{rcu_read_lock()} and \co{rcu_read_unlock()} in the specific interrupt handlers that you want \co{synchronize_rcu()} to wait for. \QuickQ{} What happens if you mix and match? For example, suppose you use \co{rcu_read_lock()} and \co{rcu_read_unlock()} to delimit RCU read-side critical sections, but then use \co{call_rcu_bh()} to post an RCU callback? \QuickA{} If there happened to be no RCU read-side critical sections delimited by \co{rcu_read_lock_bh()} and \co{rcu_read_unlock_bh()} at the time \co{call_rcu_bh()} was invoked, RCU would be within its rights to invoke the callback immediately, possibly freeing a data structure still being used by the RCU read-side critical section! This is not merely a theoretical possibility: a long-running RCU read-side critical section delimited by \co{rcu_read_lock()} and \co{rcu_read_unlock()} is vulnerable to this failure mode. This vulnerability disappears in -rt kernels, where RCU Classic and RCU BH both map onto a common implementation. \QuickQ{} Hardware interrupt handlers can be thought of as being under the protection of an implicit \co{rcu_read_lock_bh()}, right? \QuickA{} Absolutely not! And especially not when using preemptible RCU! If you need to access rcu\_bh''-protected data structures in an interrupt handler, you need to provide explicit calls to \co{rcu_read_lock_bh()} and \co{rcu_read_unlock_bh()}. \QuickQ{} What happens if you mix and match RCU Classic and RCU Sched? \QuickA{} In a non-\co{PREEMPT} or a \co{PREEMPT} kernel, mixing these two works "by accident" because in those kernel builds, RCU Classic and RCU Sched map to the same implementation. However, this mixture is fatal in \co{PREEMPT_RT} builds using the -rt patchset, due to the fact that Realtime RCU's read-side critical sections can be preempted, which would permit \co{synchronize_sched()} to return before the RCU read-side critical section reached its \co{rcu_read_unlock()} call. This could in turn result in a data structure being freed before the read-side critical section was finished with it, which could in turn greatly increase the actuarial risk experienced by your kernel. In fact, the split between RCU Classic and RCU Sched was inspired by the need for preemptible RCU read-side critical sections. \QuickQ{} In general, you cannot rely on \co{synchronize_sched()} to wait for all pre-existing interrupt handlers, right? \QuickA{} That is correct! Because -rt Linux uses threaded interrupt handlers, there can be context switches in the middle of an interrupt handler. Because \co{synchronize_sched()} waits only until each CPU has passed through a context switch, it can return before a given interrupt handler completes. If you need to wait for a given interrupt handler to complete, you should instead use \co{synchronize_irq()} or place explicit RCU read-side critical sections in the interrupt handlers that you wish to wait on. \QuickQ{} Why do both SRCU and QRCU lack asynchronous \co{call_srcu()} or \co{call_qrcu()} interfaces? \QuickA{} Given an asynchronous interface, a single task could register an arbitrarily large number of SRCU or QRCU callbacks, thereby consuming an arbitrarily large quantity of memory. In contrast, given the current synchronous \co{synchronize_srcu()} and \co{synchronize_qrcu()} interfaces, a given task must finish waiting for a given grace period before it can start waiting for the next one. \QuickQ{} Under what conditions can \co{synchronize_srcu()} be safely used within an SRCU read-side critical section? \QuickA{} In principle, you can use \co{synchronize_srcu()} with a given \co{srcu_struct} within an SRCU read-side critical section that uses some other \co{srcu_struct}. In practice, however, doing this is almost certainly a bad idea. In particular, the code shown in Figure~\ref{fig:defer:Multistage SRCU Deadlocks} could still result in deadlock. \begin{figure}[htbp] { \centering \begin{verbatim} 1 idx = srcu_read_lock(&ssa); 2 synchronize_srcu(&ssb); 3 srcu_read_unlock(&ssa, idx); 4 5 /* . . . */ 6 7 idx = srcu_read_lock(&ssb); 8 synchronize_srcu(&ssa); 9 srcu_read_unlock(&ssb, idx); \end{verbatim} } \caption{Multistage SRCU Deadlocks} \label{fig:defer:Multistage SRCU Deadlocks} \end{figure} \QuickQ{} Why doesn't \co{list_del_rcu()} poison both the \co{next} and \co{prev} pointers? \QuickA{} Poisoning the \co{next} pointer would interfere with concurrent RCU readers, who must use this pointer. However, RCU readers are forbidden from using the \co{prev} pointer, so it may safely be poisoned. \QuickQ{} Normally, any pointer subject to \co{rcu_dereference()} \emph{must} always be updated using \co{rcu_assign_pointer()}. What is an exception to this rule? \QuickA{} One such exception is when a multi-element linked data structure is initialized as a unit while inaccessible to other CPUs, and then a single \co{rcu_assign_pointer()} is used to plant a global pointer to this data structure. The initialization-time pointer assignments need not use \co{rcu_assign_pointer()}, though any such assignments that happen after the structure is globally visible \emph{must} use \co{rcu_assign_pointer()}. However, unless this initialization code is on an impressively hot code-path, it is probably wise to use \co{rcu_assign_pointer()} anyway, even though it is in theory unnecessary. It is all too easy for a "minor" change to invalidate your cherished assumptions about the initialization happening privately. \QuickQ{} Are there any downsides to the fact that these traversal and update primitives can be used with any of the RCU API family members? \QuickA{} It can sometimes be difficult for automated code checkers such as sparse'' (or indeed for human beings) to work out which type of RCU read-side critical section a given RCU traversal primitive corresponds to. For example, consider the code shown in Figure~\ref{fig:defer:Diverse RCU Read-Side Nesting}. \begin{figure}[htbp] { \centering \begin{verbatim} 1 rcu_read_lock(); 2 preempt_disable(); 3 p = rcu_dereference(global_pointer); 4 5 /* . . . */ 6 7 preempt_enable(); 8 rcu_read_unlock(); \end{verbatim} } \caption{Diverse RCU Read-Side Nesting} \label{fig:defer:Diverse RCU Read-Side Nesting} \end{figure} Is the \co{rcu_dereference()} primitive in an RCU Classic or an RCU Sched critical section? What would you have to do to figure this out? \QuickQ{} Why wouldn't any deadlock in the RCU implementation in Figure~\ref{fig:defer:Lock-Based RCU Implementation} also be a deadlock in any other RCU implementation? \QuickA{} \begin{figure}[tbp] { \scriptsize \begin{verbatim} 1 void foo(void) 2 { 3 spin_lock(&my_lock); 4 rcu_read_lock(); 5 do_something(); 6 rcu_read_unlock(); 7 do_something_else(); 8 spin_unlock(&my_lock); 9 } 10 11 void bar(void) 12 { 13 rcu_read_lock(); 14 spin_lock(&my_lock); 15 do_some_other_thing(); 16 spin_unlock(&my_lock); 17 do_whatever(); 18 rcu_read_unlock(); 19 } \end{verbatim} } \caption{Deadlock in Lock-Based RCU Implementation} \label{fig:defer:Deadlock in Lock-Based RCU Implementation} \end{figure} Suppose the functions \co{foo()} and \co{bar()} in Figure~\ref{fig:defer:Deadlock in Lock-Based RCU Implementation} are invoked concurrently from different CPUs. Then \co{foo()} will acquire \co{my_lock()} on line~3, while \co{bar()} will acquire \co{rcu_gp_lock} on line~13. When \co{foo()} advances to line~4, it will attempt to acquire \co{rcu_gp_lock}, which is held by \co{bar()}. Then when \co{bar()} advances to line~14, it will attempt to acquire \co{my_lock}, which is held by \co{foo()}. Each function is then waiting for a lock that the other holds, a classic deadlock. Other RCU implementations neither spin nor block in \co{rcu_read_lock()}, hence avoiding deadlocks. \QuickQ{} Why not simply use reader-writer locks in the RCU implementation in Figure~\ref{fig:defer:Lock-Based RCU Implementation} in order to allow RCU readers to proceed in parallel? \QuickA{} One could in fact use reader-writer locks in this manner. However, textbook reader-writer locks suffer from memory contention, so that the RCU read-side critical sections would need to be quite long to actually permit parallel execution~\cite{McKenney03a}. On the other hand, use of a reader-writer lock that is read-acquired in \co{rcu_read_lock()} would avoid the deadlock condition noted above. \QuickQ{} Wouldn't it be cleaner to acquire all the locks, and then release them all in the loop from lines~15-18 of Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation}? After all, with this change, there would be a point in time when there were no readers, simplifying things greatly. \QuickA{} Making this change would re-introduce the deadlock, so no, it would not be cleaner. \QuickQ{} Is the implementation shown in Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation} free from deadlocks? Why or why not? \QuickA{} One deadlock is where a lock is held across \co{synchronize_rcu()}, and that same lock is acquired within an RCU read-side critical section. However, this situation will deadlock any correctly designed RCU implementation. After all, the \co{synchronize_rcu()} primitive must wait for all pre-existing RCU read-side critical sections to complete, but if one of those critical sections is spinning on a lock held by the thread executing the \co{synchronize_rcu()}, we have a deadlock inherent in the definition of RCU. Another deadlock happens when attempting to nest RCU read-side critical sections. This deadlock is peculiar to this implementation, and might be avoided by using recursive locks, or by using reader-writer locks that are read-acquired by \co{rcu_read_lock()} and write-acquired by \co{synchronize_rcu()}. However, if we exclude the above two cases, this implementation of RCU does not introduce any deadlock situations. This is because only time some other thread's lock is acquired is when executing \co{synchronize_rcu()}, and in that case, the lock is immediately released, prohibiting a deadlock cycle that does not involve a lock held across the \co{synchronize_rcu()} which is the first case above. \QuickQ{} Isn't one advantage of the RCU algorithm shown in Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation} that it uses only primitives that are widely available, for example, in POSIX pthreads? \QuickA{} This is indeed an advantage, but do not forget that \co{rcu_dereference()} and \co{rcu_assign_pointer()} are still required, which means \co{volatile} manipulation for \co{rcu_dereference()} and memory barriers for \co{rcu_assign_pointer()}. Of course, many Alpha CPUs require memory barriers for both primitives. \QuickQ{} But what if you hold a lock across a call to \co{synchronize_rcu()}, and then acquire that same lock within an RCU read-side critical section? \QuickA{} Indeed, this would deadlock any legal RCU implementation. But is \co{rcu_read_lock()} \emph{really} participating in the deadlock cycle? If you believe that it is, then please ask yourself this same question when looking at the RCU implementation in Section~\ref{defer:RCU Based on Quiescent States}. \QuickQ{} How can the grace period possibly elapse in 40 nanoseconds when \co{synchronize_rcu()} contains a 10-millisecond delay? \QuickA{} The update-side test was run in absence of readers, so the \co{poll()} system call was never invoked. In addition, the actual code has this \co{poll()} system call commented out, the better to evaluate the true overhead of the update-side code. Any production uses of this code would be better served by using the \co{poll()} system call, but then again, production uses would be even better served by other implementations shown later in this section. \QuickQ{} Why not simply make \co{rcu_read_lock()} wait when a concurrent \co{synchronize_rcu()} has been waiting too long in the RCU implementation in Figure~\ref{fig:defer:RCU Implementation Using Single Global Reference Counter}? Wouldn't that prevent \co{synchronize_rcu()} from starving? \QuickA{} Although this would in fact eliminate the starvation, it would also mean that \co{rcu_read_lock()} would spin or block waiting for the writer, which is in turn waiting on readers. If one of these readers is attempting to acquire a lock that the spinning/blocking \co{rcu_read_lock()} holds, we again have deadlock. In short, the cure is worse than the disease. See Section~\ref{defer:Starvation-Free Counter-Based RCU} for a proper cure. \QuickQ{} Why the memory barrier on line~5 of \co{synchronize_rcu()} in Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair} given that there is a spin-lock acquisition immediately after? \QuickA{} The spin-lock acquisition only guarantees that the spin-lock's critical section will not bleed out'' to precede the acquisition. It in no way guarantees that code preceding the spin-lock acquisition won't be reordered into the critical section. Such reordering could cause a removal from an RCU-protected list to be reordered to follow the complementing of \co{rcu_idx}, which could allow a newly starting RCU read-side critical section to see the recently removed data element. Exercise for the reader: use a tool such as Promela/spin to determine which (if any) of the memory barriers in Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair} are really needed. See Section~\ref{app:formal:Formal Verification} for information on using these tools. The first correct and complete response will be credited. \QuickQ{} Why is the counter flipped twice in Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}? Shouldn't a single flip-and-wait cycle be sufficient? \QuickA{} Both flips are absolutely required. To see this, consider the following sequence of events: \begin{enumerate} \item Line~8 of \co{rcu_read_lock()} in Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair} picks up \co{rcu_idx}, finding its value to be zero. \item Line~8 of \co{synchronize_rcu()} in Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair} complements the value of \co{rcu_idx}, setting its value to one. \item Lines~10-13 of \co{synchronize_rcu()} find that the value of \co{rcu_refcnt[0]} is zero, and thus returns. (Recall that the question is asking what happens if lines~14-20 are omitted.) \item Lines~9 and 10 of \co{rcu_read_lock()} store the value zero to this thread's instance of \co{rcu_read_idx} and increments \co{rcu_refcnt[0]}, respectively. Execution then proceeds into the RCU read-side critical section. \label{defer:rcu_rcgp:RCU Read Side Start} \item Another instance of \co{synchronize_rcu()} again complements \co{rcu_idx}, this time setting its value to zero. Because \co{rcu_refcnt[1]} is zero, \co{synchronize_rcu()} returns immediately. (Recall that \co{rcu_read_lock()} incremented \co{rcu_refcnt[0]}, not \co{rcu_refcnt[1]}!) \label{defer:rcu_rcgp:RCU Grace Period Start} \item The grace period that started in step~\ref{defer:rcu_rcgp:RCU Grace Period Start} has been allowed to end, despite the fact that the RCU read-side critical section that started beforehand in step~\ref{defer:rcu_rcgp:RCU Read Side Start} has not completed. This violates RCU semantics, and could allow the update to free a data element that the RCU read-side critical section was still referencing. \end{enumerate} Exercise for the reader: What happens if \co{rcu_read_lock()} is preempted for a very long time (hours!) just after line~8? Does this implementation operate correctly in that case? Why or why not? The first correct and complete response will be credited. \QuickQ{} Given that atomic increment and decrement are so expensive, why not just use non-atomic increment on line~10 and a non-atomic decrement on line~25 of Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair}? \QuickA{} Using non-atomic operations would cause increments and decrements to be lost, in turn causing the implementation to fail. See Section~\ref{defer:Scalable Counter-Based RCU} for a safe way to use non-atomic operations in \co{rcu_read_lock()} and \co{rcu_read_unlock()}. \QuickQ{} Come off it! We can see the \co{atomic_read()} primitive in \co{rcu_read_lock()}!!! So why are you trying to pretend that \co{rcu_read_lock()} contains no atomic operations??? \QuickA{} The \co{atomic_read()} primitives does not actually execute atomic machine instructions, but rather does a normal load from an \co{atomic_t}. Its sole purpose is to keep the compiler's type-checking happy. If the Linux kernel ran on 8-bit CPUs, it would also need to prevent store tearing'', which could happen due to the need to store a 16-bit pointer with two eight-bit accesses on some 8-bit systems. But thankfully, it seems that no one runs Linux on 8-bit systems. \QuickQ{} Great, if we have $N$ threads, we can have $2N$ ten-millisecond waits (one set per \co{flip_counter_and_wait()} invocation, and even that assumes that we wait only once for each thread. Don't we need the grace period to complete \emph{much} more quickly? \QuickA{} Keep in mind that we only wait for a given thread if that thread is still in a pre-existing RCU read-side critical section, and that waiting for one hold-out thread gives all the other threads a chance to complete any pre-existing RCU read-side critical sections that they might still be executing. So the only way that we would wait for $2N$ intervals would be if the last thread still remained in a pre-existing RCU read-side critical section despite all the waiting for all the prior threads. In short, this implementation will not wait unnecessarily. However, if you are stress-testing code that uses RCU, you might want to comment out the \co{poll()} statement in order to better catch bugs that incorrectly retain a reference to an RCU-protected data element outside of an RCU read-side critical section. \QuickQ{} All of these toy RCU implementations have either atomic operations in \co{rcu_read_lock()} and \co{rcu_read_unlock()}, or \co{synchronize_rcu()} overhead that increases linearly with the number of threads. Under what circumstances could an RCU implementation enjoy light-weight implementations for all three of these primitives, all having deterministic ($O\left(1\right)$) overheads and latencies? \QuickA{} Special-purpose uniprocessor implementations of RCU can attain this ideal~\cite{PaulEMcKenney2009BloatwatchRCU}. \QuickQ{} If any even value is sufficient to tell \co{synchronize_rcu()} to ignore a given task, why doesn't line~10 of Figure~\ref{fig:defer:Free-Running Counter Using RCU} simply assign zero to \co{rcu_reader_gp}? \QuickA{} Assigning zero (or any other even-numbered constant) would in fact work, but assigning the value of \co{rcu_gp_ctr} can provide a valuable debugging aid, as it gives the developer an idea of when the corresponding thread last exited an RCU read-side critical section. \QuickQ{} Why are the memory barriers on lines~17 and 29 of Figure~\ref{fig:defer:Free-Running Counter Using RCU} needed? Aren't the memory barriers inherent in the locking primitives on lines~18 and 28 sufficient? \QuickA{} These memory barriers are required because the locking primitives are only guaranteed to confine the critical section. The locking primitives are under absolutely no obligation to keep other code from bleeding in to the critical section. The pair of memory barriers are therefore requires to prevent this sort of code motion, whether performed by the compiler or by the CPU. \QuickQ{} Couldn't the update-side optimization described in Section~\ref{defer:Scalable Counter-Based RCU With Shared Grace Periods} be applied to the implementation shown in Figure~\ref{fig:defer:Free-Running Counter Using RCU}? \QuickA{} Indeed it could, with a few modifications. This work is left as an exercise for the reader. \QuickQ{} Is the possibility o readers being preempted in line~3 of Figure~\ref{fig:defer:Free-Running Counter Using RCU} a real problem, in other words, is there a real sequence of events that could lead to failure? If not, why not? If so, what is the sequence of events, and how can the failure be addressed? \QuickA{} It is a real problem, there is a sequence of events leading to failure, and there are a number of possible ways of addressing it. For more details, see the Quick Quizzes near the end of Section~\ref{defer:Nestable RCU Based on Free-Running Counter}. The reason for locating the discussion there is to (1) give you more time to think about it, and (2) because the nesting support added in that section greatly reduces the time required to overflow the counter. \QuickQ{} Why not simply maintain a separate per-thread nesting-level variable, as was done in previous section, rather than having all this complicated bit manipulation? \QuickA{} The apparent simplicity of the separate per-thread variable is a red herring. This approach incurs much greater complexity in the guise of careful ordering of operations, especially if signal handlers are to be permitted to contain RCU read-side critical sections. But don't take my word for it, code it up and see what you end up with! \QuickQ{} Given the algorithm shown in Figure~\ref{fig:defer:Nestable RCU Using a Free-Running Counter}, how could you double the time required to overflow the global \co{rcu_gp_ctr}? \QuickA{} One way would be to replace the magnitude comparison on lines~33 and 34 with an inequality check of the per-thread \co{rcu_reader_gp} variable against \co{rcu_gp_ctr+RCU_GP_CTR_BOTTOM_BIT}. \QuickQ{} Again, given the algorithm shown in Figure~\ref{fig:defer:Nestable RCU Using a Free-Running Counter}, is counter overflow fatal? Why or why not? If it is fatal, what can be done to fix it? \QuickA{} It can indeed be fatal. To see this, consider the following sequence of events: \begin{enumerate} \item Thread~0 enters \co{rcu_read_lock()}, determines that it is not nested, and therefore fetches the value of the global \co{rcu_gp_ctr}. Thread~0 is then preempted for an extremely long time (before storing to its per-thread \co{rcu_reader_gp} variable). \item Other threads repeatedly invoke \co{synchronize_rcu()}, so that the new value of the global \co{rcu_gp_ctr} is now \co{RCU_GP_CTR_BOTTOM_BIT} less than it was when thread~0 fetched it. \item Thread~0 now starts running again, and stores into its per-thread \co{rcu_reader_gp} variable. The value it stores is \co{RCU_GP_CTR_BOTTOM_BIT+1} greater than that of the global \co{rcu_gp_ctr}. \item Thread~0 acquires a reference to RCU-protected data element~A. \item Thread 1 now removes the data element~A that thread~0 just acquired a reference to. \item Thread 1 invokes \co{synchronize_rcu()}, which increments the global \co{rcu_gp_ctr} by \co{RCU_GP_CTR_BOTTOM_BIT}. It then checks all of the per-thread \co{rcu_reader_gp} variables, but thread~0's value (incorrectly) indicates that it started after thread~1's call to \co{synchronize_rcu()}, so thread~1 does not wait for thread~0 to complete its RCU read-side critical section. \item Thread 1 then frees up data element~A, which thread~0 is still referencing. \end{enumerate}</