Skip to content
Find file
Fetching contributors…
Cannot retrieve contributors at this time
527 lines (374 sloc) 17 KB
o Consider linking quick quizzes to the answers.
o Consider refactoring test code in CodeSamples directory tree.
Probably some opportunity for common measurement code and
also common argument-parsing code.
o Cite
for reference on CUDA, MPI, etc.
o Add discussion of reader-writer locking to the locking chapter
implementation discussion. Ticket rwlocks, brlock/lglock,
unfair locks (and reasons for them), forward reference to
seqlock, RCU lock, and barrier lock.
o Consider a barrier section in defer chapter.
o Consider expanding on dequeue discussion, especially on queues,
in data-structures chapter (e.g., limits of ordering guarantees).
Need hash tables there, of course!
o Consider using parallel maze solving as an example. See:
A better approach might assign one processor to each end of
the maze. Chopping the maze up into blocks might help, but
this would likely encounter pathological cases with lots
of boundary crossing.
Another approach is to start each CPU at a random point in
the maze (in addition to the one each at start and end).
Give each CPU a color, and when there is a path from beginning
to end, you are done. Possibly an application for parallel
union-find. ;-)
o Add the various livejournal posts on memory order, especially
the one on transitivity. Consider also adding Will Deacon's
January 6 2011 QoS story.
o Update TM section based on OSR "grass is greener" paper and
relevant paragraph from urcu paper.
o Move OSR RCU history paper to appendix.
o Change ASCII double quotes to TeX quotes. [Horst]
o Add a chapter on performance analysis and benchmarking.
o Convert bloatwatch RCU and add to appendix/rcuimpl.
o Use the existence problem as motivation for RCU. Show some
solutions in the locking chapter. See the day-2 ACACES slideset.
See also the RCU tutorial and the SC22WG14 email.
o spin_lock_irqsave(): Critical sections must guarantee forward
progress against everything except NMI handlers.
o raw_spin_lock(): Critical sections must guarantee forward
progress against everything except IRQ (including softirq)
and NMI handlers.
o spin_lock(): Critical sections must guarantee forward
progress against everything except IRQ (again including
softirq) and NMI handlers and (given CONFIG_PREEMPT_RT)
higher-priority realtime tasks.
o mutex_lock(): Critical sections need not guarantee
forward progress, as general blocking is permitted.
The other issue is the scope of the lock. The Linux kernel has
the following:
o BKL: global scope.
o Everything else: scope defined by the use of the underlying
lock variable.
One of the many reasons that we are trying to get rid of
BKL is because it combines global scope with relatively weak
forward-progress guarantees.
o Data-race detection [Bart Van Assche January 2010]
Helgrind, DRD, ThreadSanitizer, Intel Thread Checker,
Intel Parallel Studio, DIOTA, ...
Valgrind-based tools allow suppression patterns that
span multiple call stack frames, other tools reportedly
do not.
J. Maebe, M. Ronsse, and K. De Bosschere. *DIOTA: Dynamic
instrumentation, optimization and transformation
of applications*. In Proceedings of WBT-2002,
Charlottesville, Virginia, USA, September 2002.
o Consider adding discussion of Amdahl's Law and Gustafson's
Law to the introduction. [Kay Muller]
o Consider adding something about VLIW and SIMD to hwhabits?
Maybe something more about memory hierarchy? [Kay Muller]
o Need some list somewhere of environments that allow you to
avoid worrying about locking and threading: BOINC, map-reduce,
SQL, sequential-program optimizations, ...
o Move memory-allocation code from "Partitioning and Synchronization
Design" to "Data Ownership"?
o Update "SMP Synchronization Design" to cover parallelism in
general. Discuss partitioning up front. Describe two major
cases, pipelining and data parallelism. Code locking can be
analogous to pipelining, but probably needs a separate section.
o What is the form of the need for added performance?
o Throughput in face of many relatively small
units of work. (multiple instances.
data parallelism. DBMS. Web serving.)
o Throughput of large block(s) of work.
(pipelining, or "series". parallel decompostion,
or "parallel".)
o Latency of units of work that are large relative
to the communications overhead. (overlapped
pipelining, parallel decomposition.)
o Latency of an aggregate of many units of work
that are small relative to the communications
overhead. (batch the units of work into larger
units and then re-ask the question relative to
the batches)
o Latency of units of work that are small relative
to the communications overhead. (sequential
programming optimizations. full-up redesign.
approximations/heuristic techniques.
overclocking. implement in hardware, e.g.,
o Large programs or systems might have different needs
in different places -- thus might need combination of
o Move data-structures discussion to this chapter.
o Need to make a chapter on locking (as opposed to the current
one on synchronization). [IN PROGRESS]
o Add quote "weak ordering requires strong minds" and dissection
thereof to memory-ordering discussion.
o Add mechanism to automatically display authorhood.
o Add section to intro/hwhabits.tex giving intuitive notion of
how cache coherence protocol and write buffer makes atomic
instructions expensive. Draw from URochester slide set
(in paper/scalability/TR-TMeval or thereabouts).
Move intro/hwhabits.tex to cpu/. Add sections with increasingly
detailed pictures of what a CPU does, including speed-of-light
round-trip time (~3 cm, or a bit more than an inch) at 5GHz.
3D fabrication might help, but watch the fabrication cost and
the heat dissipation!!! Also, electrons travel from 3-30 times
more slowly in silicon than light does in a vacuum, and even
more slowly if there are levels of clocked logic in the way.
Include actual numbers/graphs from real machines. Show effects
of various forms of misbehavior.
o My local view adds a modified book.cls that adds the bibliography
to the table of contents. However, the license on book.cls forbids
distributing it without also distributing latex. Need to come up
with a better way!
In the meantime, edit your local copy of book.cls and remove the
"*" following the "chapter" directive in "thebibliography" macro.
o Add more terms to vocabulary. More TM stuff, for example.
Also RCU nomenclature [done].
o Find Herlihy's topology-analogy stuff and add hints to the
formal-methods section.
o Fill out material.
o Fill out "defer" section, starting with fork-join examples.
Relate to state-space complexity.
o Add material from "is parallel programming hard" position
paper. (After publication.)
o Pull in material from LJ2003 RCU and dissertation
demonstrating locking performance issues.
o Need citation for Matlab*p.
x Add Steve's and my dyntick paper to analysis as an example
of simpler examples.
o Add stuff from differential profiling paper.
o RCU implementations appendix:
x RCU requirements/desiderata. (See paper on
desk at home for start, along with the usual
papers.) Also look in the advsync/rcu.tex section
on RCU, which needs to be folded into the
sections in defer/
o Uniprocessor RCU implementation. Simplify
the current one so that the rcu and rcu_bh
implementations are identical, as appropriate
for a device not subject to networking DoS
x Toy implementations. Follow wikipedia, but add
reference-count approach.
x Performance and scalability for toy versions.
x User-level RCU implementations (after
o Non-realtime kernel implementations. Take from
dissertation and from LWN articles. Or, given
treercu, just cite dissertation.
x Realtime kernel implementations.
o RCU priority boosting. I suppose getting a good
boosting implementation would help...
x Hierarchical implementation (after November 20, 2008
due to LWN subscriber-only period).
Also propagate references into
Also add code walkthrough. You will need to do
the walkthrough to find rcutree bugs anyway...
o Derive RCU requirements from list of examples,
propagate list up to defer chapter.
o Crisply present advantages and disadvantages of
RCU in defer chapter.
o Memory barriers -- get principles lined out, then show
"bookend" implementations.
o Add appendix with mathematics for hard realtime response
using locks.
o Create code examples for SMPdesign stuff, fill out
Hierarchical Locking, Resource Allocation Caches,
Performance Summary.
o Refocus the SMPdesign chapter to partitioning.
o Fill out RCU patterns, also corresponding code and
performance summary.
o RCU and atomic list movement (Josh?)
o RCU and reference counts
o Converting rwlocks to RCU
o Consider adding history of RCU from OSR paper (low priority)
o Other stuff Paul McKenney has from previous publications.
o Other stuff that must be generated or harvested from
o Note that pthread_mutex permits static initialization.
(Windows apparently does not.)
x Finish whymemorybarriers.tex
o Add Makefile (and comments) to CodeSamples directories.
(defer and SMPdesign done.)
o Come up with suitable API for example code, so that a single
program can be used for each example, covering the pthreads,
Linux-kernel-module, and other C-language environments.
double dgettimeofday(void);
void spin_lock_init(spinlock_t *sp); /* Must be called. */
void spin_lock(spinlock_t *sp);
int spin_trylock(spinlock_t *sp);
void spin_unlock(spinlock_t *sp);
thread_id_t create_thread(void *(*func), void *arg);
void *wait_thread(pthread_id_t tid);
-- See CodeSamples/pthreads/api-pthreads.h for a start.
Clearly Java will need its own code.
o Add support for real 64-bit x86 builds.
x Fix the references in defer/rcu*.tex to "green" and "red" cells
in the various tables. Change to refer to the symbols actually
x Verify versions of cartoons. (Need whippersnapper in intro)
x Upgrade Quick Quiz scripts to remove material in second parameter
when aggregating into "answers" section. Trivialize by changing
required format to put closing "}" before end tag. Then get rid
of the special cases for code formatting in QuickQuiz answers.
x Move API description to an appendix.
x Fill out material.
x Finish converting WhyRCU LWN series.
x Add cautions to analysis section. Steve's and
my "Integrating and Validating dynticks and Preemptable RCU"
could be a good start, comparing to the rcutree.c
dynticks implementation.
x Also add "what to do instead of parallel programming"
x Move the formal-verification sections to an appendix.
x Add the dyntick verification. (~/paper/KernelJanitor/RCU/dyntick/LWN)
x Fix rcutorture.h to correctly compute from time. Allow the
test time as a third argument.
x Reference counting: basic problem: make data structure stick
around while you are incrementing the reference count.
Strategies: lock, pre-existing reference, atomic check for
zero reference count coupled with GC/RCU, various
non-blocking-synchronization tricks.
x Copy rcutree full data-structure diagram to section describing
force_quiescent_state() callee traversal of the leaf rcu_node
structures. Decorate with right-facing arrow to show search
direction. Or perhaps copy diagram that shows initialization
x Add \RevisedFrom to origpub.tex to allow something like:
The ideas discussed in Section... were originally presented
in XXX\cite{}.
Allow ranges of sections for OriginallyFrom.
x Discuss other textbooks in intro.
x Figure out some way to get section headers between "Answers to
Quick Quizzes" for different chapters. Want to avoid any
section header for a chapter that has no "quick quizzes".
Not a big deal, nice to have.
x Add a "Tools of the Trade" chapter before "SMP Synchronization
Design" giving an overview of how locking and atomic operations
work. Need simple fork/join, along with foreshadowing of how
good design can simplify implementation.
x Add variables to rest of examples in toyrcu.tex.
x In the double-ended-queue code, consider renaming the
"enqueue" operations to "push" and the "dequeue" operations
to "pop".
x Need a chapter on counting.
x Add the separate-thread approach suggested unwittingly by
Christoph Lameter to the count.tex chapter.
x Embed fonts from .eps files:
See the sed script:
cat original_graphics.eps | sed 's+Times-Bold+NimbusSanL-Bold+g' |\
sed 's+Times-Roman+NimbusSanL-Regu+g' |\
sed 's+Times+NimbusSanL-Regu+g' |\
sed 's+Helvetica-BoldOblique+NimbusSanL-BoldItal+g' |\
sed 's+Helvetica-Oblique+NimbusSanL-ReguItal+g' |\
sed 's+Helvetica-Bold+NimbusSanL-Bold+g' |\
sed 's+Helvetica-Bold-iso+NimbusSanL-Bold+g' |\
sed 's+Helvetica+NimbusSanL-Regu+g' |\
sed 's+Helvetica-iso+NimbusSanL-Regu+g' |\
sed 's+Symbol+StandardSymL+g' > new_graphics.eps
And the additional conversions:
Courier NimbusMonL-Regu
Courier-Bold NimbusMonL-Bold
Courier-Oblique NimbusMonL-ReguObli
Courier-BoldOblique NimbusMonL-BoldObli
x Change calls to bzero() to memset(). [Horst]
x dvipdf is quite slow, especially on 1-up setups:
time dvips perfbook
real 0m3.405s
user 0m3.184s
sys 0m0.224s
time psnup -2
real 0m0.852s
user 0m0.248s
sys 0m0.308s
time ps2pdf perfbook-2up.pdf
real 0m25.383s
user 0m24.646s
sys 0m0.276s
time dvipdf perfbook
real 1m31.456s
user 0m48.347s
sys 0m45.599s
Could parallelism be profitably applied here? ;-)
Or perhaps just careful avoidance of quadratic algorithms?
Use pdflatex instead, much faster!
x Explicitly permit reformatted copies (single-column, ebook, etc.).
x Change from "preemptable" to "preemptible". [Horst]
x Produce an HTML version. One possibility is latex2html
and scripting, and tamasrepus suggests
Challenges include the scripting and special latex macros
that hendle the quick quizzes and the illustration credits.
x Add discussion of seqlock to defer chapter between refcnt and RCU.
Use trivial example showing consistency. Note inapplicability
to pointer-based data structures.
x Update chain of RCU implementations to add
x Line up reviewers. Need a range of viewpoints, and especially
a range of familiarity with SMP coding. [Now that the book is
public, just let them line themselves up!]
x Work out who else to invite to contribute. Possibilities include:
[Now that the book is public, just let them line themselves up!]
o Real-life use of RCU: Dipankar Sarma, Steve Hemminger,
Maneesh Soni, David Howells, ...
o Nick Piggin: radix tree.
o Robert Ohlsson: fib trie.
o Steve Hemminger: brlock replacement.
o SELinux guys.
o Extreme scalability: Christoph Lameter, Jesse Barnes,
other SGI and ex-SGI folks.
o Various architecture maintainers for issues on abstract
CPUs and other hardware issues.
o Chinese angle: guys from Tsinghua U. in Beijing.
o Stuff from Rusty's Unreliable Guides.
o Rusty Commentary version.
o Tom Hart on RCU performance measurement.
Perhaps also on Hazard Pointers (maybe Maged Michael
as well or instead).
o Josh Triplett on RCU error checking in sparse.
o Val Henson on academic research on parallel SMP.
o --
+1-706-542-3706. "Wiki-based" textbooks. This is
the Terry College of Business at U. of Georgia,
but what is wrong with a little cross-pollenation?
The usual experience is that somewhere between 20% and 50%
of the people who get excited about contributing really will
do so.
August 5, 2006: 76 pages
August 12, 2006: 84 pages
August 20, 2006: 90 pages
September 3, 2006: 113 pages
September 17, 2006: 123 pages
-- LJ paper, discussions on memory barriers --
October 11, 2006: 111 pages
November 5, 2006: 135 pages
October 20, 2008: 176 pages
October 30, 2008: 199 pages
November 7, 2008: 205 pages
Change page-layout parameters...
November 7, 2008: 168 pages
November 9, 2008: 200 pages
November 16, 2008: 220 pages
November 30, 2008: 232 pages
December 8, 2008: 250 pages
December 23, 2008: 266 pages
January 13, 2009: 276 pages
January 31, 2009: 288 pages
February 28, 2009: 292 pages
May 24, 2009: 302 pages
July 3, 2009: 308 pages
February 21, 2010: 342 pages
February 21, 2010: 354 pages
Change page-layout parameters...
February 20, 2011: 379 pages
March 23, 2011: Test edition posted on
Something went wrong with that request. Please try again.