In this homework you will start to build your own library of parallel
primitives. We will start from the basics and gradually implement more
interesting and sophisticated algorithms. For this first homework,
we'll implement efficient parallel algorithms for the reduce
primitive, the scan
(or prefix-sum) primitive, and the listrank
primitive. These primitives can be found in their respective
directories.
The rest of the instructions assume that you have cd'd into a given directory, and uses reduce as an example.
Create a new repository by clicking Use this template
->Create a new repository
and work on that.
Please submit your code on Gradescope (you can directly link your github repo). Also, please submit a pdf with answers for the questions in this README on Gradescope. We have put all of the questions in quotes, e.g.:
Make sure your UID and name are clearly visible on your submission.
There are 100 points in total.
- 10 points: auto-graded tests to verify correctness, but not efficiency
- 40 points: code satisfying the required work/depth bounds, breakdown
as follows:
- 10 points: scan code
- 15 points: wyllie code
- 15 points: sampling-based code
- 20 points: writeup for reduce
- 15 points: writeup for scan
- 15 points: writeup for list-ranking
Yes! Please go ahead. Just don't blindly copy code. If you found some resource useful, please include a link to it in your submission.
A Makefile is given in each directory, simply use make
to compile the code.
You can run the code with:
./reduce [num_elements] [num_rounds]
If not specified, the default values are num_elements=1000000000
and num_rounds=3
.
Note that num_elements
is smaller for listrank.
In your command line, set the environment variable PARLAY_NUM_THREADS
using export
. For example, set the number of threads to 4:
export PARLAY_NUM_THREADS=4
You can combine this with running your code, e.g.,:
PARLAY_NUM_THREADS=8 ./reduce 100000000 3
If your machine has 4 cores that are 2-way hyper-threaded, you can expect reasonable performance up to 8 threads. There are many ways to find out the number of threads supported by your machine, but one portable approach (in C++) is
#include <thread>
auto num_threads = std::thread::hardware_concurrency();
We have provided a tester (that we used in Gradescope) in case you wanted something to test your code locally. To set it up, cd to the tester
directory and run the following command (assuming you have python
installed):
pip3 install -r requirements.txt
To run the tester, just run the following command (after cd'ing to the tester
directory):
bash run
For this first primitive we have given a faithful implementation of
the reduce primitive that runs in
The algorithm uses a single parallel primitive, parlay::par_do(f, g)
, which simply runs the functions f
and g
in parallel, and
continues the main thread after both functions finish. This is exactly
what fork
does in the binary-forking model.
Try running reduce on some large inputs, e.g., n=1e8 or n=1e9. Try running on both 1 thread and all threads on your machine and report the running times in your writeup.
As a point of comparison, on a 2021 M1Pro, this serial version on n=1e8 elements takes 1.45 seconds, and the parallel version takes 0.238 seconds (6.1x speedup).
The speedup looks high (big numbers good!), but you may be rightfully skeptical.
You can measure the speedup this way, but is this the right serial baseline?
This 1-thread running time is the so-called "serial elision" of your
parallel program, i.e., the serial program that we get when for each
call to
par_do(f, g);
we simply run
f(); g();
There is still some (small) scheduling overheads incurred when we run
using PARLAY_NUM_THREADS=1 ./reduce ...
. You can make the serial
elision a little faster by ignoring the scheduler altogether and
compiling using make SERIAL=1
(the difference is about 18% on my
machine).
Instead of the recursive serial elision we're using, write a better serial algorithm for reduce which just goes over the input array once and computes the sum of all of the elements.
What is the running time of this serial implementation? Does your previous parallel code still get speedup?
Next, we'll see how to make the parallel algorithm run faster by performing granularity control, i.e., tuning the point at which we switch from forking, to using a serial implementation.
Edit the parallel reduce implementation to use a better serial
baseline. In particular, when
What is the running time of your parallel implementation now, and what is the speedup it obtains over the optimized serial implementation?
We will go through a similar set of steps as in reduce, but now with the
scan primitive (and a bit faster). You will implement the scan
algorithm we designed together in class which runs in
We have provided a serial implementation of an in-place scan in scan.h.
Your parallel scan implementation should be implemented in the
scan_inplace
function. Note that your implementation should mutate
the input array, A
. You may need to allocate some additional memory
in your implementation (you can see an example of this allocation in
scan.cpp).
Measure the self-speedup and the speedup with respect to the serial baseline on n=1e9. Are the two numbers close, or quite different?
If the performance of your parallel implementation was poor relative to the serial implementation, try to optimize your implementation further. Did you try applying granularity control?
In this last part of this homework, you will implement and evaluate three different list ranking algorithms. We have implemented the first one for you, which simply performs list-ranking on an input list sequentially.
The input to the problem is a single linked-list. A list node is defined as
struct ListNode {
ListNode* next;
size_t rank;
ListNode(ListNode* next) : next(next), rank(std::numeric_limits<size_t>::max()) {}
};
The problem is to write the correct value (distance) into each node's
rank
field.
We consider two different distributions.
- The first is a random cyclic permutation, which we generate using a variant of the Knuth Shuffle. (flag = 0)
- The second is a linked list where succ(node(i)) = node(i+1). (flag = 1)
The first list has essentially no locality, whereas the second list has very high locality. Unless specified otherwise, please use the first distribution (the default) for all tests. You can use the second distribution by specifying a command-line flag:
./serial 100000000 3 1
where the third argument sets the distribution flag to 1.
Run the serial algorithm on some large lists, e.g., n=1e7 or n=1e8. What is the largest input that you can solve using the serial implementation in under a minute?
First, please implement Wyllie's algorithm as we have described in class.
You may need to allocate some extra memory to save information (e.g., pointers, rank information) within a round. Make sure you free any temporary arrays that you have allocated.
What is the parallel running time of your Wyllie's implementation? What speedup does it get over serial list ranking for a reasonably large value of n?
Lastly, and as probably the most involved part of this homework, you
will implement a sampling-based method for list-ranking similar to the
one we discussed in class. Instead of aiming to get
Your algorithm should be similar to the algorithm we described in class, with a few changes:
- sample
$O(\sqrt n)$ points - ensure that the head and tail of the list are included in the sample
Now, the algorithm runs in three steps:
- Build a linked-list only on the sampled nodes, where each sample has a "weight" proportional to the number of non-sampled neighbors between it and the next sample
- Run a serial (weighted) list-ranking algorithm on the sample-only linked-list. At this point, the ranks for samples are correctly computed.
- Assign ranks for the non-sampled nodes.
You should carefully think about how to implement this algorithm so as
to ensure that your overall algorithm runs in
What is the parallel performance of your sampling-based implementation compared to your implementation of Wyllie's algorithm?
What is the self-speedup of your sampling based algorithm? When you run the serial elision, how fast is it relative to the sequential list-ranker? The ratio of these two running times empirically measures the work-inefficiency of your parallel algorithm.