bitfield optimization of phase2 #120

arvidn · 2020-11-04T23:31:48Z

This avoids some sort-on-disk in phase2 (back-propagation) by recording which entries in table_index are used by table_index + 1. Phase 2 now looks like this:

loop from table 7..1:
- scan current table:
  - if table < 7, skip any entries not set in current_bitfield
  - record the indices we find in the the current table by setting the corresponding bit in previous_bitfield
- if table < 7: scan current table (again):
  - compact table by skipping fields whose bit is not set in current_bitfield
  - write back each pos and offset with an adjusted position, based on the new (compacted) locations in previous_bitfield.
- current_bitfield = previous_bitfield
- clear all bits in previous_bitfield

In order to hand over the tables in the correct order, sorted as the current phase 2 does, the "write-back" step is really done into a SortManager object. This object is then passed on to phase 3, as if it was a Disk.

Enabling a SortManager to implement the Disk interface required a few changes. The Disk interface now returns a pointer to the entry, when reading from it. This is how SortManager returns data, and I envision it is how we will actually return data if we transition to memory mapped files too.

To allow passing a SortManager from one phase to the next, they now heap allocate their memory, rather than relying in being passed a buffer that is later reused.

Another optimization of the algorithm described above is that the last table, table 1, isn't sorted, it just need to be compacted. Instead of running a read and write-back pass over table 1, there's yet another abstraction implementing the Disk interface; FilteredDisk. It wraps a normal file along with a bitfield and provides an object that appears to be a compacted file. i.e. all the entries not set in the bitfield are skipped when reading from the file. This saves another read and write pass over table 1.

The last change that was necessary to unify the interfaces of SortManager and Disk was to provide a BufferedDisk. It wraps a plain file on disk but implements a read and write cache. This simplifies the loops in phase 2 and phase 3, where they no longer need to do their own read caching. Similarly, the SortManager no longer needs to do its own write caching when writing to buckets.

The intention of this patch is to reduce disk I/O. Enabling disk I/O logging and testing:

./ProofOfSpace create -k 24 -s 8192

Yields this result:

total-read:        5863072979
total-write:       5307037029

This can be compared to the current master I/O, with the same command:

total-read:        5713090499
total-write:       6035790148

That's a reduction of writes by 12,07% and an increase of reads by 2.63%.

When the full working set fits in RAM (i.e. no actual disk I/O is performed) this patch runs slower. This is most likely because the sorting of the buckets in phase3 is done twice, since phase 3 performs two read passes over the tables. It's expected that it will be faster when running against an a disk, plotting a file much larger than available RAM.

…fies loops that iterate files, by not having them read larger chunks, but considering one entry at a time.

aqk · 2020-11-04T23:56:48Z

src/bitfield_index.hpp

+struct bitfield_index
+{
+    // cache the number of set bits evey n bits. This is n. For a bitfield of
+    // size 2^32, this means a 2 MiB index


Suggested comment change:

// Cache the number of set bits every kIndexBucket bits.
// For a 2^32 entries, this means a 200KiB index.

aqk · 2020-11-05T00:22:59Z

tests/test.cpp

+    bitfield_index const idx(bitfield);
+    CHECK(idx.lookup(1048576 - 3, 1) == std::pair<uint64_t, uint64_t>{0,1});
+    CHECK(idx.lookup(1048576 - 2, 1) == std::pair<uint64_t, uint64_t>{1,1});
+}


Suggest boundary condition tests for bitfield_index:

// One past bucket
idx.lookup(pos=0, offset=kIndexBucket)

// One past bitfield.size()

// Init a bitfield_index with a bitfield that has .size == kIndexBucket-1, kIndexBucket, kIndexBucket+1

I used to have tests for unset bits, that they would throw. But this is a pretty hot function so I backed out that check. Now it's up to the caller to only use valid positions and offsets (hence the asserts)

aqk · 2020-11-05T00:37:30Z

src/bitfield_index.hpp

+        uint64_t const bucket = pos / kIndexBucket;
+
+        assert(bucket < index_.size());
+        assert(pos < bitfield_.size());


assert(pos < bitfield_.size());

I see. So, it's okay for pos >= kIndexBucket?

Yes, they are different domains. kIndexBucket is the interval of precomputed counts of bits. the index of such precomputed counts of set bits is bitfield_.size() / kIndexBucket.

It's OK for pos to be greater than bucket * kIndexBucket, but only as long as it's < bitfield_.size().

aqk · 2020-11-05T00:41:23Z

src/bitfield_index.hpp

+    {
+        uint64_t counter = 0;
+        auto it = bitfield_.begin();
+        index_.reserve(bitfield_.size() / kIndexBucket);


index_.reserve(bitfield_.size() / kIndexBucket);

Should this be index_.reserve((bitfield_.size() / kIndexBucket)+1);, or is the index not used for the last bucket where bucket size < kIndexBucket ?

no, I think this is right. This provides an index for the number of set bits at the start of every kIndexBucket bits. So it rounds down.

aqk

This is large, so I'm sending my first wave of comments now.

aqk · 2020-11-05T00:45:28Z

src/phase2.hpp

+    // table 7, where we don't use the sort_manager; then it's used as a write
+    // cache for the table, as we update it.
+
+    // As the last step, we compact table 1. At that point the halfes are also


halfes -> halves or halfs

this comment is removed in a later commit, so I think I'll leave it misspelled. If I update this commit I think it would make it harder to review, since changes won't come in as commits on top of what's here

aqk · 2020-11-05T00:48:06Z

src/phase2.hpp

@@ -149,100 +157,100 @@ std::vector<uint64_t> RunPhase2(
            0,
            strategy_t::quicksort);

+        if (table_index == 7) {
+            // we don't use the sort manager in this case, so we can use the
+            // memort as a write buffer instead


same thing here. It's supposed to say "memory", but this comment is removed in a later commit, when the memory is heap allocated instead.

aqk · 2020-11-05T01:09:03Z

src/disk.hpp

+        NeedReadCache();
+        // all allocations need 7 bytes head-room, since
+        // SliceInt64FromBytes() may overrun by 7 bytes
+        if (read_buffer_start_ <= begin && read_buffer_start_ + read_buffer_size_ > begin + length + 7) {
            // if the read is entirely inside the buffer, just memcopy it out


just memcopy it out -> just pass a pointer to it

aqk · 2020-11-05T01:10:46Z

src/disk.hpp


-    void Read(uint64_t begin, uint8_t *memcache, uint64_t length) override
+    uint8_t const* Read(uint64_t begin, uint64_t length) override


What if the requested read is bigger than uint64_t read_ahead ?
Since we pass a pointer out, a valid read can only be as large as that buffer.

we only read one entry at a time, and I think the largest entry is 7 bytes. I suppose there could be an assert though

aldenml · 2020-11-05T13:47:06Z

CMakeLists.txt

@@ -42,11 +42,11 @@ FetchContent_Declare(
 FetchContent_MakeAvailable(pybind11-src)

 IF (CMAKE_BUILD_TYPE STREQUAL "RELEASE")
-set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3")
-set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -O3")
+set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -mtune=native")


@arvidn isn't this a potential problem if you want to create binaries to distribute?

it would be, if we build such binary distribution on an exotic machine. The problem of not tuning it at all is that GCC doesn't even think it has access to popcnt, which the bitfield_index relies on

arvidn · 2020-11-06T11:15:30Z

@aqk I had a few typos in my additional unit test that catch (for some reason) didn't report so I didn't notice until CI ran and it detected a non-zero return code.

I've fixed that now. I want to squash all the fixup commits to their respective change they belong to, but to make it easier to review my changes, I left them in as new commits on top for now. Please take a look then I'll squash.

arvidn · 2020-11-06T12:19:52Z

I ran some large benchmarks overnight (probably a lot larger than they had to be).

This is on MacOS, a fairly powerful laptop against a slow spinning disk connected over USB. So, this was most definitely I/O bound. I wanted to avoid running a test so small that it would fit in RAM, and I don't have all that much free space on the internal SSD. But it might be nice to get an idea of what the numbers look like on a more reasonable setup with a large SSD as the plotting target.

I ran ./ProofOfSpace create -u 32 -b 4096 -k 31 --threads 8.

I timed these runs with time, but I also piped the output to tee to log it, along with 2>&1. I think this messed up the output from time and I lost the real time (I got the user and system CPU time as well as the total time).

total time with current master: 9:17:47.59
total time with this patch: 9:11:14.91

6 minutes and 33 seconds faster. Not a whole lot given a > 9 hour runtime.

This is the timing output from the runs, master first:

Time for phase 1 = 10833.506 seconds. CPU (116.590%) Fri Nov  6 03:50:20 2020
Time for phase 2 = 7490.284 seconds. CPU (36.240%) Fri Nov  6 05:55:10 2020
Time for phase 3 = 14189.635 seconds. CPU (44.280%) Fri Nov  6 09:51:39 2020
Time for phase 4 = 949.447 seconds. CPU (57.620%) Fri Nov  6 10:07:29 2020
Total time = 33462.874 seconds. CPU (66.270%) Fri Nov  6 10:07:29 2020

Now, for this patch:

Time for phase 1 = 9960.252 seconds. CPU (127.730%) Thu Nov  5 17:42:43 2020
Time for phase 2 = 6451.230 seconds. CPU (53.280%) Thu Nov  5 19:30:14 2020
Time for phase 3 = 15829.239 seconds. CPU (56.150%) Thu Nov  5 23:54:03 2020
Time for phase 4 = 826.582 seconds. CPU (66.470%) Fri Nov  6 00:07:49 2020
Total time = 33067.307 seconds. CPU (77.410%) Fri Nov  6 00:07:49 2020

That's only a 1.18 % reduction in runtime. Given that it's still performing 12% fewer writes (at the expense of 2% more reads) and it's not making things slower (at least not dramatically), I think it's still a viable change.

I suspect that one reason it's not improving runtime all that much on a spinning disk is that it (probably) increases seek times since phase 3 now reads from two separate sort managers, meaning it will keep two read cursors on disk that it will have to alternate between.

Right now I set the disk read-ahead (and write) buffers to 4 MiB, as that's the minimum extent size on ext4 (iirc). I'll experiment with making that larger, say, 8 or 16 MiB.

arvidn · 2020-11-06T22:48:53Z

I ran another test of this patch with a 16 MiB read- and write cache. It almost took 1 hour longer than the two previous tests. I'm thinking that there's probably a lot of noise in the way I'm testing this. Most likely introduced by the spinning disk.

Interestingly, almost all of the additional time happened in phase1, which should be unaffected by my patch anyway. There may be some opportunity for improvements in phase 1 on spinning disks.

…ptimize vector<bool>

…tion of the tables

… one phase to the next.

arvidn · 2020-11-09T14:21:54Z

squashed fixups

arvidn added 3 commits November 5, 2020 00:05

use bitfield

03da699

test bitfield

a362f87

Add BufferedFile, that incorporates read and write cache. This simpli…

c238ec5

…fies loops that iterate files, by not having them read larger chunks, but considering one entry at a time.

aqk reviewed Nov 4, 2020

View reviewed changes

aqk reviewed Nov 5, 2020

View reviewed changes

aldenml reviewed Nov 5, 2020

View reviewed changes

aqk approved these changes Nov 5, 2020

View reviewed changes

arvidn added 8 commits November 9, 2020 15:01

speed up bitfield, or at least make it work on compilers that don't o…

aecf3e3

…ptimize vector<bool>

introduce abstraction between phase2 and phase3 to allow lazy computa…

6c46607

…tion of the tables

heap allocate sort buffers in SortManager, to support passing it from…

178135e

… one phase to the next.

heap allocate bitfield

2d72602

additional tests for bitfield_index

397ebb3

save read/write pass of table 1 by compacting it lazily

bac1629

fix build with disk logging enabled on Linux/GCC

bea9a63

print test output on failure on CI

af2bee9

arvidn force-pushed the bitfield-cleanup branch from 95b58cd to af2bee9 Compare November 9, 2020 14:21

hoffmang9 merged commit 74b2fb2 into master Nov 9, 2020

hoffmang9 deleted the bitfield-cleanup branch November 9, 2020 15:42

hoffmang9 mentioned this pull request Dec 3, 2020

Beta 18 Chia-Network/chia-blockchain#542

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bitfield optimization of phase2 #120

bitfield optimization of phase2 #120

arvidn commented Nov 4, 2020

aqk Nov 4, 2020

aqk Nov 5, 2020 •

edited

Loading

arvidn Nov 5, 2020

aqk Nov 5, 2020

arvidn Nov 5, 2020

aqk Nov 5, 2020

arvidn Nov 5, 2020

aqk left a comment

aqk Nov 5, 2020

arvidn Nov 5, 2020

aqk Nov 5, 2020

arvidn Nov 5, 2020

aqk Nov 5, 2020

aqk Nov 5, 2020

arvidn Nov 5, 2020

aldenml Nov 5, 2020

arvidn Nov 5, 2020

arvidn commented Nov 6, 2020

arvidn commented Nov 6, 2020 •

edited

Loading

arvidn commented Nov 6, 2020

arvidn commented Nov 9, 2020


		void Read(uint64_t begin, uint8_t *memcache, uint64_t length) override
		uint8_t const* Read(uint64_t begin, uint64_t length) override

bitfield optimization of phase2 #120

bitfield optimization of phase2 #120

Conversation

arvidn commented Nov 4, 2020

Choose a reason for hiding this comment

aqk Nov 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aqk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arvidn commented Nov 6, 2020

arvidn commented Nov 6, 2020 • edited Loading

arvidn commented Nov 6, 2020

arvidn commented Nov 9, 2020

aqk Nov 5, 2020 •

edited

Loading

arvidn commented Nov 6, 2020 •

edited

Loading