Skip to content

Tombstones purge from hashtable: theory and practice

Roman Leventov edited this page Nov 17, 2018 · 10 revisions

Given: hashtable playing the role of fixed-sized LRU cache: on inserting a new entry the oldest one should be removed. DHash and QHash leave so-called tombstones instead of removed entries (these slots are considered as free for insertion and as full for lookup queries and removals). The hashtable is getting more and more polluted: free slots are running out slowly. Insertion can displace a free slot (total free slots count decreases) or a tombstone (total free slots count doesn't change). Effective load of the hashtable for lookup queries is (capacity - free slots) / capacity, not size / capacity, so queries become more and more expensive. At some point we should rehash the table to purge it from tombstones.

The problem: at what effective table load we should rehash, depending on the target load of rehash?

Theory

All the following reasoning regards to the DHash model, however, the QHash model is not very different.

Let
c - the hashtable's capacity
s - the size (fixed)
l - target (and initial) hashtable load. l = s / c
t - current number of tombstone slots
f - current number of free slots. c = s + t + f
irp - total number of "insertion-removal pairs" since the previous rehash or hashtable construction

The probability of displacing a free slot on insertion of a new entry is (f / (t + f)) ^ 2.

Solving a differential equation, we have t(irp) = (c - s) * irp / (c - s + irp).
If irp = 0 then t = 0, if irp → ∞ then t → c - s, f → 0.

Suppose irp_rehash = k * s, i. e. our strategy is to rehash every "size * k" insertion-removal pairs.
Let rl = (s + t_rehash) / c - "rehash load" - load, at which rehash will occur with our strategy.

Then rl = (l * (k - l + 1)) / (k * l - l + 1).
E. g. if k = 1, i. e. our strategy is to rehash every s insertion-removal pairs, then rl = 2 * l - l^2. E. g. if l = 0.5, rl = 0.75.

Theory and practice

I made a series of experiments using this benchmark to determine really optimal rehash loads for DHash and QHash.

The header column contain hashtable target loads.

The first two columns denote expected effective loads at rehash, is we use the proposed strategy to rehash each k * s insertion-removal pairs. k = 2.5 seems to produce the closest approximation to the optimal rehash loads.

The 3rd and 4th columns contain really optimal effective loads to rehash at, for the given target loads, for DHash and QHash respectively.

l Strategy, k = 1 Strategy, k = 2.5 DHash, optimal QHash, optimal
.3 .51 .66 .72-.76 .75-.8
.4 .64 .775 .83 .81-.85
.5 .75 .86 .84-.87 .86-.88
.6 .84 .915 .88-.9 .9
.7 .91 .955 .93-.94 .93-.95
.8 .96 .98 .955 .96
.9 .99 .995 .985 .985
Apparently the proposed strategy leads to too often purges for low target loads (sparse hashtables) and too rare purges for high loads (dense hashtables).

Approximation of the DHash optimal rehash loads right from the 3rd column of the table above:
rl = 0.54 + 0.816 * l - 0.363 * l^2
QHash, 4th column:
rl = 0.6 + 0.671 * l - 0.274 * l^2

Explaination

In hashtables, the cost of any query consists of constant base cost and additional cost multiplied by the number of probes over the first probe. I. e. if a query requires 3 probes, it's cost is base cost + 2 * additional cost. Of course, base and additional costs are different for different types of queries (lookup, insertion, removal).

On low loads, most of the queries require only one probe, on very high loads the average probe count per query raise up to 10-20 and more.

During rehash, entries are reinserted into newly-created clean table, therefore even at high loads at least a half of the entries are reinserted at the base cost (reinsertion required only one probe). But all ordinary queries (including insertions and removals) require many additional probes. While on low loads regular queries and reinsertions both require mostly just one probe.

That is why on high loads rehash (reinsertion of every entry) is relatively cheaper than on low loads. But the proposed strategy doesn't take into account that the regular query cost / reinsertion cost ratio depends on the target load.

Why DHash optimal rehash loads are lower than QHash??
It definitely means that QHash reinsertions are relatively more expensive, than regular queries, than DHash reinsertions. But I don't understand why, I expected the opposite, because

  • The first probe cost for DHash includes two expensive integral divisions, while for QHash - only one. Costs of additional probes for the both algorithms don't include integral divisions and should be similar. So the first probe cost for DHash is relatively more expensive, than the cost of additional probes, than for QHash.
  • QHash average probe counts are higher by 10-28% than DHash for typical loads.

Therefore QHash reinsertions should be relatively cheaper, than ordinary queries, than DHash reinsertions.

TODO: benchmark DHash vs. QHash purge exclusively.

Real practice

In real practice, apart from insertions and removals there are many lookups, making rehash cost relatively less meaningful, therefore optimal effective loads to rehash at should be lower, than in the 3rd and 4th columns of the table above.

On the other hand, an LRU cache with linked entries keeps track of the oldest entry slot, making removals very cheap (in my benchmark there are ordinary removals).

I failed to obtain optimal rehash loads considering a portion of simple lookups along with insertions and removals with any precision, because measurement results vary from run to run too much. Estimation: rl_consider_lookups = rl - 0.05 * (1 - l). With it, approximation of optimal rehash loads
for DHash is rl = 0.49 + 0.866 * l - 0.363 * l^2,
for QHash: rl = 0.55 + 0.721 * l - 0.274 * l^2.