-
-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
faster string table #4088
faster string table #4088
Conversation
- use a hash table with open addressing and linear probing for better cache locality
- based on numbers from phobos compilation
- guaranteed to visit every bucket in a power of 2 sized hash table - mitigates clustering problem (with many hash collisions) while still preserving good cache locality
- much better distribution leads to less collisions and thus to faster lookup - also faster
- more cache hits - allocate StringValues from pools in StringTable and reference them with a 32-bit index+offset - hash was only 32-bit anyhow so use uin32_t instead of hash_t
- call getValue(vptr) in findSlot only when hash matches (saves a possibly unnecessary load)
@MartinNowak I read about this algorithm.
|
Sweet!
This stuff adds up. Getting a full % speedup is a big deal. |
The nice thing about speedups like this is they affect everything - for example, now the build/test system will run faster! It keeps paying dividends. |
BTW, while I was poking around DMD's backend AAs for bug #8596 I noticed that some of the AA buckets where going linear & getting over populated because of a hash function casting a pointer to a |
That's quadratic probing and has the problem that you might fail to insert a value once your load factor is bigger than 0.5 [¹].
It's the load factor that determines the average amount of buckets you'll have to look at. |
What expected count of probing unexisted value until empty bucket will be found. And main question: should we use this algorithm in our D AA implementation? |
There is no prove other than that you'll eventually find any empty bucket in at most N steps, where N is the table size.
It's even possible to reduce the effect of second clustering (reduce variance of probing steps) with a method called robin-hood-hashing.
Well, if it's faster then yes :). |
It should perform 1/(1-a) lookups in average where a is the load factor. |
We may also store and check the max depth of key.
We may simply mark deleted entry with special flag. When are we searching existing value, we should ignore a deleted entry, when are we insterting value we may use a deleted entry for a new value. Of course, when we are deleting a entry, we should call destructor for the key and the value if it needed. |
You don't need to do that, any probe sequence ends when it hits an empty bucket.
Tombstones (buckets marked as deleted) have some performance pitfalls and actually deletion is fairly simple. You just have to backshift the values in your probe sequence. I still want to try out Robin Hood hashing, sounds promising because it mitigates primary and secondary clustering by swapping buckets during insertion. |
I introduced a bug with this pull, please see #4102. |
@IgorStepanov I tried Robin Hood hashing. While the distribution is really good the resulting hash table is slower in all my benchmarks, most likely because insert/update method becomes more complex. So I'd stick to quadratic (triangular) probing as the fastest implementation. |
@MartinNowak I've implemented this scheme for AA.
|
@MartinNowak |
Please |
I've tested removing and using (inserting new values and searching it) after series of removings. Do you think that we need to finish a new version or use an existing one? |
Ok. |
1.5 million entries is fairly much, please optimize for common use cases.
We'll I'm not sure what you refer to when you say max depth, but notice that the number of probes for quadratic probing is smaller, but as the step size increases the maximum distance is expected to be bigger. This helps to mitigate primary clustering which is a problem of open addressing even with very good hashes.
Yes, for open addressing performance get's really bad when the load factor approaches 1 because there is almost no empty bucket. Something between 0.6 - 0.8 should be fine for quadratic probing though.
Maybe rehashing is too expensive, what's your grow policy (we should move that discussion somewhere else, how about the newsgroup?).
If correctly done, it's definitely many times faster than a hash table with buckets. |
What's simpler than iterating through an array? |
It doesn't affect the API, so let's do it later, but we should definitely do it. |
Old implementation:
Summary: N is a depth of AA (which better in the old implementation)
Summary: DIFF:
Ok, they are almost equal, but new implementation has a worse distribution (In the old implementation values distributed uniformly across the table. In the new implementation value should search a free space). |
Where is your code for that? I think I can help out a bit, your buckets look to big (they are 8-byte in StringTable). Asymptotic complexity is pretty irrelevant when it comes to cache misses (see). |
http://pastebin.com/uZEgTd3S
we may get About removing: P.S. I hope you understand my explanation, because I wrote quite sloppy. :o) |
As I said before, you don't need that test, because triangular numbers guarantee that you'll visit each bucket without duplicates (see here). |
Yikes!!! static struct Entry
{
size_t hash;
Key key;
Value value;
size_t _depth; That thing is 48 bytes, no wonder that it performs so badly Here's is what I have in mind for a generic AA. struct Bucket
{
uint hash; // 32-bit hash suffices
uint idx; // index into entries
}
struct Entry
{
Key key;
Value value;
}
Bucket[] buckets;
Entry[] entries;
uint[] freeList;
uint allocEntry()
{
uint idx = void;
if (!freeList.length)
{
if (entries.length + 1 > growThreshold * buckets.length)
grow();
idx = entries.length;
entries.length += 1;
}
else
{
idx = freeList[$-1];
freeList = freeList[0 .. $-1];
}
return idx;
}
void freeEntry(uint idx)
{
if (freeList.length + 1 > shrinkThreshold * buckets.length)
shrink();
freeList.assumeSafeAppend();
freeList ~= idx;
}
void grow()
{
// double buckets length and rehash
// probably compress entries and remap indices
}
void shrink()
{
// compress entries and remap indices
// halve buckets length and rehash
} |
It turns out I poorly explained. The empty buckets appear after removing. Let we have two keys: Key1 and Key2. We have a partial filled table: When we adding a Key1 we should pass 0, 1, 3 indexes and store it by 6 idx. |
On 11/09/2014 02:11 PM, IgorStepanov wrote:
Ah right, and it's fairly expensive to find an entry that can be |
And now we return to question, how we should store max_depth and flag empty/occupied for buckets. P.S. Are there still any objections aganist #3904? |
@MartinNowak The list in traits.c used a sentinel for a reason, this broke DDMD. |
Sorry for that, but if something is done for non-obvious reasons it needs a comment. |
No worries, I just comment so people have some idea of what breaks ddmd. Usually it's things like shadowing variables and comments in strange places, which I can't really document in the source. |
@MartinNowak I've made it as you say. if Bucket.index == EmptyBucket, bucket is empty. When bucket[i] is removed, I set hashes[i] to uint.max (very rare hash).
I hoped that if
Tested on ~1000000 keys with The first three tests tested on new AA implementation with #3904 (but with old algorithm), which ~5% faster then current AA implementation. |
This StringTable is much faster, but even though
StringTable::search
was high up the profile a change from1.6%
to0.8%
accumulated CPU usage will go unnoticed for dmd users.