Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster string table #4088

Merged
merged 6 commits into from
Oct 24, 2014
Merged

Conversation

MartinNowak
Copy link
Member

  • use a hash table with open addressing and quadratic probing (triangular numbers)
  • adapt initial capacity (to phobos)
  • better hash function (MurmurHash2)
  • StringValue pool allocator for locality and 4-byte indices

This StringTable is much faster, but even though StringTable::search was high up the profile a change from 1.6% to 0.8% accumulated CPU usage will go unnoticed for dmd users.

- use a hash table with open addressing and linear probing
  for better cache locality
- based on numbers from phobos compilation
- guaranteed to visit every bucket in a power of 2 sized hash table

- mitigates clustering problem (with many hash collisions) while still preserving
  good cache locality
- much better distribution leads to less collisions
  and thus to faster lookup

- also faster
- more cache hits

- allocate StringValues from pools in StringTable
  and reference them with a 32-bit index+offset

- hash was only 32-bit anyhow so use uin32_t instead of hash_t
- call getValue(vptr) in findSlot only when hash
  matches (saves a possibly unnecessary load)
@IgorStepanov
Copy link
Contributor

@MartinNowak I read about this algorithm.
In the following example, if key doesn't exists in a table, we should interate all j from 0 to AA size, yes?
Is it ok?
What if AA is very big?

int quadratic_probing_search(int *hashtable, int key, int *empty)
{
    /* If the key is found in the hash table, the function returns the index of the hashtable where the key is inserted, otherwise it 
       returns (-1) if the key is not found */ 

    int j = 0, hk;
    hk = key  % SIZE;
    while(j < SIZE) 
    {
        if((empty[hk] == 0) && (hashtable[hk] == key))
            return (hk);
        j++;
        hk = (key + j * j) % SIZE;
    }
    return (-1);
}

@WalterBright
Copy link
Member

Sweet!

a change from 1.6% to 0.8% accumulated CPU usage will go unnoticed for dmd users.

This stuff adds up. Getting a full % speedup is a big deal.

WalterBright added a commit that referenced this pull request Oct 24, 2014
@WalterBright WalterBright merged commit 047d0de into dlang:master Oct 24, 2014
@WalterBright
Copy link
Member

The nice thing about speedups like this is they affect everything - for example, now the build/test system will run faster! It keeps paying dividends.

@Safety0ff
Copy link
Contributor

BTW, while I was poking around DMD's backend AAs for bug #8596 I noticed that some of the AA buckets where going linear & getting over populated because of a hash function casting a pointer to a size_t.
Despite that, I didn't find enough justification for making a change to the hash. I'm letting you know in case you have a use case where the fix is worthwhile.

@MartinNowak
Copy link
Member Author

@MartinNowak I read about this algorithm.
In the following example, if key doesn't exists in a table, we should interate all j from 0 to AA size, yes?
Is it ok?

That's quadratic probing and has the problem that you might fail to insert a value once your load factor is bigger than 0.5 [¹].
Triangular numbers as probe sequence in a 2^N sized table is guaranteed to visit every bucket [²].
An important part here is that StringEntry is only 8-bytes so you can load 8 entries in one cacheline, that's why practically quadratic probing works out better than cuckoo hashing.

What if AA is very big?

It's the load factor that determines the average amount of buckets you'll have to look at.
Notice that probing stops once it stumbles upon an empty bucket.

@MartinNowak MartinNowak deleted the fasterStringTable branch October 25, 2014 13:20
@IgorStepanov
Copy link
Contributor

Notice that probing stops once it stumbles upon an empty bucket.

What expected count of probing unexisted value until empty bucket will be found.
I've check you implementation and found that misscheck count not larger than 2 where table contains maximum count of values before extending.
However how do we can to prove it?

And main question: should we use this algorithm in our D AA implementation?

@MartinNowak
Copy link
Member Author

However how do we can to prove it?

There is no prove other than that you'll eventually find any empty bucket in at most N steps, where N is the table size.
http://dpaste.dzfl.pl/0c6663ec5b62

I've check you implementation and found that misscheck count not larger than 2 where table contains maximum count of values before extending.

It's even possible to reduce the effect of second clustering (reduce variance of probing steps) with a method called robin-hood-hashing.
This can also be used with the triangular probing, but I think it requires a table to lookup the number of steps a certain value is away from it's slot. I'll measure the variance in the lexer table and see whether this would help.

And main question: should we use this algorithm in our D AA implementation?

Well, if it's faster then yes :).
Deletion is a bit tricky, but it's much more cache friendly than using bucket lists.

@MartinNowak
Copy link
Member Author

misscheck count not larger than 2

It should perform 1/(1-a) lookups in average where a is the load factor.

@IgorStepanov
Copy link
Contributor

There is no prove other than that you'll eventually find any empty bucket in at most N steps, where N is the table size.
http://dpaste.dzfl.pl/0c6663ec5b62

We may also store and check the max depth of key.
When we adding new element, we should update a stored max_depth value if j is greater then max_depth.
When we searching value we may limit j with max_depth.

Deletion is a bit tricky, but it's much more cache friendly than using bucket lists.

We may simply mark deleted entry with special flag. When are we searching existing value, we should ignore a deleted entry, when are we insterting value we may use a deleted entry for a new value.

Of course, when we are deleting a entry, we should call destructor for the key and the value if it needed.

@MartinNowak
Copy link
Member Author

We may also store and check the max depth of key.
When we adding new element, we should update a stored max_depth value if j is greater then max_depth.
When we searching value we may limit j with max_depth.

You don't need to do that, any probe sequence ends when it hits an empty bucket.

We may simply mark deleted entry with special flag. When are we searching existing value, we should ignore a deleted entry, when are we insterting value we may use a deleted entry for a new value.

Tombstones (buckets marked as deleted) have some performance pitfalls and actually deletion is fairly simple. You just have to backshift the values in your probe sequence.

I still want to try out Robin Hood hashing, sounds promising because it mitigates primary and secondary clustering by swapping buckets during insertion.

@MartinNowak
Copy link
Member Author

I introduced a bug with this pull, please see #4102.

@MartinNowak
Copy link
Member Author

I still want to try out Robin Hood hashing, sounds promising because it mitigates primary and secondary clustering by swapping buckets during insertion.

@IgorStepanov I tried Robin Hood hashing. While the distribution is really good the resulting hash table is slower in all my benchmarks, most likely because insert/update method becomes more complex. So I'd stick to quadratic (triangular) probing as the fastest implementation.

@IgorStepanov
Copy link
Contributor

@MartinNowak I've implemented this scheme for AA.
It's not fully debugged yet, however adding and searching works correctly.
I've tested it and found that it works 4 times slower for inserting equals for updating then new AA implementation, which 5% faster with -O -inline -release then original implementation.
I've testes on string keys, tested table contains over 1500000 entries.
The max depth (max j in cycle) of "triangular" table is >55, the max depth of dlang/druntime#934 near 10 (MurmurHash3 gives a very good distribution).
My "triangular" table layout:

struct AssociativeArray(Key, Value)
{
    static struct Impl
    {
        immutable(AssociativeArrayHandle)* handle = &AssociativeArray.handle;
        Entry[] buckets;
        size_t nodes;       // total number of entries
    } 
   static struct Entry
    {
        size_t hash;
        Key key;
        Value value;
        size_t _depth; //first two bits is Empty/Occupied/SomethingElse flags
                                //the other bits is maximum j (depth) if search startes from this entry in table
    }
   Impl* impl;
}

@IgorStepanov
Copy link
Contributor

@MartinNowak
Hmm. When I changed LoadFactor from 0.8 to 0.6 and added special init method, which takes an expected table size, inserting started to work faster then original AA by 8 times.
I think, if we want improve D aa, we should add a possibility to set initial table size, like
string[string] aa = new string[string](500);

@MartinNowak
Copy link
Member Author

string[string] aa = new stringstring;

Please aa.reserve(500) like for arrays.

@IgorStepanov
Copy link
Contributor

I've tested removing and using (inserting new values and searching it) after series of removings.
This implementation works faster when then old AA when inserting new values (when AA doesn't need a resizing), bechause the old implementation needs to allocate dynamic structure when inserting the new Entry. "triangular" table doesn't need it. Removing speeds are equals. Lookup over the old implementation is 10% faster then in "triangular" version. (As expected, because a searching in the old implementation is much simpler, and the old implementation has an better distribution).
The new implementation uses less memory, because it stores in a flat array.

Do you think that we need to finish a new version or use an existing one?

@IgorStepanov
Copy link
Contributor

Please aa.reserve(500) like for arrays.

Ok.

@MartinNowak
Copy link
Member Author

I've testes on string keys, tested table contains over 1500000 entries.

1.5 million entries is fairly much, please optimize for common use cases.
We should be focused on a handful up to a few hundred entries.

The max depth (max j in cycle) of "triangular" table is >55, the max depth of dlang/druntime#934 near 10 (MurmurHash3 gives a very good distribution).
My "triangular" table layout:

We'll I'm not sure what you refer to when you say max depth, but notice that the number of probes for quadratic probing is smaller, but as the step size increases the maximum distance is expected to be bigger. This helps to mitigate primary clustering which is a problem of open addressing even with very good hashes.
Take a look at this, I've plotted the distribution for different collision resolutions.
http://nbviewer.ipython.org/url/dawg.eu/OpenAddressing.ipynb
As you can see linear probing has a very long tail.
Keep in mind that with bucket lists 10 means 10 cache misses ~1K cycles. While for linear probing 10 steps means 2 cache misses ~200cycles. For a realistic benchmark you need to pollute your cache during lookups (see CacheStomper).

When I changed LoadFactor from 0.8 to 0.6

Yes, for open addressing performance get's really bad when the load factor approaches 1 because there is almost no empty bucket. Something between 0.6 - 0.8 should be fine for quadratic probing though.

and added special init method, which takes an expected table size

Maybe rehashing is too expensive, what's your grow policy (we should move that discussion somewhere else, how about the newsgroup?).

inserting started to work faster then original AA by 8 times.

If correctly done, it's definitely many times faster than a hash table with buckets.
It's all about the number of cache misses.

@MartinNowak
Copy link
Member Author

because a searching in the old implementation is much simpler

What's simpler than iterating through an array?
Took me just a few lines + value comparision (https://github.com/D-Programming-Language/dmd/blob/10451418ec52adb4bb6235fa7e56ae81e94d817a/src/root/stringtable.c#L149).

@MartinNowak
Copy link
Member Author

Do you think that we need to finish a new version or use an existing one?

It doesn't affect the API, so let's do it later, but we should definitely do it.

@IgorStepanov
Copy link
Contributor

What's simpler than iterating through an array?

Old implementation:

h = getHash(key);
idx = h % table_size;
e = buckets[idx];
while(e)
{
    if (e.hash == h && e.equals(key))
       return;
    e = e.next;
}

Summary:
1 hash calculation
1 mod
1 access by idx
N comparisons
3*N access by a pointer (e.hash, e.equals, e.next)

N is a depth of AA (which better in the old implementation)
New AA:

h = getHash(key);
idx = h & (impl.buckets.length - 1); //Same time with "%"

for (; j <= mdepth; i = (i + j) & (impl.buckets.length - 1), ++j)
{
    if(impl.buckets[i].occupation == Occupied &&
                impl.buckets[i].hash == hash && 
                impl.buckets[i].key == pkey)
    return;
}

Summary:
1 hash calculation
1 and
2 * N comparings (with Occupied and with hash), the last comparsion almost always true if it executed
N transitions to the pointer
N j++
N and
3 * N access by a pointer (occupation, hash, key)
3 * N access by an index //May be it can be translacted to single instruction with access be address

DIFF:

  • 3 * N access by a pointer (e.hash, e.equals, e.next)
  • 3 * N access by a pointer (occupation, hash, key)
  • 3 * N access by an index
  • N comparisons
  • 2 * N comparisons
    +N j++
    +N and

Ok, they are almost equal, but new implementation has a worse distribution (In the old implementation values distributed uniformly across the table. In the new implementation value should search a free space).
However we able to improve lookup code without algorithm changing.

@MartinNowak
Copy link
Member Author

N is a depth of AA (which better in the old implementation)
New AA:

Where is your code for that? I think I can help out a bit, your buckets look to big (they are 8-byte in StringTable). Asymptotic complexity is pretty irrelevant when it comes to cache misses (see).

@IgorStepanov
Copy link
Contributor

@MartinNowak

Where is your code for that?

http://pastebin.com/uZEgTd3S
This is modification of [new AA], which uses new algorithm.
It works fine with string[string] table, but it isn't fully tested on other cases.
The main difference from you implementation is a saving max_depth for each bucket.
For example, when we calculates

size_t initial_hash = hash & (impl.buckets.length - 1);

we may get depth from bucket[initial_hash]
This depth means that when we iterates j: for (; j <= mdepth; i = (i + j) & (impl.buckets.length - 1), ++j){...}, j shouldn't be larger then depth.

About removing:
When I find J1 bucket, which will be removed, I continue interating over j and search J2: the most deepest bucket, which has the same initial_hash with J1 (The same first bucket in search). If J1 == J2, J1 is a most derived bucket and it may be removed without any additional actions.
If J2 != J1, I swap J1 and J2 buckets. Now J2 should be removed and it is most derived bucket.
However, when we start to search some another value with another initial_hash, we may need to go through J2 bucket which was removed at the previous stage. However I use depth as exclusive condition to break cycle, thus empty values will not be a trouble.

P.S. I hope you understand my explanation, because I wrote quite sloppy. :o)

@MartinNowak
Copy link
Member Author

This depth means that when we iterates j: for (; j <= mdepth; i = (i + j) & (impl.buckets.length - 1), ++j){...}, j shouldn't be larger then depth.

As I said before, you don't need that test, because triangular numbers guarantee that you'll visit each bucket without duplicates (see here).
You always stop looping when you hit the first empty bucket.

@MartinNowak
Copy link
Member Author

Yikes!!!

    static struct Entry
    {
        size_t hash;
        Key key;
        Value value;
        size_t _depth;

That thing is 48 bytes, no wonder that it performs so badly
You gotta make buckets very small to benefit from cache locality during probing. Please look closely at StringTable, I even used a separate allocator to shrink bucket entries from 16 to 8 byte.

Here's is what I have in mind for a generic AA.

struct Bucket
{
    uint hash; // 32-bit hash suffices
    uint idx; // index into entries
}
struct Entry
{
    Key key;
    Value value;
}
Bucket[] buckets;
Entry[] entries;
uint[] freeList;

uint allocEntry()
{
    uint idx = void;
    if (!freeList.length)
    {
        if (entries.length + 1 > growThreshold * buckets.length)
            grow();
        idx = entries.length;
        entries.length += 1;
    } 
    else
    {
        idx = freeList[$-1];
        freeList = freeList[0 .. $-1];
    }
    return idx;
}

void freeEntry(uint idx)
{
    if (freeList.length + 1 > shrinkThreshold * buckets.length)
        shrink();
    freeList.assumeSafeAppend();
    freeList ~= idx;
}

void grow()
{
    // double buckets length and rehash
    // probably compress entries and remap indices
}

void shrink()
{
    // compress entries and remap indices
    // halve buckets length and rehash
}

@IgorStepanov
Copy link
Contributor

As I said before, you don't need that test, because triangular numbers guarantee that you'll visit each bucket without duplicates (see here).
You always stop looping when you hit the first empty bucket.

It turns out I poorly explained. The empty buckets appear after removing.
See the next example:
http://dpaste.dzfl.pl/fork/0c6663ec5b62

Let we have two keys: Key1 and Key2. Hash(Key1) % buckets.length == 0 and Hash(Key2) % buckets.length == 3.

We have a partial filled table: When we adding a Key1 we should pass 0, 1, 3 indexes and store it by 6 idx.
After that when we adding Key2, we should pass 3, 4, 6 and store it by 9 idx.
Now let remove Key1. buckets[6] is NULL now.
When we will search Key2, we will break on 6 idx and we never reach 9.

@MartinNowak
Copy link
Member Author

On 11/09/2014 02:11 PM, IgorStepanov wrote:

We have a partial filled table: When we adding a Key1 we should pass 0,
1, 3 indexes and store it by 6 idx.
After that when we adding Key2, we should pass 3, 4, 6 and store it by 9
idx.
Now let remove Key1. buckets[6] is NULL now.
When we will search Key2, we will break on 6 idx and we never reach 9.

Ah right, and it's fairly expensive to find an entry that can be
backshifted to slot 6.

@IgorStepanov
Copy link
Contributor

Ah right, and it's fairly expensive to find an entry that can be backshifted to slot 6.

And now we return to question, how we should store max_depth and flag empty/occupied for buckets.
I suggest to store max_depth in separate array (anyway we need to get this value one time per lookup) and occupation flag as zero bit in Bucket.idx. Is it ok?

P.S. Are there still any objections aganist #3904?

@yebblies
Copy link
Member

@MartinNowak The list in traits.c used a sentinel for a reason, this broke DDMD.

@MartinNowak
Copy link
Member Author

@MartinNowak The list in traits.c used a sentinel for a reason, this broke DDMD.

Sorry for that, but if something is done for non-obvious reasons it needs a comment.

@yebblies
Copy link
Member

No worries, I just comment so people have some idea of what breaks ddmd. Usually it's things like shadowing variables and comments in strange places, which I can't really document in the source.

@IgorStepanov
Copy link
Contributor

@MartinNowak I've made it as you say.
At first stage I split Entry struct to three: Entry (key, value), Bucket(uint hash, uint index) and uint[] depthes.

if Bucket.index == EmptyBucket, bucket is empty.
After that I made the next optimization: I've created uint[] hashes, and store index and max depth in Bucket.

When bucket[i] is removed, I set hashes[i] to uint.max (very rare hash).
And in cycle I've do the checkes in the following order:

for (; j <= mdepth; i = (i + j) & (impl.buckets.length - 1), ++j)
{
        if (impl.hashes[i] == hash &&
          impl.buckets[i].occupied &&
          impl.entries[impl.buckets[i].index].isEqualsKey(pkey))
          {
              //success
          }
}

I hoped that if impl.hashes[i] == hash then probably impl.buckets[i].occupied == true (empty bucket has a rare hash) and impl.entries[impl.buckets[i].index].isEqualsKey(pkey) == true (If values have a uniform distribution and if we have a good hash function, probability of hash collision is size / hash_max).
Now the results:

Old AA

inserting: 1.541
updating: 0.293
remove access: 0.255
second access: 0.675

inserting: 1.581
updating: 0.282
remove access: 0.257
second access: 0.715

inserting: 1.55
updating: 0.281
remove access: 0.257
second access: 0.714

=========
New AA

inserting: 0.282
updating: 0.237
remove access: 0.239
second access: 0.61

inserting: 0.269
updating: 0.231
remove access: 0.238
second access: 0.582

inserting: 0.282
updating: 0.25
remove access: 0.247
second access: 0.579

=========
With .hashes

inserting: 0.24
updating: 0.234
remove access: 0.231
second access: 0.568

inserting: 0.245
updating: 0.238
remove access: 0.238
second access: 0.57

inserting: 0.266
updating: 0.258
remove access: 0.249
second access: 0.579
=========

Tested on ~1000000 keys with -release -inline -O flags.
Please don't see to old aa inserting result: old aa haven't a reserve function and need to rehash every time when AA is crowded.
The other results should be honest.

The first three tests tested on new AA implementation with #3904 (but with old algorithm), which ~5% faster then current AA implementation.
The second three tests is a new AA algorithm with hashes and indices in a same array,
The third three tests is a new AA algorithm with hashes in a separate array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants