Make CollectionIDs a 32bit hash value of the collection name #412

hegner · 2023-05-04T16:24:34Z

BEGINRELEASENOTES

Using string hashes as CollectionID based on MurmurHash

ENDRELEASENOTES

This PR is work in progress. The frame interface works. However the legacy interface used the previous structure of IDs for some optimizations, which I have to remove.

src/MurmurHash2.cpp

tmadlener · 2023-05-05T09:45:20Z

How does this fare with reading files prior to this? It should work because we just replace the calculation of the ID, right? Can we easily verify that (in CI)?

hegner · 2023-05-06T21:29:33Z

@tmadlener - on fixing the undetected narrowing from 64 to 32 bit I found quite a few signed/unsigned inconsistencies which haven't been caught before. the collectionID is now uint64_t throughout

jmcarcell · 2023-05-07T08:49:28Z

What's the final take on the hash size, 64 bits? With 64 bits there is some small collision probability with many collections, right?

hegner · 2023-05-07T16:08:25Z

Yes. 64bit hash size. 32 bit was too little to be safe from collisions for a huge number of collections. With 64 bits we are in very safe territory

hegner · 2023-05-07T16:10:31Z

See e.g. here https://preshing.com/20110504/hash-collision-probabilities/

tmadlener

Making the collectionID have 64 bits (and planning to use all of them) has potentially a few more implications, e.g. the id method will no longer be very useful:

podio/python/templates/macros/declarations.jinja2

Line 96 in 08117a1

    
           unsigned int id() const { return getObjectID().collectionID * 10000000 + getObjectID().index; }

include/podio/ObjectID.h

src/selection.xml

tmadlener · 2023-05-07T17:24:45Z

Regarding backwards compatibility, for SIO we would need to increase the version number for the SIOCollectionIDTableBlock (and fix the current inconsistency):

podio/include/podio/SIOBlock.h

Lines 101 to 108 in 705721d

    
           SIOCollectionIDTableBlock() : sio::block("CollectionIDs", sio::version::encode_version(0, 4)) { 
        
           } 
        
           SIOCollectionIDTableBlock(podio::EventStore* store); 
        
           SIOCollectionIDTableBlock(std::vector<std::string>&& names, std::vector<int>&& ids, std::vector<std::string>&& types, 
        
                                     std::vector<short>&& isSubsetColl) : 
        
               sio::block("CollectionIDs", sio::version::encode_version(0, 3)),

podio/src/SIOBlock.cc

Line 16 in 705721d

sio::block("CollectionIDs", sio::version::encode_version(0, 3)) {

This could then be handled accordingly on the reading side:

podio/src/SIOBlock.cc

Lines 35 to 42 in 705721d

    
           void SIOCollectionIDTableBlock::read(sio::read_device& device, sio::version_type version) { 
        
             device.data(_names); 
        
             device.data(_ids); 
        
             device.data(_types); 
        
             if (version >= sio::version::encode_version(0, 2)) { 
        
               device.data(_isSubsetColl); 
        
             } 
        
           }

There is also a missing conversion to uint64_t here.

podio/src/sioUtils.h

Line 49 in 705721d

std::vector<int> ids;

hegner · 2023-05-07T18:48:47Z

Thanks. The SIO I really didn't test properly yet.

tmadlener · 2023-05-15T13:51:48Z

This looks like it needs a rebase (and potentially some conflict resolution).

For me this looks good, the main points that we need discussion IMHO are:

Are we OK with effectively increasing the ObjectID size by 50%
An alternative (potentially additional) approach would be to effectively externalize the collection ID generation to the outside and make it possible to supply the Frame with a pre-populated CollectionIDTable and then either falling back to hashing or failing on collection names that are not present in the supplied external table.

hegner · 2023-05-16T07:53:36Z

Yes, that would be something for the next EDM4hep meeting to discuss.

tmadlener · 2023-05-25T19:22:14Z

There are some conflicts. I think they should resolve themselves with a rebase on to master

hegner · 2023-05-25T19:24:51Z

rebase didn't do it

hegner · 2023-05-26T13:08:05Z

@tmadlener thanks

tmadlener · 2023-05-26T17:12:48Z

One of the sanitizer workflows seems to be picking up on #174 and clang-tidy is complaining about a few things in murmurhash3. They are in principle easily fixable, and since we are already changing it to make clang-format happy, I don't see a reason not to also fix these issues here: https://github.com/AIDASoft/podio/actions/runs/5093184726/jobs/9155485339?pr=412#step:4:659

tmadlener · 2023-05-30T11:14:50Z

I added another tag to the failing test to ignore it in UndefinedBehavior sanitizer runs since it is picking up on a known issue and the test will be obsolete once the EventStore is removed in any case.

tmadlener · 2023-05-30T12:30:11Z

This PR is work in progress. The frame interface works. However the legacy interface used the previous structure of IDs for some optimizations, which I have to remove.

Just to confirm. This is no longer the case and this PR is ready as it is, right?

andresailer · 2023-05-30T12:37:10Z

It is 64 bits now, or was I mistaken in changing the title of the PR?

jmcarcell · 2023-05-30T12:42:20Z

During the last call one week ago it was agreed that it would be 32 bit and collisions would be dealt with due to concerns about increasing sizes of classes (for example due to padding) and files (?), personally I would like to see some numbers first in case dealing with collisions turns out to be some work...

tmadlener · 2023-05-30T14:24:13Z

I have added a small utility tool that takes a list of collection names and checks for collisions using the chosen hash function. It should be fairly straight forward to come up with a list of all currently used collection names and see if we have collisions in there already.

Speaking of collisions @hegner, I think we could also use this PR to make collisions more visible in the Frame, currently this is effectively silent (and potentially incomplete):

podio/include/podio/Frame.h

Lines 318 to 327 in 76c98a6

    
           template <typename CollT, typename> 
        
           const CollT& Frame::put(CollT&& coll, const std::string& name) { 
        
             const auto* retColl = static_cast<const CollT*>(m_self->put(std::make_unique<CollT>(std::move(coll)), name)); 
        
             if (retColl) { 
        
               return *retColl; 
        
             } 
        
             // TODO: Handle collision case 
        
             static const auto emptyColl = CollT(); 
        
             return emptyColl; 
        
           }

which calls

podio/include/podio/Frame.h

Lines 410 to 427 in 76c98a6

    
           template <typename FrameDataT> 
        
           const podio::CollectionBase* Frame::FrameModel<FrameDataT>::put(std::unique_ptr<podio::CollectionBase> coll, 
        
                                                                           const std::string& name) { 
        
             { 
        
               std::lock_guard lock{*m_mapMtx}; 
        
               auto [it, success] = m_collections.try_emplace(name, std::move(coll)); 
        
               if (success) { 
        
                 // TODO: Check whether this collection is already known to the idTable 
        
                 // -> What to do on collision? 
        
                 // -> Check before we emplace it into the internal map to prevent possible 
        
                 //    collisions from collections that are potentially present from rawdata? 
        
                 it->second->setID(m_idTable.add(name)); 
        
                 return it->second.get(); 
        
               } 
        
             } 
        
             return nullptr; 
        
           }

…ng non-hashing

Test that will be deprecated with EventStore, so should be OK

hegner · 2023-06-05T13:46:06Z

@tmadlener - thanks for the rebase

tmadlener · 2023-06-05T15:48:32Z

Alright, I collected this list of collection names unique_coll_names.txt from the following sources:

a REC file from the ILD standard reconstruction Output_REC
a REC file from the CLIC standard reconstruction Output_REC
the list of files that can be found in the miniDST files that were produced for the snowmass tutorials of the ILC (slide 8 of this presentation
The output names of the default configuration for k4SimDelphes
The list of EIC reconstruction output as well as the output of rootls -t on a ddsim file provided by @wdconinc.
The list of collections that can be produced by the EDM4hep output of guinea-pig
A list of collection names currently in use by CEPC (provided by @mirguest)

This currently yields 458 unique collection names which have no collisions, and puts us somewhere into the 1:10000 and 1:100000 collision probability range according to this table here:

tmadlener

I think (and hope) we have now considered pretty much all the collection names that are currently in use. I didn't find any collisions among them, so at least for now 32 bits seem to be enough.

tmadlener reviewed May 4, 2023

View reviewed changes

src/MurmurHash2.cpp Outdated Show resolved Hide resolved

hegner force-pushed the hash branch from e75dd5a to b72c00f Compare May 5, 2023 08:25

tmadlener mentioned this pull request May 6, 2023

Writing multiple podio collections with the same name in DD4hep... AIDASoft/DD4hep#1111

Closed

1 task

tmadlener reviewed May 7, 2023

View reviewed changes

include/podio/ObjectID.h Outdated Show resolved Hide resolved

src/selection.xml Outdated Show resolved Hide resolved

hegner mentioned this pull request May 15, 2023

Cleanup / reorganize tests #233

Closed

tmadlener linked an issue May 19, 2023 that may be closed by this pull request

CollectionIDs should not depend on the insertion order #381

Closed

hegner force-pushed the hash branch 2 times, most recently from 53f29e6 to 6d86c3f Compare May 25, 2023 07:26

tmadlener force-pushed the hash branch from 63d3ad2 to 191999a Compare May 25, 2023 20:07

tmadlener changed the title ~~[WIP] add hashing feature to CollectionID table~~ Make CollectionIDs a 32bit hash value of the collection name May 30, 2023

andresailer changed the title ~~Make CollectionIDs a 32bit hash value of the collection name~~ Make CollectionIDs a 64bit hash value of the collection name May 30, 2023

tmadlener changed the title ~~Make CollectionIDs a 64bit hash value of the collection name~~ Make CollectionIDs a 32bit hash value of the collection name May 30, 2023

tmadlener and others added 19 commits June 5, 2023 15:42

add hashing feature to CollectionID table

46c72f9

fix fallthrough warnings

491131b

fix murmurhash fallthrough; remove colllectionID optimizations assumi…

83168c8

…ng non-hashing

include clang-format suggestions

da8bb26

include clang-format

401820c

move collectionID from mix of int/unsigned to consistent uint64_t

1ab3d2a

fix collectionID in SIOBlock

b18fc26

use 32 bit hash; use murmurhash3

97db917

clang format murmurhash3

b0da0d7

fix fallthrough warnings

3e2e8f2

[clang-format] Format MurmurHash3 after fallthrough fixes

a572486

Change SIO parts back to 32 bits

6a1119f

[clang-formt] A few more format fixes

e6cb480

[clang-tidy] NOLINT murmurhash3 include guards

b9ea64a

Make dictionary genertion go back to 32 bit IDs

9e55216

Fix clang-tidy warning by explicitly including necessary header

fee0d03

[clang-tidy] Fix warnings in MurmurHash3

320a19d

Ignore tests in UB sanitizer runs

9d1038b

Test that will be deprecated with EventStore, so should be OK

Add standalone executable for collision detection

aee75ee

tmadlener force-pushed the hash branch from b1d809e to aee75ee Compare June 5, 2023 13:43

protect frame against double insert

914155e

tmadlener approved these changes Jun 6, 2023

View reviewed changes

tmadlener merged commit ac086e0 into master Jun 8, 2023
17 of 18 checks passed

tmadlener deleted the hash branch June 8, 2023 11:12

tmadlener restored the hash branch June 8, 2023 11:12

tmadlener mentioned this pull request Jul 3, 2023

Generated id method for user facing types has become useless #438

Closed

tmadlener mentioned this pull request Sep 28, 2023

Make Object::id() return an ObjectID since unsigned has become useless #493

Merged

tmadlener mentioned this pull request Nov 13, 2023

Make sure that the ROOT writers enforce consistency for the Frame contents they write #513

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make CollectionIDs a 32bit hash value of the collection name #412

Make CollectionIDs a 32bit hash value of the collection name #412

hegner commented May 4, 2023

tmadlener commented May 5, 2023

hegner commented May 6, 2023

jmcarcell commented May 7, 2023

hegner commented May 7, 2023

hegner commented May 7, 2023

tmadlener left a comment

tmadlener commented May 7, 2023

hegner commented May 7, 2023

tmadlener commented May 15, 2023

hegner commented May 16, 2023

tmadlener commented May 25, 2023

hegner commented May 25, 2023

hegner commented May 26, 2023

tmadlener commented May 26, 2023

tmadlener commented May 30, 2023

tmadlener commented May 30, 2023

andresailer commented May 30, 2023

jmcarcell commented May 30, 2023 •

edited

tmadlener commented May 30, 2023

hegner commented Jun 5, 2023

tmadlener commented Jun 5, 2023 •

edited

tmadlener left a comment

Make CollectionIDs a 32bit hash value of the collection name #412

Make CollectionIDs a 32bit hash value of the collection name #412

Conversation

hegner commented May 4, 2023

tmadlener commented May 5, 2023

hegner commented May 6, 2023

jmcarcell commented May 7, 2023

hegner commented May 7, 2023

hegner commented May 7, 2023

tmadlener left a comment

Choose a reason for hiding this comment

tmadlener commented May 7, 2023

hegner commented May 7, 2023

tmadlener commented May 15, 2023

hegner commented May 16, 2023

tmadlener commented May 25, 2023

hegner commented May 25, 2023

hegner commented May 26, 2023

tmadlener commented May 26, 2023

tmadlener commented May 30, 2023

tmadlener commented May 30, 2023

andresailer commented May 30, 2023

jmcarcell commented May 30, 2023 • edited

tmadlener commented May 30, 2023

hegner commented Jun 5, 2023

tmadlener commented Jun 5, 2023 • edited

tmadlener left a comment

Choose a reason for hiding this comment

jmcarcell commented May 30, 2023 •

edited

tmadlener commented Jun 5, 2023 •

edited