Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CollectionIDs a 32bit hash value of the collection name #412

Merged
merged 20 commits into from Jun 8, 2023
Merged

Conversation

hegner
Copy link
Collaborator

@hegner hegner commented May 4, 2023

BEGINRELEASENOTES

  • Using string hashes as CollectionID based on MurmurHash

ENDRELEASENOTES

This PR is work in progress. The frame interface works. However the legacy interface used the previous structure of IDs for some optimizations, which I have to remove.

src/MurmurHash2.cpp Outdated Show resolved Hide resolved
@tmadlener
Copy link
Collaborator

How does this fare with reading files prior to this? It should work because we just replace the calculation of the ID, right? Can we easily verify that (in CI)?

@hegner
Copy link
Collaborator Author

hegner commented May 6, 2023

@tmadlener - on fixing the undetected narrowing from 64 to 32 bit I found quite a few signed/unsigned inconsistencies which haven't been caught before. the collectionID is now uint64_t throughout

@jmcarcell
Copy link
Member

What's the final take on the hash size, 64 bits? With 64 bits there is some small collision probability with many collections, right?

@hegner
Copy link
Collaborator Author

hegner commented May 7, 2023

Yes. 64bit hash size. 32 bit was too little to be safe from collisions for a huge number of collections. With 64 bits we are in very safe territory

@hegner
Copy link
Collaborator Author

hegner commented May 7, 2023

Copy link
Collaborator

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the collectionID have 64 bits (and planning to use all of them) has potentially a few more implications, e.g. the id method will no longer be very useful:

unsigned int id() const { return getObjectID().collectionID * 10000000 + getObjectID().index; }

include/podio/ObjectID.h Outdated Show resolved Hide resolved
src/selection.xml Outdated Show resolved Hide resolved
@tmadlener
Copy link
Collaborator

Regarding backwards compatibility, for SIO we would need to increase the version number for the SIOCollectionIDTableBlock (and fix the current inconsistency):

SIOCollectionIDTableBlock() : sio::block("CollectionIDs", sio::version::encode_version(0, 4)) {
}
SIOCollectionIDTableBlock(podio::EventStore* store);
SIOCollectionIDTableBlock(std::vector<std::string>&& names, std::vector<int>&& ids, std::vector<std::string>&& types,
std::vector<short>&& isSubsetColl) :
sio::block("CollectionIDs", sio::version::encode_version(0, 3)),

sio::block("CollectionIDs", sio::version::encode_version(0, 3)) {

This could then be handled accordingly on the reading side:

podio/src/SIOBlock.cc

Lines 35 to 42 in 705721d

void SIOCollectionIDTableBlock::read(sio::read_device& device, sio::version_type version) {
device.data(_names);
device.data(_ids);
device.data(_types);
if (version >= sio::version::encode_version(0, 2)) {
device.data(_isSubsetColl);
}
}

There is also a missing conversion to uint64_t here.

std::vector<int> ids;

@hegner
Copy link
Collaborator Author

hegner commented May 7, 2023

Thanks. The SIO I really didn't test properly yet.

@tmadlener
Copy link
Collaborator

This looks like it needs a rebase (and potentially some conflict resolution).

For me this looks good, the main points that we need discussion IMHO are:

  • Are we OK with effectively increasing the ObjectID size by 50%
  • An alternative (potentially additional) approach would be to effectively externalize the collection ID generation to the outside and make it possible to supply the Frame with a pre-populated CollectionIDTable and then either falling back to hashing or failing on collection names that are not present in the supplied external table.

@hegner
Copy link
Collaborator Author

hegner commented May 16, 2023

Yes, that would be something for the next EDM4hep meeting to discuss.

@tmadlener tmadlener linked an issue May 19, 2023 that may be closed by this pull request
@hegner hegner force-pushed the hash branch 2 times, most recently from 53f29e6 to 6d86c3f Compare May 25, 2023 07:26
@tmadlener
Copy link
Collaborator

There are some conflicts. I think they should resolve themselves with a rebase on to master

@hegner
Copy link
Collaborator Author

hegner commented May 25, 2023

rebase didn't do it

@hegner
Copy link
Collaborator Author

hegner commented May 26, 2023

@tmadlener thanks

@tmadlener
Copy link
Collaborator

One of the sanitizer workflows seems to be picking up on #174 and clang-tidy is complaining about a few things in murmurhash3. They are in principle easily fixable, and since we are already changing it to make clang-format happy, I don't see a reason not to also fix these issues here: https://github.com/AIDASoft/podio/actions/runs/5093184726/jobs/9155485339?pr=412#step:4:659

@tmadlener
Copy link
Collaborator

I added another tag to the failing test to ignore it in UndefinedBehavior sanitizer runs since it is picking up on a known issue and the test will be obsolete once the EventStore is removed in any case.

@tmadlener tmadlener changed the title [WIP] add hashing feature to CollectionID table Make CollectionIDs a 32bit hash value of the collection name May 30, 2023
@tmadlener
Copy link
Collaborator

This PR is work in progress. The frame interface works. However the legacy interface used the previous structure of IDs for some optimizations, which I have to remove.

Just to confirm. This is no longer the case and this PR is ready as it is, right?

@andresailer andresailer changed the title Make CollectionIDs a 32bit hash value of the collection name Make CollectionIDs a 64bit hash value of the collection name May 30, 2023
@andresailer
Copy link
Member

It is 64 bits now, or was I mistaken in changing the title of the PR?

@jmcarcell
Copy link
Member

jmcarcell commented May 30, 2023

During the last call one week ago it was agreed that it would be 32 bit and collisions would be dealt with due to concerns about increasing sizes of classes (for example due to padding) and files (?), personally I would like to see some numbers first in case dealing with collisions turns out to be some work...

@tmadlener
Copy link
Collaborator

I have added a small utility tool that takes a list of collection names and checks for collisions using the chosen hash function. It should be fairly straight forward to come up with a list of all currently used collection names and see if we have collisions in there already.

Speaking of collisions @hegner, I think we could also use this PR to make collisions more visible in the Frame, currently this is effectively silent (and potentially incomplete):

podio/include/podio/Frame.h

Lines 318 to 327 in 76c98a6

template <typename CollT, typename>
const CollT& Frame::put(CollT&& coll, const std::string& name) {
const auto* retColl = static_cast<const CollT*>(m_self->put(std::make_unique<CollT>(std::move(coll)), name));
if (retColl) {
return *retColl;
}
// TODO: Handle collision case
static const auto emptyColl = CollT();
return emptyColl;
}

which calls

podio/include/podio/Frame.h

Lines 410 to 427 in 76c98a6

template <typename FrameDataT>
const podio::CollectionBase* Frame::FrameModel<FrameDataT>::put(std::unique_ptr<podio::CollectionBase> coll,
const std::string& name) {
{
std::lock_guard lock{*m_mapMtx};
auto [it, success] = m_collections.try_emplace(name, std::move(coll));
if (success) {
// TODO: Check whether this collection is already known to the idTable
// -> What to do on collision?
// -> Check before we emplace it into the internal map to prevent possible
// collisions from collections that are potentially present from rawdata?
it->second->setID(m_idTable.add(name));
return it->second.get();
}
}
return nullptr;
}

@tmadlener tmadlener changed the title Make CollectionIDs a 64bit hash value of the collection name Make CollectionIDs a 32bit hash value of the collection name May 30, 2023
@hegner
Copy link
Collaborator Author

hegner commented Jun 5, 2023

@tmadlener - thanks for the rebase

@tmadlener
Copy link
Collaborator

tmadlener commented Jun 5, 2023

Alright, I collected this list of collection names unique_coll_names.txt from the following sources:

  • a REC file from the ILD standard reconstruction Output_REC
  • a REC file from the CLIC standard reconstruction Output_REC
  • the list of files that can be found in the miniDST files that were produced for the snowmass tutorials of the ILC (slide 8 of this presentation
  • The output names of the default configuration for k4SimDelphes
  • The list of EIC reconstruction output as well as the output of rootls -t on a ddsim file provided by @wdconinc.
  • The list of collections that can be produced by the EDM4hep output of guinea-pig
  • A list of collection names currently in use by CEPC (provided by @mirguest)

This currently yields 458 unique collection names which have no collisions, and puts us somewhere into the 1:10000 and 1:100000 collision probability range according to this table here:

image

Copy link
Collaborator

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (and hope) we have now considered pretty much all the collection names that are currently in use. I didn't find any collisions among them, so at least for now 32 bits seem to be enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CollectionIDs should not depend on the insertion order
4 participants