Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving RDataSource closer to Podio/EDM4hep #593

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

kjvbrt
Copy link
Contributor

@kjvbrt kjvbrt commented May 3, 2024

BEGINRELEASENOTES

  • Adding ROOT RDataSource

ENDRELEASENOTES

Copy link
Collaborator

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kjvbrt. I wasn't aware this was possible in this generality. Very nice. I am probably still missing quite a few things, but I tried to summarize my current understanding below and added and a few comments / questions inline

For my general understanding, what this does is basically:

  • Create a reader for each thread / slot
  • Keep a vector of Frames, again one per thread / slot
  • Populate a vector of collections per thread / slot, where
    • each collection always is at the same position, so that we can simply index into it

I am wondering whether we could do with just one reader instead of multiple, but I suppose then we would probably have to do our own locking around that, because the RDataSource doesn't do that for us?

If I understand correctly, we could in the end even replace the current ROOTReader with the Reader that is introduced in #522 and effectively support all backends + RDataFrame?

include/podio/ROOTDataSource.h Outdated Show resolved Hide resolved
src/ROOTDataSource.cc Outdated Show resolved Hide resolved
src/ROOTDataSource.cc Outdated Show resolved Hide resolved
src/ROOTDataSource.cc Outdated Show resolved Hide resolved
src/ROOTDataSource.cc Outdated Show resolved Hide resolved
src/ROOTDataSource.cc Outdated Show resolved Hide resolved
src/ROOTDataSource.cc Outdated Show resolved Hide resolved
Comment on lines 28 to 29
template<typename T>
std::vector<T**> GetColumnReaders(std::string_view columnName);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the documentation of RDataSource correctly, we don't need this at all. Also this is not overriding anything from the RDataSource base class. I think all of the "magic" is happening in the GetColumnReadersImpl, right?

Comment on lines 154 to 105
m_frames.emplace_back(
std::make_unique<podio::Frame>(
podio::Frame(m_podioReaders[i]->readEntry("events", 0))));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we reading the same event nSlots times this way? As far as I can see we read the correct entry in any case later in SetEntry. Can we not just simply put an empty Frame here? Or even just populate the vector with a bunch of nullptr?

@tmadlener
Copy link
Collaborator

The sanitizer workflows are most likely failing because the tests that produce the input files are not running. You can disable them by adding the new tests also to the list in CTestCustom.cmake

Copy link

@Zehvogel Zehvogel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid the hardcoded strings :)

src/ROOTDataSource.cc Outdated Show resolved Hide resolved
src/ROOTDataSource.cc Outdated Show resolved Hide resolved
src/ROOTDataSource.cc Outdated Show resolved Hide resolved
Comment on lines 88 to 66
// Get collections stored in the files
std::vector<std::string> collNames = frame.getAvailableCollections();
// std::cout << "podio::ROOTDataSource: Found following collections:\n";
for (auto& collName : collNames) {
const podio::CollectionBase* coll = frame.get(collName);
if (coll->isValid()) {
m_columnNames.emplace_back(collName);
m_columnTypes.emplace_back(coll->getValueTypeName());
// std::cout << " - " << collName << "\n";
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also make the parameters of the frame available as columns, e.g. to access the cross section? :)

Copy link
Collaborator

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few new minor suggestions for using the reader interface, now that it's available. I think I got all the places, but I might have missed some.

include/podio/ROOTDataSource.h Outdated Show resolved Hide resolved
src/ROOTDataSource.cc Outdated Show resolved Hide resolved
src/ROOTDataSource.cc Outdated Show resolved Hide resolved
include/podio/ROOTDataSource.h Outdated Show resolved Hide resolved
include/podio/ROOTDataSource.h Outdated Show resolved Hide resolved
src/ROOTDataSource.cc Outdated Show resolved Hide resolved
Comment on lines 110 to 116
target_link_libraries(podioRootIO PUBLIC podio::podio
ROOT::Core
ROOT::RIO
ROOT::Tree
ROOT::ROOTVecOps
ROOT::ROOTDataFrame)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am contemplating whether we should move the ROOTDataSource into a separate library instead of making it part of the podioRootIO one. The main argument against it is that we will have one more library. However, if we really make use of the Reader interface internally, then we would need to link podioRootIO against podioIO (making a cyclic lib dependency) and potentially also against podioSioIO. Then we would have no separation of the libraries at all any longer.

Since it is probably easier to merge libraries down the line, than splitting them apart, I would suggest to leave the podioRootIO library untouched and instead make a new podioRDataSource library that links against podio::podioIO (and all the necessary things from ROOT).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had to link against podio::podioIO in couple of places. Will create separate library...

@kjvbrt
Copy link
Contributor Author

kjvbrt commented Jul 25, 2024

I continue with testing the changes made here in FCCAnalyses

Copy link
Collaborator

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be an issue with the installation now, because the EDM4hep tests are failing. As far as I can tell it has to do with header files that are no longer installed. Otherwise it's mainly minor comments.

Comment on lines +67 to +69
# Export compile commands --- used by the tools from clang family
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this in an alias for cmake. I don't think we want to change a default here and rather let the users decide whether they want this or not.

std::vector<void*> GetColumnReadersImpl(std::string_view name, const std::type_info& typeInfo) override;

std::string AsString() override {
return "Podio data source";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are spaces a problem here? Or the fact that this is not the same as the class name?

endif()
if (NOT ENABLE_RNTUPLE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (NOT ENABLE_RNTUPLE)
if (NOT ENABLE_DATASOURCE)

At least that would make sense for the files that get filtered.

Comment on lines +201 to +203
file(GLOB headers_necessary
RELATIVE ${PROJECT_SOURCE_DIR}
"${PROJECT_SOURCE_DIR}/install/podio/*.h")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is fishy with this construct here or the filtering below, because no headers are installed now: https://github.com/AIDASoft/podio/actions/runs/10090563011/job/27900250138?pr=593#step:6:600

unsigned int nEventsInFiles = 0;
auto podioReader = podio::makeReader(m_filePathList);
nEventsInFiles = podioReader.getEntries(podio::Category::Event);
frame = podio::Frame(podioReader.readFrame(podio::Category::Event, 0));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
frame = podio::Frame(podioReader.readFrame(podio::Category::Event, 0));
frame = podioReader.readFrame(podio::Category::Event, 0);

The Reader interface already gives us a fully usable Frame, we don't need to construct it any longer.

}

for (size_t i = 0; i < m_nSlots; ++i) {
m_frames.emplace_back(std::make_unique<podio::Frame>(podio::Frame{}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
m_frames.emplace_back(std::make_unique<podio::Frame>(podio::Frame{}));
m_frames.emplace_back(std::make_unique<podio::Frame>());

make_unique forwards to c'tor, so there is no need to first create a frame and then make make_unique call the copy c'tor ?

};

///
/// Not used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not used, can you explain why it is needed to be here?

std::vector<std::string> m_filePathList = {};

/// Total number of events
unsigned int m_nEvents = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unsigned int m_nEvents = 0;
ULong64_t m_nEvents = 0;

Shouldn't this match with the ranges defined in the next lines?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants