Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add tests and rudimentary protections for default-constructed PortableCollections #44844

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Apr 24, 2024

PR description:

The HCAL CUDA->Alpaka porting work by @kakwok demonstrated bad behavior with default-constructed PortableCollection, so I added some tests to demonstrate those (at one point the HCAL work seemed to point to some strange behavior also for zero-sized PortableCollection, but I was not able to replicate that, and the strange behavior also didn't repeat later with the HCAL Alpaka code).

A default-constructed PortableCollection is in a state where it has no buffer. The PortableCollection interface does not provide a way to check the state, and therefore a caller has no way to know if PortableCollection::buffer() leads to defined or undefined behavior (because std::optional<T>::operator*() leads to undefined behavior if the optional does not contain value). This PR takes one attempt to add a function PortableCollection::isValid() that allows checking the validity, and adds assert()s to all accessors (plus size() to be able to access the SoA size without the assert(). I'm not sure if this the behavior we really want, but at least it is a starting point for a discussion.

The tests for 0-size SoA Layout and PortableCollection are for ensuring valid behavior when the SoA has also scalars in addition to columns. (and now I'm wondering what should be the behavior of a 0-size PortableCollection for a SoA that has only columns?)

The PortableObject and PortableMultiCollection should also be treated consistently, but I wanted to get feedback first on the direction we want to go first.

PR validation:

Unit tests pass

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Possibly to be backported to 14_0_X

…eCollections

Also add tests for zero-sized PortableCollections
@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 24, 2024

cms-bot internal usage

View& view() {
assert(isValid());
return view_;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively the View accessors could be left unchecked, and require the users to explicitly check isValid() or/and the column/scalar accessors of the View to be non-nullptr when it is not clearly guaranteed that the PortableCollection is non-default constructed.

(we could also throw an exception instead of assert(), but maybe the use of default-constructed PortableCollection could be more of a logic error rather than something that would depend e.g. on the data?)

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44844/40093

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @makortel for master.

It involves the following packages:

  • DataFormats/Portable (heterogeneous)
  • DataFormats/SoATemplate (heterogeneous)

@fwyzard, @cmsbuild, @makortel can you please review it and eventually sign? Thanks.
@missirol, @rovere this is something you requested to watch as well.
@antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

//REQUIRE(coll->num() == 42);

// CopyToDevice<PortableHostCollection<T>> is not defined
#ifndef ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These #ifdefs could be removed with #43969

@makortel
Copy link
Contributor Author

enable gpu

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ef15bb/39082/summary.html
COMMIT: c30702c
CMSSW: CMSSW_14_1_X_2024-04-24-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44844/39082/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

There are some workflows for which there are errors in the baseline:
24834.78 step 2
25088.203 step 3
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39740
  • DQMHistoTests: Total failures: 23
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 39717
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor

fwyzard commented Apr 25, 2024

Are there any use case for a default-constructed PortableCollection, other than the ROOT dictionary ?

If that is the only use case, could we make the default constructor private, and somehow declare the ROOT dictionary stuff as a friend ?

@fwyzard
Copy link
Contributor

fwyzard commented Apr 25, 2024

what should be the behavior of a 0-size PortableCollection for a SoA that has only columns?

I think that a 0-size PortableCollection should be a well-defined object, with a device and a 0-size buffer associated to it.

The reason that in general we cannot do this for a default constructed PortableCollection is that we don't know what device to use.
Though, for the host case, we do know since there is only one possibility (so far), and we could do the same ?

@makortel
Copy link
Contributor Author

Are there any use case for a default-constructed PortableCollection, other than the ROOT dictionary ?

It's a good question, and I'd be tempted to answer "no". On the other hand, this issue came up with code along

// in class definition
device::EDPutToken<PortableCollection<...>> putToken_;

...

// in produce() function
if (inputVector.empty()) {
  iEvent.emplace(putToken_); // leads to default-constructed PortableCollection
}

. It's not necessarily a good use case, but it is easy to do.

If that is the only use case, could we make the default constructor private, and somehow declare the ROOT dictionary stuff as a friend ?

I don't know how easy it would be to figure out the components in ROOT that need the default constructor (some of them might even be in anonymous namespaces). In addition, there are some code paths where edm::Wrapper<T> calls the default constructor of T. In addition, since there are classes that either inherit from PortableCollection, or use it via composition, at least the default constructors of those classes must be able to call the default constructor of PortableCollection. Having the Portable{Host,Device}Collection to declare all of those as friends doesn't sound very scalable.

With @Dr15Jones we were not able to come up with hacks that would allow hiding the default constructor from user code (or even reporting with static analyzer) would not explode to our face in some way.

(in the long term it would be great to be able to move edm::Wrapper<T> to use std::optional<T> some day, but that won't be easy either)

@fwyzard
Copy link
Contributor

fwyzard commented Apr 25, 2024

I see.

Can we prevent at least default-constructed PortableDeviceCollections ?

@makortel
Copy link
Contributor Author

what should be the behavior of a 0-size PortableCollection for a SoA that has only columns?

I think that a 0-size PortableCollection should be a well-defined object, with a device and a 0-size buffer associated to it.

I agree 0-size PortableCollection should be defined at least to the extent of having a device, and the recommended way of asking the size returns 0.

What would be the meaning of the 0-size buffer? Whatever happens to be returned by the underlying allocator of the backend? E.g. for malloc() that would be implementation-defined (cppreference), and for cudaMalloc() apparently returns NULL (NVIDIA forum).

Does Alpaka itself make any assumptions on whether the underlying allocator returns a nullptr or not?

Would we want a buffer object, but containing a nullptr? Or do we care?

The reason that in general we cannot do this for a default constructed PortableCollection is that we don't know what device to use. Though, for the host case, we do know since there is only one possibility (so far), and we could do the same ?

Right, I thought the host case too. Device itself is known, but the we would not be able to make the "queue association" (i.e. cached allocation). I'm not sure if that would really matter though for this corner case.

@makortel
Copy link
Contributor Author

Can we prevent at least default-constructed PortableDeviceCollections ?

The additional indirection (or wrapping) via edm::DeviceProduct<T> would at least technically make it straightforward. I need to think a bit and make some tests. It would be an asymmetry between PortableHostCollection and PortableDeviceCollection, but it probably would prevent the use of the default constructor in a portable code (i.e. even if such code would compile for the Serial backend, it would not compile for GPU backends).

@cmsbuild
Copy link
Contributor

cmsbuild commented May 1, 2024

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44844/40142

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@makortel
Copy link
Contributor Author

makortel commented May 1, 2024

Can we prevent at least default-constructed PortableDeviceCollections ?

The additional indirection (or wrapping) via edm::DeviceProduct<T> would at least technically make it straightforward. I need to think a bit and make some tests. It would be an asymmetry between PortableHostCollection and PortableDeviceCollection, but it probably would prevent the use of the default constructor in a portable code (i.e. even if such code would compile for the Serial backend, it would not compile for GPU backends).

Allowing edm::DeviceProduct<T> to be used with types that do not have default constructor (7f2113b), and removing the default constructor from PortableDeviceCollection (b6a555e) turned out to work pretty well (and revealed some cases where the default constructor is used current).

That made me wonder what to do with the moved-from state of PortableCollection (especially in conjunction with std::optional<PortableCollection<...>> module data members). I mean, after being moved from, a PortableCollection object is in a state where isValid() == true, metadata().size() returns the same values as before the move, but the Alpaka buffer is in a moved-from state (i.e. effectively null even if I got the impression from alpaka-group/alpaka#1426 and alpaka-group/alpaka#1445 this is not considered as a really valid state or something?)

In e4d2b53 I explored what it would entail to make PortableHostCollection's moved-from state well defined (isValid() == false, and metadata().size() returns 0). That made me wonder how much we really care of that state.

On one hand, the present moved-from state fulfills the standard's requirement of the object being destructible, and "minimal recommendation of the core guidelines" of the object being assignable. I.e. when used properly, things work.

On the other hand, e.g. std::vector and std::unique_ptr have well-defined moved-from state, and perhaps it would be beneficial to be a little bit defensive to be able to catch at least some programming mistakes? (use-after-move would still lead to a crash in many cases, so the practical question could be how easy it would be to the cause of the error)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants