Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate segments to SoA+PortableCollection #93

Merged
merged 10 commits into from
Oct 18, 2024

Conversation

ariostas
Copy link
Member

This PR is a draft for a possible way to migrate the classes we have to use SoA+PortableCollection. Since for segments we have columns of 3 different lengths then I had to introduce an extra struct. I still need to think whether there is a better way to do this or whether the names can be improved. Also, I haven't tested this to see if it works.

@ariostas
Copy link
Member Author

/run standalone

Copy link

There was a problem while building and running in standalone mode. The logs can be found here.

@ariostas
Copy link
Member Author

/run standalone

Copy link

There was a problem while building and running in standalone mode. The logs can be found here.

@ariostas
Copy link
Member Author

/run standalone

Copy link

There was a problem while building and running in standalone mode. The logs can be found here.

Copy link

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     33.4    321.7    112.3     44.1     92.9    496.9    125.5    148.6    102.2      1.7    1479.4     949.1+/- 255.2     406.1   explicit_cache[s=4] (target branch)
   avg     43.2    320.4    110.7     63.0     97.4    502.7    129.4    171.0    101.1      2.0    1540.9     995.0+/- 273.9     418.3   explicit_cache[s=4] (this PR)

@slava77
Copy link

slava77 commented Sep 23, 2024

Here is a timing comparison:

is the T3 increase from 44 to 63 real?
It would be nice to re-check locally, in GPU and see if this is reproduced

@ariostas
Copy link
Member Author

Here's a timing comparison on cgpu-1.

This PR (9e2a402)
Total Timing Summary
Average time for map loading = 469.029 ms
Average time for input loading = 7456.79 ms
Average time for lst::Event creation = 0.0362964 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     16.2      1.3      0.4      1.4      1.5      0.4      0.8      0.6      1.2      0.1      23.8       7.2+/-  1.4      25.8   explicit_cache[s=1]
   avg      5.1      1.4      0.6      1.9      1.9      0.5      1.1      0.7      1.7      0.2      15.2       9.6+/-  2.0       8.7   explicit_cache[s=2]
   avg      8.6      1.9      0.9      3.3      3.1      0.6      2.3      1.3      2.9      0.3      25.3      16.1+/-  3.4       7.0   explicit_cache[s=4]
   avg     13.7      2.4      1.4      4.7      4.5      0.7      3.6      2.0      4.4      0.4      37.7      23.3+/-  5.2       6.8   explicit_cache[s=6]
   avg     19.8      3.2      1.7      6.2      6.0      0.9      4.8      2.6      5.4      0.6      51.2      30.5+/-  7.1       6.8   explicit_cache[s=8]

CMSSW_14_1_0_pre3_LST_X_LSTCore_realfiles_batch7 (3858cf3)
Total Timing Summary
Average time for map loading = 468.89 ms
Average time for input loading = 7462.57 ms
Average time for lst::Event creation = 0.0350249 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     16.6      1.2      0.4      1.4      1.5      0.4      0.7      0.4      1.2      0.1      23.9       6.9+/-  1.4      26.1   explicit_cache[s=1]
   avg      5.5      1.4      0.5      1.8      1.9      0.4      1.1      0.6      1.7      0.2      15.2       9.3+/-  2.0       8.7   explicit_cache[s=2]
   avg      8.7      1.7      0.8      3.1      2.8      0.5      2.2      1.2      2.7      0.4      24.1      14.9+/-  3.1       6.7   explicit_cache[s=4]
   avg     14.2      2.2      1.3      4.2      4.2      0.7      3.3      1.8      4.0      0.5      36.4      21.5+/-  5.1       6.5   explicit_cache[s=6]
   avg     21.0      2.9      1.5      5.2      5.4      0.8      4.6      2.6      5.0      0.6      49.4      27.6+/-  6.4       6.6   explicit_cache[s=8]

It seems like it might actually be a bit slower. I'll check whether initializing the whole buffer is causing this, or if it's unrelated.

I'm still working on getting the CI to run the CMSSW comparison.

@ariostas
Copy link
Member Author

/run cmssw

Copy link

There was a problem while building and running with CMSSW. The logs can be found here.

Copy link

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@ariostas
Copy link
Member Author

ariostas commented Sep 25, 2024

Pending the CI test, it seems like initializing only the required columns fixed the timing.

This PR (96be4f7)
Total Timing Summary
Average time for map loading = 464.98 ms
Average time for input loading = 7410.43 ms
Average time for lst::Event creation = 0.0355244 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     15.3      1.2      0.4      1.4      1.5      0.4      0.7      0.5      1.2      0.1      22.6       6.9+/-  1.4      24.5   explicit_cache[s=1]
   avg      4.7      1.4      0.5      1.9      1.8      0.5      1.1      0.7      1.7      0.1      14.4       9.2+/-  2.0       8.3   explicit_cache[s=2]
   avg      8.2      1.7      0.9      3.1      2.9      0.5      2.2      1.2      2.8      0.3      23.7      15.0+/-  3.3       6.6   explicit_cache[s=4]
   avg     13.4      2.2      1.3      4.3      4.3      0.7      3.4      1.9      4.1      0.4      35.9      21.9+/-  5.8       6.5   explicit_cache[s=6]
   avg     19.6      2.9      1.5      5.6      5.7      0.8      4.7      2.7      5.2      0.6      49.1      28.7+/-  7.5       6.5   explicit_cache[s=8]

CMSSW_14_1_0_pre3_LST_X_LSTCore_realfiles_batch7 (3858cf3)
Total Timing Summary
Average time for map loading = 461.071 ms
Average time for input loading = 7381.67 ms
Average time for lst::Event creation = 0.0382719 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     17.6      1.2      0.4      1.4      1.5      0.4      0.7      0.4      1.2      0.1      24.9       6.9+/-  1.4      27.0   explicit_cache[s=1]
   avg      5.5      1.4      0.5      1.9      1.9      0.4      1.1      0.6      1.7      0.2      15.3       9.3+/-  1.9       8.8   explicit_cache[s=2]
   avg      8.8      1.7      0.8      3.0      2.9      0.5      2.1      1.2      2.8      0.4      24.2      14.8+/-  3.0       6.7   explicit_cache[s=4]
   avg     14.4      2.2      1.2      4.1      4.3      0.7      3.3      1.8      4.0      0.5      36.5      21.4+/-  5.4       6.6   explicit_cache[s=6]
   avg     21.1      3.0      1.5      5.4      5.8      0.8      4.6      2.6      5.0      0.8      50.5      28.6+/-  7.5       6.7   explicit_cache[s=8]

/run standalone

Copy link

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     33.7    323.4    113.3     45.9     94.6    500.1    127.2    148.9    102.3      3.0    1492.4     958.7+/- 250.9     408.1   explicit_cache[s=4] (target branch)
   avg     34.5    323.6    118.3     62.6     98.3    504.3    128.5    171.9    101.6      1.4    1545.0    1006.2+/- 271.4     420.7   explicit_cache[s=4] (this PR)

@ariostas
Copy link
Member Author

ariostas commented Sep 25, 2024

Hmm it's weird that the T3 timing is still significantly higher in the CI.

@ariostas
Copy link
Member Author

/run cmssw

Copy link

There was a problem while building and running with CMSSW. The logs can be found here.

Copy link

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@ariostas
Copy link
Member Author

This is the timing comparison for CPU on cgpu-1. It must be that the new memory layout is not very favorable for CPU, but at least it seems like it's not too bad on GPU.

This PR (96be4f7)
Total Timing Summary
Average time for map loading = 436.833 ms
Average time for input loading = 7381.81 ms
Average time for lst::Event creation = 0.00952539 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     25.6    210.9     79.7     50.6     76.2    263.4     69.3    116.3     56.0      0.5     948.4     659.4+/- 172.5     949.3   explicit_cache[s=1]
   avg     23.6    210.5     79.8     47.5     70.0    261.7     69.6    105.6     55.7      1.2     925.2     639.9+/- 170.4     237.7   explicit_cache[s=4]
   avg     25.5    233.4     88.1     51.6     78.9    291.4     76.9    114.1     61.7      1.7    1023.5     706.5+/- 184.3      76.3   explicit_cache[s=16]
   avg     30.1    239.7    110.3     65.5     89.1    294.4     81.6    119.5     63.2     11.8    1105.0     780.5+/- 191.8      45.0   explicit_cache[s=32]
   avg     45.1    259.1    102.9     76.0     99.4    309.3     85.1    124.2     66.5      7.7    1175.5     821.0+/- 220.9      27.5   explicit_cache[s=64]

CMSSW_14_1_0_pre3_LST_X_LSTCore_realfiles_batch7 (3858cf3)
Total Timing Summary
Average time for map loading = 428.306 ms
Average time for input loading = 7442.49 ms
Average time for lst::Event creation = 0.0342869 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     24.3    229.5     88.7     37.8     74.1    260.3     67.4     84.7     55.8     10.4     932.8     648.3+/- 153.6     934.3   explicit_cache[s=1]
   avg     22.7    214.4     79.5     30.8     70.3    258.0     68.5     85.5     56.2      2.8     888.7     608.0+/- 157.3     231.3   explicit_cache[s=4]
   avg     26.7    237.3     90.9     36.3     81.6    286.3     76.5     95.1     62.3      3.7     996.7     683.7+/- 164.7      73.5   explicit_cache[s=16]
   avg     33.0    240.5     94.2     37.1     82.7    287.2     76.9     95.8     62.6      4.7    1014.8     694.6+/- 165.4      40.8   explicit_cache[s=32]
   avg     68.2    254.0    108.0     53.0     94.4    297.7     83.5    101.9     64.6     11.2    1136.4     770.5+/- 185.7      26.6   explicit_cache[s=64]

Copy link

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@ariostas ariostas changed the title Moved segments to SoA+PortableCollection Migrate segments to SoA+PortableCollection Sep 30, 2024
Comment on lines +249 to +261
alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.ptIn(), size), ptIn, size);
alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.ptErr(), size), ptErr, size);
alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.px(), size), px, size);
alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.py(), size), py, size);
alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.pz(), size), pz, size);
alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.etaErr(), size), etaErr, size);
alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.isQuad(), size), isQuad, size);
alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.eta(), size), eta, size);
alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.phi(), size), phi, size);
alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.charge(), size), charge, size);
alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.seedIdx(), size), seedIdx, size);
alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.superbin(), size), superbin, size);
alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.pixelType(), size), pixelType, size);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly the final size argument can be omitted, and will be assumed to match the size of the destination of the copy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That usually works, but not when combining std::vector and Alpaka buffers. You end up getting this error:

In file included from /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/mem/buf/Traits.hpp:10,
                 from /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/dev/DevCpu.hpp:11,
                 from /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/acc/AccCpuOmp2Blocks.hpp:36,
                 from /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/alpaka.hpp:13,
                 from /home/users/anriosta/CMSSW_tests/CMSSW_14_2_0_pre1/src/RecoTracker/LSTCore/standalone/../../../HeterogeneousCore/AlpakaInterface/interface/memory.h:6,
                 from ../../src/alpaka/Event.dev.cc:1:
/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/mem/view/Traits.hpp: In instantiation of 'auto alpaka::createTaskMemcpy(TViewDstFwd&&, const TViewSrc&, const TExtent&) [with TExtent = Vec<std::integral_constant<long unsigned int, 1>, long unsigned int>; TViewSrc = std::vector<float>; TViewDstFwd = ViewPlainPtr<DevCpu, float, std::integral_constant<long unsigned int, 1>, unsigned int>]':
/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/mem/view/Traits.hpp:312:40:   required from 'void alpaka::memcpy(TQueue&, TViewDstFwd&&, const TViewSrc&) [with TViewSrc = std::vector<float>; TViewDstFwd = ViewPlainPtr<DevCpu, float, std::integral_constant<long unsigned int, 1>, unsigned int>; TQueue = QueueGenericThreadsBlocking<DevCpu>]'
../../src/alpaka/Event.dev.cc:249:17:   required from here
/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/mem/view/Traits.hpp:277:58: error: static assertion failed: The destination view and the extent are required to have compatible index types!
  277 |             meta::IsIntegralSuperset<DstIdx, ExtentIdx>::value,
      |                                                          ^~~~~
/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/mem/view/Traits.hpp:277:58: note: 'std::integral_constant<bool, false>::value' evaluates to false

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the error message - I will try to reproduce it and open an issue with alpaka.

@fwyzard
Copy link

fwyzard commented Oct 2, 2024

@ariostas @slava77 could you tell me how to interpret these timing results ?

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     33.7    323.4    113.3     45.9     94.6    500.1    127.2    148.9    102.3      3.0    1492.4     958.7+/- 250.9     408.1   explicit_cache[s=4] (target branch)
   avg     34.5    323.6    118.3     62.6     98.3    504.3    128.5    171.9    101.6      1.4    1545.0    1006.2+/- 271.4     420.7   explicit_cache[s=4] (this PR)

Is it a before / after comparison of the impact of these changes ?


What about these ones ?

This is the timing comparison for CPU on cgpu-1. It must be that the new memory layout is not very favorable for CPU, but at least it seems like it's not too bad on GPU.

This PR (96be4f7)
Total Timing Summary
Average time for map loading = 436.833 ms
Average time for input loading = 7381.81 ms
Average time for lst::Event creation = 0.00952539 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     25.6    210.9     79.7     50.6     76.2    263.4     69.3    116.3     56.0      0.5     948.4     659.4+/- 172.5     949.3   explicit_cache[s=1]
   avg     23.6    210.5     79.8     47.5     70.0    261.7     69.6    105.6     55.7      1.2     925.2     639.9+/- 170.4     237.7   explicit_cache[s=4]
   avg     25.5    233.4     88.1     51.6     78.9    291.4     76.9    114.1     61.7      1.7    1023.5     706.5+/- 184.3      76.3   explicit_cache[s=16]
   avg     30.1    239.7    110.3     65.5     89.1    294.4     81.6    119.5     63.2     11.8    1105.0     780.5+/- 191.8      45.0   explicit_cache[s=32]
   avg     45.1    259.1    102.9     76.0     99.4    309.3     85.1    124.2     66.5      7.7    1175.5     821.0+/- 220.9      27.5   explicit_cache[s=64]

CMSSW_14_1_0_pre3_LST_X_LSTCore_realfiles_batch7 (3858cf3)
Total Timing Summary
Average time for map loading = 428.306 ms
Average time for input loading = 7442.49 ms
Average time for lst::Event creation = 0.0342869 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     24.3    229.5     88.7     37.8     74.1    260.3     67.4     84.7     55.8     10.4     932.8     648.3+/- 153.6     934.3   explicit_cache[s=1]
   avg     22.7    214.4     79.5     30.8     70.3    258.0     68.5     85.5     56.2      2.8     888.7     608.0+/- 157.3     231.3   explicit_cache[s=4]
   avg     26.7    237.3     90.9     36.3     81.6    286.3     76.5     95.1     62.3      3.7     996.7     683.7+/- 164.7      73.5   explicit_cache[s=16]
   avg     33.0    240.5     94.2     37.1     82.7    287.2     76.9     95.8     62.6      4.7    1014.8     694.6+/- 165.4      40.8   explicit_cache[s=32]
   avg     68.2    254.0    108.0     53.0     94.4    297.7     83.5    101.9     64.6     11.2    1136.4     770.5+/- 185.7      26.6   explicit_cache[s=64]

What do the five rows indicate ?


Looking at the columns, what do T3 and pT3 refer to, in terms of code ?

@ariostas
Copy link
Member Author

ariostas commented Oct 2, 2024

could you tell me how to interpret these timing results ?
Is it a before / after comparison of the impact of these changes ?

Yeah, the first row is before, and the second row is after (it's not very clear, but they are labeled at the end of the line). The columns are the (average) time to complete each step of the algorithm.

What about these ones ?
What do the five rows indicate ?

These are the same, but testing different numbers of streams, indicated by [s=*] at the end of the line.

Looking at the columns, what do T3 and pT3 refer to, in terms of code ?

What is being timed are these functions (and the other columns time equivalent functions)

void Event::createTriplets() {

void Event::createPixelTriplets() {

@makortel
Copy link

makortel commented Oct 2, 2024

The columns are the (average) time to complete each step of the algorithm.

How exactly do you measure time? Is it CPU time? Wall clock time? Kernel time?

Comment on lines 914 to 916
// This is not used, but it is needed for compilation
template <typename T0, typename... Args>
struct CopyToHost<PortableHostMultiCollection<T0, Args...>> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? Having to copy a "host collection" to host seems wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to ask you about this. I needed to add it so that this part compiles (even with the if constexpr).

template <typename TSoA>
typename TSoA::ConstView Event::getSegments(bool sync) {
if constexpr (std::is_same_v<Device, DevHost>)
return segmentsDev_->const_view<TSoA>();
if (!segmentsHost_) {
segmentsHost_.emplace(cms::alpakatools::CopyToHost<SegmentsDeviceCollection>::copyAsync(queue_, *segmentsDev_));
if (sync)
alpaka::wait(queue_); // host consumers expect filled data
}
return segmentsHost_->const_view<TSoA>();
}
template SegmentsConst Event::getSegments<SegmentsSoA>(bool);
template SegmentsOccupancyConst Event::getSegments<SegmentsOccupancySoA>(bool);
template SegmentsPixelConst Event::getSegments<SegmentsPixelSoA>(bool);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if constexpr allows the non-taken branch to be non-compilable only when the condition depends on a template argument (e.g. if the the getSegments() would take the Queue as an argument, but could be achieved in many other ways). I'd want to avoid the specialization of CopyToHost for host-specific data types.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a similar note: is it intended to make a copy of the data when running on the host ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, what happens if you avoid the copy at line 1404 when running on the Cpu ?
Does the code require that the device buffer is still valid after the copy ? Does it assume that the "host" and "device" buffers can be modified independently of each other ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ignore me, I missed the if on line 1401.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makortel I tried this

template <typename TSoA, typename TDev = Device>
typename TSoA::ConstView Event::getSegments(bool sync) {
  if constexpr (std::is_same_v<TDev, DevHost>)
    return segmentsDev_->const_view<TSoA>();
  if (!segmentsHost_) {
    segmentsHost_.emplace(cms::alpakatools::CopyToHost<SegmentsDeviceCollection>::copyAsync(queue_, *segmentsDev_));
    if (sync)
      alpaka::wait(queue_);  // host consumers expect filled data
  }
  return segmentsHost_->const_view<TSoA>();
}

and this

template <typename TSoA, typename TQueue>
typename TSoA::ConstView Event::getSegments(bool sync, TQueue& queue) {
  if constexpr (std::is_same_v<alpaka::Dev<TQueue>, DevHost>)
    return segmentsDev_->const_view<TSoA>();
  if (!segmentsHost_) {
    segmentsHost_.emplace(cms::alpakatools::CopyToHost<SegmentsDeviceCollection>::copyAsync(queue_, *segmentsDev_));
    if (sync)
      alpaka::wait(queue_);  // host consumers expect filled data
  }
  return segmentsHost_->const_view<TSoA>();
}

but none of them compile.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you need to put the non-compilable code into an explicit if or else branch, along

template <typename TSoA, typename TQueue>
typename TSoA::ConstView Event::getSegments(bool sync, TQueue& queue) {
  if constexpr (std::is_same_v<alpaka::Dev<TQueue>, DevHost>) {
    return segmentsDev_->const_view<TSoA>();
  } else {
    if (!segmentsHost_) {
      segmentsHost_.emplace(cms::alpakatools::CopyToHost<SegmentsDeviceCollection>::copyAsync(queue_, *segmentsDev_));
      if (sync)
        alpaka::wait(queue_);  // host consumers expect filled data
    }
    return segmentsHost_->const_view<TSoA>();
  }
}

Copy link
Member Author

@ariostas ariostas Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @makortel. For some reason that still didn't work for me, but I got it to work by additionally explicitly involving the templated type in CopyToHost.

@ariostas
Copy link
Member Author

ariostas commented Oct 2, 2024

How exactly do you measure time? Is it CPU time? Wall clock time? Kernel time?

Just wall time

my_timer.Start();
event->createTriplets();
event->wait(); // device side event calls are asynchronous: wait to measure time or print
float t3_elapsed = my_timer.RealTime();

@ariostas
Copy link
Member Author

I resolved the merge conflicts and changed things to match the conventions from previous PRs. If the CI plots look fine, then this is ready for review.

/run all

Copy link

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     37.7    320.1    138.7     68.3    108.5    505.5    128.4    169.0    100.6      2.7    1579.6    1036.5+/- 283.0     428.6   explicit_cache[s=4] (target branch)
   avg     43.0    319.9    140.7     82.4    111.0    545.9    130.0    190.6    107.5      2.7    1673.9    1085.0+/- 300.5     454.9   explicit_cache[s=4] (this PR)

@ariostas ariostas marked this pull request as ready for review October 16, 2024 15:42
Copy link

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

Copy link

@slava77 slava77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few rather technical comments; these can be addressed later

segmentsDC_.emplace(segments_sizes, queue_);

auto nSegments_view =
alpaka::createView(devAcc_, segmentsDC_->view<SegmentsOccupancySoA>().nSegments(), nLowerModules_ + 1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
alpaka::createView(devAcc_, segmentsDC_->view<SegmentsOccupancySoA>().nSegments(), nLowerModules_ + 1);
alpaka::createView(devAcc_, segmentsDC_->view<SegmentsOccupancySoA>().nSegments(), segments_sizes[1]);

same below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed these. I'll fix them and finish the SoA migration next week.


// Create source views for size and mdSize
auto src_view_size = alpaka::createView(cms::alpakatools::host(), &size, (Idx)1u);
auto src_view_mdSize = alpaka::createView(cms::alpakatools::host(), &mdSize, (Idx)1u);

auto dst_view_segments = alpaka::createSubView(segmentsBuffers_->nSegments_buf, (Idx)1u, (Idx)pixelModuleIndex);
SegmentsOccupancy segmentsOccupancy = segmentsDC_->view<SegmentsOccupancySoA>();
auto nSegments_view = alpaka::createView(devAcc_, segmentsOccupancy.nSegments(), (Idx)nLowerModules_ + 1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better use the segmentsOccupancy.metadata().size() here explicitly (same below)

@slava77 slava77 merged commit cb84b9c into CMSSW_14_1_0_pre3_LST_X_LSTCore_realfiles_batch7 Oct 18, 2024
3 checks passed
@slava77 slava77 mentioned this pull request Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants