Migrate segments to SoA+PortableCollection #93

ariostas · 2024-09-18T19:50:42Z

This PR is a draft for a possible way to migrate the classes we have to use SoA+PortableCollection. Since for segments we have columns of 3 different lengths then I had to introduce an extra struct. I still need to think whether there is a better way to do this or whether the names can be improved. Also, I haven't tested this to see if it works.

ariostas · 2024-09-18T19:51:11Z

/run standalone

github-actions · 2024-09-18T19:56:26Z

There was a problem while building and running in standalone mode. The logs can be found here.

ariostas · 2024-09-18T20:44:40Z

/run standalone

github-actions · 2024-09-18T20:51:01Z

There was a problem while building and running in standalone mode. The logs can be found here.

RecoTracker/LSTCore/src/alpaka/Segment.h

ariostas · 2024-09-23T19:02:31Z

/run standalone

github-actions · 2024-09-23T19:19:58Z

There was a problem while building and running in standalone mode. The logs can be found here.

github-actions · 2024-09-23T19:44:46Z

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     33.4    321.7    112.3     44.1     92.9    496.9    125.5    148.6    102.2      1.7    1479.4     949.1+/- 255.2     406.1   explicit_cache[s=4] (target branch)
   avg     43.2    320.4    110.7     63.0     97.4    502.7    129.4    171.0    101.1      2.0    1540.9     995.0+/- 273.9     418.3   explicit_cache[s=4] (this PR)

slava77 · 2024-09-23T20:24:00Z

Here is a timing comparison:

is the T3 increase from 44 to 63 real?
It would be nice to re-check locally, in GPU and see if this is reproduced

ariostas · 2024-09-23T20:48:40Z

Here's a timing comparison on cgpu-1.

This PR (9e2a402)
Total Timing Summary
Average time for map loading = 469.029 ms
Average time for input loading = 7456.79 ms
Average time for lst::Event creation = 0.0362964 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     16.2      1.3      0.4      1.4      1.5      0.4      0.8      0.6      1.2      0.1      23.8       7.2+/-  1.4      25.8   explicit_cache[s=1]
   avg      5.1      1.4      0.6      1.9      1.9      0.5      1.1      0.7      1.7      0.2      15.2       9.6+/-  2.0       8.7   explicit_cache[s=2]
   avg      8.6      1.9      0.9      3.3      3.1      0.6      2.3      1.3      2.9      0.3      25.3      16.1+/-  3.4       7.0   explicit_cache[s=4]
   avg     13.7      2.4      1.4      4.7      4.5      0.7      3.6      2.0      4.4      0.4      37.7      23.3+/-  5.2       6.8   explicit_cache[s=6]
   avg     19.8      3.2      1.7      6.2      6.0      0.9      4.8      2.6      5.4      0.6      51.2      30.5+/-  7.1       6.8   explicit_cache[s=8]

CMSSW_14_1_0_pre3_LST_X_LSTCore_realfiles_batch7 (3858cf3)
Total Timing Summary
Average time for map loading = 468.89 ms
Average time for input loading = 7462.57 ms
Average time for lst::Event creation = 0.0350249 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     16.6      1.2      0.4      1.4      1.5      0.4      0.7      0.4      1.2      0.1      23.9       6.9+/-  1.4      26.1   explicit_cache[s=1]
   avg      5.5      1.4      0.5      1.8      1.9      0.4      1.1      0.6      1.7      0.2      15.2       9.3+/-  2.0       8.7   explicit_cache[s=2]
   avg      8.7      1.7      0.8      3.1      2.8      0.5      2.2      1.2      2.7      0.4      24.1      14.9+/-  3.1       6.7   explicit_cache[s=4]
   avg     14.2      2.2      1.3      4.2      4.2      0.7      3.3      1.8      4.0      0.5      36.4      21.5+/-  5.1       6.5   explicit_cache[s=6]
   avg     21.0      2.9      1.5      5.2      5.4      0.8      4.6      2.6      5.0      0.6      49.4      27.6+/-  6.4       6.6   explicit_cache[s=8]

It seems like it might actually be a bit slower. I'll check whether initializing the whole buffer is causing this, or if it's unrelated.

I'm still working on getting the CI to run the CMSSW comparison.

ariostas · 2024-09-24T15:05:45Z

/run cmssw

github-actions · 2024-09-24T15:26:42Z

There was a problem while building and running with CMSSW. The logs can be found here.

github-actions · 2024-09-24T17:28:07Z

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

ariostas · 2024-09-25T17:32:06Z

Pending the CI test, it seems like initializing only the required columns fixed the timing.

This PR (96be4f7)
Total Timing Summary
Average time for map loading = 464.98 ms
Average time for input loading = 7410.43 ms
Average time for lst::Event creation = 0.0355244 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     15.3      1.2      0.4      1.4      1.5      0.4      0.7      0.5      1.2      0.1      22.6       6.9+/-  1.4      24.5   explicit_cache[s=1]
   avg      4.7      1.4      0.5      1.9      1.8      0.5      1.1      0.7      1.7      0.1      14.4       9.2+/-  2.0       8.3   explicit_cache[s=2]
   avg      8.2      1.7      0.9      3.1      2.9      0.5      2.2      1.2      2.8      0.3      23.7      15.0+/-  3.3       6.6   explicit_cache[s=4]
   avg     13.4      2.2      1.3      4.3      4.3      0.7      3.4      1.9      4.1      0.4      35.9      21.9+/-  5.8       6.5   explicit_cache[s=6]
   avg     19.6      2.9      1.5      5.6      5.7      0.8      4.7      2.7      5.2      0.6      49.1      28.7+/-  7.5       6.5   explicit_cache[s=8]

CMSSW_14_1_0_pre3_LST_X_LSTCore_realfiles_batch7 (3858cf3)
Total Timing Summary
Average time for map loading = 461.071 ms
Average time for input loading = 7381.67 ms
Average time for lst::Event creation = 0.0382719 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     17.6      1.2      0.4      1.4      1.5      0.4      0.7      0.4      1.2      0.1      24.9       6.9+/-  1.4      27.0   explicit_cache[s=1]
   avg      5.5      1.4      0.5      1.9      1.9      0.4      1.1      0.6      1.7      0.2      15.3       9.3+/-  1.9       8.8   explicit_cache[s=2]
   avg      8.8      1.7      0.8      3.0      2.9      0.5      2.1      1.2      2.8      0.4      24.2      14.8+/-  3.0       6.7   explicit_cache[s=4]
   avg     14.4      2.2      1.2      4.1      4.3      0.7      3.3      1.8      4.0      0.5      36.5      21.4+/-  5.4       6.6   explicit_cache[s=6]
   avg     21.1      3.0      1.5      5.4      5.8      0.8      4.6      2.6      5.0      0.8      50.5      28.6+/-  7.5       6.7   explicit_cache[s=8]

/run standalone

github-actions · 2024-09-25T17:46:49Z

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     33.7    323.4    113.3     45.9     94.6    500.1    127.2    148.9    102.3      3.0    1492.4     958.7+/- 250.9     408.1   explicit_cache[s=4] (target branch)
   avg     34.5    323.6    118.3     62.6     98.3    504.3    128.5    171.9    101.6      1.4    1545.0    1006.2+/- 271.4     420.7   explicit_cache[s=4] (this PR)

ariostas · 2024-09-25T17:51:49Z

Hmm it's weird that the T3 timing is still significantly higher in the CI.

ariostas · 2024-09-25T20:50:09Z

/run cmssw

github-actions · 2024-09-26T18:33:01Z

There was a problem while building and running with CMSSW. The logs can be found here.

github-actions · 2024-09-26T20:26:11Z

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

ariostas · 2024-09-27T16:17:03Z

This is the timing comparison for CPU on cgpu-1. It must be that the new memory layout is not very favorable for CPU, but at least it seems like it's not too bad on GPU.

This PR (96be4f7)
Total Timing Summary
Average time for map loading = 436.833 ms
Average time for input loading = 7381.81 ms
Average time for lst::Event creation = 0.00952539 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     25.6    210.9     79.7     50.6     76.2    263.4     69.3    116.3     56.0      0.5     948.4     659.4+/- 172.5     949.3   explicit_cache[s=1]
   avg     23.6    210.5     79.8     47.5     70.0    261.7     69.6    105.6     55.7      1.2     925.2     639.9+/- 170.4     237.7   explicit_cache[s=4]
   avg     25.5    233.4     88.1     51.6     78.9    291.4     76.9    114.1     61.7      1.7    1023.5     706.5+/- 184.3      76.3   explicit_cache[s=16]
   avg     30.1    239.7    110.3     65.5     89.1    294.4     81.6    119.5     63.2     11.8    1105.0     780.5+/- 191.8      45.0   explicit_cache[s=32]
   avg     45.1    259.1    102.9     76.0     99.4    309.3     85.1    124.2     66.5      7.7    1175.5     821.0+/- 220.9      27.5   explicit_cache[s=64]

CMSSW_14_1_0_pre3_LST_X_LSTCore_realfiles_batch7 (3858cf3)
Total Timing Summary
Average time for map loading = 428.306 ms
Average time for input loading = 7442.49 ms
Average time for lst::Event creation = 0.0342869 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     24.3    229.5     88.7     37.8     74.1    260.3     67.4     84.7     55.8     10.4     932.8     648.3+/- 153.6     934.3   explicit_cache[s=1]
   avg     22.7    214.4     79.5     30.8     70.3    258.0     68.5     85.5     56.2      2.8     888.7     608.0+/- 157.3     231.3   explicit_cache[s=4]
   avg     26.7    237.3     90.9     36.3     81.6    286.3     76.5     95.1     62.3      3.7     996.7     683.7+/- 164.7      73.5   explicit_cache[s=16]
   avg     33.0    240.5     94.2     37.1     82.7    287.2     76.9     95.8     62.6      4.7    1014.8     694.6+/- 165.4      40.8   explicit_cache[s=32]
   avg     68.2    254.0    108.0     53.0     94.4    297.7     83.5    101.9     64.6     11.2    1136.4     770.5+/- 185.7      26.6   explicit_cache[s=64]

github-actions · 2024-09-27T17:14:52Z

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

fwyzard · 2024-10-02T05:03:31Z

RecoTracker/LSTCore/src/alpaka/Event.dev.cc

+  alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.ptIn(), size), ptIn, size);
+  alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.ptErr(), size), ptErr, size);
+  alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.px(), size), px, size);
+  alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.py(), size), py, size);
+  alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.pz(), size), pz, size);
+  alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.etaErr(), size), etaErr, size);
+  alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.isQuad(), size), isQuad, size);
+  alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.eta(), size), eta, size);
+  alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.phi(), size), phi, size);
+  alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.charge(), size), charge, size);
+  alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.seedIdx(), size), seedIdx, size);
+  alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.superbin(), size), superbin, size);
+  alpaka::memcpy(queue_, alpaka::createView(devAcc_, segmentsPixel.pixelType(), size), pixelType, size);


If I remember correctly the final size argument can be omitted, and will be assumed to match the size of the destination of the copy.

That usually works, but not when combining std::vector and Alpaka buffers. You end up getting this error:

In file included from /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/mem/buf/Traits.hpp:10, from /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/dev/DevCpu.hpp:11, from /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/acc/AccCpuOmp2Blocks.hpp:36, from /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/alpaka.hpp:13, from /home/users/anriosta/CMSSW_tests/CMSSW_14_2_0_pre1/src/RecoTracker/LSTCore/standalone/../../../HeterogeneousCore/AlpakaInterface/interface/memory.h:6, from ../../src/alpaka/Event.dev.cc:1: /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/mem/view/Traits.hpp: In instantiation of 'auto alpaka::createTaskMemcpy(TViewDstFwd&&, const TViewSrc&, const TExtent&) [with TExtent = Vec<std::integral_constant<long unsigned int, 1>, long unsigned int>; TViewSrc = std::vector<float>; TViewDstFwd = ViewPlainPtr<DevCpu, float, std::integral_constant<long unsigned int, 1>, unsigned int>]': /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/mem/view/Traits.hpp:312:40: required from 'void alpaka::memcpy(TQueue&, TViewDstFwd&&, const TViewSrc&) [with TViewSrc = std::vector<float>; TViewDstFwd = ViewPlainPtr<DevCpu, float, std::integral_constant<long unsigned int, 1>, unsigned int>; TQueue = QueueGenericThreadsBlocking<DevCpu>]' ../../src/alpaka/Event.dev.cc:249:17: required from here /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/mem/view/Traits.hpp:277:58: error: static assertion failed: The destination view and the extent are required to have compatible index types! 277 | meta::IsIntegralSuperset<DstIdx, ExtentIdx>::value, | ^~~~~ /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-4d4f1220bfca9be4c4149ab758d15463/include/alpaka/mem/view/Traits.hpp:277:58: note: 'std::integral_constant<bool, false>::value' evaluates to false

Thanks for the error message - I will try to reproduce it and open an issue with alpaka.

fwyzard · 2024-10-02T05:15:46Z

@ariostas @slava77 could you tell me how to interpret these timing results ?

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     33.7    323.4    113.3     45.9     94.6    500.1    127.2    148.9    102.3      3.0    1492.4     958.7+/- 250.9     408.1   explicit_cache[s=4] (target branch)
   avg     34.5    323.6    118.3     62.6     98.3    504.3    128.5    171.9    101.6      1.4    1545.0    1006.2+/- 271.4     420.7   explicit_cache[s=4] (this PR)

Is it a before / after comparison of the impact of these changes ?

What about these ones ?

This is the timing comparison for CPU on cgpu-1. It must be that the new memory layout is not very favorable for CPU, but at least it seems like it's not too bad on GPU.

This PR (96be4f7)
Total Timing Summary
Average time for map loading = 436.833 ms
Average time for input loading = 7381.81 ms
Average time for lst::Event creation = 0.00952539 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     25.6    210.9     79.7     50.6     76.2    263.4     69.3    116.3     56.0      0.5     948.4     659.4+/- 172.5     949.3   explicit_cache[s=1]
   avg     23.6    210.5     79.8     47.5     70.0    261.7     69.6    105.6     55.7      1.2     925.2     639.9+/- 170.4     237.7   explicit_cache[s=4]
   avg     25.5    233.4     88.1     51.6     78.9    291.4     76.9    114.1     61.7      1.7    1023.5     706.5+/- 184.3      76.3   explicit_cache[s=16]
   avg     30.1    239.7    110.3     65.5     89.1    294.4     81.6    119.5     63.2     11.8    1105.0     780.5+/- 191.8      45.0   explicit_cache[s=32]
   avg     45.1    259.1    102.9     76.0     99.4    309.3     85.1    124.2     66.5      7.7    1175.5     821.0+/- 220.9      27.5   explicit_cache[s=64]

CMSSW_14_1_0_pre3_LST_X_LSTCore_realfiles_batch7 (3858cf3)
Total Timing Summary
Average time for map loading = 428.306 ms
Average time for input loading = 7442.49 ms
Average time for lst::Event creation = 0.0342869 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     24.3    229.5     88.7     37.8     74.1    260.3     67.4     84.7     55.8     10.4     932.8     648.3+/- 153.6     934.3   explicit_cache[s=1]
   avg     22.7    214.4     79.5     30.8     70.3    258.0     68.5     85.5     56.2      2.8     888.7     608.0+/- 157.3     231.3   explicit_cache[s=4]
   avg     26.7    237.3     90.9     36.3     81.6    286.3     76.5     95.1     62.3      3.7     996.7     683.7+/- 164.7      73.5   explicit_cache[s=16]
   avg     33.0    240.5     94.2     37.1     82.7    287.2     76.9     95.8     62.6      4.7    1014.8     694.6+/- 165.4      40.8   explicit_cache[s=32]
   avg     68.2    254.0    108.0     53.0     94.4    297.7     83.5    101.9     64.6     11.2    1136.4     770.5+/- 185.7      26.6   explicit_cache[s=64]

What do the five rows indicate ?

Looking at the columns, what do T3 and pT3 refer to, in terms of code ?

ariostas · 2024-10-02T13:38:57Z

could you tell me how to interpret these timing results ?
Is it a before / after comparison of the impact of these changes ?

Yeah, the first row is before, and the second row is after (it's not very clear, but they are labeled at the end of the line). The columns are the (average) time to complete each step of the algorithm.

What about these ones ?
What do the five rows indicate ?

These are the same, but testing different numbers of streams, indicated by [s=*] at the end of the line.

Looking at the columns, what do T3 and pT3 refer to, in terms of code ?

What is being timed are these functions (and the other columns time equivalent functions)

cmssw/RecoTracker/LSTCore/src/alpaka/Event.dev.cc

Line 389 in 3858cf3

void Event::createTriplets() {

cmssw/RecoTracker/LSTCore/src/alpaka/Event.dev.cc

Line 619 in 3858cf3

void Event::createPixelTriplets() {

makortel · 2024-10-02T15:10:49Z

The columns are the (average) time to complete each step of the algorithm.

How exactly do you measure time? Is it CPU time? Wall clock time? Kernel time?

makortel · 2024-10-02T15:17:04Z

RecoTracker/LSTCore/src/alpaka/Segment.h

+  // This is not used, but it is needed for compilation
+  template <typename T0, typename... Args>
+  struct CopyToHost<PortableHostMultiCollection<T0, Args...>> {


Could you elaborate? Having to copy a "host collection" to host seems wrong.

I wanted to ask you about this. I needed to add it so that this part compiles (even with the if constexpr).

cmssw/RecoTracker/LSTCore/src/alpaka/Event.dev.cc

Lines 1399 to 1412 in 96be4f7

template <typename TSoA>

typename TSoA::ConstView Event::getSegments(bool sync) {

if constexpr (std::is_same_v<Device, DevHost>)

return segmentsDev_->const_view<TSoA>();

if (!segmentsHost_) {

segmentsHost_.emplace(cms::alpakatools::CopyToHost<SegmentsDeviceCollection>::copyAsync(queue_, *segmentsDev_));

if (sync)

alpaka::wait(queue_); // host consumers expect filled data

}

return segmentsHost_->const_view<TSoA>();

}

template SegmentsConst Event::getSegments<SegmentsSoA>(bool);

template SegmentsOccupancyConst Event::getSegments<SegmentsOccupancySoA>(bool);

template SegmentsPixelConst Event::getSegments<SegmentsPixelSoA>(bool);

The if constexpr allows the non-taken branch to be non-compilable only when the condition depends on a template argument (e.g. if the the getSegments() would take the Queue as an argument, but could be achieved in many other ways). I'd want to avoid the specialization of CopyToHost for host-specific data types.

On a similar note: is it intended to make a copy of the data when running on the host ?

In fact, what happens if you avoid the copy at line 1404 when running on the Cpu ?
Does the code require that the device buffer is still valid after the copy ? Does it assume that the "host" and "device" buffers can be modified independently of each other ?

Ah, ignore me, I missed the if on line 1401.

@makortel I tried this

template <typename TSoA, typename TDev = Device> typename TSoA::ConstView Event::getSegments(bool sync) { if constexpr (std::is_same_v<TDev, DevHost>) return segmentsDev_->const_view<TSoA>(); if (!segmentsHost_) { segmentsHost_.emplace(cms::alpakatools::CopyToHost<SegmentsDeviceCollection>::copyAsync(queue_, *segmentsDev_)); if (sync) alpaka::wait(queue_); // host consumers expect filled data } return segmentsHost_->const_view<TSoA>(); }

and this

template <typename TSoA, typename TQueue> typename TSoA::ConstView Event::getSegments(bool sync, TQueue& queue) { if constexpr (std::is_same_v<alpaka::Dev<TQueue>, DevHost>) return segmentsDev_->const_view<TSoA>(); if (!segmentsHost_) { segmentsHost_.emplace(cms::alpakatools::CopyToHost<SegmentsDeviceCollection>::copyAsync(queue_, *segmentsDev_)); if (sync) alpaka::wait(queue_); // host consumers expect filled data } return segmentsHost_->const_view<TSoA>(); }

but none of them compile.

It seems you need to put the non-compilable code into an explicit if or else branch, along

template <typename TSoA, typename TQueue> typename TSoA::ConstView Event::getSegments(bool sync, TQueue& queue) { if constexpr (std::is_same_v<alpaka::Dev<TQueue>, DevHost>) { return segmentsDev_->const_view<TSoA>(); } else { if (!segmentsHost_) { segmentsHost_.emplace(cms::alpakatools::CopyToHost<SegmentsDeviceCollection>::copyAsync(queue_, *segmentsDev_)); if (sync) alpaka::wait(queue_); // host consumers expect filled data } return segmentsHost_->const_view<TSoA>(); } }

Thank you @makortel. For some reason that still didn't work for me, but I got it to work by additionally explicitly involving the templated type in CopyToHost.

ariostas · 2024-10-02T16:54:56Z

How exactly do you measure time? Is it CPU time? Wall clock time? Kernel time?

Just wall time

cmssw/RecoTracker/LSTCore/standalone/code/core/trkCore.cc

Lines 118 to 121 in 96be4f7

    
           my_timer.Start(); 
        
           event->createTriplets(); 
        
           event->wait();  // device side event calls are asynchronous: wait to measure time or print 
        
           float t3_elapsed = my_timer.RealTime();

…segments_soa

ariostas · 2024-10-16T15:24:34Z

I resolved the merge conflicts and changed things to match the conventions from previous PRs. If the CI plots look fine, then this is ready for review.

/run all

github-actions · 2024-10-16T15:38:38Z

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     37.7    320.1    138.7     68.3    108.5    505.5    128.4    169.0    100.6      2.7    1579.6    1036.5+/- 283.0     428.6   explicit_cache[s=4] (target branch)
   avg     43.0    319.9    140.7     82.4    111.0    545.9    130.0    190.6    107.5      2.7    1673.9    1085.0+/- 300.5     454.9   explicit_cache[s=4] (this PR)

github-actions · 2024-10-16T17:32:14Z

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

slava77

I have a few rather technical comments; these can be addressed later

slava77 · 2024-10-18T14:13:57Z

RecoTracker/LSTCore/src/alpaka/Event.dev.cc

+    segmentsDC_.emplace(segments_sizes, queue_);
+
+    auto nSegments_view =
+        alpaka::createView(devAcc_, segmentsDC_->view<SegmentsOccupancySoA>().nSegments(), nLowerModules_ + 1);


Suggested change

alpaka::createView(devAcc_, segmentsDC_->view<SegmentsOccupancySoA>().nSegments(), nLowerModules_ + 1);

alpaka::createView(devAcc_, segmentsDC_->view<SegmentsOccupancySoA>().nSegments(), segments_sizes[1]);

same below

Sorry I missed these. I'll fix them and finish the SoA migration next week.

slava77 · 2024-10-18T14:18:52Z

RecoTracker/LSTCore/src/alpaka/Event.dev.cc


  // Create source views for size and mdSize
  auto src_view_size = alpaka::createView(cms::alpakatools::host(), &size, (Idx)1u);
  auto src_view_mdSize = alpaka::createView(cms::alpakatools::host(), &mdSize, (Idx)1u);

-  auto dst_view_segments = alpaka::createSubView(segmentsBuffers_->nSegments_buf, (Idx)1u, (Idx)pixelModuleIndex);
+  SegmentsOccupancy segmentsOccupancy = segmentsDC_->view<SegmentsOccupancySoA>();
+  auto nSegments_view = alpaka::createView(devAcc_, segmentsOccupancy.nSegments(), (Idx)nLowerModules_ + 1);


better use the segmentsOccupancy.metadata().size() here explicitly (same below)

Moved segments to SoA+PortableCollection

d5da756

Fixed standalone compilation

a7e3ec3

slava77 reviewed Sep 19, 2024

View reviewed changes

RecoTracker/LSTCore/src/alpaka/Segment.h Outdated Show resolved Hide resolved

ariostas commented Sep 19, 2024

View reviewed changes

RecoTracker/LSTCore/src/alpaka/Segment.h Outdated Show resolved Hide resolved

ariostas added 4 commits September 20, 2024 08:30

Switched to only using views

acff43f

Fixed indexing issues

37fe122

Fixed memset

77ff0e5

Format code

9e2a402

Initialize only required columns

96be4f7

ariostas changed the title ~~Moved segments to SoA+PortableCollection~~ Migrate segments to SoA+PortableCollection Sep 30, 2024

slava77 mentioned this pull request Oct 1, 2024

Migrate MDs to SoA+PortableCollection #101

Merged

fwyzard reviewed Oct 2, 2024

View reviewed changes

makortel reviewed Oct 2, 2024

View reviewed changes

ariostas added 3 commits October 4, 2024 06:21

Removed unnecessary CopyToHost

efef777

Merge branch 'CMSSW_14_1_0_pre3_LST_X_LSTCore_realfiles_batch7' into …

6b37c07

…segments_soa

Match conventions from previous PRs

0e0b5e1

ariostas marked this pull request as ready for review October 16, 2024 15:42

slava77 approved these changes Oct 18, 2024

View reviewed changes

slava77 merged commit cb84b9c into CMSSW_14_1_0_pre3_LST_X_LSTCore_realfiles_batch7 Oct 18, 2024
3 checks passed

ariostas mentioned this pull request Oct 23, 2024

continuing migration to SoA: update LS, migrate ranges, hits, endcap, modules #106

Merged

slava77 mentioned this pull request Oct 28, 2024

migrate to SoATemplate #109

Merged

	template <typename TSoA>
	typename TSoA::ConstView Event::getSegments(bool sync) {
	if constexpr (std::is_same_v<Device, DevHost>)
	return segmentsDev_->const_view<TSoA>();
	if (!segmentsHost_) {
	segmentsHost_.emplace(cms::alpakatools::CopyToHost<SegmentsDeviceCollection>::copyAsync(queue_, *segmentsDev_));
	if (sync)
	alpaka::wait(queue_); // host consumers expect filled data
	}
	return segmentsHost_->const_view<TSoA>();
	}
	template SegmentsConst Event::getSegments<SegmentsSoA>(bool);
	template SegmentsOccupancyConst Event::getSegments<SegmentsOccupancySoA>(bool);
	template SegmentsPixelConst Event::getSegments<SegmentsPixelSoA>(bool);

	alpaka::createView(devAcc_, segmentsDC_->view<SegmentsOccupancySoA>().nSegments(), nLowerModules_ + 1);
	alpaka::createView(devAcc_, segmentsDC_->view<SegmentsOccupancySoA>().nSegments(), segments_sizes[1]);

Migrate segments to SoA+PortableCollection #93

Migrate segments to SoA+PortableCollection #93

Conversation

ariostas commented Sep 18, 2024

ariostas commented Sep 18, 2024

github-actions bot commented Sep 18, 2024

ariostas commented Sep 18, 2024

github-actions bot commented Sep 18, 2024

ariostas commented Sep 23, 2024

github-actions bot commented Sep 23, 2024

github-actions bot commented Sep 23, 2024

slava77 commented Sep 23, 2024

ariostas commented Sep 23, 2024

ariostas commented Sep 24, 2024

github-actions bot commented Sep 24, 2024

github-actions bot commented Sep 24, 2024

ariostas commented Sep 25, 2024 • edited Loading

github-actions bot commented Sep 25, 2024

ariostas commented Sep 25, 2024 • edited Loading

ariostas commented Sep 25, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

ariostas commented Sep 27, 2024

github-actions bot commented Sep 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fwyzard commented Oct 2, 2024

ariostas commented Oct 2, 2024

makortel commented Oct 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ariostas Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

ariostas commented Oct 2, 2024

ariostas commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

slava77 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ariostas commented Sep 25, 2024 •

edited

Loading

ariostas commented Sep 25, 2024 •

edited

Loading

ariostas Oct 4, 2024 •

edited

Loading